Pipelines: Make the DSL to be closer to DS expected DSL

Created on 10 Aug 2019 · 11Comments · Source: kubeflow/pipelines

Developing a model in the current DSL is far from what "usual" model development looks like.

There are a lot of extra things that the Data Scientist needs to worry about that he didn't need before. Things such as pipeline volumes. We should think in a DSL that would wrap most of this things and let the DS worry only about the model's code.

Based on the Netflix's Metfalow abstraction, I was thinking in something like the following example, but I am willing to discuss about other options here.

This is the code that you need to write in order to build a simple pipeline, we are declaring the dependencies using the @step annotation:

class TestFlow(SuperClass):
    @step()
    def s1(self):
        print('s1')
        return 1

    @step()
    def s2(self):
        print('s2')
        return 2

    @step(s1, s2)
    def join(self, *inputs):
        print('s3')
        return max(inputs)

This is the generated pipeline:

example pipeline run

Source

gabrielvcbessa

All 11 comments

@gabrielvcbessa this is interesting. Any thought about how you could compile this down into an Argo workflow?

jlewi on 10 Aug 2019

@gabrielvcbessa this is interesting. Any thought about how you could compile this down into an Argo workflow?

@jlewi In the example above I am just wrapping the specified method with a ContainerOp that uses a python image. This container runs the method and writes the result in a file that is used as a file_output.

You will have problems when handling things that need a special serializer, but it is just a proof of concept.

gabrielvcbessa on 10 Aug 2019

@gabrielvcbessa I wonder why is this dsl wrapper different than using . after or infering inputs in the current dsl? (which is also the approach in airflow)

or are you referring to writing python code vs definition of container ops?

yaronha on 11 Aug 2019

@yaronha I am referring to being able to write just simple python code instead of needing to define container ops.

gabrielvcbessa on 11 Aug 2019

@gabrielvcbessa have you looked into python_op this is a way to write python code directly without specifying container there are other options like using Nuclio and fairing that convert code + dependencies to containers automatically.

i do think that it make more sense when you scale to separate components/steps to independent code elements which can be version controlled and tested independently, or be re-used with slight parameter changes (saving lots of extra dev) or tested with hyper-parameters just by changing the pipeline. Another thing to consider is package dependencies and ops aspects like DB credentials which may be different for every step (e.g. data prep may have deps on data manipulation & db packages, and training steps may have deps on deep learning packages, etc.)

with MLRUN we tried to capture that notion that each step is an independent/reusable component, and the way you form a workflow is by soft wiring them (steps can be code elements too). we do use python eval to enumerate parameters which allow passing native python structures like dicts between steps.

yaronha on 11 Aug 2019

👍1

Thanks for that @yaronha! I will take a look at MLRUN. For now, I think the issue can be closed! :)

gabrielvcbessa on 11 Aug 2019

@gabrielvcbessa You can write the code in a similar style using the Lightweight python component feature of KFP. And it might be even more organic than in your example.

One thing I've noticed in your example is that you're conflating "components" and "tasks".
In your code, each component (a function) is also a task (step). There is no way to use same component multiple times.
I also see that the data passing does looks pretty rudimentary.

Here is how you can do the same (and much more) with the current DSL:

@func_to_container_op
def s1(self):
   print('s1')
   return 1

@func_to_container_op
def s2(self):
    print('s2')
    return 2

@func_to_container_op
def add2(self, a, b):
    print('add2')
    return a + b

def my_pipeline():
    add2(s1().output, s2().output)

Ark-kun on 16 Aug 2019

👍1

@Ark-kun Seems pretty good!

We were also thinking about making the PVC something more simple to the final user, something like this:

@volume_claim('data', 1, mount_path='/mnt')
@func_to_container_op
def s1() -> int:
    print('s1')
    return 1


@volume_claim('data', mount_path='/mnt')
@func_to_container_op
def s2() -> int:
    print('s2')
    return 2


@volume_claim('data', mount_path='/mnt')
@func_to_container_op
def max2(a: int, b: int) -> int:
    print('s3')
    return max(a, b)

If the PVC with the same name has already been declared, we would use the same PVC for every other claim with the same name, instead of calling VolumeOp again. What do you think?

This could be a behavior that you could set using a flag or something.

gabrielvcbessa on 16 Aug 2019

We were also thinking about making the PVC something more simple to the final user,

My vision is that mounting specific PVCs belongs to the task level, not component level (imagine shared component trying to use the same hard-coded volume on all clusters).

We already show how to easily mount PVCs in our samples: https://github.com/kubeflow/pipelines/blob/0d898cb40f2bf30b711fce07c3d8769a9dc819b9/samples/core/tfx_cab_classification/tfx_cab_classification.py#L158

from kfp.onprem import mount_pvc
dsl.get_pipeline_conf().op_transformers.append(mount_pvc('claim-name', volume_mount_path='/mnt'))

Ark-kun on 19 Aug 2019

👍1

BTW, my previous example is not how func_to_container_op is normally used. We prefer to use it as a function, so that you have both the original python function and the component:

max2_op = func_to_container_op(max2)

Ark-kun on 19 Aug 2019

👍1

Makes sense. Our idea was trying to hide everything related to the pipeline, VPCs or other things in order to allow the user to write almost the same code in both scenarios, but there are a lot of things that make this kinda hard.

I would love to contribute to this project and to the python's SDK. Is there any contribution guide?

gabrielvcbessa on 19 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings