Pipelines: [Discuss][SDK] Future direction of VolumeOp and ResourceOp

Created on 24 Nov 2020 · 13Comments · Source: kubeflow/pipelines

Background: comments in https://github.com/kubeflow/pipelines/pull/4575#pullrequestreview-536585912

There are several questions regarding the way ResourceOp and VolumeOp were implemented. Basically:

They are not as portable as ContainerOp -- the current ability of interpreting them relies on Argo.
They actually entail some side effects to other components in the pipeline. For example, if a container step uses a volume for data passing then it will be harder to use it in another pipeline because that volume will only exist in the current K8S cluster, which undermines the decomposability of the component.

I think we can explore more options and discuss a bit regarding the maturity and what should be the best way to implement such feature in user-facing APIs.

kinfeature

Source

numerology

Most helpful comment

They are not as portable as ContainerOp -- the current ability of interpreting them relies on Argo.

Indeed, they are currently tightly coupled to Argo, but this seems to be the case for most of the current code in the repo. I understand that we want to decouple the backend, but ResourceOp seems to be just a tiny part of this problem.

We've been planning for other parts already, so we'd want to make sure VolumeOp won't be left behind.

I'd love to contribute in the direction proposed by @Ark-kun in #4083. The solution described there is to change what happens in the background, and what that object is translated into [for the workflow engine used].

We could even make ResourceOps spawn "special" ContainerOps that essentially spawn an argoexec container. Then there will be no special thing.

They actually entail some side effects to other components in the pipeline. For example, if a container step uses a volume for data passing then it will be harder to use it in another pipeline because that volume will only exist in the current K8S cluster, which undermines the decomposability of the component.

Correct. But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing.

There have been discussions in the past with users wanting to spawn Kubernetes resources (e.g., TFJobs) and have full control over the resource that they will deploy, instead of just a few options using an existing component.

I think implementation-wise, that's the direction we want to go. Welcome contributions on this topic!
But considering interface, why does this feature need to exist in KFP DSL? It should be readily usable using the standard component.yaml like all other components, this would reduce unnecessary interface area for KFP on features we have not been able to abstract --- data passing.

I believe for component.yaml components, you can also accept a JSON patch or YAML patch as a parameter to allow arbitrary manipulation from users.

Plus, we can't expect people to create a component for each and every resource they may want to deploy. My understanding is that this was the original purpose of this feature.

I think it'd make sense creating a K8s resource component that will have a similar interface as resource op to support the generic case, while (depending on use-case), building specialized components that manage resources for specific components.

Specifically for the case of volumes, your argument stands, but there are a lot of other reasons, especially wrt performance and efficiency for which advanced users choose to use PVCs for data passing inside the cluster. Also note that even in the case of inter-cluster execution you can still use volumes if you have an intelligent storage layer that can reference and clone globally immutable snapshots.

Thus, since a lot of users are currently relying on this, I think we should definitely support it, and we can put effort to even make it Argo-agnostic.

Completely agree, volume is also something we want to support. We are pushing against the current interface to support volumes. I believe @Ark-kun has started some early efforts in https://github.com/kubeflow/pipelines/pull/3371, which allows using volume for data passing, while allowing pipelines/components to not become strongly coupled with volumes.

Bobgy on 30 Nov 2020

👍2

All 13 comments

/cc @elikatsis
/cc @Ark-kun
/cc @Bobgy

numerology on 24 Nov 2020

Hi @numerology,
thanks for starting the conversation.

They are not as portable as ContainerOp -- the current ability of interpreting them relies on Argo.

Indeed, they are currently tightly coupled to Argo, but this seems to be the case for most of the current code in the repo. I understand that we want to decouple the backend, but ResourceOp seems to be just a tiny part of this problem.

I'd love to contribute in the direction proposed by @ark-kun in #4083. The solution described there is to change what happens in the background, and what that object is translated into [for the workflow engine used].

We could even make ResourceOps spawn "special" ContainerOps that essentially spawn an argoexec container. Then there will be no special thing.

They actually entail some side effects to other components in the pipeline. For example, if a container step uses a volume for data passing then it will be harder to use it in another pipeline because that volume will only exist in the current K8S cluster, which undermines the decomposability of the component.

Correct. But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing.

There have been discussions in the past with users wanting to spawn Kubernetes resources (e.g., TFJobs) and have full control over the resource that they will deploy, instead of just a few options using an existing component.

Plus, we can't expect people to create a component for each and every resource they may want to deploy. My understanding is that this was the original purpose of this feature.

Specifically for the case of volumes, your argument stands, but there are a lot of other reasons, especially wrt performance and efficiency for which advanced users choose to use PVCs for data passing inside the cluster. Also note that even in the case of inter-cluster execution you can still use volumes if you have an intelligent storage layer that can reference and clone globally immutable snapshots.

Thus, since a lot of users are currently relying on this, I think we should definitely support it, and we can put effort to even make it Argo-agnostic.

elikatsis on 25 Nov 2020

They are not as portable as ContainerOp -- the current ability of interpreting them relies on Argo.

Indeed, they are currently tightly coupled to Argo, but this seems to be the case for most of the current code in the repo. I understand that we want to decouple the backend, but ResourceOp seems to be just a tiny part of this problem.

We've been planning for other parts already, so we'd want to make sure VolumeOp won't be left behind.

I'd love to contribute in the direction proposed by @Ark-kun in #4083. The solution described there is to change what happens in the background, and what that object is translated into [for the workflow engine used].

We could even make ResourceOps spawn "special" ContainerOps that essentially spawn an argoexec container. Then there will be no special thing.

They actually entail some side effects to other components in the pipeline. For example, if a container step uses a volume for data passing then it will be harder to use it in another pipeline because that volume will only exist in the current K8S cluster, which undermines the decomposability of the component.

Correct. But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing.

There have been discussions in the past with users wanting to spawn Kubernetes resources (e.g., TFJobs) and have full control over the resource that they will deploy, instead of just a few options using an existing component.

I believe for component.yaml components, you can also accept a JSON patch or YAML patch as a parameter to allow arbitrary manipulation from users.

Plus, we can't expect people to create a component for each and every resource they may want to deploy. My understanding is that this was the original purpose of this feature.

Specifically for the case of volumes, your argument stands, but there are a lot of other reasons, especially wrt performance and efficiency for which advanced users choose to use PVCs for data passing inside the cluster. Also note that even in the case of inter-cluster execution you can still use volumes if you have an intelligent storage layer that can reference and clone globally immutable snapshots.

Thus, since a lot of users are currently relying on this, I think we should definitely support it, and we can put effort to even make it Argo-agnostic.

Bobgy on 30 Nov 2020

👍2

re: @elikatsis comment "_But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing._", my team are not particularly advanced, but we do need to modify resources, eg. cpu, mem, on the TFX InfraValidator component when using KFP to orchestrate a TFX pipeline. Btw, I got to this discussion looking for help in actually making this work LOL

adriangay on 11 Dec 2020

@adriangay

Thanks for chiming in! I'm curious what's your use case for TFX InfraValidator: are you looking for configure those resource and make sure the InfraValidator launches pod at a node with those resources?

Meanwhile I don't think the current pattern (i.e. ResourceOp) will suffice in that case b/c the interoperability b/w TFX components and KFP ones is a missing piece right now.

numerology on 11 Dec 2020

@numerology Hi!

We want to use InfraValidator when training models in production to perform the basic 'mechanical' testing of serving the model in a 'sandbox' just like it says in the brochure LOL. So yes, we want to set vCPUs and memory to match the containers used when it's served by KFServing. I've been at this for a few days now.

First I was trying to use override_component_config. I also need to run tests outside of the real KFP cluster, and I gave up on that because it can only be specified on KubeflowDagRunnerConfig and I wanted to check the compiled pipeline YAML locally and it's not compatible with InProcessComponentLauncher. I thought I was being clever specifying all three launcher classes in supported_launcher_classes, but config_utils.find_component_launch_info() fails because component_launcher_class.can_launch() is looking for executor type ExecutorContainerSpec in the KubernetesComponentLauncher case sigh

Today I discovered the pipeline_operator_funcs arg of KubeflowDagRunnerConfig and was just starting to try using dsl.ContainerOp to set it, something like this:

    pipeline_operator_funcs_test = dsl.ContainerOp(name='test', image='xxx')
    pipeline_operator_funcs_test.set_cpu_requests('1.0')
    pipeline_operator_funcs_test.set_cpu_limit('10')

But maybe you're telling me I'm not going to be able to do this, even the pipeline is orchestrated by KFP?

adriangay on 12 Dec 2020

@adriangay

Reading from TFX code here https://github.com/tensorflow/tfx/blob/dc163ce3b1a264cb337f7a25a69737cb19770e81/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py#L140 it does not seem to respect anything passed from KFP side. The source of truth regarding the configuration they are using is from this proto message https://github.com/tensorflow/tfx/blob/dc163ce3b1a264cb337f7a25a69737cb19770e81/tfx/proto/infra_validator.proto#L96

I think some serious customization needs to be done to 1) the proto definition and 2) the executor of InfraValidator to support that. Do you mind opening an issue in TFX repo and we can continue the discussion there?

numerology on 12 Dec 2020

@numerology

Hi. Before I cause trouble over on TFX :) I thought I'd let you know what I've managed to accomplish. The following excerpt from a compiled pipeline YAML TFX InfraValidator component shows modification of resources:

      - --component_config
      - '{"__class__": "KubernetesComponentConfig", "__module__": "tfx.orchestration.config.kubernetes_component_config",
        "__tfx_object_type__": "jsonable", "pod": {"spec": {"containers": [{"name":
        "main", "resources": {"limits": {"cpu": 66, "memory": "66Gi"}, "requests":
        {"cpu": 11, "memory": "11Gi"}}}]}}}'

I achieved this using override_component_config as follows:

    override_component_config = kubernetes_component_config.KubernetesComponentConfig(
        client.V1Pod(
            spec=client.V1PodSpec(
                containers=[
                    client.V1Container(
                        name="main", resources={"limits": {"memory": "66Gi", "cpu": 66}, "requests": {"memory": "11Gi", "cpu": 11}}
                    )
                ]
            )
        )
    )

    runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
        kubeflow_metadata_config=metadata_config,
        tfx_image=tfx_image,
        supported_launcher_classes=[
            in_process_component_launcher.InProcessComponentLauncher,
            docker_component_launcher.DockerComponentLauncher,
            kubernetes_component_launcher.KubernetesComponentLauncher
        ],
        component_config_overrides={infraval.id: override_component_config}

    kdr = kubeflow_dag_runner.KubeflowDagRunner(
        config=runner_config, output_filename=OUTPUT_FILENAME
    )

    with mock.patch('tfx.orchestration.launcher.kubernetes_component_launcher.KubernetesComponentLauncher.can_launch') as mocked_can_launch:
        mocked_can_launch.return_value = True
        kdr.run(pipeline)

In order to run this locally on OSX, I cheated, by mocking KubernetesComponentLauncher.can_launch which would fail locally, but running it with a local runner, supplied via supported_launcher_classes.

But, would appreciate any comment about whether I have really achieved this, or whether I'm just fooling myself (because I haven't deployed on a real test cluster... yet).

adriangay on 15 Dec 2020

@adriangay

The current behavior of InfraValidator is as follows:

The code in its executor will be launched using InprocessComponentLauncher, and then
the executor will use KubernetesRunner to call K8S API to create a namespaced pod, to smoke-test your model.

Neither step will actually use the KubernetesComponentConfig being passed. IIUC, KubernetesComponentConfig only applies to components whose executor is written in ExecutorContainerSpec, but it's not applicable here.

numerology on 15 Dec 2020

@numerology

As I mentioned, I am trying to perform functional test prior to deployment in a real K8s/KFP pipeline. I am running a KubeflowDagRunner 'locally' to verify the operation of infraVal. Earlier tests use a local BeamRunner and local Docker launch config, and that executes InfraValidator component via local Docker API. What I'm doing now ius trying to prove that we can modify the container resources, prior to testing for real in an Integration cluster. If I remove kubernetes_component_launcher from supported_launcher_classes then I get error RuntimeError: No launcher info can be found for component "InfraValidator". because I am running a KubeflowDagRunner. I need to use KubeflowDagRunner in order to be able to provide the component_config_overrides [on it's config]. this can't be specified on a local BeamDagRunner. The objective is a functional test of setting container resources... to prove it will work prior to deployment. To be clear, [I think] the effect of what I have done is modify the K8S deployment YAML that Compile generates prior to execution, but executing the pipeline locally (both part of run), ie. not requiring it to actually run on K8s [which is not available locally].

I am happy to accept I may be going about this the wrong way. I am a TFX/KFP noob :)

adriangay on 15 Dec 2020

@adriangay

Sorry I got a bit of confused. What you're doing can perhaps effectively modify the resource spec where the executor of InfraValidator is running, but not the pod spun up to validate the model, right?

If that's the goal here then I think you're on the right track modulo some additional works perhaps.

numerology on 15 Dec 2020

Hi @adriangay, would you mind moving the discussion to your own issue? because that seems deviated from this topic.

Bobgy on 16 Dec 2020

@Bobgy sure, no problem. At this time, my 'problem' is solved, but I will use a new issue in future. I appreciate the guidance I got here.

adriangay on 16 Dec 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings