Background: comments in https://github.com/kubeflow/pipelines/pull/4575#pullrequestreview-536585912
There are several questions regarding the way ResourceOp and VolumeOp were implemented. Basically:
ContainerOp -- the current ability of interpreting them relies on Argo.I think we can explore more options and discuss a bit regarding the maturity and what should be the best way to implement such feature in user-facing APIs.
/cc @elikatsis
/cc @Ark-kun
/cc @Bobgy
Hi @numerology,
thanks for starting the conversation.
They are not as portable as ContainerOp -- the current ability of interpreting them relies on Argo.
Indeed, they are currently tightly coupled to Argo, but this seems to be the case for most of the current code in the repo. I understand that we want to decouple the backend, but ResourceOp seems to be just a tiny part of this problem.
I'd love to contribute in the direction proposed by @ark-kun in #4083. The solution described there is to change what happens in the background, and what that object is translated into [for the workflow engine used].
We could even make ResourceOps spawn "special" ContainerOps that essentially spawn an argoexec container. Then there will be no special thing.
They actually entail some side effects to other components in the pipeline. For example, if a container step uses a volume for data passing then it will be harder to use it in another pipeline because that volume will only exist in the current K8S cluster, which undermines the decomposability of the component.
Correct. But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing.
There have been discussions in the past with users wanting to spawn Kubernetes resources (e.g., TFJobs) and have full control over the resource that they will deploy, instead of just a few options using an existing component.
Plus, we can't expect people to create a component for each and every resource they may want to deploy. My understanding is that this was the original purpose of this feature.
Specifically for the case of volumes, your argument stands, but there are a lot of other reasons, especially wrt performance and efficiency for which advanced users choose to use PVCs for data passing inside the cluster. Also note that even in the case of inter-cluster execution you can still use volumes if you have an intelligent storage layer that can reference and clone globally immutable snapshots.
Thus, since a lot of users are currently relying on this, I think we should definitely support it, and we can put effort to even make it Argo-agnostic.
They are not as portable as ContainerOp -- the current ability of interpreting them relies on Argo.
Indeed, they are currently tightly coupled to Argo, but this seems to be the case for most of the current code in the repo. I understand that we want to decouple the backend, but ResourceOp seems to be just a tiny part of this problem.
We've been planning for other parts already, so we'd want to make sure VolumeOp won't be left behind.
I'd love to contribute in the direction proposed by @Ark-kun in #4083. The solution described there is to change what happens in the background, and what that object is translated into [for the workflow engine used].
We could even make ResourceOps spawn "special" ContainerOps that essentially spawn an
argoexeccontainer. Then there will be no special thing.They actually entail some side effects to other components in the pipeline. For example, if a container step uses a volume for data passing then it will be harder to use it in another pipeline because that volume will only exist in the current K8S cluster, which undermines the decomposability of the component.
Correct. But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing.
There have been discussions in the past with users wanting to spawn Kubernetes resources (e.g., TFJobs) and have full control over the resource that they will deploy, instead of just a few options using an existing component.
I think implementation-wise, that's the direction we want to go. Welcome contributions on this topic!
But considering interface, why does this feature need to exist in KFP DSL? It should be readily usable using the standard component.yaml like all other components, this would reduce unnecessary interface area for KFP on features we have not been able to abstract --- data passing.
I believe for component.yaml components, you can also accept a JSON patch or YAML patch as a parameter to allow arbitrary manipulation from users.
Plus, we can't expect people to create a component for each and every resource they may want to deploy. My understanding is that this was the original purpose of this feature.
I think it'd make sense creating a K8s resource component that will have a similar interface as resource op to support the generic case, while (depending on use-case), building specialized components that manage resources for specific components.
Specifically for the case of volumes, your argument stands, but there are a lot of other reasons, especially wrt performance and efficiency for which advanced users choose to use PVCs for data passing inside the cluster. Also note that even in the case of inter-cluster execution you can still use volumes if you have an intelligent storage layer that can reference and clone globally immutable snapshots.
Thus, since a lot of users are currently relying on this, I think we should definitely support it, and we can put effort to even make it Argo-agnostic.
Completely agree, volume is also something we want to support. We are pushing against the current interface to support volumes. I believe @Ark-kun has started some early efforts in https://github.com/kubeflow/pipelines/pull/3371, which allows using volume for data passing, while allowing pipelines/components to not become strongly coupled with volumes.
re: @elikatsis comment "_But we also see a lot of advanced users who want to actually use ResourceOps and volumes for data passing, so I think we should at least give them the option to be able to do this, if they know what they are doing._", my team are not particularly advanced, but we do need to modify resources, eg. cpu, mem, on the TFX InfraValidator component when using KFP to orchestrate a TFX pipeline. Btw, I got to this discussion looking for help in actually making this work LOL
@adriangay
Thanks for chiming in! I'm curious what's your use case for TFX InfraValidator: are you looking for configure those resource and make sure the InfraValidator launches pod at a node with those resources?
Meanwhile I don't think the current pattern (i.e. ResourceOp) will suffice in that case b/c the interoperability b/w TFX components and KFP ones is a missing piece right now.
@numerology Hi!
We want to use InfraValidator when training models in production to perform the basic 'mechanical' testing of serving the model in a 'sandbox' just like it says in the brochure LOL. So yes, we want to set vCPUs and memory to match the containers used when it's served by KFServing. I've been at this for a few days now.
First I was trying to use override_component_config. I also need to run tests outside of the real KFP cluster, and I gave up on that because it can only be specified on KubeflowDagRunnerConfig and I wanted to check the compiled pipeline YAML locally and it's not compatible with InProcessComponentLauncher. I thought I was being clever specifying all three launcher classes in supported_launcher_classes, but config_utils.find_component_launch_info() fails because component_launcher_class.can_launch() is looking for executor type ExecutorContainerSpec in the KubernetesComponentLauncher case sigh
Today I discovered the pipeline_operator_funcs arg of KubeflowDagRunnerConfig and was just starting to try using dsl.ContainerOp to set it, something like this:
pipeline_operator_funcs_test = dsl.ContainerOp(name='test', image='xxx')
pipeline_operator_funcs_test.set_cpu_requests('1.0')
pipeline_operator_funcs_test.set_cpu_limit('10')
But maybe you're telling me I'm not going to be able to do this, even the pipeline is orchestrated by KFP?
@adriangay
Reading from TFX code here https://github.com/tensorflow/tfx/blob/dc163ce3b1a264cb337f7a25a69737cb19770e81/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py#L140 it does not seem to respect anything passed from KFP side. The source of truth regarding the configuration they are using is from this proto message https://github.com/tensorflow/tfx/blob/dc163ce3b1a264cb337f7a25a69737cb19770e81/tfx/proto/infra_validator.proto#L96
I think some serious customization needs to be done to 1) the proto definition and 2) the executor of InfraValidator to support that. Do you mind opening an issue in TFX repo and we can continue the discussion there?
@numerology
Hi. Before I cause trouble over on TFX :) I thought I'd let you know what I've managed to accomplish. The following excerpt from a compiled pipeline YAML TFX InfraValidator component shows modification of resources:
- --component_config
- '{"__class__": "KubernetesComponentConfig", "__module__": "tfx.orchestration.config.kubernetes_component_config",
"__tfx_object_type__": "jsonable", "pod": {"spec": {"containers": [{"name":
"main", "resources": {"limits": {"cpu": 66, "memory": "66Gi"}, "requests":
{"cpu": 11, "memory": "11Gi"}}}]}}}'
I achieved this using override_component_config as follows:
override_component_config = kubernetes_component_config.KubernetesComponentConfig(
client.V1Pod(
spec=client.V1PodSpec(
containers=[
client.V1Container(
name="main", resources={"limits": {"memory": "66Gi", "cpu": 66}, "requests": {"memory": "11Gi", "cpu": 11}}
)
]
)
)
)
runner_config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=tfx_image,
supported_launcher_classes=[
in_process_component_launcher.InProcessComponentLauncher,
docker_component_launcher.DockerComponentLauncher,
kubernetes_component_launcher.KubernetesComponentLauncher
],
component_config_overrides={infraval.id: override_component_config}
kdr = kubeflow_dag_runner.KubeflowDagRunner(
config=runner_config, output_filename=OUTPUT_FILENAME
)
with mock.patch('tfx.orchestration.launcher.kubernetes_component_launcher.KubernetesComponentLauncher.can_launch') as mocked_can_launch:
mocked_can_launch.return_value = True
kdr.run(pipeline)
In order to run this locally on OSX, I cheated, by mocking KubernetesComponentLauncher.can_launch which would fail locally, but running it with a local runner, supplied via supported_launcher_classes.
But, would appreciate any comment about whether I have really achieved this, or whether I'm just fooling myself (because I haven't deployed on a real test cluster... yet).
@adriangay
The current behavior of InfraValidator is as follows:
InprocessComponentLauncher, and thenNeither step will actually use the KubernetesComponentConfig being passed. IIUC, KubernetesComponentConfig only applies to components whose executor is written in ExecutorContainerSpec, but it's not applicable here.
@numerology
As I mentioned, I am trying to perform functional test prior to deployment in a real K8s/KFP pipeline. I am running a KubeflowDagRunner 'locally' to verify the operation of infraVal. Earlier tests use a local BeamRunner and local Docker launch config, and that executes InfraValidator component via local Docker API. What I'm doing now ius trying to prove that we can modify the container resources, prior to testing for real in an Integration cluster. If I remove kubernetes_component_launcher from supported_launcher_classes then I get error RuntimeError: No launcher info can be found for component "InfraValidator". because I am running a KubeflowDagRunner. I need to use KubeflowDagRunner in order to be able to provide the component_config_overrides [on it's config]. this can't be specified on a local BeamDagRunner. The objective is a functional test of setting container resources... to prove it will work prior to deployment. To be clear, [I think] the effect of what I have done is modify the K8S deployment YAML that Compile generates prior to execution, but executing the pipeline locally (both part of run), ie. not requiring it to actually run on K8s [which is not available locally].
I am happy to accept I may be going about this the wrong way. I am a TFX/KFP noob :)
@adriangay
Sorry I got a bit of confused. What you're doing can perhaps effectively modify the resource spec where the executor of InfraValidator is running, but not the pod spun up to validate the model, right?
If that's the goal here then I think you're on the right track modulo some additional works perhaps.
Hi @adriangay, would you mind moving the discussion to your own issue? because that seems deviated from this topic.
@Bobgy sure, no problem. At this time, my 'problem' is solved, but I will use a new issue in future. I appreciate the guidance I got here.
Most helpful comment
We've been planning for other parts already, so we'd want to make sure
VolumeOpwon't be left behind.I think implementation-wise, that's the direction we want to go. Welcome contributions on this topic!
But considering interface, why does this feature need to exist in KFP DSL? It should be readily usable using the standard
component.yamllike all other components, this would reduce unnecessary interface area for KFP on features we have not been able to abstract --- data passing.I believe for component.yaml components, you can also accept a JSON patch or YAML patch as a parameter to allow arbitrary manipulation from users.
I think it'd make sense creating a K8s resource component that will have a similar interface as resource op to support the generic case, while (depending on use-case), building specialized components that manage resources for specific components.
Completely agree, volume is also something we want to support. We are pushing against the current interface to support volumes. I believe @Ark-kun has started some early efforts in https://github.com/kubeflow/pipelines/pull/3371, which allows using volume for data passing, while allowing pipelines/components to not become strongly coupled with volumes.