Hi Argo maintainers! Thank you so much for building such an awesome workflow solution for Kubernetes!
I'm a maintainer of Kubeflow Pipelines (KFP).
Kubeflow Pipelines uses argo workflows under the hood as the workflow engine, see https://github.com/kubeflow/pipelines#acknowledgments.
Realized I wrote too long, so TL.DR here
We want a mechanism that allows a external plugin service way to hook deeply into argo workflow task or probably pod lifecycle:
The external system generates argo workflow spec, so it works for us if above features require changes in workflow spec.
Kubeflow Pipelines' mission is building a platform for reusable end to end ML workflows.
What's special about ML workflows is that people care a lot about data, e.g. where the data come from, what hyper parameters we used for a model, how good is a model? These are questions we care about.
Therefore, KFP has standardized on using https://github.com/google/ml-metadata (MLMD) as its metadata store. Additionally, we do not only want to support users writing code in their pipelines to record those metadata. We also want to auto-log any metadata information the workflow system already knows (like artifact url, name...).
We have already achieved this goal by building metadata writer -- a service that watches argo workflows and pods, then auto log those information into MLMD.
Continuing on this path, we further want to allow using data from MLMD for workflow orchestration. For example, tasks can use placeholders that refer to information in MLMD.
We can insert tasks into user workflows to read from MLMD and expose them to argo, but given the fact that our metadata-writer watches argo workflows to record metadata, there's no guarantee any data is available in MLMD before the task that requires it.
When thinking about building a future KFP with full integration of MLMD on top of argo, we'd want a mechanism that allows an external plugin service to hook deeply into argo workflow task lifecycle:
For clarification, KFP's current architecture is like:
KFP needs two type of lifecycle hooks:
Potentially, to avoid disrupting non KFP argo workflows, we can configure argo that only workflows with a certain label will be sent to the lifecycle hook.
This is a very early discussion on one primitive feature that can enable us to build what KFP wants on top of argo. I am willing to contribute if we can get consensus on the enhancement proposal.
What do you think about this? Do you have alternatives (existing features, different mechanism to integrate?) you'd suggest instead?
Looking forward to your response! : )
Message from the maintainers:
Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.
My first thought is how would you do lifecycle hooks today:
WorkflowRunning which another controller could pick up on and perform processing based on this.I'm guessing you've considered these already - so I'm interested to know how they don't fit the bill, or how you want to do things differently.
Hi @alexec, maybe I wasn't clear enough. What we want is lifecycle hooks for tasks, not workflow.
Above mentioned feature request is an early strawman proposal, if there are any way existing today or any other way you may design that can let us manage task lifecycle with our requirements, we are okay to them too.
I just built a POC of one type of interface I'd want for caching: Allow an init container in a task to decide whether the Pod should be cached, if it should be cached, the init container can mark the Pod as skipped and emit previous artifacts loaded from a database.
https://github.com/kubeflow/pipelines/blob/mlmd2/contrib/ir/samples/05_caching.yaml
I used a few hacks to achieve this with argo today:
workflows.argoproj.io/outputs annotation to its own pod using data saved previously.continueOn: { error: true }, so that the error doesn't stop the entire workflow.This seems to be enough for my caching requirement. What do you think about this hack?
Can we adjust it to make an official interface in argo?
Regarding Pod exit handler, it probably works for me simply generating two argo steps for each user step, so that we put exit handler into a single argo step.
It seems more of an upstream Kubernetes Pod lifecycle problem, rather than argo.
I think the kind of caching you're talking about could be done using the new memorization feature.
Kubernetes events are also emitting for node completion. If they were also emitted on start, would that help?
I think the kind of caching you're talking about could be done using the new memorization feature.
I am aware of the new memoization feature in argo, but we are opinionated for KFP caching behavior:
caching data should come from MLMD and we need some lifecycle handling even when a step is cached. I don't think we will ever adopt argo memoization.
/cc @Ark-kun
What's your opinion about this?
Kubernetes events are also emitting for node completion. If they were also emitted on start, would that help?
Rethinking over this, I don't think that will be helpful, the primary reason we ask for a lifecycle hook for node start event is to handle caching. Readonly event doesn't work for us.
Let init container emit outputs by adding workflows.argoproj.io/outputs annotation to its own pod using data saved previously.
Let the init container exit with 1, so main containers are skipped.
Actually, thinking carefully about features I want to implement in KFP, I think this argo behavior in this POC: https://github.com/kubeflow/pipelines/blob/mlmd2/contrib/ir/samples/05_caching.yaml is enough for us to build on top. We do not have any feature requests beyond that.
Do you have any concerns with this HACK to implement third party caching behavior in initContainers? Can we rely on it? Will argo stop using this mechanism?
Can we rely on it?
Short answer - no. These reads an annotation which is an implementation detail which we plan to remove to improve our ability to have larger inputs and outputs.
I think we want to support lifecycle hooks in Argo Workflows that can be configured at a namespace-level and controller-level.
I'm going to assume you need to run a container to do the processing. This would be both simple and flexible.
We already kind of have hooks: the init container runs before the main container and the wait container runs afterwards.
What if you could configure a preTask and postTask images at namespace or controller-level. These run before the init container, and have the inputs and outputs mounted as read/write volumes?
More broadly, I think you, me, @jessesuen and @Ark-kun should get on a Zoom so we can learn more about KFP roadmap and get alignment.
Short answer - no. These reads an annotation which is an implementation detail which we plan to remove to improve our ability to have larger inputs and outputs.
Understood, makes sense for supporting larger inputs and outputs.
Let me think about if we need an official supported interface in argo, what would be best for us.
Off the top of my head, lifecycle containers align with what I try to do.
More broadly, I think you, me, @jessesuen and @Ark-kun should get on a Zoom so we can learn more about KFP roadmap and get alignment.
Yes, agree we should proceed with zoom. I'll organize KFP side people who should attend.
Hi @alexec!
I just came up an idea with an alternative feature that can enable what we want to achieve. I want to get some early feedback from you.
Tasks can consume and return artifacts by json metadata with its url.
To be specific, a task can consume a previous task鈥檚 artifacts by json representation of the artifact, e.g. for a minio artifact, it can look like (whatever format argo is emitting today)
{
'name': 'hello-world',
'bucket': 'ml-pipeline',
'key': 'path/in/the/bucket/object-name',
'secret': { 'name': 'pipeline-minio-secret' }
}
Similarly, a task can return artifacts by returning this json representation.
This basic primitive feature will enable us to fully encapsulate a normal argo task by control steps.
A KFP step can be translated to
In this way, we can control exact runtime semantics we want to expose to KFP users, while utilizing argo for workflow orchestration.
I'm discussing my design with my teammates, once we reach an agreement, I'll reach out to you for further discussion.
I'm not sure I fully understand your idea, but is it similar to #4341 ?
@alexec Definitely, that sounds like a step in the right direction.
My feature request will be supporting placeholders like {{inputs.artifacts.hello.key}}.
And output an artifact by key in a container
outputs:
artifacts:
- name: hello
keyFrom:
path: /tmp/hello_world_key.txt
Let's get a meeting with @Ark-kun . Do you need me to drive this?
@alexc Thanks! I'll get my design reviewed internally first next week. I can drive the meeting with you after that.
I've got a round of feedback internally, I'll soon update the feature requests to reflect latest discussion.