Argo: [Proposal] task lifecycle hooks for workflow controller

Created on 23 Oct 2020 · 14Comments · Source: argoproj/argo

Hi Argo maintainers! Thank you so much for building such an awesome workflow solution for Kubernetes!
I'm a maintainer of Kubeflow Pipelines (KFP).
Kubeflow Pipelines uses argo workflows under the hood as the workflow engine, see https://github.com/kubeflow/pipelines#acknowledgments.

Use Cases

TLDR

Realized I wrote too long, so TL.DR here
We want a mechanism that allows a ~~external plugin service~~ way to hook deeply into argo workflow task or probably pod lifecycle:

offload caching decision to an external system
allow extra placeholders using data from an external system
block workflow execution until an external system has finished logging some metadata for one task

The external system generates argo workflow spec, so it works for us if above features require changes in workflow spec.

Details

Kubeflow Pipelines' mission is building a platform for reusable end to end ML workflows.
What's special about ML workflows is that people care a lot about data, e.g. where the data come from, what hyper parameters we used for a model, how good is a model? These are questions we care about.

Therefore, KFP has standardized on using https://github.com/google/ml-metadata (MLMD) as its metadata store. Additionally, we do not only want to support users writing code in their pipelines to record those metadata. We also want to auto-log any metadata information the workflow system already knows (like artifact url, name...).
We have already achieved this goal by building metadata writer -- a service that watches argo workflows and pods, then auto log those information into MLMD.

Continuing on this path, we further want to allow using data from MLMD for workflow orchestration. For example, tasks can use placeholders that refer to information in MLMD.
We can insert tasks into user workflows to read from MLMD and expose them to argo, but given the fact that our metadata-writer watches argo workflows to record metadata, there's no guarantee any data is available in MLMD before the task that requires it.

When thinking about building a future KFP with full integration of MLMD on top of argo, we'd want a mechanism that allows an external plugin service to hook deeply into argo workflow task lifecycle:

offload caching decision to the plugin
allow extra placeholders
block workflow execution until the plugin has finished logging some metadata for this task

For clarification, KFP's current architecture is like:

Users write workflow definition in KFP DSL
KFP Compiler compiles the DSL code to argo workflow spec
User uploads the spec to KFP DB and triggers runs by KFP API, the runs will be created as argo workflows in the cluster.

Summary

KFP needs two type of lifecycle hooks:

prehook - before a task runs, argo workflow controller sends a request to the plugin service. The request should contain all information related to this task. The plugin service should return
- information about additional placeholders specific to this plugin service
- caching decision and what artifacts/parameters to reuse
posthook - after a task completes, argo workflow controller sends a request to the plugin service. The request should contain task spec and results. The plugin service should return when it finishes metadata logging. Note, argo workflow controller should block on the request until it proceeds with tasks depending on the current one, because the plugin service could be doing sth that affects the next steps.

Potentially, to avoid disrupting non KFP argo workflows, we can configure argo that only workflows with a certain label will be sent to the lifecycle hook.

This is a very early discussion on one primitive feature that can enable us to build what KFP wants on top of argo. I am willing to contribute if we can get consensus on the enhancement proposal.

What do you think about this? Do you have alternatives (existing features, different mechanism to integrate?) you'd suggest instead?

Looking forward to your response! : )

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

enhancement use-cascontinuous-integration use-casmachine-learning

Source

Bobgy

All 14 comments

My first thought is how would you do lifecycle hooks today:

We emit Kubernetes such as WorkflowRunning which another controller could pick up on and perform processing based on this.
Any workflow could have an initial step and an exit handler to do the processing.

I'm guessing you've considered these already - so I'm interested to know how they don't fit the bill, or how you want to do things differently.

alexec on 23 Oct 2020

Hi @alexec, maybe I wasn't clear enough. What we want is lifecycle hooks for tasks, not workflow.

Above mentioned feature request is an early strawman proposal, if there are any way existing today or any other way you may design that can let us manage task lifecycle with our requirements, we are okay to them too.

Bobgy on 26 Oct 2020

I just built a POC of one type of interface I'd want for caching: Allow an init container in a task to decide whether the Pod should be cached, if it should be cached, the init container can mark the Pod as skipped and emit previous artifacts loaded from a database.

https://github.com/kubeflow/pipelines/blob/mlmd2/contrib/ir/samples/05_caching.yaml
I used a few hacks to achieve this with argo today:

Let init container emit outputs by adding workflows.argoproj.io/outputs annotation to its own pod using data saved previously.
Let the init container exit with 1, so main containers are skipped.
Configure this step to continueOn: { error: true }, so that the error doesn't stop the entire workflow.

This seems to be enough for my caching requirement. What do you think about this hack?
Can we adjust it to make an official interface in argo?

Bobgy on 26 Oct 2020

Regarding Pod exit handler, it probably works for me simply generating two argo steps for each user step, so that we put exit handler into a single argo step.

It seems more of an upstream Kubernetes Pod lifecycle problem, rather than argo.

Bobgy on 26 Oct 2020

I think the kind of caching you're talking about could be done using the new memorization feature.

Kubernetes events are also emitting for node completion. If they were also emitted on start, would that help?

alexec on 26 Oct 2020

I think the kind of caching you're talking about could be done using the new memorization feature.

I am aware of the new memoization feature in argo, but we are opinionated for KFP caching behavior:
caching data should come from MLMD and we need some lifecycle handling even when a step is cached. I don't think we will ever adopt argo memoization.
/cc @Ark-kun
What's your opinion about this?

Kubernetes events are also emitting for node completion. If they were also emitted on start, would that help?

Rethinking over this, I don't think that will be helpful, the primary reason we ask for a lifecycle hook for node start event is to handle caching. Readonly event doesn't work for us.

Let init container emit outputs by adding workflows.argoproj.io/outputs annotation to its own pod using data saved previously.
Let the init container exit with 1, so main containers are skipped.

Actually, thinking carefully about features I want to implement in KFP, I think this argo behavior in this POC: https://github.com/kubeflow/pipelines/blob/mlmd2/contrib/ir/samples/05_caching.yaml is enough for us to build on top. We do not have any feature requests beyond that.

Do you have any concerns with this HACK to implement third party caching behavior in initContainers? Can we rely on it? Will argo stop using this mechanism?

Bobgy on 27 Oct 2020

Can we rely on it?

Short answer - no. These reads an annotation which is an implementation detail which we plan to remove to improve our ability to have larger inputs and outputs.

I think we want to support lifecycle hooks in Argo Workflows that can be configured at a namespace-level and controller-level.

I'm going to assume you need to run a container to do the processing. This would be both simple and flexible.

We already kind of have hooks: the init container runs before the main container and the wait container runs afterwards.

What if you could configure a preTask and postTask images at namespace or controller-level. These run before the init container, and have the inputs and outputs mounted as read/write volumes?

More broadly, I think you, me, @jessesuen and @Ark-kun should get on a Zoom so we can learn more about KFP roadmap and get alignment.

alexec on 27 Oct 2020

👍1

Short answer - no. These reads an annotation which is an implementation detail which we plan to remove to improve our ability to have larger inputs and outputs.

Understood, makes sense for supporting larger inputs and outputs.

Let me think about if we need an official supported interface in argo, what would be best for us.

Off the top of my head, lifecycle containers align with what I try to do.

More broadly, I think you, me, @jessesuen and @Ark-kun should get on a Zoom so we can learn more about KFP roadmap and get alignment.

Yes, agree we should proceed with zoom. I'll organize KFP side people who should attend.

Bobgy on 28 Oct 2020

👍1

Hi @alexec!
I just came up an idea with an alternative feature that can enable what we want to achieve. I want to get some early feedback from you.

Feature request

Tasks can consume and return artifacts by json metadata with its url.
To be specific, a task can consume a previous task’s artifacts by json representation of the artifact, e.g. for a minio artifact, it can look like (whatever format argo is emitting today)

{
    'name': 'hello-world',
    'bucket': 'ml-pipeline',
    'key': 'path/in/the/bucket/object-name',
    'secret': { 'name': 'pipeline-minio-secret' }
}

Similarly, a task can return artifacts by returning this json representation.

How we build on top of it

This basic primitive feature will enable us to fully encapsulate a normal argo task by control steps.
A KFP step can be translated to

Driver step that reads artifact metadata from MLMD, make caching decision, and outputs argo artifacts by artifact metadata.
Executor step that runs the KFP step's user container.
Publisher step that consumes executor step's outputs by metadata and publishes them to MLMD

In this way, we can control exact runtime semantics we want to expose to KFP users, while utilizing argo for workflow orchestration.

Current status

I'm discussing my design with my teammates, once we reach an agreement, I'll reach out to you for further discussion.

Bobgy on 29 Oct 2020

I'm not sure I fully understand your idea, but is it similar to #4341 ?

alexec on 29 Oct 2020

@alexec Definitely, that sounds like a step in the right direction.

My feature request will be supporting placeholders like {{inputs.artifacts.hello.key}}.

And output an artifact by key in a container

outputs:
  artifacts:
  - name: hello
    keyFrom:
      path: /tmp/hello_world_key.txt

Bobgy on 30 Oct 2020

Let's get a meeting with @Ark-kun . Do you need me to drive this?

alexec on 6 Nov 2020

@alexc Thanks! I'll get my design reviewed internally first next week. I can drive the meeting with you after that.

Bobgy on 6 Nov 2020

👍1

I've got a round of feedback internally, I'll soon update the feature requests to reflect latest discussion.

Bobgy on 23 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Proposal: Rename & generify output.parameters.path to valueFrom and make it context aware

jessesuen · 4Comments

Hello Wold example stuck at ContainerCreating with microk8s

srados · 3Comments

Slack invite link is no longer active [regression]

alexlatchford · 3Comments

Possible memory leak?

hden · 3Comments

Provide enum type parameters for Argo workflows

basanthjenuhb · 3Comments