Argo: Sensors

Created on 22 Feb 2018  路  16Comments  路  Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

Hi,

What would be the best way to achieve the same functionality as Airflow sensors with Argo? (https://airflow.incubator.apache.org/_modules/airflow/operators/sensors.html).

Here is one use case:

I implement an Argo workflow to complete successive transformations of my data (the steps of the workflow could for instance be the preprocessing, training, evaluation steps of an ML workflow). But the training step does not in fact start a container where training runs. Instead, it starts a container that calls an external service to run the training algorithm.

Now, the Argo workflow must somehow wait that the external service completes training.

I tried to find several ways to do this but each solution has a drawback:

  1. The training container that calls the external service/cluster could just stay alive and keep calling the external service until the training algorithm is complete. The disadvantage of this approach is that the container must keep running, wasting resources.

  2. I could use the retry functionality of Argo. A container would exit with a non-0 exit code if the training is not completed in the external service. It would exit successfully otherwise. The disadvantage of this approach is that the Argo UI/CLI would show a large number of failed pods, while in fact the behavior was the intended one.

Is there a better solution to this problem?

Thanks!

Most helpful comment

The implementation that I have been working on solves for the case of connecting processes from cross-functional teams (for us these processes just happen to be Argo workflows). I am looking to solve two use cases and my current implementation exists with triggers for external events (message queues, s3 notifications, etc..), kubernetes resources, and time based schedules (with business logic plugins). @jlewi I actually started looking at lambda frameworks to do this job, however they basically just work with one single event whereas I wanted to be able to manage multiple complex dependencies. So in addition to be able to support near real-time event processing like lambda frameworks, my solution also handles resolving dependencies from multiple sources for batch jobs. With the exception of time-based events, all other events perform no polling and therefore the sensor pods take up very little resources while waiting.

As I've mentioned, I'm currently going through processes to publish the code on Github. It should be available soon.

All 16 comments

Yes, retry would be a poor choice, since you would not want the failures to pollute the actual workflow status. Also because we don't have exponential backoff for retries (yet).

If your external service has the ability to talk back to the kubernetes api server, then in v2.1.0-alpha1 workflows have the ability to suspend themselves through something called a "suspend template". See the following example:
https://github.com/argoproj/argo/blob/master/examples/suspend-template.yaml
Suspended workflows do not take up any kubernetes resources.

When the external training service completes, it would perform an argo resume workflowname so that the workflow can resume.

Here is the original issue and motivation:
https://github.com/argoproj/argo/issues/702

One disadvantage is that the external service would be responsible for making the call back. This means that:

  • The external service would have to know about the workflow calling it.
  • We would have to somehow monitor that the external service does not fail to call back when it tries.
  • Achieving this looks complicated in several cases relevant to ML, such as: submitting a job to a cluster running Spark, submitting a job to a cloud service such a GCP Cloud ML engine training, GCP dataproc, and AWS/Azure equivalents.

It would be nice to have a solution that does not require the external service to implement the call back feature.

What about this to implement "sensors":

  • The main workflow could use a "resource" template (https://github.com/argoproj/argo/blob/master/examples/k8s-orchestration.yaml) to create a second Argo workflow (the sensor) that then uses exp backoff retries (when it is available) to query the external service until it has completed its operation. The advantage is that the main workflow would not see the "retry failures" of the second one.

This seems a bit hacky though. An alternative would be to have a "sensor" CRD that implements the behavior that we need. An Argo template could use the "sensor" CRD by using a "resource" template, or a new type of "sensor" template to make this easier.

The CRD solution is viable solution coupled with resource templates. I have seen this done where argo creates a tensorflow job (custom resource) where argo then waits for the status.state field to become Succeeded. See this example:

https://github.com/kubeflow/example-seldon/blob/master/workflows/training-tf-mnist-workflow.yaml#L65

I think the tradeoff would be that a sensor CRD is a ton of upfront work if all you are looking for is a way to wait for some external thing to complete. Also, the sensor CRD in the end, might be forced to be implemented using polling.

Also, keep in mind that resource template are implemented as a container. It is essentially a convenience abstraction which invokes kubectl get resource resource-name -w and monitors all updates on that resource until successCondition is met. This is slightly better than polling because it allows the kube API server to to trigger the change event. If there is nothing going on, then there will be no traffic. There will, however, be an open HTTP connection to k8s while it waits.

Here would be the options, the way I see it:

  • If your requirement is that waiting for the thing to complete should take zero resources, then resource templates would not work for you. The only option I see which takes no resources, is resuming the workflow via Kube API server.
  • If you are acceptable with minimal amount of resources then a polling container may work
  • If polling is undesirable, some type of "sensor CRD" would be an option, but someone would have to implement this controller, and the way it monitors changes may be just as expensive than the polling container.

Got it.

Looks like an advantage of the third option (sensor CRD) would be economy of scale. If there are many "sensor" resources created, a single controller would be able to orchestrate running the containers for all of them.

I am guessing that an efficient implementation of the sensor CRD would be a bit similar to the implementation of Kubernetes CronJobs.

I am currently trying to evaluate Argo against a couple competing workflow solutions such as Airflow and Brigade and I notice that Argo currently doesn't integrate well with external trigger for workflows. I did notice another project (argo-ci) specifically for the CI/CD purpose. I have a specific use case to define/create workflows in which the workflow definition should persist, but an individual execution of a workflow occurs on a regular (daily) cycle and triggered by external events.

What is the status on integrating event triggers into Argo and what would these look like?

Just to follow here.
I have been building out a Sensor CRD along with a controller that will be handle resource, artifact, and time-based events. The sensor acts as a sort of middle-layer that waits on any number of dependencies and once they are resolved, it triggers a response. I designed it as a standalone resource so it can be useful for triggering argo workflows or any other downstream process; one would simply need to create a gateway to take action on the response. I am hoping I will be able to open-source the project soon on Github in the next couple of weeks.

Why not just a lambda framework e.g. kubeless, metacontroller to create an Argo workflow in response to some events?

jlewi@, I think the goal for this bug is a bit different. It's more about having a step within a workflow that waits for something to complete (e.g., a job in another system) rather than having a mechanism to trigger workflows.

For the later, kubeless and metacontroller look like interesting options to investigate. Thanks for the suggestion!

For the former, it would be nice to have a mechanism in Argo that does not require a container to run continuously until the waiting condition is satisfied.

Misunderstood thanks; Still why can't you use kubeless/metacontroller for sensors?

I think this is essentially the approach @jessesuen suggested above except using meta-controller should drastically reduce the cost of implementing the sensor.

Here's an example of a CRD to create a GCP endpoint by calling the Cloud Endpoints service. It uses metacontroller to simplify the implementation of the CRD.

So using metacontroller you can easily implement lambdas that monitor resources/jobs in external systems and turn them into K8s conditions.

Its the metacontroller that runs the control loop. So if you wanted to do adaptive polling then that's something you'd want the metacontroller to support. Specifically, you might file a FR so that the lambda can report in its status the next time the metacontroller should invoke the lambda.

Got it. Looks like we could implement a CRD using meta-controller that would just launch a container periodically to verify whether the job in the other system is completed.

One of the parameter of the CRD would be which container to launch. This way, someone who wants to implement a sensor for a new external system wouldn't have to learn about CRD / meta-controller / etc, and could just provide an executable in a container.

If you want workflows with efficient waiting, a combination of the suspend feature that already exists with an event-based wakeup/resume may be appropriate. A very simple form of this would be to trigger a wakeup after a set period of time but one could also wakeup in response to external events. Of course, one could also trigger actions other than wakeup.

I've always wanted to add an eventing framework to Argo. In addition to responding to time or external events, it could process K8s events, which could make it useful for managing K8s clusters and applications.

The implementation that I have been working on solves for the case of connecting processes from cross-functional teams (for us these processes just happen to be Argo workflows). I am looking to solve two use cases and my current implementation exists with triggers for external events (message queues, s3 notifications, etc..), kubernetes resources, and time based schedules (with business logic plugins). @jlewi I actually started looking at lambda frameworks to do this job, however they basically just work with one single event whereas I wanted to be able to manage multiple complex dependencies. So in addition to be able to support near real-time event processing like lambda frameworks, my solution also handles resolving dependencies from multiple sources for batch jobs. With the exception of time-based events, all other events perform no polling and therefore the sensor pods take up very little resources while waiting.

As I've mentioned, I'm currently going through processes to publish the code on Github. It should be available soon.

Ed, adding an eventing framework to Argo would be extremely useful. Do you have a proposal handy? I would be happy to help on this.

@magaldima Looking forward to seeing your code!

Thank you, @vicaire!

We're planning an Argo Community Meeting near the end of this month. Let's see if we can kick off a new project :-) All interested parties are welcome to attend.

Please join the Argo Slack channel for updates on Argo community events.
https://join.slack.com/t/argoproj/shared_invite/enQtMzExODU3MzIyNjYzLTA5MTFjNjI0Nzg3NzNiMDZiNmRiODM4Y2M1NWQxOGYzMzZkNTc1YWVkYTZkNzdlNmYyZjMxNWI3NjY2MDc1MzI

There is now a new repo for event management. Let's continue the discussion there.
https://github.com/argoproj/argo-events/issues/2

Was this page helpful?
0 / 5 - 0 ratings