Dvc: Support for long-running asynchronous stages

Created on 28 Aug 2019  路  14Comments  路  Source: iterative/dvc

A conversation came up on Discord about what to do when you have a stage in your pipeline that either takes a massive amount of time or needs to be run asynchronously.

For projects that involve large datasets or require a lot of compute on specialized hardware (e.g. training large neural networks on GPUs/TPUs), it's common to have infrastructure where you submit jobs to a cluster or service that will handle scheduling and provisioning resources for you. Client-side, however, these tasks will exit immediately and won't directly produce any outputs, making them difficult to integrate in a DVC pipeline.

Currently the best workaround for our case is to rework the infrastructure so that dvc repro is run on a remote machine that has the resources you need and make every stage synchronous. This might not be an option for some teams though, what should be the right solution in this case?

Thanks!!

discussion feature request p2-medium

Most helpful comment

I don't understand how @Suor comment is related to the underlying use case behind this issue. The main deal here is about streamlining dvc and remote infrastructures while @Suor comment is about concurrent processes running on the "local" machine where dvc is.

As far as I understand @zo7 concern (and it is also mine), the situation is that one is running dvc from a "local" machine. Assume that one of the stages in a pipeline has to be ran on a large Spark cluster, grid of GPU etc. In this setting, it is very much _not_ straightforward how to integrate such a step into the dvc pipeline.

What I'm currently doing in my projects is wrapping the stage which depends on an external resource in a script which checks every x minutes whether the stage is completed or not.

N.B. In my case I'm running the stage on an EMR cluster. Each time, the cluster ID to which I'm submitting my job changes. I mitigate it by reading an environment variable which holds the cluster's ID. This way, this ID is not part of the .dvc file.

All 14 comments

To solve this, we would have to come up with a special marker for the dvc files, so that they know that the job got deployed and we are waiting for it to be completed. Maybe something like

cmd: ./spawn_remote_job_and_detach.sh
detached:
    cmd: ./check_if_remote_job_is_done.sh

/check_if_remote_job_is_done.sh would probably need to report proper return codes for us to differentiate the state. E.g. <0 errored, 0 succeeded and ready, >0 still running. Another way would be to do the same but with the ./spawn_remote_job_and_detach.sh itself(but we should def use detached: True flag), so it reports those return codes. E.g. when spawned it returns >0 so dvc knows that the job got deployed and is in progress, then on status or another repro attempt, it runs it again, sees that >0 is not ready yet and exists. And so on. Thoughts, guys?

@zo7 What you describe seems like a situation that needs some workflow management tool. Have you tried any of them? I don't have any experience with such tools, but I could find some that maybe could do the job:

I don't think that adding functionality that is covered by other tools (reinventing the wheel) will make DVC better.

The OP in this issue describes well my hurdles. Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost... Furthermore, this is rather inefficient approach for at least two reasons:

  1. The local machine has to be awake and running during the whole time (which can be long)
  2. The resulting artifact (which can be huge) ends on the local machine's storage

Thanks @efiop for pointing me to this thread. I copied my reply from discord for better visibility.

@dashohoxha Why don't you consider dvc a workflow management tool? After all it is maintaining DAG(s). Do you have in mind somehow using dvc and airflow together?

Do you have in mind somehow using dvc and airflow together?

Yes, exactly.

Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost...

I don't have any experience with airflow, but I believe that this is exactly the kind of situations that it is supposed to solve: running and monitoring tasks/processes on different hosts, and making sure that they follow the dependency constraints (like don't run this until this task is finished successfully). Why don't you give it a try? The time needed to get used to it may pay off. Your simple hacks will never be as good and mature as the general solutions provided by airflow and its community.

By the way, it seems that the DAG of airflow is a DAG of processes/tasks and their orders (what runs before what). It does not have the concept of input and output dependencies (like DVC).

@dashohoxha

I'm not experienced with tools like airflow/luigi/digag but how would these tools work alongside DVC? It looks like if you were to use one of these you would be executing your tasks in that framework instead, so intermediate states wouldn't be tracked or versioned with DVC.

@efiop

Adding a detached script that checks periodically might help, it still would have the problem like @drorata pointed out where it's running on a local machine for a potentially prohibitively long time. I'd also imagine that spawn_remote_job_and_detach needs some way to pass information to check_if_remote_job_is_done too (like a job ID) so it knows what to check for.

how would these tools work alongside DVC

I guess dvc repro on a remote host can be one of the tasks of airflow. Airflow can monitor its execution and when it is done it can start another dvc repro on the local host.

My intuition tells me that mixing dvc and airflow doesn't make sense, but I'm lacking knowhow about the latter so take it with a grain of salt. It sounds to me that trying to use both tools will quickly lead to some cyclic confusing setting. I think the title of this issue should be something like: "Support for remote asynchronous stages yielding outputs on remote locations" This will reflect the challenge that dvc runs on machine A, is trying to execute a stage on some remote machine/cluster B and the results of this stage are stored on some third remote environment C.

Once we support concurrent runs/repros we may just run this synchronously.

@Suor Good point. Related to #755

I don't understand how @Suor comment is related to the underlying use case behind this issue. The main deal here is about streamlining dvc and remote infrastructures while @Suor comment is about concurrent processes running on the "local" machine where dvc is.

As far as I understand @zo7 concern (and it is also mine), the situation is that one is running dvc from a "local" machine. Assume that one of the stages in a pipeline has to be ran on a large Spark cluster, grid of GPU etc. In this setting, it is very much _not_ straightforward how to integrate such a step into the dvc pipeline.

What I'm currently doing in my projects is wrapping the stage which depends on an external resource in a script which checks every x minutes whether the stage is completed or not.

N.B. In my case I'm running the stage on an EMR cluster. Each time, the cluster ID to which I'm submitting my job changes. I mitigate it by reading an environment variable which holds the cluster's ID. This way, this ID is not part of the .dvc file.

The current solution is to make your command launch the job, save the metadata somewhere and return an error. After that, on next runs (dvc repro) it will see (through the metadata that it saved) that it already has the worker running and will continue returning errors until the worker is done. When it is done, it will do the rest of the actions and finish successfully.

Adding to @efiop 's comment, here's an example of a script with such behavior:

#!/usr/bin/env bash

# name: example.sh

if [[ -f metadata ]] && grep -qs "running" metadata ; then
  echo "Command is already running"
  exit 1
fi

{
  echo "running" > metadata
  sleep 15
  echo "done" > metadata
} &

You can then do dvc run example.sh

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kevin-hanselman picture kevin-hanselman  路  37Comments

JoeyCarson picture JoeyCarson  路  53Comments

ChrisHowlin picture ChrisHowlin  路  35Comments

Casyfill picture Casyfill  路  56Comments

danfischetti picture danfischetti  路  41Comments