Dvc: Support for long-running asynchronous stages

Created on 28 Aug 2019 · 14Comments · Source: iterative/dvc

A conversation came up on Discord about what to do when you have a stage in your pipeline that either takes a massive amount of time or needs to be run asynchronously.

For projects that involve large datasets or require a lot of compute on specialized hardware (e.g. training large neural networks on GPUs/TPUs), it's common to have infrastructure where you submit jobs to a cluster or service that will handle scheduling and provisioning resources for you. Client-side, however, these tasks will exit immediately and won't directly produce any outputs, making them difficult to integrate in a DVC pipeline.

Currently the best workaround for our case is to rework the infrastructure so that dvc repro is run on a remote machine that has the resources you need and make every stage synchronous. This might not be an option for some teams though, what should be the right solution in this case?

Thanks!!

discussion feature request p2-medium

Source

somewacko

👍4

Most helpful comment

I don't understand how @Suor comment is related to the underlying use case behind this issue. The main deal here is about streamlining dvc and remote infrastructures while @Suor comment is about concurrent processes running on the "local" machine where dvc is.

As far as I understand @zo7 concern (and it is also mine), the situation is that one is running dvc from a "local" machine. Assume that one of the stages in a pipeline has to be ran on a large Spark cluster, grid of GPU etc. In this setting, it is very much _not_ straightforward how to integrate such a step into the dvc pipeline.

What I'm currently doing in my projects is wrapping the stage which depends on an external resource in a script which checks every x minutes whether the stage is completed or not.

N.B. In my case I'm running the stage on an EMR cluster. Each time, the cluster ID to which I'm submitting my job changes. I mitigate it by reading an environment variable which holds the cluster's ID. This way, this ID is not part of the .dvc file.

drorata on 20 Dec 2019

👍3 ❤1

All 14 comments

Mentioned conversation.

pared on 28 Aug 2019

To solve this, we would have to come up with a special marker for the dvc files, so that they know that the job got deployed and we are waiting for it to be completed. Maybe something like

cmd: ./spawn_remote_job_and_detach.sh
detached:
    cmd: ./check_if_remote_job_is_done.sh

/check_if_remote_job_is_done.sh would probably need to report proper return codes for us to differentiate the state. E.g. <0 errored, 0 succeeded and ready, >0 still running. Another way would be to do the same but with the ./spawn_remote_job_and_detach.sh itself(but we should def use detached: True flag), so it reports those return codes. E.g. when spawned it returns >0 so dvc knows that the job got deployed and is in progress, then on status or another repro attempt, it runs it again, sees that >0 is not ready yet and exists. And so on. Thoughts, guys?

efiop on 28 Aug 2019

@zo7 What you describe seems like a situation that needs some workflow management tool. Have you tried any of them? I don't have any experience with such tools, but I could find some that maybe could do the job:

https://airflow.apache.org/
https://towardsdatascience.com/data-pipelines-luigi-airflow-everything-you-need-to-know-18dc741449b7
https://medium.com/@scott_64558/in-praise-of-digag-an-alternative-to-airflow-258a0eef83bc

I don't think that adding functionality that is covered by other tools (reinventing the wheel) will make DVC better.

dashohoxha on 31 Aug 2019

The OP in this issue describes well my hurdles. Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost... Furthermore, this is rather inefficient approach for at least two reasons:

The local machine has to be awake and running during the whole time (which can be long)
The resulting artifact (which can be huge) ends on the local machine's storage

Thanks @efiop for pointing me to this thread. I copied my reply from discord for better visibility.

drorata on 3 Sep 2019

👍2

@dashohoxha Why don't you consider dvc a workflow management tool? After all it is maintaining DAG(s). Do you have in mind somehow using dvc and airflow together?

drorata on 3 Sep 2019

Do you have in mind somehow using dvc and airflow together?

Yes, exactly.

Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost...

I don't have any experience with airflow, but I believe that this is exactly the kind of situations that it is supposed to solve: running and monitoring tasks/processes on different hosts, and making sure that they follow the dependency constraints (like don't run this until this task is finished successfully). Why don't you give it a try? The time needed to get used to it may pay off. Your simple hacks will never be as good and mature as the general solutions provided by airflow and its community.

By the way, it seems that the DAG of airflow is a DAG of processes/tasks and their orders (what runs before what). It does not have the concept of input and output dependencies (like DVC).

dashohoxha on 3 Sep 2019

@dashohoxha

I'm not experienced with tools like airflow/luigi/digag but how would these tools work alongside DVC? It looks like if you were to use one of these you would be executing your tasks in that framework instead, so intermediate states wouldn't be tracked or versioned with DVC.

@efiop

Adding a detached script that checks periodically might help, it still would have the problem like @drorata pointed out where it's running on a local machine for a potentially prohibitively long time. I'd also imagine that spawn_remote_job_and_detach needs some way to pass information to check_if_remote_job_is_done too (like a job ID) so it knows what to check for.

somewacko on 3 Sep 2019

how would these tools work alongside DVC

I guess dvc repro on a remote host can be one of the tasks of airflow. Airflow can monitor its execution and when it is done it can start another dvc repro on the local host.

dashohoxha on 3 Sep 2019

My intuition tells me that mixing dvc and airflow doesn't make sense, but I'm lacking knowhow about the latter so take it with a grain of salt. It sounds to me that trying to use both tools will quickly lead to some cyclic confusing setting. I think the title of this issue should be something like: "Support for remote asynchronous stages yielding outputs on remote locations" This will reflect the challenge that dvc runs on machine A, is trying to execute a stage on some remote machine/cluster B and the results of this stage are stored on some third remote environment C.

drorata on 4 Sep 2019

👍1

Once we support concurrent runs/repros we may just run this synchronously.

Suor on 5 Dec 2019

👍1

@Suor Good point. Related to #755

efiop on 5 Dec 2019

What I'm currently doing in my projects is wrapping the stage which depends on an external resource in a script which checks every x minutes whether the stage is completed or not.

drorata on 20 Dec 2019

👍3 ❤1

The current solution is to make your command launch the job, save the metadata somewhere and return an error. After that, on next runs (dvc repro) it will see (through the metadata that it saved) that it already has the worker running and will continue returning errors until the worker is done. When it is done, it will do the rest of the actions and finish successfully.

efiop on 24 Dec 2019

Adding to @efiop 's comment, here's an example of a script with such behavior:

#!/usr/bin/env bash

# name: example.sh

if [[ -f metadata ]] && grep -qs "running" metadata ; then
  echo "Command is already running"
  exit 1
fi

{
  echo "running" > metadata
  sleep 15
  echo "done" > metadata
} &

You can then do dvc run example.sh