Dvc: run/repro: Execute a command on another machine

Created on 11 Jan 2019 · 3Comments · Source: iterative/dvc

Sometimes, you have a dedicated machine with the proper runtime environment to execute your exepriments. It would be great to have an option to send input and run a command on that machine and being able to retrieve the output files.

Things to look up:

Cancelled or failed runs could leave garbage on the remote hosts that need to be collected
Large Input / Output files with inefficient transfers
Use -d and o to keep track of which files need to pushed to and retrieved from the remote host

Maybe, this could work alongside setting up SSHFS or NFS.

We can introduce an option --sshlogin to receive the URI of the node where the computation is needed to be run.

enhancement feature request p3-nice-to-have

Source

ghost

👍4

Most helpful comment

Can this also be related to data processing on a remote Spark cluster?

We're using EMR clusters to run our Spark jobs. One way it is done is using aws emr add-steps. In this case, we the job's code is placed on S3 and by providing a cluster ID we can execute the work.

Let me try to simplify. Say that I have a local script execute_process_on_emr_cluster.sh. The result of running this script is an EMR step running on a predefined EMR cluster. Furthermore, the result of the computation will be persisted to s3://mybucket/great_result.parquet. I would like to be able to do something like:

dvc run -d execute_process_on_emr_cluster.sh [-d some other dependencies maybe] -o s3://mybucket/great_result.parquet execute_process_on_emr_cluster.sh

The problem is that execute_process_on_emr_cluster.sh merely returns the ID of the step submitted to EMR, so dvc will complain that no expected output was found. I guess that some "asynchronous" approach is needed here.

drorata on 22 Aug 2019

👍2

All 3 comments

Can this also be related to data processing on a remote Spark cluster?

We're using EMR clusters to run our Spark jobs. One way it is done is using aws emr add-steps. In this case, we the job's code is placed on S3 and by providing a cluster ID we can execute the work.

dvc run -d execute_process_on_emr_cluster.sh [-d some other dependencies maybe] -o s3://mybucket/great_result.parquet execute_process_on_emr_cluster.sh

drorata on 22 Aug 2019

👍2

@drorata a workaround can be to actively pull the status in the execute_process_on_emr_cluster.sh and exit when it's done. I'm not sure I understand how can asynchronous mode look like. Have you used some notification mechanisms for this before?

shcheklein on 22 Aug 2019

👍1

The idea suggested by @shcheklein is indeed a workaround --- first issue which comes to my mind is that the local machine (where dvc is running) would have to stay awake during the whole processing and this can be a very lengthy process.

I don't have any clear picture in mind how an asynchronous flow should look like, but it is probably something worthy discussing. After all, dvc is designed around handling huge data sets, but this renders useless if there's no way of designing stages which process the data in a distributed manner on a Spark or dask clusters for example.

I use a notification mechanism in one of my projects where I trigger a process on EMR and this process emits a message once completed. There's a counterpart waiting for that message and once received the 2-phase kicks in.

drorata on 23 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

working behind corporate proxy

tc-ying · 3Comments

pkg: deploy model or dataset

dmpetrov · 3Comments

pull: computing md5 for large directories (message is misleading)

ghost · 3Comments

Possibly an incorrect value in a test in diff.DIFF_IDENT

GildedHonour · 3Comments

Exception is raised when adding the same data again

shcheklein · 3Comments