Dvc: run/repro: Execute a command on another machine

Created on 11 Jan 2019  路  3Comments  路  Source: iterative/dvc

Sometimes, you have a dedicated machine with the proper runtime environment to execute your exepriments. It would be great to have an option to send input and run a command on that machine and being able to retrieve the output files.

Things to look up:

  • Cancelled or failed runs could leave garbage on the remote hosts that need to be collected
  • Large Input / Output files with inefficient transfers
  • Use -d and o to keep track of which files need to pushed to and retrieved from the remote host

Maybe, this could work alongside setting up SSHFS or NFS.

We can introduce an option --sshlogin to receive the URI of the node where the computation is needed to be run.

enhancement feature request p3-nice-to-have

Most helpful comment

Can this also be related to data processing on a remote Spark cluster?

We're using EMR clusters to run our Spark jobs. One way it is done is using aws emr add-steps. In this case, we the job's code is placed on S3 and by providing a cluster ID we can execute the work.

Let me try to simplify. Say that I have a local script execute_process_on_emr_cluster.sh. The result of running this script is an EMR step running on a predefined EMR cluster. Furthermore, the result of the computation will be persisted to s3://mybucket/great_result.parquet. I would like to be able to do something like:

dvc run -d execute_process_on_emr_cluster.sh [-d some other dependencies maybe] -o s3://mybucket/great_result.parquet execute_process_on_emr_cluster.sh

The problem is that execute_process_on_emr_cluster.sh merely returns the ID of the step submitted to EMR, so dvc will complain that no expected output was found. I guess that some "asynchronous" approach is needed here.

All 3 comments

Can this also be related to data processing on a remote Spark cluster?

We're using EMR clusters to run our Spark jobs. One way it is done is using aws emr add-steps. In this case, we the job's code is placed on S3 and by providing a cluster ID we can execute the work.

Let me try to simplify. Say that I have a local script execute_process_on_emr_cluster.sh. The result of running this script is an EMR step running on a predefined EMR cluster. Furthermore, the result of the computation will be persisted to s3://mybucket/great_result.parquet. I would like to be able to do something like:

dvc run -d execute_process_on_emr_cluster.sh [-d some other dependencies maybe] -o s3://mybucket/great_result.parquet execute_process_on_emr_cluster.sh

The problem is that execute_process_on_emr_cluster.sh merely returns the ID of the step submitted to EMR, so dvc will complain that no expected output was found. I guess that some "asynchronous" approach is needed here.

@drorata a workaround can be to actively pull the status in the execute_process_on_emr_cluster.sh and exit when it's done. I'm not sure I understand how can asynchronous mode look like. Have you used some notification mechanisms for this before?

The idea suggested by @shcheklein is indeed a workaround --- first issue which comes to my mind is that the local machine (where dvc is running) would have to stay awake during the whole processing and this can be a very lengthy process.

I don't have any clear picture in mind how an asynchronous flow should look like, but it is probably something worthy discussing. After all, dvc is designed around handling huge data sets, but this renders useless if there's no way of designing stages which process the data in a distributed manner on a Spark or dask clusters for example.

I use a notification mechanism in one of my projects where I trigger a process on EMR and this process emits a message once completed. There's a counterpart waiting for that message and once received the 2-phase kicks in.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ynop picture ynop  路  41Comments

Suor picture Suor  路  39Comments

yukw777 picture yukw777  路  45Comments

luchoPipe87 picture luchoPipe87  路  69Comments

mdekstrand picture mdekstrand  路  43Comments