Sometimes, you have a dedicated machine with the proper runtime environment to execute your exepriments. It would be great to have an option to send input and run a command on that machine and being able to retrieve the output files.
Things to look up:
-d
and o
to keep track of which files need to pushed to and retrieved from the remote hostMaybe, this could work alongside setting up SSHFS or NFS.
We can introduce an option --sshlogin
to receive the URI of the node where the computation is needed to be run.
Can this also be related to data processing on a remote Spark cluster?
We're using EMR clusters to run our Spark jobs. One way it is done is using aws emr add-steps
. In this case, we the job's code is placed on S3 and by providing a cluster ID we can execute the work.
Let me try to simplify. Say that I have a local script execute_process_on_emr_cluster.sh
. The result of running this script is an EMR step
running on a predefined EMR cluster. Furthermore, the result of the computation will be persisted to s3://mybucket/great_result.parquet
. I would like to be able to do something like:
dvc run -d execute_process_on_emr_cluster.sh [-d some other dependencies maybe] -o s3://mybucket/great_result.parquet execute_process_on_emr_cluster.sh
The problem is that execute_process_on_emr_cluster.sh
merely returns the ID of the step submitted to EMR, so dvc will complain that no expected output was found. I guess that some "asynchronous" approach is needed here.
@drorata a workaround can be to actively pull the status in the execute_process_on_emr_cluster.sh
and exit when it's done. I'm not sure I understand how can asynchronous mode look like. Have you used some notification mechanisms for this before?
The idea suggested by @shcheklein is indeed a workaround --- first issue which comes to my mind is that the local machine (where dvc is running) would have to stay awake during the whole processing and this can be a very lengthy process.
I don't have any clear picture in mind how an asynchronous flow should look like, but it is probably something worthy discussing. After all, dvc is designed around handling huge data sets, but this renders useless if there's no way of designing stages which process the data in a distributed manner on a Spark or dask clusters for example.
I use a notification mechanism in one of my projects where I trigger a process on EMR and this process emits a message once completed. There's a counterpart waiting for that message and once received the 2-phase kicks in.
Most helpful comment
Can this also be related to data processing on a remote Spark cluster?
We're using EMR clusters to run our Spark jobs. One way it is done is using
aws emr add-steps
. In this case, we the job's code is placed on S3 and by providing a cluster ID we can execute the work.Let me try to simplify. Say that I have a local script
execute_process_on_emr_cluster.sh
. The result of running this script is an EMRstep
running on a predefined EMR cluster. Furthermore, the result of the computation will be persisted tos3://mybucket/great_result.parquet
. I would like to be able to do something like:The problem is that
execute_process_on_emr_cluster.sh
merely returns the ID of the step submitted to EMR, so dvc will complain that no expected output was found. I guess that some "asynchronous" approach is needed here.