Dvc: [Feature Request?] `dvc run ...` without actually running?

Created on 19 Jul 2018  路  17Comments  路  Source: iterative/dvc

If this is already possible, I apologise. Please let me know how it can be done.

Say I'm working on a script that transforms data1.txt to data2.txt:

(data1.txt) --> [script1.py] --> (data2.txt)

Imagine script1.py takes a very long time to run (hours, days...). What I would like to do is:

  • develop and test script1.py until it reaches a state I am happy with
  • let it run completely (fully generate data2.txt)
  • confirm everything works as expected
  • finally, add it to dvc: dvc run -d data1.txt -d script1.py -o data2.txt python script1.py

Unfortunately, this will cause the script to run again and make me wait for it to finish. I know there is also the flag --no-exec but, from what I understood, it does not calculate the checksums. So dvc status will not know if files changed or not. I would like to have a way to dvc run a workflow like the one above, where all the dependencies and outputs will become dvc tracked with all the checksum calculated, but without actually executing the command.

If this is still not possible, I think it would be a very useful feature.

What do you think?

enhancement

Most helpful comment

We've been thinking about it a lot and decided to change dvc add/run/repro so they will only save checksums and won't actually save files to cache, plus introduce dvc commit that will save files to cache. This will make dvc handle more like git and will make interactions with it more natural. We will introduce those changes starting from v1.0, since they are backward incompatible.

All 17 comments

Hi @andrethrill !

You are right, --no-exec doesn't calculate the checksums, it just writes dvcfile, that is all.

I would like to have a way to dvc run a workflow like the one above, where all the dependencies and outputs will become dvc tracked with all the checksum calculated, but without actually executing the command.

That seems rather dangerous :) But thinking about it, I can definitely see that this hackish feature might be useful. That being said, it should definitely be practiced with care, since it is prone to human errors. We could add something like dvc run --no-exec --save(or just dvc run --save) for this feature. Naming suggestions are very welcomed :) I've added this feature to our TODO list, will take a closer look at it soon.

Thanks,
Ruslan

Hi @efiop thanks for the quick feedback as always.

I understand what you mean by hackish :) but do you have a different suggestion when working with scripts under development that take a long time to run and that also the output data is rather big and as such, I don't want to start caching dozens of intermediate version of the output data?

Another way I could see this working would be to have a flag like dvc run --overwrite (or a different name) where it would keep only the last version of the output in cache and delete the previous ones. What do you think? Actually, this sounds much less hackish then my first thought.

It almost feels like dvc should have two different stages dvc run where it runs the command and calculates the checksums, and dvc commit where it actually caches the files....

My suggestion would be to use dvc run to run your script and after you are done debugging(or even sooner) simply call dvc gc -a to cleanup unused cache. Would that work for you?

@efiop I see... I guess that could work yes. I'm still getting familiar with the different ways of achieving things using dvc :)

What do you think about my second comment above of having a run and commit stage?

Ah, sorry, I forgot to previously mention our garbage collector command :) Maybe I'm missing something in your scenario, but dvc gc -a will remove any currently unused cache in your project, so in your scenario it will remove your previous versions of the output leaving only the last one existing because it is actually the one used in the pipeline.

What do you think about my second comment above of having a run and commit stage?

Thank you for the great suggestion! This is actually a very interesting idea. I can definitely see this being useful in a variety of scenarios where copy is the only option for caching your outputs(i.e. external output scenario for s3, gs, current hdfs and ssh drivers implementations, even copy cache type for local outputs, etc) and so it has quite an overhead that could be avoided. We could consider adding dvc run --no-commit(or --no-save) to tell dvc not to save the outputs temporarily and then add dvc commit stage.dvc(or dvc save stage.dvc) to tell dvc to actually save outputs of the specified stage to the cache. Would that be suitable for you?

PS. I would probably prefer dvc run --no-save + dvc save to the commit, because it actually makes more sense in the relation to the architecture of DVC and doesn't create confusion with dvc add + dvc commit.

@andrethrill I have one more solution for true Git fans :)

  1. Fork a new branch git checkout -b script1_dev (let's assume from master branch)
  2. Develop the script and generate, let say, 3 outputs. Let say ffffff1, ffffff2, ffffff3 are the outputs checksums. Only the last run and ffffff3 checksum are correct and need to be merged.
  3. Merge the last changes into your master: git checkout master; dvc checkout; git merge script1_dev
  4. Resolve the conflicts: code conflicts - as a usual, dvc files - keep checksums from the script1_dev branch if conflicts. In many cases, theirs merge strategy does all the work git merge -X theirs script1_dev in step 3.
  5. Run dvc checkout to restore right data files from the cache.

Now you have script1.py integrated to master. One more run (hours, days...) is not needed since all the checksum were verified in script1_dev branch and script1.dvc output points to the correct data filesffffff3. As a result, dvc repro will do nothing.

Another alternative the --no-save`--commit` approach which you guys just discussed. At the first glance it adds more complexity but can be a good solution.

@andrethrill what are your thoughts? Which approach looks more appealing to you?

Oh, sorry. it looks like I've answered to a bit different question :)
Let me think a bit more...

So, it looks like the following workflow makes sense to try:

  1. dvc run --no-exec ... - just to add a stub of the stage initially
  2. dvc repro -s stage.dvc - to iterate on that stage without running the full pipeline
  3. dvc gc - to collect unused data files and keep the cache as small as possible. Keep in mind that you need to push data files you actually want to save to some remote before running this.

    @efiop what do you think about this? Am I missing something?

It's def better than allowing people save DVC directly, and should be similar to dvc repro --no-save; dvc save. The only overhead here is calculating md5s every time. @andrethrill, please try it and let us know if it's tolerable.

@efiop what do you think about this? Am I missing something?

@shcheklein That is precisely what I've been suggesting with dvc gc -a above.

It's def better than allowing people save DVC directly, and should be similar to dvc repro --no-save; dvc save. The only overhead here is calculating md5s every time.

As I've said previously, the main overhead here would be to store one more version of the output, plus it might be not very handy to remember to call dvc gc every time you run a command. I definitely think that there is a place for dvc run/repro --no-save + dvc save in our architecture, since it is still making sure that everything is tracked and just simply leaving the save part for later, which might be very useful in the scenarios where copy is the only option for saving the cache(i.e. local cache.type==copy, current hdfs and ssh drivers and maybe some other drivers in the future). Mind that --no-save will still calculate the checksums, but it only won't actually save data files to the cache, leaving them in the user's workspace.

Makes sense. I'm just trying to find an option to avoid new commands :) Is it possible to reuse dvc add instead of dvc save? Basically it would mean that dvc add will be used always to move file to cache and create dvc file if it does not exist yet.

@shcheklein I agree, saving a command is great, but those are the details that we will have to figure out if we have a confirmation from the user that this feature is indeed needed. That being said, I don't think that dvc add should handle this case at all. An alternative that makes more sense to me would be to have a --save(or smth like that, we can always figure it out later) option for dvc repro. But again, those are the details we can discuss when we receive a confirmed feature request.

Thanks everyone for the insightful discussion. It's showing me different ways of using DVC. Below follow my answers:

@efiop

simply call dvc gc -a to cleanup unused cache.

I think this serves a different purpose. Because, as I understand it, it cleans all the unused cache. Personally, I like the idea of working on two or different branches and being able to quickly changing between them by just checking out files. And if the files that are garbage collected take a long time to generate that won't be the case anymore.

As a general comment, I would say that I like to keep "useful" files in cache as much as possible. On the other hand, I don't like the idea of keeping trash in cache, and that's why I raised this issue.

At the moment, with the available commands, it a seems almost an all or nothing situation (i know it's not exactly like that, I'm just trying to prove a point).

We could consider adding dvc run --no-commit(or --no-save) to tell dvc not to save the outputs temporarily and then add dvc commit stage.dvc(or dvc save stage.dvc) to tell dvc to actually save outputs of the specified stage to the cache. Would that be suitable for you?

From a user-friendliness perspective, I think this makes the most sense. Especially, given the similarity and close relationship dvc has with git.

@dmpetrov:

@andrethrill I have one more solution for true Git fans :)

That sounds a reasonable solution as well! (although I like the flexibility of not needing to create a different branch. 馃槂 ) But will that clean the unused cached files ffffff1 and ffffff2 automatically?

@efiop:

it might be not very handy to remember to call dvc gc every time you run a command

This is a very good point.


I just want to add:

Obviously, I'm just one user, and as such, if I adopt dvc I will use the options that it supplies. :)

At the same time, I may not be the average user but, from my perspective, since I first read the @dmpetrov 's blogpost introducing dvc I always saw it as a "git for data files". In that sense, it seems reasonable and intuitive to make comparisons with git and to have a command to tell dvc a certain file/pipeline should be tracked and to have another command to actually cache/track it.

If it is called commit or it's just a flag like --no-save, I think you guys are the most knowledgeable to decide that.

We've been thinking about it a lot and decided to change dvc add/run/repro so they will only save checksums and won't actually save files to cache, plus introduce dvc commit that will save files to cache. This will make dvc handle more like git and will make interactions with it more natural. We will introduce those changes starting from v1.0, since they are backward incompatible.

Hi @andrethrill !

We've released 0.28.0 with dvc commit + dvc add/run/repro --no-commit support. Feel free to upgrade and give it a try! Would love to hear about your experience :slightly_smiling_face:

Thanks,
Ruslan

Hi @efiop!

Did an initial test. And it seems to do exactly what I would like! :)

Is there a way to have the flag --no-commit on by default in the dvc config or something? I guess that's the way I would like to work by default.

@andrethrill Currently there is no way to set that behaviour as default one :slightly_frowning_face: We plan to adopt that as a default behaviour starting from 1.0, but it is a great idea to provide opt-in config option for now. Created https://github.com/iterative/dvc/issues/1627 to track progress on it. Thanks for the feedback! :slightly_smiling_face:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ynop picture ynop  路  41Comments

kevin-hanselman picture kevin-hanselman  路  37Comments

dmpetrov picture dmpetrov  路  35Comments

Suor picture Suor  路  39Comments

yukw777 picture yukw777  路  45Comments