Currently, the way I integrate DVC into my workflow is:
Consider the sample repo:
- [repo root]
- src
- script1.py
- script2.py
- script3.py
- data
- file1.csv
- file1.csv.dvc
- file2.csv
- file2.csv.dvc
- file2_1.csv
- file2_1.csv.dvc
- file2_2.csv
- file2_2.csv.dvc
Where:
file2.csv is generated by script1.py and depends on file1.csv.file2_1.csv is generated by script2.py and depends on file2.csv.file2_2.csv is generated by script3.py and depends on file2.csv.Imagine that I change script1.py, that changes file2.csv, which is a dependency for file2_1.csv and file2_2.csv.
I believe that, at the moment, the only way to update all files that became outdated with such change, is to run dvc status, check everything that changed (imagine I have dozens of files in a real project) and individually run dvc repro on each outdated file.
I know that if there was a file3.csv that depends on both file2_1.csv and file2_2.csv I could just run one dvc repro file3.csv.dvc. Still that is not always the case and furthermore requires the user to actively be aware of these cases.
Is there currently an easy way to repro all the outdated .dvc files in a repository tree?
If not, I think this would be a very nice addition. I would still first show the user what will be executed for him/her to be aware. For confirmation purposes.
Hi @andrethrill !
As a workaround you could use the following trick to tie down that nicely with a dummy Dvcfile placed in a project root(as an example) that will have all your file*.cvs as dependencies and will not have a command nor will it have outputs. To generate that file run:
$ dvc run -d data/file2.csv -d data/file2_1.csv ... -f Dvcfile
and then you will be able to just run
$ dvc repro
to make dvc handle everything for you.
This will essentially tie down all the hanging ends(outputs) of your pipeline into a single final stage that is there just for convenience, so you don't even have to wonder which pipeline branch you need to call repro on when you come back to the project in a while.
That being said, I think we could definitely introduce some option like --pipeline(since dvc pipeline show --ascii stage.dvc already outputs the whole pipeline for the specified stage and not only its dependencies) or something for the dvc repro that will find the end stages for the pipeline that the stage belongs to and will traverse it from there. Would that be suitable for you?
Thanks,
Ruslan
Hi @efiop! :)
Thank you very much for the quick reply as usual.
If I understand well your proposed workaround, that would require the user to actively remember to go and update such dummy dvc file everytime a new .dvc file is created or removed right? It seems a bit cumbersome from the user perspective.
The solution you propose at the end seems much more appealing (for me at least), even more given that you seem to have already the logic implemented to walk down the repository tree looking for .dvc files and representing the entire pipeline.
At first thought, it seems to me more logical to implement such feature as an option to the dvc repro command. (Although I’m not sure how that will change in version 1.0 when dvc commands look more “git like”)
Thanks,
André
If I understand well your proposed workaround, that would require the user to actively remember to go and update such dummy dvc file everytime a new .dvc file is created or removed right? It seems a bit cumbersome from the user perspective.
Agreed, that workaround is not ideal for this scenario.
The solution you propose at the end seems much more appealing (for me at least), even more given that you seem to have already the logic implemented to walk down the repository tree looking for .dvc files and representing the entire pipeline.
At first thought, it seems to me more logical to implement such feature as an option to the dvc repro command. (Although I’m not sure how that will change in version 1.0 when dvc commands look more “git like)
Yes, I meant dvc repro -p|--pipeline. I'll send a patch for it soon, should be released with 0.19 until the end of the week. Regarding v1.0, I'm sure --pipeline will fit nicely there as well 🙂
Thank you for the feedback!
-Ruslan
Hi @efiop,
I was checking the new -p feature. Thanks for that! I have a question:
Is there a way to reproduce all .dvc files in a repository tree (instead of just reproducing the pipeline that a specific .dvc file belongs to)?
Hi @andrethrill !
No, there is no special option for that right now. Currently we rely on the user tying everything together. We should definitely introduce some option for that. Created https://github.com/iterative/dvc/issues/1165 to track the progress on that. Thank you for a great suggestion as always!
-Ruslan
Most helpful comment
Agreed, that workaround is not ideal for this scenario.
Yes, I meant
dvc repro -p|--pipeline. I'll send a patch for it soon, should be released with 0.19 until the end of the week. Regarding v1.0, I'm sure --pipeline will fit nicely there as well 🙂Thank you for the feedback!
-Ruslan