If this is already possible, I apologise. Please let me know how it can be done.
Say I'm working on a script that transforms data1.txt
to data2.txt
:
(data1.txt) --> [script1.py] --> (data2.txt)
Imagine script1.py
takes a very long time to run (hours, days...). What I would like to do is:
script1.py
until it reaches a state I am happy withdata2.txt
)dvc run -d data1.txt -d script1.py -o data2.txt python script1.py
Unfortunately, this will cause the script to run again and make me wait for it to finish. I know there is also the flag --no-exec
but, from what I understood, it does not calculate the checksums. So dvc status
will not know if files changed or not. I would like to have a way to dvc run
a workflow like the one above, where all the dependencies and outputs will become dvc tracked with all the checksum calculated, but without actually executing the command.
If this is still not possible, I think it would be a very useful feature.
What do you think?
Hi @andrethrill !
You are right, --no-exec
doesn't calculate the checksums, it just writes dvcfile, that is all.
I would like to have a way to dvc run a workflow like the one above, where all the dependencies and outputs will become dvc tracked with all the checksum calculated, but without actually executing the command.
That seems rather dangerous :) But thinking about it, I can definitely see that this hackish feature might be useful. That being said, it should definitely be practiced with care, since it is prone to human errors. We could add something like dvc run --no-exec --save
(or just dvc run --save
) for this feature. Naming suggestions are very welcomed :) I've added this feature to our TODO list, will take a closer look at it soon.
Thanks,
Ruslan
Hi @efiop thanks for the quick feedback as always.
I understand what you mean by hackish
:) but do you have a different suggestion when working with scripts under development that take a long time to run and that also the output data is rather big and as such, I don't want to start caching dozens of intermediate version of the output data?
Another way I could see this working would be to have a flag like dvc run --overwrite
(or a different name) where it would keep only the last version of the output in cache and delete the previous ones. What do you think? Actually, this sounds much less hackish then my first thought.
It almost feels like dvc should have two different stages dvc run
where it runs the command and calculates the checksums, and dvc commit
where it actually caches the files....
My suggestion would be to use dvc run
to run your script and after you are done debugging(or even sooner) simply call dvc gc -a
to cleanup unused cache. Would that work for you?
@efiop I see... I guess that could work yes. I'm still getting familiar with the different ways of achieving things using dvc :)
What do you think about my second comment above of having a run
and commit
stage?
Ah, sorry, I forgot to previously mention our garbage collector command :) Maybe I'm missing something in your scenario, but dvc gc -a
will remove any currently unused cache in your project, so in your scenario it will remove your previous versions of the output leaving only the last one existing because it is actually the one used in the pipeline.
What do you think about my second comment above of having a run and commit stage?
Thank you for the great suggestion! This is actually a very interesting idea. I can definitely see this being useful in a variety of scenarios where copy
is the only option for caching your outputs(i.e. external output scenario for s3, gs, current hdfs and ssh drivers implementations, even copy
cache type for local outputs, etc) and so it has quite an overhead that could be avoided. We could consider adding dvc run --no-commit
(or --no-save
) to tell dvc not to save the outputs temporarily and then add dvc commit stage.dvc
(or dvc save stage.dvc
) to tell dvc to actually save outputs of the specified stage to the cache. Would that be suitable for you?
PS. I would probably prefer dvc run --no-save
+ dvc save
to the commit
, because it actually makes more sense in the relation to the architecture of DVC and doesn't create confusion with dvc add
+ dvc commit
.
@andrethrill I have one more solution for true Git fans :)
git checkout -b script1_dev
(let's assume from master
branch)ffffff1
, ffffff2
, ffffff3
are the outputs checksums. Only the last run and ffffff3
checksum are correct and need to be merged.master
: git checkout master; dvc checkout; git merge script1_dev
script1_dev
branch if conflicts. In many cases, theirs
merge strategy does all the work git merge -X theirs script1_dev
in step 3.dvc checkout
to restore right data files from the cache.Now you have script1.py integrated to master
. One more run (hours, days...) is not needed since all the checksum were verified in script1_dev
branch and script1.dvc
output points to the correct data filesffffff3
. As a result, dvc repro
will do nothing.
Another alternative the --no-save
`--commit` approach which you guys just discussed. At the first glance it adds more complexity but can be a good solution.
@andrethrill what are your thoughts? Which approach looks more appealing to you?
Oh, sorry. it looks like I've answered to a bit different question :)
Let me think a bit more...
So, it looks like the following workflow makes sense to try:
dvc run --no-exec ...
- just to add a stub of the stage initiallydvc repro -s stage.dvc
- to iterate on that stage without running the full pipelinedvc gc
- to collect unused data files and keep the cache as small as possible. Keep in mind that you need to push data files you actually want to save to some remote before running this.
@efiop what do you think about this? Am I missing something?
It's def better than allowing people save DVC directly, and should be similar to dvc repro --no-save; dvc save
. The only overhead here is calculating md5s every time. @andrethrill, please try it and let us know if it's tolerable.
@efiop what do you think about this? Am I missing something?
@shcheklein That is precisely what I've been suggesting with dvc gc -a
above.
It's def better than allowing people save DVC directly, and should be similar to dvc repro --no-save; dvc save. The only overhead here is calculating md5s every time.
As I've said previously, the main overhead here would be to store one more version of the output, plus it might be not very handy to remember to call dvc gc
every time you run a command. I definitely think that there is a place for dvc run/repro --no-save
+ dvc save
in our architecture, since it is still making sure that everything is tracked and just simply leaving the save
part for later, which might be very useful in the scenarios where copy
is the only option for saving the cache(i.e. local cache.type==copy, current hdfs and ssh drivers and maybe some other drivers in the future). Mind that --no-save
will still calculate the checksums, but it only won't actually save data files to the cache, leaving them in the user's workspace.
Makes sense. I'm just trying to find an option to avoid new commands :) Is it possible to reuse dvc add
instead of dvc save
? Basically it would mean that dvc add
will be used always to move file to cache and create dvc file if it does not exist yet.
@shcheklein I agree, saving a command is great, but those are the details that we will have to figure out if we have a confirmation from the user that this feature is indeed needed. That being said, I don't think that dvc add
should handle this case at all. An alternative that makes more sense to me would be to have a --save
(or smth like that, we can always figure it out later) option for dvc repro
. But again, those are the details we can discuss when we receive a confirmed feature request.
Thanks everyone for the insightful discussion. It's showing me different ways of using DVC. Below follow my answers:
@efiop
simply call dvc gc -a to cleanup unused cache.
I think this serves a different purpose. Because, as I understand it, it cleans all the unused cache. Personally, I like the idea of working on two or different branches and being able to quickly changing between them by just checking out files. And if the files that are garbage collected take a long time to generate that won't be the case anymore.
As a general comment, I would say that I like to keep "useful" files in cache as much as possible. On the other hand, I don't like the idea of keeping trash in cache, and that's why I raised this issue.
At the moment, with the available commands, it a seems almost an all or nothing situation (i know it's not exactly like that, I'm just trying to prove a point).
We could consider adding dvc run --no-commit(or --no-save) to tell dvc not to save the outputs temporarily and then add dvc commit stage.dvc(or dvc save stage.dvc) to tell dvc to actually save outputs of the specified stage to the cache. Would that be suitable for you?
From a user-friendliness perspective, I think this makes the most sense. Especially, given the similarity and close relationship dvc has with git.
@dmpetrov:
@andrethrill I have one more solution for true Git fans :)
That sounds a reasonable solution as well! (although I like the flexibility of not needing to create a different branch. 馃槂 ) But will that clean the unused cached files ffffff1
and ffffff2
automatically?
@efiop:
it might be not very handy to remember to call
dvc gc
every time you run a command
This is a very good point.
I just want to add:
Obviously, I'm just one user, and as such, if I adopt dvc I will use the options that it supplies. :)
At the same time, I may not be the average user but, from my perspective, since I first read the @dmpetrov 's blogpost introducing dvc I always saw it as a "git for data files". In that sense, it seems reasonable and intuitive to make comparisons with git and to have a command to tell dvc a certain file/pipeline should be tracked and to have another command to actually cache/track it.
If it is called commit
or it's just a flag like --no-save
, I think you guys are the most knowledgeable to decide that.
We've been thinking about it a lot and decided to change dvc add/run/repro
so they will only save checksums and won't actually save files to cache, plus introduce dvc commit
that will save files to cache. This will make dvc handle more like git and will make interactions with it more natural. We will introduce those changes starting from v1.0, since they are backward incompatible.
Hi @andrethrill !
We've released 0.28.0 with dvc commit
+ dvc add/run/repro --no-commit
support. Feel free to upgrade and give it a try! Would love to hear about your experience :slightly_smiling_face:
Thanks,
Ruslan
Hi @efiop!
Did an initial test. And it seems to do exactly what I would like! :)
Is there a way to have the flag --no-commit
on by default in the dvc config
or something? I guess that's the way I would like to work by default.
@andrethrill Currently there is no way to set that behaviour as default one :slightly_frowning_face: We plan to adopt that as a default behaviour starting from 1.0, but it is a great idea to provide opt-in config option for now. Created https://github.com/iterative/dvc/issues/1627 to track progress on it. Thanks for the feedback! :slightly_smiling_face:
Most helpful comment
We've been thinking about it a lot and decided to change
dvc add/run/repro
so they will only save checksums and won't actually save files to cache, plus introducedvc commit
that will save files to cache. This will make dvc handle more like git and will make interactions with it more natural. We will introduce those changes starting from v1.0, since they are backward incompatible.