As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.
My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.
stages:
extract:
cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
deps:
- data/thingy10k/10k_surface.tar.gz
- data/thingy10k/10k_tetmesh.tar.gz
outs:
- data/thingy10k/10k_surface:
cache: false
- data/thingy10k/10k_tetmesh:
cache: false
preprocess:
cmd: some preprocessing command
deps:
- data/thingy10k/10k_tetmesh
- data/thingy10k/10k_surface
After running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.
I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.
Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.
I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.
Output of dvc version:
$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git
Additional Information (if any):
If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.
Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already have dvc run --no-run-cache that tells dvc not to use existing run-cache, but it still tries to save the results of this run :slightly_frowning_face:
Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.
We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache.
I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both cache: false, run-cache: false in my dvc.yaml. Or would you specify the run-cache: false for the entire pipeline stage? Or only specify run-cache: false which would imply cache: false?
To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to .dvc/cache, when I obviously set the cache flag to false.
@digitalillusions Yeah, I meant run_cache: False(or something like that) for the whole stage, rather than for every output. That might be useful for other applications as well.
But would you then in addition have to specify cache: false for the individual outputs? Other than that seems like a nice solution to me.
@digitalillusions Correct, we would. The reason is that cache: false is solely for data management purposes, meaning that it won't be affected by dvc checkout/pull/push/etc. So in terms of a dvc run it would look something like dvc run --no-save-run-cache -O my_uncached_out .... Don't like the --no-save-run-cache naming though, will need to figure something out.
The reason is that cache: false is solely for data management purposes
Yeah, cache: became a bit misleading name now :( No ideas, how to rename or how to name new options yet though ...
Would it be an option to provide keyword arguments? So instead of having --no-run-cache, you could have --run-cache (ignore | disable) where ignore would be the old --no-run-cache and disable would be to completely disable the run cache for that specific stage.
@digitalillusions Great idea! Yeah, could think about using something like that. :+1:
I think we should bump the priority on this issue. It seems like no-cache effectively does not exist since 1.0:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet
dvc init --quiet
echo hello >> hello
dvc add hello
git add -A
git commit -am "init"
dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"
Results in saving hello.txt to cache.
Hello :) I have the same kind of problem:
A preprocessing stage that takes a big folder as input and produces a big folder as output. It would be great if there was a easy and fine grained way of specifying where the output of a stage ends up being. As far as I can see there are the following options for an output of stage:
I understand that the run-cache is useful. So what about the following:
dvc run -o -> dvc stores full outputdvc run -O5 -> dvc stores hashsums and the last 5 versions of the output temporary. For run 6, dvc deletes output of run 1.dvc run -O -> same as nowWould this be possible?
Thanks
@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first.
A faster, and arguably more proper fix for this is to pass run_cache argument to stage.save and if not run_cache not call stage_cache.save https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 . Stage.run already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it to run_stage, but it should also pass it to stage.save(). That will make dvc run --no-run-cache and dvc repro --no-run-cache not save the run cache for particular stages. Though granularity still requires run_cache: true/false per-output, as described in my other comment above :(
Most helpful comment
I think we should bump the priority on this issue. It seems like
no-cacheeffectively does not exist since 1.0:Results in saving
hello.txtto cache.