As the title states. I run dvc repro extract
for my extract
pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.
My dvc.yaml
looks like this. The preprocess
stage should only indicate, that the not cached folders should be used as dependencies in later stages.
stages:
extract:
cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
deps:
- data/thingy10k/10k_surface.tar.gz
- data/thingy10k/10k_tetmesh.tar.gz
outs:
- data/thingy10k/10k_surface:
cache: false
- data/thingy10k/10k_tetmesh:
cache: false
preprocess:
cmd: some preprocessing command
deps:
- data/thingy10k/10k_tetmesh
- data/thingy10k/10k_surface
After running dvc repro extract
, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false
option.
I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache
, which went up by exactly the size of the two folders after running the command.
Interestingly, after running dvc gc -a
the cache is freed again. Also running dvc push
(without first running dvc gc -a
) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.
I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add
and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output
.
Output of dvc version
:
$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git
Additional Information (if any):
If applicable, please also provide a --verbose
output of the command, eg: dvc add --verbose
.
Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run
that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already have dvc run --no-run-cache
that tells dvc not to use existing run-cache, but it still tries to save the results of this run :slightly_frowning_face:
Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.
We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache.
I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both cache: false, run-cache: false
in my dvc.yaml
. Or would you specify the run-cache: false
for the entire pipeline stage? Or only specify run-cache: false
which would imply cache: false
?
To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to .dvc/cache
, when I obviously set the cache
flag to false.
@digitalillusions Yeah, I meant run_cache: False
(or something like that) for the whole stage, rather than for every output. That might be useful for other applications as well.
But would you then in addition have to specify cache: false
for the individual outputs? Other than that seems like a nice solution to me.
@digitalillusions Correct, we would. The reason is that cache: false
is solely for data management purposes, meaning that it won't be affected by dvc checkout/pull/push/etc
. So in terms of a dvc run
it would look something like dvc run --no-save-run-cache -O my_uncached_out ...
. Don't like the --no-save-run-cache
naming though, will need to figure something out.
The reason is that cache: false is solely for data management purposes
Yeah, cache:
became a bit misleading name now :( No ideas, how to rename or how to name new options yet though ...
Would it be an option to provide keyword arguments? So instead of having --no-run-cache
, you could have --run-cache (ignore | disable)
where ignore
would be the old --no-run-cache
and disable
would be to completely disable the run cache for that specific stage.
@digitalillusions Great idea! Yeah, could think about using something like that. :+1:
I think we should bump the priority on this issue. It seems like no-cache
effectively does not exist since 1.0:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet
dvc init --quiet
echo hello >> hello
dvc add hello
git add -A
git commit -am "init"
dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"
Results in saving hello.txt
to cache.
Hello :) I have the same kind of problem:
A preprocessing stage that takes a big folder as input and produces a big folder as output. It would be great if there was a easy and fine grained way of specifying where the output of a stage ends up being. As far as I can see there are the following options for an output of stage:
I understand that the run-cache is useful. So what about the following:
dvc run -o
-> dvc stores full outputdvc run -O5
-> dvc stores hashsums and the last 5 versions of the output temporary. For run 6, dvc deletes output of run 1.dvc run -O
-> same as nowWould this be possible?
Thanks
@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first.
A faster, and arguably more proper fix for this is to pass run_cache
argument to stage.save
and if not run_cache
not call stage_cache.save
https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 . Stage.run
already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it to run_stage
, but it should also pass it to stage.save()
. That will make dvc run --no-run-cache
and dvc repro --no-run-cache
not save the run cache for particular stages. Though granularity still requires run_cache: true/false
per-output, as described in my other comment above :(
Most helpful comment
I think we should bump the priority on this issue. It seems like
no-cache
effectively does not exist since 1.0:Results in saving
hello.txt
to cache.