Dvc: DVC Repro incorrectly saving directory to cache

Created on 19 Aug 2020 · 11Comments · Source: iterative/dvc

Bug Report

As the title states. I run dvc repro extract for my extract pipeline stage. This takes two pretty large zip files and extracts them into specified folders. These folders should not be added to the dvc cache, since they can be easily reproduced by extracting the archives, but I declare them as dependencies so that the DAG looks nicer.

My dvc.yaml looks like this. The preprocess stage should only indicate, that the not cached folders should be used as dependencies in later stages.

stages:
  extract:
    cmd: tar -xzvf data/thingy10k/10k_tetmesh.tar.gz -C data/thingy10k/ && tar -xzvf data/thingy10k/10k_surface.tar.gz -C data/thingy10k/
    deps:
    - data/thingy10k/10k_surface.tar.gz
    - data/thingy10k/10k_tetmesh.tar.gz
    outs:
    - data/thingy10k/10k_surface:
        cache: false
    - data/thingy10k/10k_tetmesh:
        cache: false
  preprocess:
    cmd: some preprocessing command
    deps:
    - data/thingy10k/10k_tetmesh
    - data/thingy10k/10k_surface

After running dvc repro extract, the hashes of all the files are computed and then the files saved to cache. This is exactly the thing that I was trying to prevent with the cache: false option.

I confirmed that the contents of the output folders were indeed added to the cache by using du -sh .dvc/cache, which went up by exactly the size of the two folders after running the command.

Interestingly, after running dvc gc -a the cache is freed again. Also running dvc push (without first running dvc gc -a) to push the cache to my remote storage also says everything is up to date, which leads me to believe that dvc recognizes that these directories should in fact not be cached.

I have reproduced this in a local git and dvc repo by first adding the two archives using dvc add and then running the above mentioned extract stage. The archives can be downloaded from this repo https://github.com/Yixin-Hu/TetWild#dataset under Output.

Please provide information about your setup

Output of dvc version:

$ dvc version
DVC version: 1.6.0 (pip)
---------------------------------
Platform: Python 3.7.6 on Linux-5.4.0-42-generic-x86_64-with-debian-bullseye-sid
Supports: http, https, ssh, webdav, webdavs
Cache types: hardlink, symlink
Repo: dvc, git

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

feature request p1-important

Source

digitalillusions

Most helpful comment

I think we should bump the priority on this issue. It seems like no-cache effectively does not exist since 1.0:

#!/bin/bash

rm -rf repo
mkdir repo
pushd repo

git init --quiet
dvc init --quiet

echo hello >> hello
dvc add hello
git add -A
git commit -am "init"

dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"

Results in saving hello.txt to cache.

pared on 9 Dec 2020

❤1 👍1

All 11 comments

Hi @digitalillusions ! Thank you for reporting this issue! This is a result of our run-cache feature, that started caching even unchaged outputs to be able to restore stages when needed. We definitely had small git-tracked files in mind, but clearly it is not the case here. We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache. We already have dvc run --no-run-cache that tells dvc not to use existing run-cache, but it still tries to save the results of this run :slightly_frowning_face:

efiop on 20 Aug 2020

😕1

Thanks @efiop for the response! Ah thats right, I remember reading about this in the 1.0 update. I also never noticed before, because I set up my pipelines before 1.0 came out and only now made a clean clone of my repo somewhere else and used the repro functionality to reproduce the state.

We could consider somehow explicitly disabling the run-cache for the specified stages. The first idea that comes to my mind is to have some flag for dvc run that would add the flag to the generated stage spec and so dvc would know not to try to save this stage to the run-cache.

I think that would be a good idea. Though to me it would still be somewhat confusing having to, e.g. set both cache: false, run-cache: false in my dvc.yaml. Or would you specify the run-cache: false for the entire pipeline stage? Or only specify run-cache: false which would imply cache: false?

To me it was also just super unintuitive and frustrating to see 12Gb of data being saved to .dvc/cache, when I obviously set the cache flag to false.

digitalillusions on 20 Aug 2020

👍1

@digitalillusions Yeah, I meant run_cache: False(or something like that) for the whole stage, rather than for every output. That might be useful for other applications as well.

efiop on 20 Aug 2020

But would you then in addition have to specify cache: false for the individual outputs? Other than that seems like a nice solution to me.

digitalillusions on 20 Aug 2020

@digitalillusions Correct, we would. The reason is that cache: false is solely for data management purposes, meaning that it won't be affected by dvc checkout/pull/push/etc. So in terms of a dvc run it would look something like dvc run --no-save-run-cache -O my_uncached_out .... Don't like the --no-save-run-cache naming though, will need to figure something out.

efiop on 20 Aug 2020

The reason is that cache: false is solely for data management purposes

Yeah, cache: became a bit misleading name now :( No ideas, how to rename or how to name new options yet though ...

shcheklein on 20 Aug 2020

Would it be an option to provide keyword arguments? So instead of having --no-run-cache, you could have --run-cache (ignore | disable) where ignore would be the old --no-run-cache and disable would be to completely disable the run cache for that specific stage.

digitalillusions on 21 Aug 2020

👍1

@digitalillusions Great idea! Yeah, could think about using something like that. :+1:

efiop on 21 Aug 2020

I think we should bump the priority on this issue. It seems like no-cache effectively does not exist since 1.0:

#!/bin/bash

rm -rf repo
mkdir repo
pushd repo

git init --quiet
dvc init --quiet

echo hello >> hello
dvc add hello
git add -A
git commit -am "init"

dvc run -n run -d hello -O hello.txt --no-run-cache "cat hello >> hello.txt && echo 'modification' >> hello.txt"

Results in saving hello.txt to cache.

pared on 9 Dec 2020

❤1 👍1

Hello :) I have the same kind of problem:
A preprocessing stage that takes a big folder as input and produces a big folder as output. It would be great if there was a easy and fine grained way of specifying where the output of a stage ends up being. As far as I can see there are the following options for an output of stage:

dvc does nothing
dvc stores hashsums
dvc stores full output

I understand that the run-cache is useful. So what about the following:

dvc run -o -> dvc stores full output
dvc run -O5 -> dvc stores hashsums and the last 5 versions of the output temporary. For run 6, dvc deletes output of run 1.
dvc run -O -> same as now

Would this be possible?
Thanks

jimmytudeski on 12 Dec 2020

@jimmytudeski Unfortunately there is no LRU functionality in our cache yet, so we would need to figure it out first.
A faster, and arguably more proper fix for this is to pass run_cache argument to stage.save and if not run_cache not call stage_cache.save https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L439 . Stage.run already receives it https://github.com/iterative/dvc/blob/d3acbfe2b1aae2d05cb9b8732167e27f7c5f740e/dvc/stage/__init__.py#L488 , but only passes it to run_stage, but it should also pass it to stage.save(). That will make dvc run --no-run-cache and dvc repro --no-run-cache not save the run cache for particular stages. Though granularity still requires run_cache: true/false per-output, as described in my other comment above :(