Dvc: diff: does not behave properly when cache is not available

Created on 24 Feb 2020  路  6Comments  路  Source: iterative/dvc

Version

DVC version: 0.86.5+f67314.mod
Python version: 3.7.6
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce

git clone [email protected]:iterative/example-get-started.git
cd example-get-started
dvc pull
dvc diff baseline-experiment bigrams-experiment

Output

Added:
    data/features/test.pkl
    data/features/train.pkl

Modified:
    auc.metric
    data/features/
    model.pkl

files summary: 2 added, 0 deleted, 2 modified

which is wrong. Files data/features/test.pkl and data/features/train.pkl actually exist in both experiments. The only difference is that one of them cached, one is not.

bug p0-critical

Most helpful comment

So, a little update on my side:
The reason for such original behavior is as follows:

  1. There is .dir cache file for pulled revision
  2. There is no .dir cache file for the other one.
  3. When collecting data for non-existing (in cache) revision, we raise an exception which is silently ignored here
  4. In case of existing one loading dir cache is successful
  5. In diff's _paths_checskums we "unpack" dir cache content for later comparison.
  6. This logic in addition to silent ignoring of DirCacheError causes as to detect deleted or added new files (depending on revision order), instead of detecting actual change (modification).

I see the following solutions for this problem:

  1. Raise DirCacheError further and fail diff if there is no cache file - I don't like this idea, because we fail "diagnostic" operation and require the user to download files that he actually might not want to download
  2. Implement logic in remote/base.get_dir_cache for downloading absent cache files (and only them) - I also don't like this idea, as it will be heavy in implementation and still does not guarantee we get proper results (what if someone gc'd his remote? dir cache will be unavailable)
  3. Make dir outs equal from diff point of view to file outs: just remove this unpacking code - I would go with this option, as it does not require from us downloading anything in the situation presented in the original issue and still is able to inform user about changes that happened in the repository.

@dmpetrov @shcheklein what do you think?

All 6 comments

@shcheklein good catch!

The only difference is that one of them cached, one is not.

Just a clarification: these two files are not cached for master/bigrams-experiment but not for baseline-experiment.

I'm very surprised that dvc diff is reading the cache and depends on its state. We agreed that #2998 will be also included in the first version of the diff. It is crucial for CI/CD scenarios. Reopening it.

PS: @shcheklein the url didn't work by default

git clone [email protected]:iterative/example-get-started.git
Cloning into 'example-get-started'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

the url didn't work by default

git clone [email protected]:iterative/example-get-started.git

Hmmm weird. The repo is public. Is it because of the git@ SSH URL? Try with git clone [email protected]:iterative/example-get-started.git @dmpetrov?

If this is the case, we should update in https://dvc.org/doc/command-reference/get#example-compare-different-versions-of-data-or-model

UPDATE: I'm changing this just in case in a regular updates PR to docs.

Worth noting that after pull on baseline, message seems to be ok (that both paths are modified).

So, a little update on my side:
The reason for such original behavior is as follows:

  1. There is .dir cache file for pulled revision
  2. There is no .dir cache file for the other one.
  3. When collecting data for non-existing (in cache) revision, we raise an exception which is silently ignored here
  4. In case of existing one loading dir cache is successful
  5. In diff's _paths_checskums we "unpack" dir cache content for later comparison.
  6. This logic in addition to silent ignoring of DirCacheError causes as to detect deleted or added new files (depending on revision order), instead of detecting actual change (modification).

I see the following solutions for this problem:

  1. Raise DirCacheError further and fail diff if there is no cache file - I don't like this idea, because we fail "diagnostic" operation and require the user to download files that he actually might not want to download
  2. Implement logic in remote/base.get_dir_cache for downloading absent cache files (and only them) - I also don't like this idea, as it will be heavy in implementation and still does not guarantee we get proper results (what if someone gc'd his remote? dir cache will be unavailable)
  3. Make dir outs equal from diff point of view to file outs: just remove this unpacking code - I would go with this option, as it does not require from us downloading anything in the situation presented in the original issue and still is able to inform user about changes that happened in the repository.

@dmpetrov @shcheklein what do you think?

I think I'm fine with 3 as a first step, and may be think more about 2. implementation mentioned in 2 is needed potentially in other cases - e.g. dvc list - https://github.com/iterative/dvc/issues/3394 (not even sure why is it p2 tbh - it's p1 to me), so I think this mechanism will be needed down the road anyway.

thanks for the great research @pared , btw :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Suor picture Suor  路  39Comments

drorata picture drorata  路  46Comments

andrethrill picture andrethrill  路  70Comments

pared picture pared  路  73Comments

mdekstrand picture mdekstrand  路  43Comments