DVC version: 0.86.5+f67314.mod
Python version: 3.7.6
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')
git clone [email protected]:iterative/example-get-started.git
cd example-get-started
dvc pull
dvc diff baseline-experiment bigrams-experiment
Added:
data/features/test.pkl
data/features/train.pkl
Modified:
auc.metric
data/features/
model.pkl
files summary: 2 added, 0 deleted, 2 modified
which is wrong. Files data/features/test.pkl and data/features/train.pkl actually exist in both experiments. The only difference is that one of them cached, one is not.
@shcheklein good catch!
The only difference is that one of them cached, one is not.
Just a clarification: these two files are not cached for master/bigrams-experiment but not for baseline-experiment.
I'm very surprised that dvc diff is reading the cache and depends on its state. We agreed that #2998 will be also included in the first version of the diff. It is crucial for CI/CD scenarios. Reopening it.
PS: @shcheklein the url didn't work by default
git clone [email protected]:iterative/example-get-started.git
Cloning into 'example-get-started'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
the url didn't work by default
git clone [email protected]:iterative/example-get-started.git
Hmmm weird. The repo is public. Is it because of the git@ SSH URL? Try with git clone [email protected]:iterative/example-get-started.git @dmpetrov?
If this is the case, we should update in https://dvc.org/doc/command-reference/get#example-compare-different-versions-of-data-or-model
UPDATE: I'm changing this just in case in a regular updates PR to docs.
Worth noting that after pull on baseline, message seems to be ok (that both paths are modified).
So, a little update on my side:
The reason for such original behavior is as follows:
.dir cache file for pulled revision.dir cache file for the other one.diff's _paths_checskums we "unpack" dir cache content for later comparison. DirCacheError causes as to detect deleted or added new files (depending on revision order), instead of detecting actual change (modification).I see the following solutions for this problem:
DirCacheError further and fail diff if there is no cache file - I don't like this idea, because we fail "diagnostic" operation and require the user to download files that he actually might not want to downloadremote/base.get_dir_cache for downloading absent cache files (and only them) - I also don't like this idea, as it will be heavy in implementation and still does not guarantee we get proper results (what if someone gc'd his remote? dir cache will be unavailable)dir outs equal from diff point of view to file outs: just remove this unpacking code - I would go with this option, as it does not require from us downloading anything in the situation presented in the original issue and still is able to inform user about changes that happened in the repository.@dmpetrov @shcheklein what do you think?
I think I'm fine with 3 as a first step, and may be think more about 2. implementation mentioned in 2 is needed potentially in other cases - e.g. dvc list - https://github.com/iterative/dvc/issues/3394 (not even sure why is it p2 tbh - it's p1 to me), so I think this mechanism will be needed down the road anyway.
thanks for the great research @pared , btw :)
Most helpful comment
So, a little update on my side:
The reason for such original behavior is as follows:
.dircache file forpulledrevision.dircache file for the other one.diff's_paths_checskumswe "unpack" dir cache content for later comparison.DirCacheErrorcauses as to detectdeletedoraddednew files (depending on revision order), instead of detecting actual change (modification).I see the following solutions for this problem:
DirCacheErrorfurther and fail diff if there is no cache file - I don't like this idea, because we fail "diagnostic" operation and require the user to download files that he actually might not want to downloadremote/base.get_dir_cachefor downloading absent cache files (and only them) - I also don't like this idea, as it will be heavy in implementation and still does not guarantee we get proper results (what if someonegc'd his remote? dir cache will be unavailable)dirouts equal fromdiffpoint of view to fileouts: just remove this unpacking code - I would go with this option, as it does not require from us downloading anything in the situation presented in the original issue and still is able to inform user about changes that happened in the repository.@dmpetrov @shcheklein what do you think?