DVC version: 0.21.3+b45e22.mod
Running from source
Also happens on official 0.21.3 release from homebrew
Going exactly by the tutorial: https://dvc.org/doc/user-guide/external-outputs
and working with a GS bucket and cache.
Immediately after the dvc run executes, running dvc statusshows the output file on GS as changed. dvc repro will execute the command again, and it will still count as changed afterwards.
Seems like the problem is probably in RemoteGS.exists(self, path_infos)
The function is called with path_infos=[ {'scheme': 'gs', 'bucket': 'mybucket, 'key': 'data.txt'}]
However, since it is a cache remote, it lists the files in the cache bucket, whose names are etags and not the names of the files in the output. So exists returns False, and the output file is marked as changed due to not existing.
Seems like the issue's origin is in https://github.com/iterative/dvc/commit/3ec0a5c4ae852a2c1fdaad97f32e20189bccc2d3
The remote's changed() method started checking whether the actual file exists in the expected path, which fails when the remote is a cache and its paths are different.
It seems to me like this should also affect other types of remote, not just GS, including S3!
Hi @guysmoilov !
Great catch! Your analysis is spot on! I'm will send a patch for this shortly.
Thank you!
Patch is merged. I have found another bug in GS remote, preparing a patch for it right now. Will release new dvc version with all the patches tomorrow. Thank you for the feedback! :slightly_smiling_face:
@efiop Hi, thanks for the quick fix, however I think there's still a problem. Now, when I run dvc status immediately after repro, I get: Corrupted cache file warnings, the cache is deleted, and the file is immediately considered as changed.
I believe this happens in RemoteGS.changed_cache, line 79, where the etag of the output file is compared to the etag of the cache file. When they are found to not be equal, and if I understand correctly they have no reason to be equal, the cache file is considered corrupted.
Or is this the other bug that you found?
@guysmoilov This is precisely the bug, you can see our CI failing because of it. Patch is coming today, thank you! 馃檪 https://github.com/iterative/dvc/issues/1409
@guysmoilov Turns out, because of the inertia from S3, we've been wrongly using ETag on GS. On S3 Etag are the only way to identify file and it doesn't change when you copy your blob within the s3. Which is not the case with GS, where ETag does change after you simply copy it from one prefix to another. But, luckily, GS does provide md5_hash, which can and should be used for our purposes. Preparing a patch right now.
NOTE: this does not affect anything when you push/pull to/from gs remote, since we use local md5s there anyway. This problem only affects external output/deps scenario.
@guysmoilov Patch is now merged into master. Everything should work correctly from now on. Please feel free to try it out :slightly_smiling_face: We will release new dvc version later today. Thanks a lot for all the feedback and amazing analysis! :slightly_smiling_face: