Hi!
I've dvc add a directory of raw data, which .dvc file is:
md5: 40dc65b691c196a0bc68a102c845cb21
outs:
- cache: true
md5: ec0e6c9d86ca7e7238837f77464b2f61.dir
metric: false
path: ******* # this is directory #1
After I've run a dvc run command (that depends on the directory I showed above) which .dvc file is:
cmd: python ../../../src/data/*******.py
deps:
- md5: 00ad5d690b0b1e20f7536ad7b9b13e56
path: ../../../src/data/*******.py
- md5: ec0e6c9d86ca7e7238837f77464b2f61.dir
path: ../../raw/******* # this is directory #1
md5: 6d5cc84c0048037818d2022d4caa6824
outs:
- cache: true
md5: 0e278292ab6d1fe12bbcb890d1003f76.dir
metric: false
path: ******* # this is a directory #2
After I've done dvc push to S3.
Also, I don't know if this is relevant but I'm using shared cache. (cache location is outside of repo dir)
The problem
On a different machine, I've cloned the git repo and then when I do dvc pull (from S3) I always get the following errors:
Failed to load dir cache '../../******/ec/0e6c9d86ca7e7238837f77464b2f61.dir': [Errno 2] No such file or directory: '******/ec/0e6c9d86ca7e7238837f77464b2f61.dir'
Failed to load dir cache '../../******/0e/278292ab6d1fe12bbcb890d1003f76.dir': [Errno 2] No such file or directory: '******/0e/278292ab6d1fe12bbcb890d1003f76.dir'
(1/26): [##############################] 100% ab985a36268c5fd522e3837ba509bf9d
(2/26): [##############################] 100% 5fcc2be218d299581e7e6eda98471485
(3/26): [##############################] 100% ab7ee23dbafb29fdd9006abf6c0279a4
(4/26): [##############################] 100% ea03c52cf38849e4505288d4d8331f31
(5/26): [##############################] 100% 629389205697cda8caaba595dc07af0c
(6/26): [##############################] 100% 9e11a33bf9e09b6eb8523dd53b4ed9f5
(7/26): [##############################] 100% 2b3ed841e849f223af57c1512c03e88e
(8/26): [##############################] 100% 10e3e0ef98adf6e45b7b4fab507cdf16
(9/26): [##############################] 100% f63931f7e493ee6d16ba6ea0e1fb4dae
(10/26): [##############################] 100% 8dce1da718c77df818261bf627a636fc
(11/26): [##############################] 100% 19b8db23599828aa757c00097d3abe80
(12/26): [##############################] 100% 01775ba2dfb7cd941d2be742c68a1cb3
(13/26): [##############################] 100% 5518bd3663b36d2be860af83f3738faf
(14/26): [##############################] 100% 7d402e8053c0d0af40bafafef4c65864
(15/26): [##############################] 100% bd9193ab817de999b526f01845609ad0
(16/26): [##############################] 100% 304cacd01676e009f08a8bce354dedda
(17/26): [##############################] 100% 72659e880507f87c95146a3ed11f9255
(18/26): [##############################] 100% d70361968ccd4676a1ec87fad4d7dcc6
(19/26): [##############################] 100% d11a29269083bc839bff4aad3509f4c4
(20/26): [##############################] 100% 6792e382941dbb18c89fa57ed7c4a1fc
(21/26): [##############################] 100% d461cb099824287a77e6f369b34c6aaf
(22/26): [##############################] 100% 5b63fd9a7f9193c113085e09677fce5a
(23/26): [##############################] 100% deb3f81a2c3d7ccd3fb98890ffbc8e99
(24/26): [##############################] 100% 7dbc7f9aa5f21a13d72def4f4f3696e3
(25/26): [##############################] 100% 1c9144c607ecb2baf72f2fb5dbd70b0b
(26/26): [##############################] 100% e343f5915e3ca1dae351011a8cd767f4
As you can see, the remaining cached files which are not .dir download smoothly.
After I run dvc checkout no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed:
***.dvc
outs
changed: data/raw/**** # this is directory #1
***.dvc
deps
changed: data/raw/**** # this is directory #1
outs
changed: data/processed/**** # this is directory #2
This seems like a bug or I'm being silly and doing something wrong. Is this a known behaviour?
Hi @andrethrill !
Failed to load dir cache '../../******/ec/0e6c9d86ca7e7238837f77464b2f61.dir': [Errno 2] No such file or directory: '******/ec/0e6c9d86ca7e7238837f77464b2f61.dir'
Failed to load dir cache '../../******/0e/278292ab6d1fe12bbcb890d1003f76.dir': [Errno 2] No such file or directory: '******/0e/278292ab6d1fe12bbcb890d1003f76.dir'
These errors should not have been shown, but what they mean is that there is no local cache for those *.dir cache files, so it means that dvc will have to download them by itself, which is a totally normal situation for fresh pull. I've just sent a patch to tidy that up, so it is not all scary looking :)
After I run dvc checkout no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed:
This is actually strange and I was not able to reproduce. We should definitely have better msgs printed in the future so it is easier to understand what is going on. I'm trying to reproduce it myself right now and in the mean time, could you please try to reproduce it again, but now also make sure that your dirs actually exist when you call dvc status?
Thanks,
Ruslan
Thanks for the feedback @efiop.
I ran dvc pull again from a fresh start on the second machine. I still get the same error message (DVC master branch is not modified)
Continuing, after running dvc pull this is what the cache directory looks like:
tree ../../dvc_caches/project1/
../../dvc_caches/project1/
βββ 01
β βββ 775ba2dfb7cd941d2be742c68a1cb3
βββ 0e
β βββ 278292ab6d1fe12bbcb890d1003f76.dir
βββ 10
β βββ e3e0ef98adf6e45b7b4fab507cdf16
βββ 19
β βββ b8db23599828aa757c00097d3abe80
βββ 1c
β βββ 9144c607ecb2baf72f2fb5dbd70b0b
βββ 2b
β βββ 3ed841e849f223af57c1512c03e88e
βββ 30
β βββ 4cacd01676e009f08a8bce354dedda
βββ 55
β βββ 18bd3663b36d2be860af83f3738faf
βββ 5b
β βββ 63fd9a7f9193c113085e09677fce5a
βββ 5f
β βββ cc2be218d299581e7e6eda98471485
βββ 62
β βββ 9389205697cda8caaba595dc07af0c
βββ 67
β βββ 92e382941dbb18c89fa57ed7c4a1fc
βββ 72
β βββ 659e880507f87c95146a3ed11f9255
βββ 7d
β βββ 402e8053c0d0af40bafafef4c65864
β βββ bc7f9aa5f21a13d72def4f4f3696e3
βββ 8d
β βββ ce1da718c77df818261bf627a636fc
βββ 9e
β βββ 11a33bf9e09b6eb8523dd53b4ed9f5
βββ ab
β βββ 7ee23dbafb29fdd9006abf6c0279a4
β βββ 985a36268c5fd522e3837ba509bf9d
βββ bd
β βββ 9193ab817de999b526f01845609ad0
βββ d1
β βββ 1a29269083bc839bff4aad3509f4c4
βββ d4
β βββ 61cb099824287a77e6f369b34c6aaf
βββ d7
β βββ 0361968ccd4676a1ec87fad4d7dcc6
βββ de
β βββ b3f81a2c3d7ccd3fb98890ffbc8e99
βββ e3
β βββ 43f5915e3ca1dae351011a8cd767f4
βββ ea
β βββ 03c52cf38849e4505288d4d8331f31
βββ ec
β βββ 0e6c9d86ca7e7238837f77464b2f61.dir
βββ f6
βββ 3931f7e493ee6d16ba6ea0e1fb4dae
26 directories, 28 files
So we can see the .dir files were successfully pulled. (surprising for me was that after this, the original repo dir, next to the .dvc files, also had the files already. I thought this would only happen when doing dvc checkout)
Anyway, I ran dvc checkout just to be sure and no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed (as before):
***.dvc
outs
changed: data/raw/**** # this is directory #1
***.dvc
deps
changed: data/raw/**** # this is directory #1
outs
changed: data/processed/****
if I dvc repro the files, then dvc status outputs nothing which is the expected behaviour (btw, a minor suggestion: it would be more user friendly to output some message nothing changed, or similar). But then I get more files in cache:
tree ../../dvc_caches/project1/
../../dvc_caches/project1/
βββ 01
β βββ 775ba2dfb7cd941d2be742c68a1cb3
βββ 0e
β βββ 278292ab6d1fe12bbcb890d1003f76.dir
βββ 10
β βββ e3e0ef98adf6e45b7b4fab507cdf16
βββ 19
β βββ b8db23599828aa757c00097d3abe80
βββ 1c
β βββ 9144c607ecb2baf72f2fb5dbd70b0b
βββ 2b
β βββ 3ed841e849f223af57c1512c03e88e
βββ 30
β βββ 4cacd01676e009f08a8bce354dedda
βββ 55
β βββ 18bd3663b36d2be860af83f3738faf
βββ 5a
β βββ 1552a9cb995cf7879aa0756adf8366.dir
βββ 5b
β βββ 63fd9a7f9193c113085e09677fce5a
βββ 5f
β βββ cc2be218d299581e7e6eda98471485
βββ 62
β βββ 9389205697cda8caaba595dc07af0c
βββ 67
β βββ 92e382941dbb18c89fa57ed7c4a1fc
βββ 72
β βββ 659e880507f87c95146a3ed11f9255
βββ 7d
β βββ 402e8053c0d0af40bafafef4c65864
β βββ bc7f9aa5f21a13d72def4f4f3696e3
βββ 8d
β βββ ce1da718c77df818261bf627a636fc
βββ 9e
β βββ 11a33bf9e09b6eb8523dd53b4ed9f5
βββ ab
β βββ 7ee23dbafb29fdd9006abf6c0279a4
β βββ 985a36268c5fd522e3837ba509bf9d
βββ bd
β βββ 9193ab817de999b526f01845609ad0
βββ c5
β βββ a37368da9fba54ccf3720a4a583147.dir
βββ d1
β βββ 1a29269083bc839bff4aad3509f4c4
βββ d4
β βββ 61cb099824287a77e6f369b34c6aaf
βββ d7
β βββ 0361968ccd4676a1ec87fad4d7dcc6
βββ de
β βββ b3f81a2c3d7ccd3fb98890ffbc8e99
βββ e3
β βββ 43f5915e3ca1dae351011a8cd767f4
βββ ea
β βββ 03c52cf38849e4505288d4d8331f31
βββ ec
β βββ 0e6c9d86ca7e7238837f77464b2f61.dir
βββ f6
βββ 3931f7e493ee6d16ba6ea0e1fb4dae
28 directories, 30 files
Could it be because the directories, themselves, are not git tracked (they don't exist when I git clone), and they are created by DVC when checked out? I was wondering if this can cause the md5 result to be different somehow.
So we can see the .dir files were successfully pulled. (surprising for me was that after this, the original repo dir, next to the .dvc files, also had the files already. I thought this would only happen when doing dvc checkout)
dvc pull is actually dvc fetch + dvc checkout, same way as git works. So this actually works as it should.
Anyway, I ran dvc checkout just to be sure and no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed (as before):
So those dirs don't exist before you actually dvc pull? Could you remove them and run dvc checkout once again and then dvc status to see if anything changes?
if I dvc repro the files, then dvc status outputs nothing which is the expected behaviour (btw, a minor suggestion: it would be more user friendly to output some message nothing changed, or similar). But then I get more files in cache:
Could it be because the directories, themselves, are not git tracked (they don't exist when I git clone), and they are created by DVC when checked out? I was wondering if this can cause the md5 result to be different somehow.
Looks like your directory is indeed changed for some reason and when you run dvc repro, it treats your dvc add'ed directory(ec0e6c9d86ca7e7238837f77464b2f61.dir) as the data source that has changed in place and thus simply re-adds it, that is why you get two more .dir cache files after repro.
Both your directories should not be tracked by git and should be removed by dvc checkout before recreating. Is there any chance you could run diff -r on the original dvc added directory and the one produced by checkout?
Also, could you check that there are only regular files and directories in your original dir(I.e. find ./orig_dir -type b -o -type c -o -type p -o -type l -o -type l should output nothing)?
Thanks,
Ruslan
One more thing. Judging by the tree output you've provided, the only two files that were added to the cache are new *.dir ones, so looks like either some files are missing in the new directories or our algorithm somehow produced different results for seemingly the same directories. Could you please show me these outputs(you can censor filenames if you really need to, but if you can please leave them as is, so it is easier for me to read):
$ cat 0e/278292ab6d1fe12bbcb890d1003f76.dir
$ cat 5a/1552a9cb995cf7879aa0756adf8366.dir
$ cat c5/a37368da9fba54ccf3720a4a583147.dir
$ cat ec/0e6c9d86ca7e7238837f77464b2f61.dir
So those dirs don't exist before you actually dvc pull? Could you remove them and run dvc checkout once again and then dvc status to see if anything changes?
Nothing changed. But I guess the real test would be to try it now on a third machine with different user names, permissions and so on...
Looks like your directory is indeed changed for some reason and when you run dvc repro, it treats your dvc add'ed directory(ec0e6c9d86ca7e7238837f77464b2f61.dir) as the data source that has changed in place and thus simply re-adds it
I just want to double check something: when I initially do dvc add [path] or dvc run -d [path]..., here path should be path/to/the/dir/i/want/to/track and not path/to/the/dir/i/want/to/track/* or some variation of it, right?
Also, could you check that there are only regular files and directories in your original dir(I.e. find ./orig_dir -type b -o -type c -o -type p -o -type l -o -type l should output nothing)?
They output nothing. So I guess this is fine.
Is there any chance you could run diff -r on the original dvc added directory and the one produced by checkout?
I've just done that and it outputs nothing. So they are the same.
Could you please show me these outputs(you can censor filenames if you really need to, but if you can please leave them as is, so it is easier for me to read):
I'm sorry I had to censor some names so I changed them all for consistency. I hope it is still clear.
The only differences I notice is the order in which the files are listed. Should this make any difference?
# cat 0e/278292ab6d1fe12bbcb890d1003f76.dir
[{"md5": "19b8db23599828aa757c00097d3abe80", "relpath": "file1.parquet"}, {"md5": "d11a29269083bc839bff4aad3509f4c4","relpath": "file2.parquet"}, {"md5": "deb3f81a2c3d7ccd3fb98890ffbc8e99", "relpath": "file3.parquet"}, {"md5": "6792e382941dbb18c89fa57ed7c4a1fc", "relpath": "file4.parquet"}]
# cat c5/a37368da9fba54ccf3720a4a583147.dir
[{"md5": "d11a29269083bc839bff4aad3509f4c4", "relpath": "file2.parquet"}, {"md5": "19b8db23599828aa757c00097d3abe80", "relpath": "file1.parquet"}, {"md5": "deb3f81a2c3d7ccd3fb98890ffbc8e99", "relpath": "file3.parquet"}, {"md5": "6792e382941dbb18c89fa57ed7c4a1fc", "relpath": "file4.parquet"}]
# cat 5a/1552a9cb995cf7879aa0756adf8366.dir
[{"md5": "5b63fd9a7f9193c113085e09677fce5a", "relpath": "file5.csv"}, {"md5": "7dbc7f9aa5f21a13d72def4f4f3696e3", "relpath": "file6.csv"}, {"md5": "7d402e8053c0d0af40bafafef4c65864", "relpath": "file7.csv"}, {"md5": "304cacd01676e009f08a8bce354dedda", "relpath": "file8.csv"}, {"md5": "1c9144c607ecb2baf72f2fb5dbd70b0b","relpath": "file9.csv"}, {"md5": "ab7ee23dbafb29fdd9006abf6c0279a4", "relpath": "file10.csv"}, {"md5": "d70361968ccd4676a1ec87fad4d7dcc6", "relpath": "file11.csv"}, {"md5": "8dce1da718c77df818261bf627a636fc", "relpath": "file12.csv"}, {"md5": "5518bd3663b36d2be860af83f3738faf", "relpath": "file13.csv"}, {"md5": "629389205697cda8caaba595dc07af0c", "relpath": "file14.csv"}, {"md5": "01775ba2dfb7cd941d2be742c68a1cb3", "relpath": "file15.csv"}]
# cat ec/0e6c9d86ca7e7238837f77464b2f61.dir
[{"md5": "8dce1da718c77df818261bf627a636fc", "relpath": "file12.csv"}, {"md5": "5518bd3663b36d2be860af83f3738faf", "relpath": "file13.csv"}, {"md5": "629389205697cda8caaba595dc07af0c", "relpath": "file14.csv"}, {"md5":"5b63fd9a7f9193c113085e09677fce5a", "relpath": "file5.csv"}, {"md5": "ab7ee23dbafb29fdd9006abf6c0279a4", "relpath":"file10.csv"}, {"md5": "7dbc7f9aa5f21a13d72def4f4f3696e3", "relpath": "file6.csv"}, {"md5": "7d402e8053c0d0af40bafafef4c65864", "relpath": "file7.csv"}, {"md5": "304cacd01676e009f08a8bce354dedda", "relpath": "file8.csv"}, {"md5": "d70361968ccd4676a1ec87fad4d7dcc6", "relpath": "file11.csv"}, {"md5": "1c9144c607ecb2baf72f2fb5dbd70b0b", "relpath": "file9.csv"}, {"md5": "01775ba2dfb7cd941d2be742c68a1cb3", "relpath": "file15.csv"}]
(it seems like I like to find the deepest bugs, sorry @efiop π )
I just want to double check something: when I initially do dvc add [path] or dvc run -d [path]..., here path should be path/to/the/dir/i/want/to/track and not path/to/the/dir/i/want/to/track/* or some variation of it, right?
Yes, without the wildcard.
The only differences I notice is the order in which the files are listed. Should this make any difference?
Actually yes, it matters a lot, because we take an md5 sum of those files, so if the order is different then md5 sums are going to be different. This is precisely the bug, we've got it! Looks like we don't ensure enough that the list is sorted. I am working on fixing it right now, 0.10.3 will be released today right after the bug is fixed.
(it seems like I like to find the deepest bugs, sorry @efiop π )
Thank you so much for the feedback and spending your time to investigate bugs! :)
Awesome!
Will it be possible to keep the cache and .dvc files that I already have in original machine? Or do I need to dvc add everything again? (it would save me some pain if I don't)
@andrethrill Unfortunately it is not feasible, since md5 for your original dir was not properly computed in the first place, so your dvc files will still have to be changed. But you don't really need to re-add your files manually, just run dvc repro and dvc should handle everything for you(note that it will have to re-run your stages down the pipeline, since md5 for the dir is different from what they used previously). Although cache file for your dir was not properly written, the dir itself was properly created in your workspace, so luckily there is no data corruption there and re-adding(or dvc repro-ing) it in-place should fix everything. So sorry for the inconvenience :(
@efiop I understand.
And if, after I do as you say, I run the garbage collector, will it remove the other files? I wanted to avoid having "garbage" in the cache.
edit:
is there a way to remove garbage from the remote location?
Unfortunately currently garbage collector is not able to cleanup cache on the remote location. I've added https://github.com/iterative/dvc/issues/876 for that and will try to add that feature shortly.
Hi @andrethrill !
I've released https://github.com/iterative/dvc/releases/tag/0.11.0 with both this bug fixed and dvc gc --cloud supported. Please upgrade. Closing this issue for now, feel free to reopen.
Thanks,
Ruslan
Everything seems to be working now @efiop.
Thanks so much for all the support!
AndrΓ©