Dvc: [BUG?] Cached dir file pushed to remote throws error when pulled from remote

Created on 10 Jul 2018  Β·  12Comments  Β·  Source: iterative/dvc

Hi!

I've dvc add a directory of raw data, which .dvc file is:

md5: 40dc65b691c196a0bc68a102c845cb21
outs:
- cache: true
  md5: ec0e6c9d86ca7e7238837f77464b2f61.dir
  metric: false
  path: *******                                                     # this is directory #1

After I've run a dvc run command (that depends on the directory I showed above) which .dvc file is:

cmd: python ../../../src/data/*******.py
deps:
- md5: 00ad5d690b0b1e20f7536ad7b9b13e56
  path: ../../../src/data/*******.py
- md5: ec0e6c9d86ca7e7238837f77464b2f61.dir
  path: ../../raw/*******                                                     # this is directory #1
md5: 6d5cc84c0048037818d2022d4caa6824
outs:
- cache: true
  md5: 0e278292ab6d1fe12bbcb890d1003f76.dir
  metric: false
  path: *******                                                                   # this is a directory #2

After I've done dvc push to S3.

Also, I don't know if this is relevant but I'm using shared cache. (cache location is outside of repo dir)

The problem

On a different machine, I've cloned the git repo and then when I do dvc pull (from S3) I always get the following errors:

Failed to load dir cache '../../******/ec/0e6c9d86ca7e7238837f77464b2f61.dir': [Errno 2] No such file or directory: '******/ec/0e6c9d86ca7e7238837f77464b2f61.dir'
Failed to load dir cache '../../******/0e/278292ab6d1fe12bbcb890d1003f76.dir': [Errno 2] No such file or directory: '******/0e/278292ab6d1fe12bbcb890d1003f76.dir'
(1/26): [##############################] 100% ab985a36268c5fd522e3837ba509bf9d
(2/26): [##############################] 100% 5fcc2be218d299581e7e6eda98471485
(3/26): [##############################] 100% ab7ee23dbafb29fdd9006abf6c0279a4
(4/26): [##############################] 100% ea03c52cf38849e4505288d4d8331f31
(5/26): [##############################] 100% 629389205697cda8caaba595dc07af0c
(6/26): [##############################] 100% 9e11a33bf9e09b6eb8523dd53b4ed9f5
(7/26): [##############################] 100% 2b3ed841e849f223af57c1512c03e88e
(8/26): [##############################] 100% 10e3e0ef98adf6e45b7b4fab507cdf16
(9/26): [##############################] 100% f63931f7e493ee6d16ba6ea0e1fb4dae
(10/26): [##############################] 100% 8dce1da718c77df818261bf627a636fc
(11/26): [##############################] 100% 19b8db23599828aa757c00097d3abe80
(12/26): [##############################] 100% 01775ba2dfb7cd941d2be742c68a1cb3
(13/26): [##############################] 100% 5518bd3663b36d2be860af83f3738faf
(14/26): [##############################] 100% 7d402e8053c0d0af40bafafef4c65864
(15/26): [##############################] 100% bd9193ab817de999b526f01845609ad0
(16/26): [##############################] 100% 304cacd01676e009f08a8bce354dedda
(17/26): [##############################] 100% 72659e880507f87c95146a3ed11f9255
(18/26): [##############################] 100% d70361968ccd4676a1ec87fad4d7dcc6
(19/26): [##############################] 100% d11a29269083bc839bff4aad3509f4c4
(20/26): [##############################] 100% 6792e382941dbb18c89fa57ed7c4a1fc
(21/26): [##############################] 100% d461cb099824287a77e6f369b34c6aaf
(22/26): [##############################] 100% 5b63fd9a7f9193c113085e09677fce5a
(23/26): [##############################] 100% deb3f81a2c3d7ccd3fb98890ffbc8e99
(24/26): [##############################] 100% 7dbc7f9aa5f21a13d72def4f4f3696e3
(25/26): [##############################] 100% 1c9144c607ecb2baf72f2fb5dbd70b0b
(26/26): [##############################] 100% e343f5915e3ca1dae351011a8cd767f4

As you can see, the remaining cached files which are not .dir download smoothly.

After I run dvc checkout no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed:

***.dvc
        outs
                changed:  data/raw/****                                                       # this is directory #1
***.dvc
        deps
                changed:  data/raw/****                                                       # this is directory #1
        outs
                changed:  data/processed/****                                             # this is directory #2

This seems like a bug or I'm being silly and doing something wrong. Is this a known behaviour?

bug

All 12 comments

Hi @andrethrill !

Failed to load dir cache '../../******/ec/0e6c9d86ca7e7238837f77464b2f61.dir': [Errno 2] No such file or directory: '******/ec/0e6c9d86ca7e7238837f77464b2f61.dir'
Failed to load dir cache '../../******/0e/278292ab6d1fe12bbcb890d1003f76.dir': [Errno 2] No such file or directory: '******/0e/278292ab6d1fe12bbcb890d1003f76.dir'

These errors should not have been shown, but what they mean is that there is no local cache for those *.dir cache files, so it means that dvc will have to download them by itself, which is a totally normal situation for fresh pull. I've just sent a patch to tidy that up, so it is not all scary looking :)

After I run dvc checkout no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed:

This is actually strange and I was not able to reproduce. We should definitely have better msgs printed in the future so it is easier to understand what is going on. I'm trying to reproduce it myself right now and in the mean time, could you please try to reproduce it again, but now also make sure that your dirs actually exist when you call dvc status?

Thanks,
Ruslan

Thanks for the feedback @efiop.

I ran dvc pull again from a fresh start on the second machine. I still get the same error message (DVC master branch is not modified)

Continuing, after running dvc pull this is what the cache directory looks like:

tree ../../dvc_caches/project1/
../../dvc_caches/project1/
β”œβ”€β”€ 01
β”‚   └── 775ba2dfb7cd941d2be742c68a1cb3
β”œβ”€β”€ 0e
β”‚   └── 278292ab6d1fe12bbcb890d1003f76.dir
β”œβ”€β”€ 10
β”‚   └── e3e0ef98adf6e45b7b4fab507cdf16
β”œβ”€β”€ 19
β”‚   └── b8db23599828aa757c00097d3abe80
β”œβ”€β”€ 1c
β”‚   └── 9144c607ecb2baf72f2fb5dbd70b0b
β”œβ”€β”€ 2b
β”‚   └── 3ed841e849f223af57c1512c03e88e
β”œβ”€β”€ 30
β”‚   └── 4cacd01676e009f08a8bce354dedda
β”œβ”€β”€ 55
β”‚   └── 18bd3663b36d2be860af83f3738faf
β”œβ”€β”€ 5b
β”‚   └── 63fd9a7f9193c113085e09677fce5a
β”œβ”€β”€ 5f
β”‚   └── cc2be218d299581e7e6eda98471485
β”œβ”€β”€ 62
β”‚   └── 9389205697cda8caaba595dc07af0c
β”œβ”€β”€ 67
β”‚   └── 92e382941dbb18c89fa57ed7c4a1fc
β”œβ”€β”€ 72
β”‚   └── 659e880507f87c95146a3ed11f9255
β”œβ”€β”€ 7d
β”‚   β”œβ”€β”€ 402e8053c0d0af40bafafef4c65864
β”‚   └── bc7f9aa5f21a13d72def4f4f3696e3
β”œβ”€β”€ 8d
β”‚   └── ce1da718c77df818261bf627a636fc
β”œβ”€β”€ 9e
β”‚   └── 11a33bf9e09b6eb8523dd53b4ed9f5
β”œβ”€β”€ ab
β”‚   β”œβ”€β”€ 7ee23dbafb29fdd9006abf6c0279a4
β”‚   └── 985a36268c5fd522e3837ba509bf9d
β”œβ”€β”€ bd
β”‚   └── 9193ab817de999b526f01845609ad0
β”œβ”€β”€ d1
β”‚   └── 1a29269083bc839bff4aad3509f4c4
β”œβ”€β”€ d4
β”‚   └── 61cb099824287a77e6f369b34c6aaf
β”œβ”€β”€ d7
β”‚   └── 0361968ccd4676a1ec87fad4d7dcc6
β”œβ”€β”€ de
β”‚   └── b3f81a2c3d7ccd3fb98890ffbc8e99
β”œβ”€β”€ e3
β”‚   └── 43f5915e3ca1dae351011a8cd767f4
β”œβ”€β”€ ea
β”‚   └── 03c52cf38849e4505288d4d8331f31
β”œβ”€β”€ ec
β”‚   └── 0e6c9d86ca7e7238837f77464b2f61.dir
└── f6
    └── 3931f7e493ee6d16ba6ea0e1fb4dae

26 directories, 28 files

So we can see the .dir files were successfully pulled. (surprising for me was that after this, the original repo dir, next to the .dvc files, also had the files already. I thought this would only happen when doing dvc checkout)

Anyway, I ran dvc checkout just to be sure and no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed (as before):

***.dvc
        outs
                changed:  data/raw/****                                                       # this is directory #1
***.dvc
        deps
                changed:  data/raw/****                                                       # this is directory #1
        outs
                changed:  data/processed/****

if I dvc repro the files, then dvc status outputs nothing which is the expected behaviour (btw, a minor suggestion: it would be more user friendly to output some message nothing changed, or similar). But then I get more files in cache:


tree ../../dvc_caches/project1/
../../dvc_caches/project1/
β”œβ”€β”€ 01
β”‚   └── 775ba2dfb7cd941d2be742c68a1cb3
β”œβ”€β”€ 0e
β”‚   └── 278292ab6d1fe12bbcb890d1003f76.dir
β”œβ”€β”€ 10
β”‚   └── e3e0ef98adf6e45b7b4fab507cdf16
β”œβ”€β”€ 19
β”‚   └── b8db23599828aa757c00097d3abe80
β”œβ”€β”€ 1c
β”‚   └── 9144c607ecb2baf72f2fb5dbd70b0b
β”œβ”€β”€ 2b
β”‚   └── 3ed841e849f223af57c1512c03e88e
β”œβ”€β”€ 30
β”‚   └── 4cacd01676e009f08a8bce354dedda
β”œβ”€β”€ 55
β”‚   └── 18bd3663b36d2be860af83f3738faf
β”œβ”€β”€ 5a
β”‚   └── 1552a9cb995cf7879aa0756adf8366.dir
β”œβ”€β”€ 5b
β”‚   └── 63fd9a7f9193c113085e09677fce5a
β”œβ”€β”€ 5f
β”‚   └── cc2be218d299581e7e6eda98471485
β”œβ”€β”€ 62
β”‚   └── 9389205697cda8caaba595dc07af0c
β”œβ”€β”€ 67
β”‚   └── 92e382941dbb18c89fa57ed7c4a1fc
β”œβ”€β”€ 72
β”‚   └── 659e880507f87c95146a3ed11f9255
β”œβ”€β”€ 7d
β”‚   β”œβ”€β”€ 402e8053c0d0af40bafafef4c65864
β”‚   └── bc7f9aa5f21a13d72def4f4f3696e3
β”œβ”€β”€ 8d
β”‚   └── ce1da718c77df818261bf627a636fc
β”œβ”€β”€ 9e
β”‚   └── 11a33bf9e09b6eb8523dd53b4ed9f5
β”œβ”€β”€ ab
β”‚   β”œβ”€β”€ 7ee23dbafb29fdd9006abf6c0279a4
β”‚   └── 985a36268c5fd522e3837ba509bf9d
β”œβ”€β”€ bd
β”‚   └── 9193ab817de999b526f01845609ad0
β”œβ”€β”€ c5
β”‚   └── a37368da9fba54ccf3720a4a583147.dir
β”œβ”€β”€ d1
β”‚   └── 1a29269083bc839bff4aad3509f4c4
β”œβ”€β”€ d4
β”‚   └── 61cb099824287a77e6f369b34c6aaf
β”œβ”€β”€ d7
β”‚   └── 0361968ccd4676a1ec87fad4d7dcc6
β”œβ”€β”€ de
β”‚   └── b3f81a2c3d7ccd3fb98890ffbc8e99
β”œβ”€β”€ e3
β”‚   └── 43f5915e3ca1dae351011a8cd767f4
β”œβ”€β”€ ea
β”‚   └── 03c52cf38849e4505288d4d8331f31
β”œβ”€β”€ ec
β”‚   └── 0e6c9d86ca7e7238837f77464b2f61.dir
└── f6
    └── 3931f7e493ee6d16ba6ea0e1fb4dae

28 directories, 30 files

Could it be because the directories, themselves, are not git tracked (they don't exist when I git clone), and they are created by DVC when checked out? I was wondering if this can cause the md5 result to be different somehow.

So we can see the .dir files were successfully pulled. (surprising for me was that after this, the original repo dir, next to the .dvc files, also had the files already. I thought this would only happen when doing dvc checkout)

dvc pull is actually dvc fetch + dvc checkout, same way as git works. So this actually works as it should.

Anyway, I ran dvc checkout just to be sure and no error is thrown. And when I run dvc status it always outputs that the .dvc files related to the two .dir files have changed (as before):

So those dirs don't exist before you actually dvc pull? Could you remove them and run dvc checkout once again and then dvc status to see if anything changes?

if I dvc repro the files, then dvc status outputs nothing which is the expected behaviour (btw, a minor suggestion: it would be more user friendly to output some message nothing changed, or similar). But then I get more files in cache:
Could it be because the directories, themselves, are not git tracked (they don't exist when I git clone), and they are created by DVC when checked out? I was wondering if this can cause the md5 result to be different somehow.

Looks like your directory is indeed changed for some reason and when you run dvc repro, it treats your dvc add'ed directory(ec0e6c9d86ca7e7238837f77464b2f61.dir) as the data source that has changed in place and thus simply re-adds it, that is why you get two more .dir cache files after repro.
Both your directories should not be tracked by git and should be removed by dvc checkout before recreating. Is there any chance you could run diff -r on the original dvc added directory and the one produced by checkout?

Also, could you check that there are only regular files and directories in your original dir(I.e. find ./orig_dir -type b -o -type c -o -type p -o -type l -o -type l should output nothing)?

Thanks,
Ruslan

One more thing. Judging by the tree output you've provided, the only two files that were added to the cache are new *.dir ones, so looks like either some files are missing in the new directories or our algorithm somehow produced different results for seemingly the same directories. Could you please show me these outputs(you can censor filenames if you really need to, but if you can please leave them as is, so it is easier for me to read):

$ cat 0e/278292ab6d1fe12bbcb890d1003f76.dir
$ cat 5a/1552a9cb995cf7879aa0756adf8366.dir
$ cat c5/a37368da9fba54ccf3720a4a583147.dir
$ cat ec/0e6c9d86ca7e7238837f77464b2f61.dir

So those dirs don't exist before you actually dvc pull? Could you remove them and run dvc checkout once again and then dvc status to see if anything changes?

Nothing changed. But I guess the real test would be to try it now on a third machine with different user names, permissions and so on...

Looks like your directory is indeed changed for some reason and when you run dvc repro, it treats your dvc add'ed directory(ec0e6c9d86ca7e7238837f77464b2f61.dir) as the data source that has changed in place and thus simply re-adds it

I just want to double check something: when I initially do dvc add [path] or dvc run -d [path]..., here path should be path/to/the/dir/i/want/to/track and not path/to/the/dir/i/want/to/track/* or some variation of it, right?

Also, could you check that there are only regular files and directories in your original dir(I.e. find ./orig_dir -type b -o -type c -o -type p -o -type l -o -type l should output nothing)?

They output nothing. So I guess this is fine.

Is there any chance you could run diff -r on the original dvc added directory and the one produced by checkout?

I've just done that and it outputs nothing. So they are the same.

Could you please show me these outputs(you can censor filenames if you really need to, but if you can please leave them as is, so it is easier for me to read):

I'm sorry I had to censor some names so I changed them all for consistency. I hope it is still clear.

The only differences I notice is the order in which the files are listed. Should this make any difference?

# cat 0e/278292ab6d1fe12bbcb890d1003f76.dir
[{"md5": "19b8db23599828aa757c00097d3abe80", "relpath": "file1.parquet"}, {"md5": "d11a29269083bc839bff4aad3509f4c4","relpath": "file2.parquet"}, {"md5": "deb3f81a2c3d7ccd3fb98890ffbc8e99", "relpath": "file3.parquet"}, {"md5": "6792e382941dbb18c89fa57ed7c4a1fc", "relpath": "file4.parquet"}]

# cat c5/a37368da9fba54ccf3720a4a583147.dir
[{"md5": "d11a29269083bc839bff4aad3509f4c4", "relpath": "file2.parquet"}, {"md5": "19b8db23599828aa757c00097d3abe80", "relpath": "file1.parquet"}, {"md5": "deb3f81a2c3d7ccd3fb98890ffbc8e99", "relpath": "file3.parquet"}, {"md5": "6792e382941dbb18c89fa57ed7c4a1fc", "relpath": "file4.parquet"}]


# cat 5a/1552a9cb995cf7879aa0756adf8366.dir
[{"md5": "5b63fd9a7f9193c113085e09677fce5a", "relpath": "file5.csv"}, {"md5": "7dbc7f9aa5f21a13d72def4f4f3696e3", "relpath": "file6.csv"}, {"md5": "7d402e8053c0d0af40bafafef4c65864", "relpath": "file7.csv"}, {"md5": "304cacd01676e009f08a8bce354dedda", "relpath": "file8.csv"}, {"md5": "1c9144c607ecb2baf72f2fb5dbd70b0b","relpath": "file9.csv"}, {"md5": "ab7ee23dbafb29fdd9006abf6c0279a4", "relpath": "file10.csv"}, {"md5": "d70361968ccd4676a1ec87fad4d7dcc6", "relpath": "file11.csv"}, {"md5": "8dce1da718c77df818261bf627a636fc", "relpath": "file12.csv"}, {"md5": "5518bd3663b36d2be860af83f3738faf", "relpath": "file13.csv"}, {"md5": "629389205697cda8caaba595dc07af0c", "relpath": "file14.csv"}, {"md5": "01775ba2dfb7cd941d2be742c68a1cb3", "relpath": "file15.csv"}]



# cat ec/0e6c9d86ca7e7238837f77464b2f61.dir
[{"md5": "8dce1da718c77df818261bf627a636fc", "relpath": "file12.csv"}, {"md5": "5518bd3663b36d2be860af83f3738faf", "relpath": "file13.csv"}, {"md5": "629389205697cda8caaba595dc07af0c", "relpath": "file14.csv"}, {"md5":"5b63fd9a7f9193c113085e09677fce5a", "relpath": "file5.csv"}, {"md5": "ab7ee23dbafb29fdd9006abf6c0279a4", "relpath":"file10.csv"}, {"md5": "7dbc7f9aa5f21a13d72def4f4f3696e3", "relpath": "file6.csv"}, {"md5": "7d402e8053c0d0af40bafafef4c65864", "relpath": "file7.csv"}, {"md5": "304cacd01676e009f08a8bce354dedda", "relpath": "file8.csv"}, {"md5": "d70361968ccd4676a1ec87fad4d7dcc6", "relpath": "file11.csv"}, {"md5": "1c9144c607ecb2baf72f2fb5dbd70b0b", "relpath": "file9.csv"}, {"md5": "01775ba2dfb7cd941d2be742c68a1cb3", "relpath": "file15.csv"}]

(it seems like I like to find the deepest bugs, sorry @efiop πŸ˜… )

I just want to double check something: when I initially do dvc add [path] or dvc run -d [path]..., here path should be path/to/the/dir/i/want/to/track and not path/to/the/dir/i/want/to/track/* or some variation of it, right?

Yes, without the wildcard.

The only differences I notice is the order in which the files are listed. Should this make any difference?

Actually yes, it matters a lot, because we take an md5 sum of those files, so if the order is different then md5 sums are going to be different. This is precisely the bug, we've got it! Looks like we don't ensure enough that the list is sorted. I am working on fixing it right now, 0.10.3 will be released today right after the bug is fixed.

(it seems like I like to find the deepest bugs, sorry @efiop πŸ˜… )

Thank you so much for the feedback and spending your time to investigate bugs! :)

Awesome!

Will it be possible to keep the cache and .dvc files that I already have in original machine? Or do I need to dvc add everything again? (it would save me some pain if I don't)

@andrethrill Unfortunately it is not feasible, since md5 for your original dir was not properly computed in the first place, so your dvc files will still have to be changed. But you don't really need to re-add your files manually, just run dvc repro and dvc should handle everything for you(note that it will have to re-run your stages down the pipeline, since md5 for the dir is different from what they used previously). Although cache file for your dir was not properly written, the dir itself was properly created in your workspace, so luckily there is no data corruption there and re-adding(or dvc repro-ing) it in-place should fix everything. So sorry for the inconvenience :(

@efiop I understand.

And if, after I do as you say, I run the garbage collector, will it remove the other files? I wanted to avoid having "garbage" in the cache.

edit:

is there a way to remove garbage from the remote location?

Unfortunately currently garbage collector is not able to cleanup cache on the remote location. I've added https://github.com/iterative/dvc/issues/876 for that and will try to add that feature shortly.

Hi @andrethrill !

I've released https://github.com/iterative/dvc/releases/tag/0.11.0 with both this bug fixed and dvc gc --cloud supported. Please upgrade. Closing this issue for now, feel free to reopen.

Thanks,
Ruslan

Everything seems to be working now @efiop.
Thanks so much for all the support!

AndrΓ©

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shcheklein picture shcheklein  Β·  3Comments

anotherbugmaster picture anotherbugmaster  Β·  3Comments

tc-ying picture tc-ying  Β·  3Comments

robguinness picture robguinness  Β·  3Comments

shcheklein picture shcheklein  Β·  3Comments