DVC 0.29.0 / ubuntu / pip
I believe you recently fixed a bug related to re-computing md5s for large files. There might be something similar happening again — or maybe I just need to better understand what triggers md5 to be computed.
$ dvc status
Computing md5 for a large directory project/data/images. This is only done once.
[##############################] 100% project/data/images
This happens not every time I run dvc status, but at least every time I reboot the machine — I haven't 100% narrowed down what triggers it. Is this expected?
It's not super high priority — it only takes ~30 seconds to re-compute md5s for these directories, which is kind of surprisingly fast. Could it be caching file-level (image-level) md5s and then simply recomputing the directory-level md5?
Hi @colllin. Thanks for reporting, it is not desired behaviour, there is PR trying to tackle this one:
it only takes ~30 seconds to re-compute md5s for these directories, which is kind of surprisingly fast. >Could it be caching file-level (image-level) md5s and then simply recomputing the directory-level md5?
Currently, besides md5 we store modification time and size of file, and, for given inode, we check if file mtime or size has been changed. If it has not, we assume that we do not need to recompute md5. So yes, we are caching file level md5s.
For the record, looks like it can be reproduced this way:
#!/bin/bash
set -x
set -e
rm -rf myrepo
mkdir myrepo
cd myrepo
git init
dvc init
git commit -m"init"
mkdir dir
for i in $(seq 1 1000); do
echo $i > dir/$i
done
dvc add dir
dvc status
dvc status
which produces
+ dvc add dir
Computing md5 for a large directory dir. This is only done once.
[##############################] 100% dir
Adding 'dir' to '.gitignore'.
Saving 'dir' to cache '.dvc/cache'.
Linking directory 'dir'.
[##############################] 100% dir
Saving information to 'dir.dvc'.
To track the changes with git run:
git add .gitignore dir.dvc
+ dvc status
Computing md5 for a large directory dir. This is only done once.
[##############################] 100% dir
Pipeline is up to date. Nothing to reproduce.
+ dvc status
Pipeline is up to date. Nothing to reproduce.
We don't print a progress bar when verifying cache for a directory, so looks like there is something else that we've forgotten to update, which makes first dvc status actually compute something once again.
@colllin sorry, I made a mistake, it seems there is something more to this case.
Thanks @efiop, Ill look into that.
@colllin it actually was a bug. After adding files we did not update directory state, so at next status we detected that modification time for directory has changed, and performed update for whole directory. Fix in review. Thank you for pointing this one!
Amazingly fast fix. Thank you!!
@colllin Just a heads up: we've rolled back a faulty optimization in 0.41.0, so status might be slower, because it is going to need to validate each file(no md5 computations though, everything is going to be pulled from state db). We are working on a proper optimization patch right, which should be ready this week. Just wanted to give you a heads up, so there are no surprises again. :slightly_smiling_face: Sorry for the inconvenience.
Most helpful comment
@colllin it actually was a bug. After adding files we did not update directory state, so at next
statuswe detected that modification time for directory has changed, and performed update for whole directory. Fix in review. Thank you for pointing this one!