DVC version: 0.50.1
Python version: 3.7.1
Platform: Linux-4.15.0-52-generic-x86_64-with-debian-buster-sid
Binary: True
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/sda')
Filesystem type (workspace): ('ext4', '/dev/sda')
I have a large directory contains thousands of images, i added the top-level directory to dvc. But when i add a small file to the image directory and run the 'dvc status', computing md5 last a long time. Is this right?
Hi @BoysFight! Yest this behaviour is expected.
Why is it happening?
The mechanism of calculating md5 for directory is based on whole directory contents. So if dvc detects that modification time or size of directory has changed, it will force directory checksum recalculation. Does that answer your question?
EDIT:
Does that answer your question?
- might have sounded a bit passive-agressive, sorry if it did, it was not meant to be :)
Some more info:
The thing is that currently our mechanism of calculating md5 relies on content of whole directory, and if anything inside has changed, we need to retrigger whole calculation, even if change is small. Probably we should consider some alternate way of doing it, so that adding one file will not retrigger wholed dir md5 recalculation
EDIT2:
Thought md5 recalculation is for whole directory, md5's of particular, already existing files should be obtained from state database, which should be much faster than caclulation for unadded directory.
@BoysFight, could you describe your case some more? Maybe there is a bug.
@BoysFight looks like a potential regression to me.
.dvc/state
SQLLite database. Can you check that the time it takes is actually the same to confirm that it runs the calculation again?Can you please share some details to reproduce it o our end:
@pared @shcheklein, Thank you very much for your patient reply.
The directory structure is as follows:
data/dataset
├── dir1
├── dir2
├── dir...
├── dir42
└── label.txt
42 directories, 1 file
There are about 327000 files whose size is varying from 100k to 500k under the dataset directory, and the total size is 54G.
After i add the dataset directory, dvc status
runing time is 5.655 seconds:
DEBUG: fetched: [(327174,)]
Pipeline is up to date. Nothing to reproduce.
dvc status -v 4.26s user 2.11s system 112% cpu 5.655 total
Then i delete one image in dataset/dir1/, the dvc status
runing time is 1:02:49.83:
data/dataset.dvc:
changed outs:
modified: data/dataset
src/retrain_inception_v3.dvc:
changed deps:
modified: data/dataset
dvc status -v 496.03s user 101.81s system 15% cpu 1:02:49.83 total
We have to wait about one hour after i delete or add one file in directory.
@BoysFight thank you very much for elaboration!
It seems to me that status should not take this much time. 300K files is much, but not THAT much. Ill try to reproduce this case and investigate.
Ok, this seems to me as lack of optimization on our side.
Here is what is happening:
dvc status
get_dir_checksum
-> _collect_dir
, which is supposed to collect md5s of files.We need to make dvc try retrieving single file checksum from state, especially that during directory save we do fill single files checksums.
@pared Great investigation!
@BoysFight fix should be merged and avialable in new dvc version.
Most helpful comment
Ok, this seems to me as lack of optimization on our side.
Here is what is happening:
dvc status
get_dir_checksum
->_collect_dir
, which is supposed to collect md5s of files.We need to make dvc try retrieving single file checksum from state, especially that during directory save we do fill single files checksums.