Dvc: Small fille changes, md5's computing last a long time

Created on 12 Jul 2019 · 7Comments · Source: iterative/dvc

DVC version: 0.50.1
Python version: 3.7.1
Platform: Linux-4.15.0-52-generic-x86_64-with-debian-buster-sid
Binary: True
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/sda')
Filesystem type (workspace): ('ext4', '/dev/sda')

I have a large directory contains thousands of images, i added the top-level directory to dvc. But when i add a small file to the image directory and run the 'dvc status', computing md5 last a long time. Is this right?

bug c5-half-a-day p1-important performance research

Source

BoysFight

Most helpful comment

Ok, this seems to me as lack of optimization on our side.
Here is what is happening:

User deletes one file, mtime for directory has changed
User invokes dvc status
During get_checksum we do not find state database entry (due to the deletion)
Dvc invokes get_dir_checksum -> _collect_dir, which is supposed to collect md5s of files.
here, we submit task for md5 calculation. Note that this method (get_file_checksum) does not check state database, so effectively, each time we change data directory, we calculate md5 for all files again and again.

We need to make dvc try retrieving single file checksum from state, especially that during directory save we do fill single files checksums.

pared on 15 Jul 2019

👍4 ❤1

All 7 comments

Hi @BoysFight! Yest this behaviour is expected.

Why is it happening?
The mechanism of calculating md5 for directory is based on whole directory contents. So if dvc detects that modification time or size of directory has changed, it will force directory checksum recalculation. Does that answer your question?

EDIT:
Does that answer your question? - might have sounded a bit passive-agressive, sorry if it did, it was not meant to be :)
Some more info:
The thing is that currently our mechanism of calculating md5 relies on content of whole directory, and if anything inside has changed, we need to retrigger whole calculation, even if change is small. Probably we should consider some alternate way of doing it, so that adding one file will not retrigger wholed dir md5 recalculation

EDIT2:
Thought md5 recalculation is for whole directory, md5's of particular, already existing files should be obtained from state database, which should be much faster than caclulation for unadded directory.
@BoysFight, could you describe your case some more? Maybe there is a bug.

pared on 12 Jul 2019

@BoysFight looks like a potential regression to me.

It should be running faster second time after the change to the directory, since we don't calculate md5s again, we take them from the .dvc/state SQLLite database. Can you check that the time it takes is actually the same to confirm that it runs the calculation again?
We have a ticket https://github.com/iterative/dvc/issues/2093 to optimize access to the state database I mentioned - it should make status runs even faster. Please, vote for the ticket, so that we can prioritize it.

Can you please share some details to reproduce it o our end:

number of files
average size of a file
time to calculate before, time to calculate after

shcheklein on 12 Jul 2019

👍1

@pared @shcheklein, Thank you very much for your patient reply.
The directory structure is as follows:

data/dataset
├── dir1
├── dir2
├── dir...
├── dir42
└── label.txt
42 directories, 1 file

There are about 327000 files whose size is varying from 100k to 500k under the dataset directory, and the total size is 54G.

After i add the dataset directory, dvc status runing time is 5.655 seconds:

DEBUG: fetched: [(327174,)]
Pipeline is up to date. Nothing to reproduce.
dvc status -v 4.26s user 2.11s system 112% cpu 5.655 total

Then i delete one image in dataset/dir1/, the dvc status runing time is 1:02:49.83:

data/dataset.dvc:
changed outs:
modified: data/dataset
src/retrain_inception_v3.dvc:
changed deps:
modified: data/dataset
dvc status -v 496.03s user 101.81s system 15% cpu 1:02:49.83 total

We have to wait about one hour after i delete or add one file in directory.

BoysFight on 15 Jul 2019

😕1

@BoysFight thank you very much for elaboration!
It seems to me that status should not take this much time. 300K files is much, but not THAT much. Ill try to reproduce this case and investigate.

pared on 15 Jul 2019

Ok, this seems to me as lack of optimization on our side.
Here is what is happening:

User deletes one file, mtime for directory has changed
User invokes dvc status
During get_checksum we do not find state database entry (due to the deletion)
Dvc invokes get_dir_checksum -> _collect_dir, which is supposed to collect md5s of files.
here, we submit task for md5 calculation. Note that this method (get_file_checksum) does not check state database, so effectively, each time we change data directory, we calculate md5 for all files again and again.

We need to make dvc try retrieving single file checksum from state, especially that during directory save we do fill single files checksums.

pared on 15 Jul 2019

👍4 ❤1

@pared Great investigation!