Dvc: Small fille changes, md5's computing last a long time

Created on 12 Jul 2019  ·  7Comments  ·  Source: iterative/dvc

DVC version: 0.50.1
Python version: 3.7.1
Platform: Linux-4.15.0-52-generic-x86_64-with-debian-buster-sid
Binary: True
Cache: reflink - False, hardlink - True, symlink - True
Filesystem type (cache directory): ('ext4', '/dev/sda')
Filesystem type (workspace): ('ext4', '/dev/sda')

I have a large directory contains thousands of images, i added the top-level directory to dvc. But when i add a small file to the image directory and run the 'dvc status', computing md5 last a long time. Is this right?

bug c5-half-a-day p1-important performance research

Most helpful comment

Ok, this seems to me as lack of optimization on our side.
Here is what is happening:

  • User deletes one file, mtime for directory has changed
  • User invokes dvc status
  • During get_checksum we do not find state database entry (due to the deletion)
  • Dvc invokes get_dir_checksum -> _collect_dir, which is supposed to collect md5s of files.
  • here, we submit task for md5 calculation. Note that this method (get_file_checksum) does not check state database, so effectively, each time we change data directory, we calculate md5 for all files again and again.

We need to make dvc try retrieving single file checksum from state, especially that during directory save we do fill single files checksums.

All 7 comments

Hi @BoysFight! Yest this behaviour is expected.

Why is it happening?
The mechanism of calculating md5 for directory is based on whole directory contents. So if dvc detects that modification time or size of directory has changed, it will force directory checksum recalculation. Does that answer your question?

EDIT:
Does that answer your question? - might have sounded a bit passive-agressive, sorry if it did, it was not meant to be :)
Some more info:
The thing is that currently our mechanism of calculating md5 relies on content of whole directory, and if anything inside has changed, we need to retrigger whole calculation, even if change is small. Probably we should consider some alternate way of doing it, so that adding one file will not retrigger wholed dir md5 recalculation

EDIT2:
Thought md5 recalculation is for whole directory, md5's of particular, already existing files should be obtained from state database, which should be much faster than caclulation for unadded directory.
@BoysFight, could you describe your case some more? Maybe there is a bug.

@BoysFight looks like a potential regression to me.

  1. It should be running faster second time after the change to the directory, since we don't calculate md5s again, we take them from the .dvc/state SQLLite database. Can you check that the time it takes is actually the same to confirm that it runs the calculation again?
  2. We have a ticket https://github.com/iterative/dvc/issues/2093 to optimize access to the state database I mentioned - it should make status runs even faster. Please, vote for the ticket, so that we can prioritize it.

Can you please share some details to reproduce it o our end:

  • number of files
  • average size of a file
  • time to calculate before, time to calculate after

@pared @shcheklein, Thank you very much for your patient reply.
The directory structure is as follows:

data/dataset
├── dir1
├── dir2
├── dir...
├── dir42
└── label.txt
42 directories, 1 file

There are about 327000 files whose size is varying from 100k to 500k under the dataset directory, and the total size is 54G.

After i add the dataset directory, dvc status runing time is 5.655 seconds:

DEBUG: fetched: [(327174,)]
Pipeline is up to date. Nothing to reproduce.
dvc status -v 4.26s user 2.11s system 112% cpu 5.655 total

Then i delete one image in dataset/dir1/, the dvc status runing time is 1:02:49.83:

data/dataset.dvc:
changed outs:
modified: data/dataset
src/retrain_inception_v3.dvc:
changed deps:
modified: data/dataset
dvc status -v 496.03s user 101.81s system 15% cpu 1:02:49.83 total

We have to wait about one hour after i delete or add one file in directory.

@BoysFight thank you very much for elaboration!
It seems to me that status should not take this much time. 300K files is much, but not THAT much. Ill try to reproduce this case and investigate.

Ok, this seems to me as lack of optimization on our side.
Here is what is happening:

  • User deletes one file, mtime for directory has changed
  • User invokes dvc status
  • During get_checksum we do not find state database entry (due to the deletion)
  • Dvc invokes get_dir_checksum -> _collect_dir, which is supposed to collect md5s of files.
  • here, we submit task for md5 calculation. Note that this method (get_file_checksum) does not check state database, so effectively, each time we change data directory, we calculate md5 for all files again and again.

We need to make dvc try retrieving single file checksum from state, especially that during directory save we do fill single files checksums.

@pared Great investigation!

@BoysFight fix should be merged and avialable in new dvc version.

Was this page helpful?
0 / 5 - 0 ratings