Dvc: Possible bug related to re-computing md5 for directories

Created on 25 Feb 2019  Â·  6Comments  Â·  Source: iterative/dvc

DVC 0.29.0 / ubuntu / pip

I believe you recently fixed a bug related to re-computing md5s for large files. There might be something similar happening again — or maybe I just need to better understand what triggers md5 to be computed.

$ dvc status
Computing md5 for a large directory project/data/images. This is only done once.
[##############################] 100% project/data/images

This happens not every time I run dvc status, but at least every time I reboot the machine — I haven't 100% narrowed down what triggers it. Is this expected?

It's not super high priority — it only takes ~30 seconds to re-compute md5s for these directories, which is kind of surprisingly fast. Could it be caching file-level (image-level) md5s and then simply recomputing the directory-level md5?

bug

Most helpful comment

@colllin it actually was a bug. After adding files we did not update directory state, so at next status we detected that modification time for directory has changed, and performed update for whole directory. Fix in review. Thank you for pointing this one!

All 6 comments

Hi @colllin. Thanks for reporting, it is not desired behaviour, there is PR trying to tackle this one:

1526

it only takes ~30 seconds to re-compute md5s for these directories, which is kind of surprisingly fast. >Could it be caching file-level (image-level) md5s and then simply recomputing the directory-level md5?

Currently, besides md5 we store modification time and size of file, and, for given inode, we check if file mtime or size has been changed. If it has not, we assume that we do not need to recompute md5. So yes, we are caching file level md5s.

For the record, looks like it can be reproduced this way:

#!/bin/bash                                                         

set -x                                                              
set -e                                                              

rm -rf myrepo                                                       
mkdir myrepo                                                        
cd myrepo                                                           
git init                                                            
dvc init                                                            

git commit -m"init"                                                 

mkdir dir                                                           
for i in $(seq 1 1000); do                                          
    echo $i > dir/$i                                                
done                                                                
dvc add dir                                                         
dvc status                                                          
dvc status                                                          

which produces

+ dvc add dir                                                   
Computing md5 for a large directory dir. This is only done once.
[##############################] 100% dir                       
Adding 'dir' to '.gitignore'.                                   
Saving 'dir' to cache '.dvc/cache'.                             
Linking directory 'dir'.                                        
[##############################] 100% dir                       
Saving information to 'dir.dvc'.                                

To track the changes with git run:                              

        git add .gitignore dir.dvc                              
+ dvc status                                                    
Computing md5 for a large directory dir. This is only done once.
[##############################] 100% dir                       
Pipeline is up to date. Nothing to reproduce.                   
+ dvc status                                                    
Pipeline is up to date. Nothing to reproduce.

We don't print a progress bar when verifying cache for a directory, so looks like there is something else that we've forgotten to update, which makes first dvc status actually compute something once again.

@colllin sorry, I made a mistake, it seems there is something more to this case.
Thanks @efiop, Ill look into that.

@colllin it actually was a bug. After adding files we did not update directory state, so at next status we detected that modification time for directory has changed, and performed update for whole directory. Fix in review. Thank you for pointing this one!

Amazingly fast fix. Thank you!!

@colllin Just a heads up: we've rolled back a faulty optimization in 0.41.0, so status might be slower, because it is going to need to validate each file(no md5 computations though, everything is going to be pulled from state db). We are working on a proper optimization patch right, which should be ready this week. Just wanted to give you a heads up, so there are no surprises again. :slightly_smiling_face: Sorry for the inconvenience.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  Â·  3Comments

robguinness picture robguinness  Â·  3Comments

siddygups picture siddygups  Â·  3Comments

mfrata picture mfrata  Â·  3Comments

gregfriedland picture gregfriedland  Â·  3Comments