The dvc pull command takes the time (which can be significant it seems) to calculate its own MD5 hashes. It would be nice if there were an option to tell it to trust the remote and skip this so that it would complete faster. Or it could trust the remote by default and make you opt-in to recalculating the hashes if that is what you really wanted. It seems unlikely that it would be a bad breaking change to flip the default behavior here. Or, maybe, we just don't need dvc pull to _ever_ calculate MD5 hashes. Thoughts?
Hi @BernMcCarty !
Sorry for the delay. This is a great request! It was designed to be a bit paranoid about checksums, but thinking about it right now, dvc has been verifying checksums before pushing for a long time now. Plus, all of the remotes are pretty reliable and do consistency checks while uploading/downloading chunks of data. Overall, it seems pretty reasonable to me to adopt that as a default behaviour.
To accomplish that, we could make RemoteLOCAL.pull(or somewhere around that) push md5 entries into our state db(see self.state.save(path_info, checksum)) after it has successfully downloaded a cache file.
For the record: we currently have some issues with gdrive and cache corruption, so we make gdrive remotes not trustworthy by default.
@BernMcCarty
Issue fixing this has been merged to master, next release should contain it(current is 0.82.1). If your remote is not Google drive, it will be enabled by default.
NOTE:
My personal benchmark: pull of 1k files, 10Mb each, local remote: improvement from 7 m 30 s to around 4 m.
Most helpful comment
@BernMcCarty
Issue fixing this has been merged to master, next release should contain it(current is
0.82.1). If your remote is not Google drive, it will be enabled by default.NOTE:
My personal benchmark: pull of 1k files, 10Mb each, local remote: improvement from 7 m 30 s to around 4 m.