Dvc: Computing file/dir hashes multiple times (when reportedly only done once reported)

Created on 27 Aug 2020  路  22Comments  路  Source: iterative/dvc

Bug Report

Performed a dvc init --subdir in an example/DL folder (part of a larger project)
set to use common shared cache
Did not define remote (yet)

created examples/DL/dataset1 and examples/DL/dataset2 sub directories
cd examples/DL/dataset1
ran 2 dvc imports (images.dvc and labels.dvc from dataset1.git) - remote A

cd examples/DL/dataset2
ran 2 dvc imports (images.dvc and labels.dvc from dataset2.git) - remote B

The files were already in cache... During each import/received message indication "Computing file/dir hashes (only done once)". In total this took ~40 minutes (though all files were already in cache).

cd examples/DL
dvc pull (expecting it to say all files up to date), as this is a new dvc repo - not tracking anything other than the dvc imports.... But instead I get:
"Computing file/dir hashes (only done once)"- and waiting another 40 min.

dataset1.git and dataset2.git where unchanged between import and pull attempt

repeating dvc pull - same behavior (Computing hashes)

It appears if you issue the dvc pull command in a project that has a dvc import(s) - it ends up recomputing the hashes every time? This doesn't seem right - is there a way to prevent?

dvc update - works as I would expect... But this almost implies that you can't easily have a dvc project that tracks outputs (e.g. created model) - when you have dvc imports? This doesn't sound right either.

Verified that the dvc pull behavior is as expected in the dataset1 repo checkout

Also - I had set dvc config --global core.checksum_jobs 32
Top reveals a single core being used ~100% - but this machine is virtually idle (24 cores total). I had assumed that option would perform the hash computation in parallel (there are certainly plenty of CPU and IO resources available) - but the option didn't seem to have much effect / if any.
htop only revealed 3 threads/processes:

Please provide information about your setup

Output of dvc version:

DVC version: 1.6.1 (conda)
---------------------------------
Platform: Python 3.8.5 on Linux-4.15.0-64-generic-x86_64-with-glibc2.10
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Repo: dvc (subdir), git

Additional Information (if any):

If applicable, please also provide a --verbose output of the command, eg: dvc add --verbose.

awaiting response bug p0-critical

Most helpful comment

Well - i took a crack at a standalone C extension using this reference md5 implementation... http://openwall.info/wiki/people/solar/software/public-domain-source-code/md5

Much to my surprise - hashlib beat the C implementation in some cases:
On my system - 512K seemed like an optimal buffer size. Consulting the hashlib documentation it does seem they are explicitly releasing GIL when buffer is larger than 2047 bytes.

I suspect the reference implementation I used is not the fastest, but it does show that with the appropriate buffer size - the hashlib performance is quite reasonable and the GIL isn't a substantial bottleneck.

Here are results computed using a 6MB reference file - for 12 threads and 4 threads:

1000 hashes using 12 threads
c version[16k]: 2.69s
python version[16k]: 7.75s
1000 hashes using 12 threads
c version[32k]: 2.19s
python version[32k]: 4.66s
1000 hashes using 12 threads
c version[64k]: 2.09s
python version[64k]: 2.78s
1000 hashes using 12 threads
c version[128k]: 1.83s
python version[128k]: 1.83s
1000 hashes using 12 threads
c version[512k]: 1.78s
python version[512k]: 1.67s
1000 hashes using 12 threads
c version[1024k]: 1.95s
python version[1024k]: 2.71s

=======================

1000 hashes using 4 threads
c version[16k]: 5.12s
python version[16k]: 6.74s
1000 hashes using 4 threads
c version[32k]: 4.69s
python version[32k]: 4.46s
1000 hashes using 4 threads
c version[64k]: 4.35s
python version[64k]: 3.78s
1000 hashes using 4 threads
c version[128k]: 4.30s
python version[128k]: 3.66s
1000 hashes using 4 threads
c version[512k]: 4.26s
python version[512k]: 3.57s
1000 hashes using 4 threads
c version[1024k]: 4.54s
python version[1024k]: 4.54s

All 22 comments

This is happening because .dvc/state that we use for optimization is not shared between projects, so each project needs to verify the cache files before it can be sure that it can use them. We don't have a common state db for the cache because it introduces concurrency issues as well as compatibility issues if one shared cache is used by multiple dvc versions (the cache itself is compatible though). That being said, we set read-only permissions on cache, so we could create a config option that looks something like:

dvc config cache.verify false

(we have a similar one for remotes right now)
that will make dvc trust the cache files if they are still read-only.

Let me see if I can provide a quick patch for this to unblock you...

@wdixon Oops, totally forgot that we already do that by default. So this actually raises more questions. Could you please show verbose log for one of the commands? E.g. dvc import. There might be something else that I'm not seeing.

The file names are quasi sensitive - I have cut out large repetitive sections - and replace some file names...
Hopefully I didn't totally render the data useless to you?

2020-08-27 15:56:11,488 DEBUG: Check for update is enabled.
2020-08-27 15:56:11,548 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-08-27 15:56:11,550 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-08-27 15:56:11,558 DEBUG: fetched: [(3,)]
2020-08-27 15:56:12,146 DEBUG: Removing output 'annotations' of stage: 'annotations.dvc'.
Importing 'annotations (https://github.build.company.com/datasets_A.git)' -> 'annotations'
2020-08-27 15:56:12,149 DEBUG: Computed stage: 'annotations.dvc' md5: 'c42aa44c1c69fcd9ef5ba323959ef1e2'
2020-08-27 15:56:12,149 DEBUG: 'md5' of stage: 'annotations.dvc' changed.
2020-08-27 15:56:12,149 DEBUG: Creating external repo https://github.build.company.com/dataset_A.git@None
2020-08-27 15:56:12,149 DEBUG: erepo: git clone 'https://github.build.company.com/auto-inspection/datasets_A.git' to a temporary dir
2020-08-27 15:56:12,675 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:12,676 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:12,677 DEBUG: Assuming '/projects/shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259' is unchanged since it is read-only
...
2020-08-27 15:56:13,805 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:13,809 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:13,811 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:13,812 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:13,814 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
...
2020-08-27 15:56:14,047 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,050 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,053 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,055 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,059 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
...
2020-08-27 15:56:14,061 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,064 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,067 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,068 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:14,069 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:16,347 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
...
2020-08-27 15:56:16,348 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:16,350 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:16,353 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:16,355 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:16,366 DEBUG: Uploading '../../../../../tmp/tmpkou9mx96' to '../../../shared_dvc_cache/.LavfxrrrRmfssWaDV9pJZj.tmp'
2020-08-27 15:56:16,378 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:16,379 DEBUG: Saving '../../../../../tmp/tmpaaophrzudvc-clone/annotations' to '../../../shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir'.
2020-08-27 15:56:16,380 DEBUG: Assuming '/projects/shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259' is unchanged since it is read-only
2020-08-27 15:56:16,381 DEBUG: Path '../../../shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259' inode '4004217241'
2020-08-27 15:56:16,381 DEBUG: fetched: [('1598457193713330944', '7166', '3d3ca0c4a263ef118f48af7601a0c259', '1598558137302253312')]
2020-08-27 15:56:16,382 DEBUG: Assuming '/projects/shared_dvc_cache/eb/c7819625ea551acb9f1200841aea5d' is unchanged since it is read-only
2020-08-27 15:56:16,382 DEBUG: Path '../../../shared_dvc_cache/eb/c7819625ea551acb9f1200841aea5d' inode '4009074115'
2020-08-27 15:56:16,383 DEBUG: fetched: [('1598457194545340160', '7532', 'ebc7819625ea551acb9f1200841aea5d', '1598558137306539264')]
2020-08-27 15:56:16,383 DEBUG: Assuming '/projects/shared_dvc_cache/c9/a99ebb8608a530a1b3e8f0fbd7a1fd' is unchanged since it is read-only
2020-08-27 15:56:16,383 DEBUG: Path '../../../shared_dvc_cache/c9/a99ebb8608a530a1b3e8f0fbd7a1fd' inode '4007035085'
2020-08-27 15:56:16,384 DEBUG: fetched: [('1598457193905330944', '7154', 'c9a99ebb8608a530a1b3e8f0fbd7a1fd', '1598558137310568704')]
2020-08-27 15:56:16,384 DEBUG: Assuming '/projects/shared_dvc_cache/70/4f782a5e8523e4371fef3767a7cad5' is unchanged since it is read-only
2020-08-27 15:56:16,385 DEBUG: Path '../../../shared_dvc_cache/70/4f782a5e8523e4371fef3767a7cad5' inode '4002825734'
2020-08-27 15:56:16,385 DEBUG: fetched: [('1598457193741329920', '7495', '704f782a5e8523e4371fef3767a7cad5', '1598558137314484736')]
2020-08-27 15:56:16,385 DEBUG: Assuming '/projects/shared_dvc_cache/f7/f4a8934f84c94b82c3e227444ab888' is unchanged since it is read-only
2020-08-27 15:56:16,386 DEBUG: Path '../../../shared_dvc_cache/f7/f4a8934f84c94b82c3e227444ab888' inode '4012373233'
2020-08-27 15:56:16,386 DEBUG: fetched: [('1598457193931337216', '7510', 'f7f4a8934f84c94b82c3e227444ab888', '1598558137318408704')]
2020-08-27 15:56:16,387 DEBUG: Assuming '/projects/shared_dvc_cache/00/ce44328e6f935c1421c567a31f1cf9' is unchanged since it is read-only
2020-08-27 15:56:16,387 DEBUG: Path '../../../shared_dvc_cache/00/ce44328e6f935c1421c567a31f1cf9' inode '4016732671'
2020-08-27 15:56:16,387 DEBUG: fetched: [('1598457194573331968', '7440', '00ce44328e6f935c1421c567a31f1cf9', '1598558137322353152')]
2020-08-27 15:56:16,388 DEBUG: Assuming '/projects/shared_dvc_cache/13/f15d0b09fc6aa3d378e5e098bc7540' is unchanged since it is read-only
...
2020-08-27 15:56:17,099 DEBUG: fetched: [('1598457201203309056', '7269', '763dabf8c624620d3c45affdd74bb9a7', '1598558141903678720')]
2020-08-27 15:56:17,099 DEBUG: Assuming '/projects/shared_dvc_cache/cd/b3da4f0d94e276a2ab33ff246c68ff' is unchanged since it is read-only
2020-08-27 15:56:17,100 DEBUG: Path '../../../shared_dvc_cache/cd/b3da4f0d94e276a2ab33ff246c68ff' inode '4015608537'
2020-08-27 15:56:17,100 DEBUG: fetched: [('1598457200987309056', '7494', 'cdb3da4f0d94e276a2ab33ff246c68ff', '1598558141908837120')]
2020-08-27 15:56:17,101 DEBUG: Path '../../../shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' inode '4002893898'
2020-08-27 15:56:17,101 DEBUG: fetched: [('1598457151749455872', '74461', 'ade3ad80edaad12b61cd420ca15f88a2.dir', '1598558142072585472')]
2020-08-27 15:56:17,101 DEBUG: checking if 'annotations'('{'md5': 'ade3ad80edaad12b61cd420ca15f88a2.dir'}') has changed.
2020-08-27 15:56:17,101 DEBUG: 'annotations' doesn't exist.
2020-08-27 15:56:17,102 DEBUG: Assuming '/projects/shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' is unchanged since it is read-only
2020-08-27 15:56:17,102 DEBUG: Assuming '/projects/shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259' is unchanged since it is read-only
2020-08-27 15:56:17,102 DEBUG: Assuming '/projects/shared_dvc_cache/eb/c7819625ea551acb9f1200841aea5d' is unchanged since it is read-only
2020-08-27 15:56:17,103 DEBUG: Assuming '/projects/shared_dvc_cache/c9/a99ebb8608a530a1b3e8f0fbd7a1fd' is unchanged since it is read-only
2020-08-27 15:56:17,103 DEBUG: Assuming '/projects/shared_dvc_cache/70/4f782a5e8523e4371fef3767a7cad5' is unchanged since it is read-only
...
2020-08-27 15:56:17,143 DEBUG: Assuming '/projects/shared_dvc_cache/76/3dabf8c624620d3c45affdd74bb9a7' is unchanged since it is read-only
2020-08-27 15:56:17,143 DEBUG: Assuming '/projects/shared_dvc_cache/cd/b3da4f0d94e276a2ab33ff246c68ff' is unchanged since it is read-only
2020-08-27 15:56:17,143 DEBUG: Checking out 'annotations' with cache 'ade3ad80edaad12b61cd420ca15f88a2.dir'.
2020-08-27 15:56:17,144 DEBUG: Linking directory 'annotations'.
2020-08-27 15:56:17,144 DEBUG: checking if 'annotations/xxx.json'('{'md5': '3d3ca0c4a263ef118f48af7601a0c259'}') has changed.
2020-08-27 15:56:17,145 DEBUG: 'annotations/xxx.json' doesn't exist.
2020-08-27 15:56:17,148 DEBUG: Created 'symlink': ../../../shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259 -> annotations/xxx.json
2020-08-27 15:56:17,148 DEBUG: Path 'annotations/xxx.json' inode '4002056887'
2020-08-27 15:56:17,148 DEBUG: fetched: [('1598457194545340160', '7532', 'ebc7819625ea551acb9f1200841aea5d', '1598558137305811968')]
...
2020-08-27 15:56:17,218 DEBUG: checking if 'annotations/yyy.json'('{'md5': 'd8f642b02fc3180a8a71a007de1bc3cd'}') has changed.
2020-08-27 15:56:17,218 DEBUG: 'annotations/yyy.json' doesn't exist.
2020-08-27 15:56:17,220 DEBUG: Created 'symlink': ../../../shared_dvc_cache/d8/f642b02fc3180a8a71a007de1bc3cd -> annotations/yyy.json
2020-08-27 15:56:17,220 DEBUG: Path 'annotations/qqq.json' inode '4002762119'
2020-08-27 15:56:17,220 DEBUG: fetched: []
...
2020-08-27 15:56:18,536 DEBUG: Path 'annotations' inode '4002056885'
2020-08-27 15:56:18,560 DEBUG: Path 'annotations' inode '4002056885'
2020-08-27 15:56:18,560 DEBUG: fetched: []
2020-08-27 15:56:18,589 DEBUG: Path '/projects/home/wdixon/dvc_test2/annotations' inode '4002056885'
2020-08-27 15:56:18,589 DEBUG: fetched: [('535ae58f02da15d2665e99501402afbc', '3995292', 'ade3ad80edaad12b61cd420ca15f88a2.dir', '1598558178560854784')]
2020-08-27 15:56:18,590 DEBUG: {'annotations': 'modified'}
2020-08-27 15:56:18,616 DEBUG: Path '/projects/home/wdixon/dvc_test2/annotations' inode '4002056885'
2020-08-27 15:56:18,617 DEBUG: fetched: [('535ae58f02da15d2665e99501402afbc', '3995292', 'ade3ad80edaad12b61cd420ca15f88a2.dir', '1598558178589778944')]
2020-08-27 15:56:18,618 DEBUG: Computed stage: 'annotations.dvc' md5: 'e20c94ac9431e2098136243dc90ef900'
2020-08-27 15:56:18,621 DEBUG: Saving 'annotations' to '../../../shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir'.
2020-08-27 15:56:18,622 DEBUG: Assuming '/projects/shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259' is unchanged since it is read-only
2020-08-27 15:56:18,622 DEBUG: Removing '/projects/home/wdixon/dvc_test2/annotations/zzz.json'
2020-08-27 15:56:18,625 DEBUG: Created 'symlink': ../../../shared_dvc_cache/3d/3ca0c4a263ef118f48af7601a0c259 -> annotations/zzz.json
...
2020-08-27 15:56:21,499 DEBUG: Path '../../../shared_dvc_cache/cd/b3da4f0d94e276a2ab33ff246c68ff' inode '4015608537'
2020-08-27 15:56:21,499 DEBUG: fetched: [('1598457200987309056', '7494', 'cdb3da4f0d94e276a2ab33ff246c68ff', '1598558177100770048')]
2020-08-27 15:56:21,537 DEBUG: Path 'annotations' inode '4002056885'
2020-08-27 15:56:21,559 DEBUG: Path 'annotations' inode '4002056885'
2020-08-27 15:56:21,559 DEBUG: fetched: [('535ae58f02da15d2665e99501402afbc', '3995292', 'ade3ad80edaad12b61cd420ca15f88a2.dir', '1598558178617431552')]
2020-08-27 15:56:21,560 DEBUG: Path '../../../shared_dvc_cache/ad/e3ad80edaad12b61cd420ca15f88a2.dir' inode '4002893898'
2020-08-27 15:56:21,560 DEBUG: fetched: [('1598457151749455872', '74461', 'ade3ad80edaad12b61cd420ca15f88a2.dir', '1598558177101371904')]
2020-08-27 15:56:21,565 DEBUG: Saving information to 'annotations.dvc'.
2020-08-27 15:56:21,571 DEBUG: fetched: [(2712,)]
2020-08-27 15:56:21,583 DEBUG: Analytics is enabled.
2020-08-27 15:56:21,585 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpqenvu_hb']'
2020-08-27 15:56:21,587 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpqenvu_hb']'

This was one of the smaller datsets - and here I just imported the json data... While the dvc import of the larger datasets took quite a bit of time (even though they are cached...) - the real concern was that a dvc pull cause the recomputing of the hashes.
This in total takes nearly 40 minutes.

@wdixon I'm not seeing a 40 minutes delay in that log though. So I suppose you really meant pull . Could you show dvc pull -v output, please?

Btw, what remote are you using? Could you show dvc remove list output too?

@wdixon I'm not seeing a 40 minutes delay in that log though. So I suppose you really meant pull . Could you show dvc pull -v output, please?

I believe the total time for all 4 of the dvc imports was close to 40 min (I just just gave you the a log for the smallest 1x that I had - which just had some small json files (so the hashes were much faster). I kicked off the pull, will send you the log

dvc remove list

I had not set a remote yet on the project that I was importing into....
dvc remote list returns nothing.

The datasets as well as the repo I am importing into are all configured to use the same local cache_dir
the 4 imports are from 2 different remotes:

origin s3://dvc-bucket/datasets/dataset_A
origin s3://dvc-bucket/datasets/dataset_B

These are a pointed to a minio server running on the LAN with 10GB network.

And I should point out nearly all the elapsed time (from the import or pull) - is spent showing the progress bar for "Computing file/dir hashes (only done once)" progress bar

Attached is the log.... Elapsed time for the pull was 33 minutes.

dvc_pull.log

Let me know if there is something else you might need.

ppening because .dvc/state that we use for optimization is not shared between projects, so each project needs to verify the cache files before it can be sure that it can use them. We don't have a common state db for the cache because it introduces concurrency issues as well as compatibility issues if one shared cache is used by multiple dvc versions (the cache itself is compatible though). That being said, we set read-only permissions on cache, so we could create a config option that looks something like:

dvc config cache.verify false

I find a .dvc/tmp/state... I am not clear on the notion of sharing state, If project_A.image is imported into project_B, aren't you talking about recording the state of the imported files within project_B? That is I can understand recomputing with the import, but once computed and stored in your state db for project_B, why would that be any different than performing the pull from within project_A - which doesn't recompute the hashes?

@wdixon I think I found it. We are not saving the pulled hashes for dvc imported stages. Investigating further... Thank you so much for the feedback, this is a really good catch! :pray:

Wonderful! - look forward to the update. And thank you for being so responsive!

Ok, indeed, we are accidentally using StateNoop when saving hashes for imported data https://github.com/iterative/dvc/blob/d5fcf7377d8687606bcc163bef0623e23ef71dd7/dvc/data_cloud.py#L101 . We should really use cache.local.tree.state instead. The confusion there is also because that _save_pulled_checksums helper really belongs to CacheLocal and not DataCloud. Preparing a PR.

Another thing is that we don't seem to use pre-recorded hashes from the dvcfiles in get_hash in DvcTree. Looking into that as well.

@wdixon Created a WIP PR https://github.com/iterative/dvc/pull/4486 . If you'd like to give it a try please uninstall dvc from conda and install it with

pip install git+https://github.com/efiop/dvc@fix-4485

Please make sure that $ dvc version shows that it is installed by pip and the version is 1.6.3+828d1a.

I'll get back to it tomorrow, need to add better tests.

The PR appears to work very nicely.
I recreated a project using the same 4 dvc imports... In total it took ~3 min (was ~40 min)
The dvc pull from the imported project now runs in ~10 sec! (was 33 minutes)
Thank you very much - this dramatically improves the usability.

I am good with closing the issue - but will leave that to you - maybe once the release is finalized?

@wdixon Glad to hear it works for you! Thank you so much for the feedback!

I am good with closing the issue - but will leave that to you - maybe once the release is finalized?

It will be automatically closed once the PR is merged. I just need to get to adding tests for it first and then we'll do a release the same day. ETA is until Monday.

I was poking around the dvc repo a bit - and you may be able to get a 10% boost in md5 performance if you reduce the memory allocation with for each chunk, by reading bytes from file into a reusable buffer such as:

def md5sum(filename, buf_size):
    m = md5()
    buf = bytearray(buf_size)
    with io.open(filename, "rb", buffering=0) as f:
        n_bytes = f.readinto(buf)
        while n_bytes > 0:
            if n_bytes == buf_size:
                m.update(buf)
            else:
                m.update(buf[0:n_bytes])
            n_bytes = f.readinto(buf)
    return m.hexdigest()

And 2 other thoughts, mostly based on the observation that when seeing the calculating hashes - the machine was essentially IDLE - without knowing how much the physical IO is limiting the throughput (though in my case I believe that this was not currently a limiting factor).

1) you could potentially provide a c routine that can release the GIL lock and perform the entire file read and md5 calculation outside the GIL, giving you much higher thread parallelization...

2) utilize process parallelization to improve parallelism - would not require custom C routine - but a bit more platform specific details to get best performance.

Anyway - thank you so much for the quick responses

@wdixon Those are great points! I suppose you've found https://github.com/iterative/dvc/blob/4013256c917c7275b222c22c08365639349f23bf/dvc/utils/__init__.py#L45 that we use? Have you tried tinkering with it? If you did, maybe consider submitting a PR. We'll be happy to help too.

Btw, 1.6.4 will be out in a few minutes. Thanks again for all the feedback! :pray:

Well - i took a crack at a standalone C extension using this reference md5 implementation... http://openwall.info/wiki/people/solar/software/public-domain-source-code/md5

Much to my surprise - hashlib beat the C implementation in some cases:
On my system - 512K seemed like an optimal buffer size. Consulting the hashlib documentation it does seem they are explicitly releasing GIL when buffer is larger than 2047 bytes.

I suspect the reference implementation I used is not the fastest, but it does show that with the appropriate buffer size - the hashlib performance is quite reasonable and the GIL isn't a substantial bottleneck.

Here are results computed using a 6MB reference file - for 12 threads and 4 threads:

1000 hashes using 12 threads
c version[16k]: 2.69s
python version[16k]: 7.75s
1000 hashes using 12 threads
c version[32k]: 2.19s
python version[32k]: 4.66s
1000 hashes using 12 threads
c version[64k]: 2.09s
python version[64k]: 2.78s
1000 hashes using 12 threads
c version[128k]: 1.83s
python version[128k]: 1.83s
1000 hashes using 12 threads
c version[512k]: 1.78s
python version[512k]: 1.67s
1000 hashes using 12 threads
c version[1024k]: 1.95s
python version[1024k]: 2.71s

=======================

1000 hashes using 4 threads
c version[16k]: 5.12s
python version[16k]: 6.74s
1000 hashes using 4 threads
c version[32k]: 4.69s
python version[32k]: 4.46s
1000 hashes using 4 threads
c version[64k]: 4.35s
python version[64k]: 3.78s
1000 hashes using 4 threads
c version[128k]: 4.30s
python version[128k]: 3.66s
1000 hashes using 4 threads
c version[512k]: 4.26s
python version[512k]: 3.57s
1000 hashes using 4 threads
c version[1024k]: 4.54s
python version[1024k]: 4.54s

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jorgeorpinel picture jorgeorpinel  路  3Comments

ghost picture ghost  路  3Comments

shcheklein picture shcheklein  路  3Comments

siddygups picture siddygups  路  3Comments

TezRomacH picture TezRomacH  路  3Comments