Context is here:
Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:
It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....
Sample script that seems to reproduce users problem:
#! /bin/bash
rm -rf storage repo
mkdir storage repo
mkdir repo/data
for i in {1..100000}
do
echo ${i} >> repo/data/${i}
done
cd repo
git init
dvc init
dvc remote add -d storage ../storage
dvc add data
dvc commit data.dvc
git add .gitignore data.dvc
git commit -am "init"
dvc push
dvc unprotect data
echo update >> data/update
dvc add data
After adding update, md5 computation for large directory is retriggered.
@pared when you are unprotecting all files inside data are copied, so dvc doesn't have entries for those files in State db, hence the recomputation.
After some research, I think we can remove the _binary_ heuristic on file_md5
, here's how to reproduce it:
Create 100000 files with 3KiB of random content and 10 files with 3GiB of random content:
mkdir data
for file in {0..100000}; do
dd if=/dev/urandom of=data/$file bs=3K count=1
done
mkdir other_data
for file in {0..10}; do
dd if=/dev/urandom of=other_data/$file bs=100M count=30
done
Results ordered from fastest to slowest operation:
md5sum
:time md5sum * > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m1.568s
# user 0m1.047s
# sys 0m0.496s
# Files: 10
# Size: 3 GiB
#
# real 7m15.048s
# user 1m47.444s
# sys 0m23.405s
git lfs track "data/**" && git add
:# git lfs track "data/**"
time git add data
# Files: 100,000
# Size: 3 KiB
#
# real 1m23.442s
# user 0m39.889s
# sys 0m30.874s
hashlib
reading the whole file:time python -c "
import os
import hashlib
for file in os.listdir():
with open(file, 'rb') as fobj:
print(file, hashlib.md5(fobj.read()).hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m2.581s
# user 0m1.900s
# sys 0m0.643s
# Files: 10
# Size: 3 GiB
#
# real 8m46.469s
# user 1m2.379s
# sys 0m49.283s
hashlib
reading with chunks:time python -c "
import os
import hashlib
LOCAL_CHUNK_SIZE = 1024 * 1024
for file in os.listdir():
hash = hashlib.md5()
with open(file, 'rb') as fobj:
while True:
data = fobj.read(LOCAL_CHUNK_SIZE)
if not data:
break
hash.update(data)
print(file, hash.hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m2.753s
# user 0m1.932s
# sys 0m0.802s
# Files: 10
# Size: 3 GiB
#
# real 7m57.565s
# user 1m53.423s
# sys 0m21.162s
hashlib
reading with chunks + CRLFtime python -c "
import os
import hashlib
LOCAL_CHUNK_SIZE = 1024 * 1024
for file in os.listdir():
hash = hashlib.md5()
with open(file, 'rb') as fobj:
while True:
data = fobj.read(LOCAL_CHUNK_SIZE)
if not data:
break
chunk = data.replace(b'\r\n', b'\n')
hash.update(chunk)
print(file, hash.hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m2.986s
# user 0m2.322s
# sys 0m0.644s
# Files: 10
# Size: 3 GiB
#
# real 7m26.300s
# user 2m34.908s
# sys 0m23.551s
hashlib
reading with chunks + CRLF + binary optimizationtime python -c "
import os
import hashlib
from dvc.istextfile import istextfile
LOCAL_CHUNK_SIZE = 1024 * 1024
for file in os.listdir():
hash = hashlib.md5()
binary = not istextfile(file)
with open(file, 'rb') as fobj:
while True:
data = fobj.read(LOCAL_CHUNK_SIZE)
if not data:
break
if binary:
chunk = data
else:
chunk = data.replace(b'\r\n', b'\n')
hash.update(chunk)
print(file, hash.hexdigest())
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m7.610s
# user 0m6.028s
# sys 0m1.528s
# Files: 10
# Size: 3 GiB
#
# real 7m44.754s
# user 1m53.498s
# sys 0m17.882s
file_md5
:time python -c "
import os
from dvc.utils import file_md5
for file in os.listdir():
print(file, file_md5(file)[0])
" > /dev/null
# Files: 100,000
# Size: 3 KiB
#
# real 0m8.927s
# user 0m7.092s
# sys 0m1.768s
# Files: 10
# Size: 3 GiB
#
# real 7m40.479s
# user 2m7.392s
# sys 0m21.710s
Also, file_md5
is returning both the hexdigest
and the digest
but we are only using hexdigest
across the code base, we can remove that.
Extracts from https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images
@shcheklein , it tooks for dvc add and dvc push about 2hour with 30mb upload speed
Do we know DVC's bottlenecks? Is it possible to provide the user with an estimate of how long is it going to take uploading / downloading depending on the number of operations and internet speed?
Hi, @shcheklein , if i understood you correctly.We are using google cloud storage as remote for dvc.In storage we use 1 bucket.Total amount of files exceeds 100000, total size on disk 229mb ,average size of file about 1.3 kb.Our Upload speed is 30mb and download speed also 30mb. I checked upload of our dataset to similar google storage bucket without dvc and it tooks about 25 min
The problem is related to Google Cloud Storage, needs more research on why is it taking 5x with DVC.
Computing checksums shouldn't take more than a couple of minutes in the worst case scenario, there must be another operation hanging the process for such a long time.
Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement.
yep, that would be great to see if the GCS is fixed
So the run with some comments:
3kb x 100k
$ time dvc add data # comp md5s
real 1m33.184s
user 1m11.137s
sys 0m27.597s
# A long delay after pbar done and nothing happens
$ time dvc add data # create unpacked dir
real 1m4.460s
user 0m53.056s
sys 0m10.855s
# A long delay before and especially after pbar
$ time dvc add data # 3rd time, still slow
real 0m37.932s
user 0m31.009s
sys 0m6.715s
# A long delay at start before anything printed out
# All subsequent `dvc add`s take the same time
$ time dvc commit data.dvc
real 0m35.182s # About the same as above
user 0m29.368s
sys 0m5.980s
$ time dvc push # to local
real 2m18.288s
user 1m57.271s
sys 1m1.700s
$ time dvc push # second time, nothing to push
real 0m56.521s
user 0m45.159s
sys 0m7.637s
$ time dvc pull # nothing to pull
real 0m57.129s
user 0m48.626s
sys 0m8.521s
# Checkout took the majority of the time
$ rm -rf .dvc/cache && rm -rf data && time dvc pull
real 4m7.259s
user 3m24.639s
sys 1m30.983s
# at the start of checkout pbar hangs at 0% for a while
$ echo update >> data/update && time dvc add data
real 1m35.354s
user 1m12.278s
sys 0m22.342s
# The time is same as starting add
Summary:
So bench runs for dirs:
| op | total | in | out | sleep |
|---------------|---------|------|-------|---------|
| add | 81.01 | 0.9 | 2.19 | 37.08 |
| add-2 | 9.4 | 2.08 | 1.92 | 3.86 |
| add-3 | 5.83 | 4.14 | 1.68 | 0 |
| commit-noop | 5.73 | 4.23 | 1.49 | 0 |
| checkout-noop | 5.95 | 0.55 | 1.5 | 0.4 |
| checkout-full | 52.17 | 0.57 | 2.35 | 1.42 |
| push | 45.32 | 1.03 | 1.8 | 0.13 |
| push-noop | 46.88 | 0.97 | 2.17 | 0.87 |
| pull-noop | 10.23 | 0.96 | 1.52 | 0.46 |
| pull | 162.65 | 0.5 | 2.19 | 44.28 |
| add-modified | 62.64 | 1.76 | 2.32 | 56.29 |
| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 222.77 | 1.57 | 2.02 | 141.7 |
| add-2 | 75.97 | 20.71 | 2.46 | 39.42 |
| add-3 | 42.7 | 40.43 | 2.27 | |
| commit-noop | 40.55 | 39.06 | 1.49 | |
| checkout-noop | 43.42 | 1.83 | 2.03 | 4.31 |
| checkout-full | 124.17 | 1.62 | 2.72 | 13.22 |
| push | 232.11 | 4.33 | 2.45 | 0.25 |
| push-noop | 145.57 | 4.6 | 1.86 | 0.76 |
| pull-noop | 85.2 | 4.49 | 1.85 | 4.41 |
| pull | 457.27 | 0.46 | 1.95 | 98.57 |
| add-modified | 204.89 | 22.69 | 2.96 | 158 |
| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 98.16 | 7.38 | 3.37 | 51.67 |
| add-2 | 6.03 | 3.52 | 2.52 | 0 |
| add-3 | 5.83 | 3.54 | 2.29 | 0 |
| commit-noop | 5.65 | 3.48 | 2.17 | 0 |
| checkout-noop | 3.26 | 1.12 | 2.14 | |
| checkout-full | 49.81 | 46.04 | 3.77 | 0 |
| push | 34.84 | 1.2 | 3.53 | 1.29 |
| push-noop | 45.24 | 1.14 | 2.71 | 40.88 |
| pull-noop | 46.08 | 1.12 | 2.34 | 38.63 |
| pull | 100.2 | 1.08 | 3.47 | 56.4 |
| add-modified | 141.81 | 1.15 | 2.8 | 57.16 |
| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 243.38 | 9.58 | 3.44 | 144.25 |
| add-2 | 35.13 | 31.96 | 3.16 | |
| add-3 | 36.76 | 33.06 | 3.69 | 0 |
| commit-noop | 31.33 | 28.44 | 2.89 | 0 |
| checkout-noop | 5.52 | 2.66 | 2.86 | 0 |
| checkout-full | 98.9 | 88.35 | 10.55 | |
| push | 97.57 | 1.89 | 9.83 | 13.7 |
| push-noop | 135.25 | 1.61 | 2.78 | 116.21 |
| pull-noop | 131.7 | 1.56 | 3.26 | 82.1 |
| pull | 198.3 | 1.28 | 6.75 | 109.82 |
| add-modified | 365.22 | 4.48 | 3.2 | 157.99 |
| op | total | in | out | sleep |
|---------------|---------|------|-------|---------|
| add | 81.99 | 1.23 | 3.56 | 36.41 |
| add-2 | 9.81 | 2.91 | 2.63 | 3.13 |
| add-3 | 6.96 | 4.21 | 2.74 | 0 |
| commit-noop | 6.67 | 4.16 | 2.5 | |
| checkout-noop | 3.88 | 1.16 | 2.52 | 0.19 |
| checkout-full | 52.8 | 1.11 | 3.82 | 1.31 |
| push | 46.7 | 1.67 | 3.22 | 0.14 |
| push-noop | 47.87 | 1.55 | 3.36 | 0.77 |
| pull-noop | 7.78 | 1.45 | 2.55 | 1.79 |
| pull | 155.81 | 1.14 | 3.75 | 43.44 |
| add-modified | 63.79 | 2.47 | 3.85 | 55.4 |
| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 220.51 | 1.71 | 3.33 | 137.64 |
| add-2 | 67.87 | 21.03 | 2.71 | 32.8 |
| add-3 | 40.55 | 37.8 | 2.75 | 0 |
| commit-noop | 37.08 | 34.04 | 3.04 | 0 |
| checkout-noop | 7.27 | 2.57 | 2.49 | 2.21 |
| checkout-full | 124.94 | 2.32 | 4.32 | 13.85 |
| push | 223.59 | 5.14 | 3.54 | 0.27 |
| push-noop | 147.53 | 5.01 | 3.54 | 0.58 |
| pull-noop | 48.35 | 4.92 | 2.86 | 18.59 |
| pull | 440.08 | 1.15 | 3.21 | 96.96 |
| add-modified | 201.93 | 21.99 | 3.57 | 154.42 |
And totals only in bar charts:
Some takeouts:
I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.
Another things is that this is tested with cache type cope only.
@Suor do you have your tests scripts still? Would be interesting to see what's up with it right now, since we've introduced a lot of optimizations in 1.0. Though, probably dvc-bench is enough.
Ok, closing for now as stale. We've introduced lots of push/pull/fetch/status/add optimizations for directories since the ticket was opened.
I guess new benches are also run for old code. So we can compare.
Most helpful comment
Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement.