dvc: performance optimization for directories

Created on 8 May 2019  路  15Comments  路  Source: iterative/dvc

Context is here:

https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

Our data set is OCR data set with more than 100000 small images, total size is about 200 MB. Using dvc to track this data set we encountered with next problems:

It took a lot of time to add data set for tracking.
Very slow upload.
Very slow download.
Update/delete/add just one image in data set cause dvc to recompute a lot of things : hashes etc....
c5-half-a-day p1-important performance research

Most helpful comment

Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement.

All 15 comments

Sample script that seems to reproduce users problem:

#! /bin/bash

rm -rf storage repo
mkdir storage repo
mkdir repo/data

for i in {1..100000}
do
  echo ${i} >> repo/data/${i}
done 

cd repo

git init 
dvc init

dvc remote add -d storage ../storage

dvc add data
dvc commit data.dvc
git add .gitignore data.dvc

git commit -am "init"
dvc push

dvc unprotect data
echo update  >> data/update
dvc add data

After adding update, md5 computation for large directory is retriggered.

@pared when you are unprotecting all files inside data are copied, so dvc doesn't have entries for those files in State db, hence the recomputation.

After some research, I think we can remove the _binary_ heuristic on file_md5, here's how to reproduce it:

Create 100000 files with 3KiB of random content and 10 files with 3GiB of random content:

mkdir data

for file in {0..100000}; do
  dd if=/dev/urandom of=data/$file bs=3K count=1
done

mkdir other_data

for file in {0..10}; do
  dd if=/dev/urandom of=other_data/$file bs=100M count=30
done

Results ordered from fastest to slowest operation:

  • md5sum:
time md5sum * > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m1.568s
# user    0m1.047s
# sys     0m0.496s


# Files:  10
# Size:   3 GiB
#
# real    7m15.048s
# user    1m47.444s
# sys     0m23.405s
  • git lfs track "data/**" && git add:
# git lfs track "data/**"
time git add data

# Files: 100,000
# Size:   3 KiB
#
# real    1m23.442s
# user    0m39.889s
# sys     0m30.874s
  • Python's hashlib reading the whole file:
time python -c "
import os
import hashlib

for file in os.listdir():
    with open(file, 'rb') as fobj:
        print(file, hashlib.md5(fobj.read()).hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.581s
# user    0m1.900s
# sys     0m0.643s


# Files:  10
# Size:   3 GiB
#
# real    8m46.469s
# user    1m2.379s
# sys     0m49.283s
  • Python's hashlib reading with chunks:
time python -c "
import os
import hashlib

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            hash.update(data)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.753s
# user    0m1.932s
# sys     0m0.802s


# Files:  10
# Size:   3 GiB
#
# real    7m57.565s
# user    1m53.423s
# sys     0m21.162s
  • Python's hashlib reading with chunks + CRLF
time python -c "
import os
import hashlib

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            chunk = data.replace(b'\r\n', b'\n')

            hash.update(chunk)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m2.986s
# user    0m2.322s
# sys     0m0.644s

# Files:  10
# Size:   3 GiB
#
# real    7m26.300s
# user    2m34.908s
# sys     0m23.551s
  • Python's hashlib reading with chunks + CRLF + binary optimization
time python -c "
import os
import hashlib
from dvc.istextfile import istextfile

LOCAL_CHUNK_SIZE = 1024 * 1024

for file in os.listdir():
    hash = hashlib.md5()
    binary = not istextfile(file)

    with open(file, 'rb') as fobj:
        while True:
            data = fobj.read(LOCAL_CHUNK_SIZE)

            if not data:
                break

            if binary:
                chunk = data
            else:
                chunk = data.replace(b'\r\n', b'\n')

            hash.update(chunk)

    print(file, hash.hexdigest())
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m7.610s
# user    0m6.028s
# sys     0m1.528s


# Files:  10
# Size:   3 GiB
#
# real    7m44.754s
# user    1m53.498s
# sys     0m17.882s
  • DVC's file_md5:
time python -c "
import os
from dvc.utils import file_md5

for file in os.listdir():
    print(file, file_md5(file)[0])
" > /dev/null

# Files: 100,000
# Size:   3 KiB
#
# real    0m8.927s
# user    0m7.092s
# sys     0m1.768s

# Files:  10
# Size:   3 GiB
#
# real    7m40.479s
# user    2m7.392s
# sys     0m21.710s

Also, file_md5 is returning both the hexdigest and the digest but we are only using hexdigest across the code base, we can remove that.

Extracts from https://stackoverflow.com/questions/56035696/version-control-for-machine-learning-data-set-with-large-amount-of-images

@shcheklein , it tooks for dvc add and dvc push about 2hour with 30mb upload speed

Do we know DVC's bottlenecks? Is it possible to provide the user with an estimate of how long is it going to take uploading / downloading depending on the number of operations and internet speed?

Hi, @shcheklein , if i understood you correctly.We are using google cloud storage as remote for dvc.In storage we use 1 bucket.Total amount of files exceeds 100000, total size on disk 229mb ,average size of file about 1.3 kb.Our Upload speed is 30mb and download speed also 30mb. I checked upload of our dataset to similar google storage bucket without dvc and it tooks about 25 min

The problem is related to Google Cloud Storage, needs more research on why is it taking 5x with DVC.
Computing checksums shouldn't take more than a couple of minutes in the worst case scenario, there must be another operation hanging the process for such a long time.

Need to re-test this with all the new performance patches that have come over the last weeks and see if there is any improvement.

yep, that would be great to see if the GCS is fixed

So the run with some comments:

3kb x 100k

$ time dvc add data  # comp md5s
real    1m33.184s
user    1m11.137s
sys 0m27.597s

# A long delay after pbar done and nothing happens


$ time dvc add data  # create unpacked dir
real    1m4.460s
user    0m53.056s
sys 0m10.855s

# A long delay before and especially after pbar


$ time dvc add data  # 3rd time, still slow
real    0m37.932s
user    0m31.009s
sys 0m6.715s

# A long delay at start before anything printed out
# All subsequent `dvc add`s take the same time


$ time dvc commit data.dvc
real    0m35.182s  # About the same as above
user    0m29.368s
sys 0m5.980s

$ time dvc push  # to local
real    2m18.288s                   
user    1m57.271s                                
sys     1m1.700s           

$ time dvc push  # second time, nothing to push
real    0m56.521s                                 
user    0m45.159s                                
sys     0m7.637s

$ time dvc pull  # nothing to pull
real    0m57.129s
user    0m48.626s
sys 0m8.521s

# Checkout took the majority of the time

$ rm -rf .dvc/cache && rm -rf data && time dvc pull
real    4m7.259s
user    3m24.639s
sys     1m30.983s

# at the start of checkout pbar hangs at 0% for a while


$ echo update >> data/update && time dvc add data
real    1m35.354s
user    1m12.278s
sys 0m22.342s

# The time is same as starting add

Summary:

  • things are bad
  • directory not changed fast check doesn't work or insufficient
  • many noop operations take a long time
  • there are numerous UI fails:

    • hang ups at the start and/or in the end

    • hang ups in the middle, e.g. pbar done and nothing happens

    • hang ups on pbar start

So bench runs for dirs:

0.6.0 (master)

N = 10000, size = 1m

| op | total | in | out | sleep |
|---------------|---------|------|-------|---------|
| add | 81.01 | 0.9 | 2.19 | 37.08 |
| add-2 | 9.4 | 2.08 | 1.92 | 3.86 |
| add-3 | 5.83 | 4.14 | 1.68 | 0 |
| commit-noop | 5.73 | 4.23 | 1.49 | 0 |
| checkout-noop | 5.95 | 0.55 | 1.5 | 0.4 |
| checkout-full | 52.17 | 0.57 | 2.35 | 1.42 |
| push | 45.32 | 1.03 | 1.8 | 0.13 |
| push-noop | 46.88 | 0.97 | 2.17 | 0.87 |
| pull-noop | 10.23 | 0.96 | 1.52 | 0.46 |
| pull | 162.65 | 0.5 | 2.19 | 44.28 |
| add-modified | 62.64 | 1.76 | 2.32 | 56.29 |

N = 100000, size = 100k

| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 222.77 | 1.57 | 2.02 | 141.7 |
| add-2 | 75.97 | 20.71 | 2.46 | 39.42 |
| add-3 | 42.7 | 40.43 | 2.27 | |
| commit-noop | 40.55 | 39.06 | 1.49 | |
| checkout-noop | 43.42 | 1.83 | 2.03 | 4.31 |
| checkout-full | 124.17 | 1.62 | 2.72 | 13.22 |
| push | 232.11 | 4.33 | 2.45 | 0.25 |
| push-noop | 145.57 | 4.6 | 1.86 | 0.76 |
| pull-noop | 85.2 | 4.49 | 1.85 | 4.41 |
| pull | 457.27 | 0.46 | 1.95 | 98.57 |
| add-modified | 204.89 | 22.69 | 2.96 | 158 |

0.40.0

N = 10000, size = 1m

| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 98.16 | 7.38 | 3.37 | 51.67 |
| add-2 | 6.03 | 3.52 | 2.52 | 0 |
| add-3 | 5.83 | 3.54 | 2.29 | 0 |
| commit-noop | 5.65 | 3.48 | 2.17 | 0 |
| checkout-noop | 3.26 | 1.12 | 2.14 | |
| checkout-full | 49.81 | 46.04 | 3.77 | 0 |
| push | 34.84 | 1.2 | 3.53 | 1.29 |
| push-noop | 45.24 | 1.14 | 2.71 | 40.88 |
| pull-noop | 46.08 | 1.12 | 2.34 | 38.63 |
| pull | 100.2 | 1.08 | 3.47 | 56.4 |
| add-modified | 141.81 | 1.15 | 2.8 | 57.16 |

N = 100000, size = 100k

| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 243.38 | 9.58 | 3.44 | 144.25 |
| add-2 | 35.13 | 31.96 | 3.16 | |
| add-3 | 36.76 | 33.06 | 3.69 | 0 |
| commit-noop | 31.33 | 28.44 | 2.89 | 0 |
| checkout-noop | 5.52 | 2.66 | 2.86 | 0 |
| checkout-full | 98.9 | 88.35 | 10.55 | |
| push | 97.57 | 1.89 | 9.83 | 13.7 |
| push-noop | 135.25 | 1.61 | 2.78 | 116.21 |
| pull-noop | 131.7 | 1.56 | 3.26 | 82.1 |
| pull | 198.3 | 1.28 | 6.75 | 109.82 |
| add-modified | 365.22 | 4.48 | 3.2 | 157.99 |

0.58.1 (before checkout changes)

N = 10000, size = 1m

| op | total | in | out | sleep |
|---------------|---------|------|-------|---------|
| add | 81.99 | 1.23 | 3.56 | 36.41 |
| add-2 | 9.81 | 2.91 | 2.63 | 3.13 |
| add-3 | 6.96 | 4.21 | 2.74 | 0 |
| commit-noop | 6.67 | 4.16 | 2.5 | |
| checkout-noop | 3.88 | 1.16 | 2.52 | 0.19 |
| checkout-full | 52.8 | 1.11 | 3.82 | 1.31 |
| push | 46.7 | 1.67 | 3.22 | 0.14 |
| push-noop | 47.87 | 1.55 | 3.36 | 0.77 |
| pull-noop | 7.78 | 1.45 | 2.55 | 1.79 |
| pull | 155.81 | 1.14 | 3.75 | 43.44 |
| add-modified | 63.79 | 2.47 | 3.85 | 55.4 |

N = 100000, size = 100k

| op | total | in | out | sleep |
|---------------|---------|-------|-------|---------|
| add | 220.51 | 1.71 | 3.33 | 137.64 |
| add-2 | 67.87 | 21.03 | 2.71 | 32.8 |
| add-3 | 40.55 | 37.8 | 2.75 | 0 |
| commit-noop | 37.08 | 34.04 | 3.04 | 0 |
| checkout-noop | 7.27 | 2.57 | 2.49 | 2.21 |
| checkout-full | 124.94 | 2.32 | 4.32 | 13.85 |
| push | 223.59 | 5.14 | 3.54 | 0.27 |
| push-noop | 147.53 | 5.01 | 3.54 | 0.58 |
| pull-noop | 48.35 | 4.92 | 2.86 | 18.59 |
| pull | 440.08 | 1.15 | 3.21 | 96.96 |
| add-modified | 201.93 | 21.99 | 3.57 | 154.42 |

And totals only in bar charts:

N = 10000, size = 1m

bench_dir N10k s1m

N = 100000, size = 100k

bench-dir-N100k-s100k

Some takeouts:

  • checkout change slows things down significantly
  • pull/push degraded over time significantly (probably with switching from listings to batch exists, this is local remote, so take it with a grain of salt though)
  • multithreaded md5s help not as much as one might expect

I saved all the output with timestamps, so that could be analyzed where we have sleeps, slow ins and outs.

Another things is that this is tested with cache type cope only.

@Suor do you have your tests scripts still? Would be interesting to see what's up with it right now, since we've introduced a lot of optimizations in 1.0. Though, probably dvc-bench is enough.

Ok, closing for now as stale. We've introduced lots of push/pull/fetch/status/add optimizations for directories since the ticket was opened.

I guess new benches are also run for old code. So we can compare.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jorgeorpinel picture jorgeorpinel  路  45Comments

ChrisHowlin picture ChrisHowlin  路  35Comments

pared picture pared  路  73Comments

kskyten picture kskyten  路  44Comments

yukw777 picture yukw777  路  45Comments