Dvc: Dramatic slowdown with increasing number of files

Created on 27 Jun 2019 · 20Comments · Source: iterative/dvc

Ubuntu 18.04, python 3.6.8, dvc version 0.35.7. and 0.50.1 installed with pip3. Fresh repos and clear cache for each test.

Adding files recursively gets a lot slower with the number of files. Noticed this when trying to add a directory with 15207 files, waited an hour and stopped it before it finished. Tested below with subsets of files to show behavior. This is especially noticeable on the newest version of dvc, while 0.35.7 works significantly better.

Unfortunately adding whole directories (without recursive) is not an option for us, because we need to split the data-set between workers before pulling. We're stuck on older version of dvc because of this issue.

DVC version 0.50.1

Tested with subset of files to show behaviour:
451 files / 16s = 28.2 files/s
1242 files / 87s = 14.3 files/s
2474 files / 327s = 7.5 files/s
4188 files / 927s = 4.5 files/s

Adding the whole directory with 4188 files (using the non-recursive option):
4188 files / 12s = 349 files/s

DVC version 0.35.7

451 files / 4s = 113 files/s
1242 files / 9s = 138 files/s
2474 files / 20s = 124 files/s
4188 files / 49s = 85 files/s
15204 files / 445s = 34 files/s

Adding the whole directory with 4188 files (using the non-recursive option):
4188 files / 8s = 524 files/s
Adding the whole directory with 15204 files (using the non-recursive option):
15204 files / 26s = 585 files/s

performance

Source

filipgrano

👀3

All 20 comments

Hi @filipgrano !

Could you please elaborate on your scenario? We might be able to suggest a workaround for now.

efiop on 27 Jun 2019

Hi @filipgrano !

Could you please elaborate on your scenario? We might be able to suggest a workaround for now.

The most simple scenario I can come up with where this would be a serious issue:
Imagine two developers working on separate machines.
Dev1 runs an experiment with an experiment producing 10 thousand large files as results, adds them to dvc and pushes to remote with dvc and git. Dev2 then wants to examine a single file of the results. If the files were added recursively it's possible to pull just a single file, but if the directory was added as a whole the whole bunch needs to be pulled.

We have quite elaborate scenarios with multistage parallel data-pipelines producing thousands of files with the input files divided between workers, but I think the above highlights the issue.

filipgrano on 28 Jun 2019

I also found the same performance issue and degradation with dvc checkout of single file when there is large number of files. I will post benchmarks shortly.

filipgrano on 28 Jun 2019

As promised, checkout of single file from large file-sets benchmarks below. Cache contains all and only the files in repo. Deleted single file and did dvc checkout filename.

This is an even worse case. Checkout of single file takes as long as adding all the files recursively.

DVC version 0.50.1

1 file from 451 files / 14s = 0.071 files/s
1 file from 1242 files / 88s = 0.011 files/s
1 file from 2474 files / 324s = 0.003 files/s

DVC version 0.35.7

1 file from 451 files / 2s = 0.500 files/s
1 file from 1242 files / 5s = 0.200 files/s
1 file from 2474 files / 15s = 0.067 files/s
1 file from 4188 files / 33s = 0.030 files/s

filipgrano on 28 Jun 2019

The most simple scenario I can come up with where this would be a serious issue:
Imagine two developers working on separate machines.
Dev1 runs an experiment with an experiment producing 10 thousand large files as results, adds them to dvc and pushes to remote with dvc and git. Dev2 then wants to examine a single file of the results. If the files were added recursively it's possible to pull just a single file, but if the directory was added as a whole the whole bunch needs to be pulled.

I meant the scenario where you need to add that number of files recursively, instead of as the whole dir. If you are using add -R with large directory, it is a known issue that it will now slow down the dvc because of the way we collect and verify DAGs. If you could describe the scenario where you need add -R maybe we could figure out a better workaround for you to use add dir as a whole instead.

We have quite elaborate scenarios with multistage parallel data-pipelines producing thousands of files with the input files divided between workers, but I think the above highlights the issue.

Got it. So maybe it would be possible to put that parallelization into a script, that you would run with

dvc run ... -o output_dir ./my_parallel_script

so it would count it as a whole output?

So the culprit for now is the way we collect the DAG. Each time almost any dvc command is running, it is collecting all the dvc files in your repo in order to build a DAG. When we are doing that we are also doing things like verifying that you don't have any cycles in your DAG and stuff like that. 0.35.7 is pretty old, and we've added some additional checks since then. We'll need to investigate the issue more carefully.

As to checking out one particular dvc file or a specific set of those, we might optimize it for now by not computing the whole dag, since we already know specific targets to checkout.

efiop on 28 Jun 2019

Maybe we should discuss introducing some feature that will allow user to extract particular output?
Due to fact that DVC prevents user from reusing same output name, it could be as simple as:

dvc pull out_name

and for particular file from data directory:

dvc pull data/dog/1.jpg

It has to be noted, that introducing such command would result in some problems how to treat that dependency. What I have in mind is that if we allow user to download just one image from whole dataset under the same path (data/dog/1.jpg), DVC, upon status check will report that this dependency (data dir) has changed, which might result in user "overwriting" whole dataset for given branch with this single file.

We could probably approach that by treating such pull similar way as we do import and treat it as "new" dependency (that would require renaming).
Note, that it will make some problems if we would like to repro the pipeline and also repro this particular dependency.

We would need to do it in 2 steps:

implementing logic for extracting singular output checksum from present stage files
importing the file from cache

@filipgrano out of curiosity:
Is your use case some kind of AutoML stuff, producing huge number of models?

pared on 28 Jun 2019

👍1

test.sh:

#!/bin/bash

set -evux

rm -rf perf
mkdir perf
pushd perf

git init
dvc init

mkdir data
for i in $(seq 1 5000);
do
    echo $i > data/$i.txt
done

Running performance test:

PYTHONPATH=<path-to-dvc-repo>/dvc python -mcProfile -s cumtime -m dvc add -R data:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   2185/1    0.009    0.000 1947.114 1947.114 {built-in method builtins.exec}
        1    0.000    0.000 1947.113 1947.113 runpy.py:195(run_module)
        1    0.000    0.000 1946.982 1946.982 runpy.py:62(_run_code)
        1    0.000    0.000 1946.982 1946.982 __main__.py:1(<module>)
        1    0.001    0.001 1946.256 1946.256 main.py:16(main)
        1    0.000    0.000 1945.825 1945.825 base.py:58(run_cmd)
        1    0.000    0.000 1945.824 1945.824 add.py:14(run)
        1    0.000    0.000 1945.824 1945.824 scm_context.py:2(run)
        1    0.007    0.007 1945.795 1945.795 add.py:14(add)
        2    0.003    0.002 1829.134  914.567 __init__.py:142(check_dag)
        2   33.953   16.977 1829.131  914.565 __init__.py:249(graph)
 49995000   51.324    0.000 1781.897    0.000 path_info.py:55(isin)
 50011525   49.291    0.000 1689.408    0.000 {built-in method builtins.any}
399960000  172.616    0.000 1634.022    0.000 path_info.py:60(<genexpr>)
399960000  132.770    0.000  903.307    0.000 _collections_abc.py:879(__iter__)
399960000  262.493    0.000  770.537    0.000 pathlib.py:594(__getitem__)
362462500  197.918    0.000  570.688    0.000 pathlib.py:735(__eq__)
724930000  304.088    0.000  337.218    0.000 pathlib.py:726(_cparts)
350010000  173.256    0.000  270.897    0.000 pathlib.py:666(_from_parsed_parts)
804198692/404234757  121.447    0.000  237.543    0.000 {built-in method builtins.len}
399960000  116.095    0.000  148.769    0.000 pathlib.py:588(__len__)
        1    0.029    0.029   72.795   72.795 add.py:56(_create_stages)
350112991   66.678    0.000   66.678    0.000 {built-in method __new__ of type object at 0x10a070030}
    10000    0.108    0.000   65.204    0.007 __init__.py:200(is_tracked)
    10001    0.386    0.000   64.193    0.006 util.py:248(__getattr__)
    10000    0.126    0.000   63.789    0.006 base.py:111(_set_cache_)
    10000   58.315    0.006   58.315    0.006 util.py:163(file_contents_ro)
     5000    0.022    0.000   58.205    0.012 stage.py:688(save)
     5000    0.081    0.000   56.166    0.011 base.py:203(save)
417586604   46.596    0.000   46.666    0.000 {built-in method builtins.isinstance}
 49995000   17.914    0.000   38.922    0.000 pathlib.py:916(parents)
349970000   33.129    0.000   33.129    0.000 pathlib.py:308(casefold_parts)
350050001   31.016    0.000   31.016    0.000 pathlib.py:683(_init)
        1    0.000    0.000   29.695   29.695 add.py:43(_find_all_targets)
        1    0.047    0.047   29.695   29.695 add.py:46(<listcomp>)
 49995000   21.008    0.000   21.008    0.000 pathlib.py:581(__init__)
     5000    0.065    0.000   14.127    0.003 stage.py:643(dump)
     5009    0.280    0.000   14.099    0.003 __init__.py:117(ignore)
     5000    0.236    0.000   11.293    0.002 stage.py:22(dump_stage_file)
     5000    0.007    0.000   10.308    0.002 stage.py:726(commit)
     5000    0.014    0.000   10.301    0.002 base.py:242(commit)
     5000    0.018    0.000   10.269    0.002 base.py:362(save)
   150009    0.346    0.000   10.007    0.000 __init__.py:392(relpath)

~If I get it right path_info is extremely slow~. @Suor @efiop @pared @mroutis any thoughts?

So, it's actually crazy number of calls to path info.

shcheklein on 1 Jul 2019

@shcheklein That rather indicates that we spend a lot of time in PathInfo. Ill run similar test for 0.35.7 so we can try to compare.
EDIT:
My computer, 500 files, DVC version: 0.35.7:
``` 5993252 function calls (5931707 primitive calls) in 2.643 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
2057/1 0.005 0.000 2.645 2.645 {built-in method builtins.exec}
1 0.000 0.000 2.645 2.645 :1()
1 0.000 0.000 2.645 2.645 runpy.py:195(run_module)
1 0.000 0.000 2.578 2.578 runpy.py:62(_run_code)
1 0.000 0.000 2.578 2.578 __main__.py:1()
1 0.000 0.000 2.192 2.192 main.py:15(main)
1 0.000 0.000 2.015 2.015 base.py:55(run_cmd)
1 0.000 0.000 2.015 2.015 add.py:14(run)
1 0.000 0.000 2.015 2.015 scm_context.py:2(run)
1 0.001 0.001 2.015 2.015 add.py:9(add)
72 0.001 0.000 1.717 0.024 __init__.py:1()
1 0.001 0.001 0.908 0.908 add.py:38(_create_stages)
1485/13 0.006 0.000 0.726 0.056 :978(_find_and_load)
1485/13 0.004 0.000 0.726 0.056 :948(_find_and_load_unlocked)
1417/13 0.005 0.000 0.725 0.056 :663(_load_unlocked)
1317/13 0.003 0.000 0.725 0.056 :722(exec_module)
1752/13 0.001 0.000 0.724 0.056 :211(_call_with_frames_removed)
2 0.000 0.000 0.609 0.304 __init__.py:137(check_dag)
2 0.207 0.103 0.609 0.304 __init__.py:299(graph)
619/46 0.001 0.000 0.548 0.012 {built-in method builtins.__import__}
500 0.001 0.000 0.465 0.001 stage.py:658(save)
```

DVC version: master:

         82438717 function calls (76873063 primitive calls) in 22.374 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   2393/1    0.007    0.000   22.377   22.377 {built-in method builtins.exec}
        1    0.000    0.000   22.377   22.377 <string>:1(<module>)
        1    0.000    0.000   22.377   22.377 runpy.py:195(run_module)
        1    0.000    0.000   22.315   22.315 runpy.py:62(_run_code)
        1    0.000    0.000   22.315   22.315 __main__.py:1(<module>)
        1    0.000    0.000   21.745   21.745 main.py:16(main)
        1    0.000    0.000   21.518   21.518 base.py:58(run_cmd)
        1    0.000    0.000   21.517   21.517 add.py:14(run)
        1    0.000    0.000   21.517   21.517 scm_context.py:2(run)
        1    0.000    0.000   21.517   21.517 add.py:14(add)
        2    0.000    0.000   19.322    9.661 __init__.py:142(check_dag)
        2    0.198    0.099   19.322    9.661 __init__.py:249(graph)
   499500    0.449    0.000   18.777    0.000 path_info.py:55(isin)
   507673    0.568    0.000   17.982    0.000 {built-in method builtins.any}
  5494500    1.966    0.000   17.372    0.000 path_info.py:60(<genexpr>)
  5494500    1.325    0.000    9.704    0.000 _collections_abc.py:879(__iter__)
  5494500    2.906    0.000    8.379    0.000 pathlib.py:585(__getitem__)
  5119750    2.205    0.000    5.801    0.000 pathlib.py:726(__eq__)
 10240000    2.863    0.000    3.222    0.000 pathlib.py:717(_cparts)
  4999500    2.067    0.000    2.961    0.000 pathlib.py:657(_from_parsed_parts)

pared on 1 Jul 2019

So maybe it would be possible to put that parallelization into a script, that you would run with
dvc run ... -o output_dir ./my_parallel_script
so it would count it as a whole output?

That would work if we ran on a single machine or with shared storage. Our workers run in a k8s on separate machines and use dvc with local cache and remote storage to synchronize data.

@filipgrano out of curiosity:
Is your use case some kind of AutoML stuff, producing huge number of models?

Yes.

filipgrano on 1 Jul 2019

Might be a silly question, but why does the DAG need to be built when simply adding files? How can they be part of any dependencies if they are freshly added?

filipgrano on 1 Jul 2019

👍1

Might be a silly question, but why does the DAG need to be built when simply adding files? How can they be part of any dependencies if they are freshly added?

We need to make sure that there are no cycles or duplicated outputs. E.g.

dvc add foo -f myfoo.dvc
dvc add foo -f myotherfoo.dvc

dvc run ... -o foo ./mycmd
dvc add foo

efiop on 1 Jul 2019

👍1

Got a 3x-4x improvement with this simple trick:

    def isin(self, other):
        if isinstance(other, basestring):
            other = self.__class__(other)
        elif self.__class__ != other.__class__:
            return False
        for p in self.parents:
            if p == other:
                return True
            if len(other.parts) >= len(p.parts):
                return False

@Suor can you confirm that it makes sense?

next is to reduce (or cache?) number of isin calls in the first place, if possible

shcheklein on 3 Jul 2019

👍1

@filipgrano it looks promising ^^ . We can send an update as soon as I confirm that there is no issues.

shcheklein on 3 Jul 2019

@shcheklein I would rather fall back to comparing strings or go over _parts from beginning. Constructing parents is way slower.

Suor on 4 Jul 2019

I wonder if we should support some partial pulling though. Because:

Managing hordes of .dvc files might be a nuisance for a user.
It might lead to other performance issues.

Suor on 4 Jul 2019

👍1

@Suor parents take only 16s for 49995000 calls. 16s out of 500s of execution. I would say it's not a problem at all. 49995000 is the problem. I hope all these checks inside graph can be rewritten to be linear (using some hashing).

shcheklein on 4 Jul 2019

@Suor of course, I'm not against using _cparts if it's easy enough.

shcheklein on 4 Jul 2019

If everything is done right, PR #2230 gives ~20x improvement:

2185/1    0.009    0.000  116.230  116.230 {built-in method builtins.exec}
        1    0.000    0.000  116.229  116.229 runpy.py:195(run_module)
        1    0.000    0.000  116.108  116.108 runpy.py:62(_run_code)
        1    0.000    0.000  116.108  116.108 __main__.py:1(<module>)
        1    0.000    0.000  115.397  115.397 main.py:16(main)
        1    0.000    0.000  115.023  115.023 base.py:58(run_cmd)
        1    0.000    0.000  115.022  115.022 add.py:14(run)
        1    0.000    0.000  115.022  115.022 scm_context.py:2(run)
        1    0.009    0.009  114.992  114.992 add.py:14(add)
        1    0.031    0.031   66.866   66.866 add.py:56(_create_stages)
    10000    0.125    0.000   62.874    0.006 __init__.py:200(is_tracked)
    10001    0.337    0.000   61.609    0.006 util.py:248(__getattr__)
    10000    0.158    0.000   61.249    0.006 base.py:111(_set_cache_)
    10000   53.306    0.005   53.306    0.005 util.py:163(file_contents_ro)
     5000    0.018    0.000   53.066    0.011 stage.py:688(save)
     5000    0.089    0.000   51.082    0.010 base.py:203(save)
        1    0.000    0.000   31.687   31.687 add.py:43(_find_all_targets)
        1    0.066    0.066   31.687   31.687 add.py:46(<listcomp>)
     5000    0.080    0.000   15.520    0.003 stage.py:643(dump)
     5009    0.278    0.000   13.587    0.003 __init__.py:117(ignore)
     5000    0.388    0.000   12.346    0.002 stage.py:22(dump_stage_file)
     5000    0.009    0.000   10.467    0.002 main.py:420(dump)
     5000    0.025    0.000   10.457    0.002 main.py:441(dump_all)
     5000    0.021    0.000   10.277    0.002 main.py:796(dump)
   150009    0.357    0.000   10.038    0.000 __init__.py:392(relpath)
     5000    0.015    0.000    9.939    0.002 representer.py:82(represent)
     5000    0.008    0.000    9.885    0.002 stage.py:726(commit)
     5000    0.014    0.000    9.877    0.002 base.py:242(commit)
     5000    0.016    0.000    9.845    0.002 base.py:364(save)

Most of the time goes to the file_contents_ro check now.

shcheklein on 4 Jul 2019

🎉1

@filipgrano please, check the latest release 0.51.1. It should be available via pip to install and give you ~20x performance improvement. Please, let us know if that works for you.

shcheklein on 11 Jul 2019

👍1

@filipgrano closing this since it should be fixed in the latest 0.51+ version.

shcheklein on 14 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings