Ubuntu 18.04, python 3.6.8, dvc version 0.35.7. and 0.50.1 installed with pip3. Fresh repos and clear cache for each test.
Adding files recursively gets a lot slower with the number of files. Noticed this when trying to add a directory with 15207 files, waited an hour and stopped it before it finished. Tested below with subsets of files to show behavior. This is especially noticeable on the newest version of dvc, while 0.35.7 works significantly better.
Unfortunately adding whole directories (without recursive) is not an option for us, because we need to split the data-set between workers before pulling. We're stuck on older version of dvc because of this issue.
Tested with subset of files to show behaviour:
451 files / 16s = 28.2 files/s
1242 files / 87s = 14.3 files/s
2474 files / 327s = 7.5 files/s
4188 files / 927s = 4.5 files/s
Adding the whole directory with 4188 files (using the non-recursive option):
4188 files / 12s = 349 files/s
451 files / 4s = 113 files/s
1242 files / 9s = 138 files/s
2474 files / 20s = 124 files/s
4188 files / 49s = 85 files/s
15204 files / 445s = 34 files/s
Adding the whole directory with 4188 files (using the non-recursive option):
4188 files / 8s = 524 files/s
Adding the whole directory with 15204 files (using the non-recursive option):
15204 files / 26s = 585 files/s
Hi @filipgrano !
Could you please elaborate on your scenario? We might be able to suggest a workaround for now.
Hi @filipgrano !
Could you please elaborate on your scenario? We might be able to suggest a workaround for now.
The most simple scenario I can come up with where this would be a serious issue:
Imagine two developers working on separate machines.
Dev1 runs an experiment with an experiment producing 10 thousand large files as results, adds them to dvc and pushes to remote with dvc and git. Dev2 then wants to examine a single file of the results. If the files were added recursively it's possible to pull just a single file, but if the directory was added as a whole the whole bunch needs to be pulled.
We have quite elaborate scenarios with multistage parallel data-pipelines producing thousands of files with the input files divided between workers, but I think the above highlights the issue.
I also found the same performance issue and degradation with dvc checkout of single file when there is large number of files. I will post benchmarks shortly.
As promised, checkout of single file from large file-sets benchmarks below. Cache contains all and only the files in repo. Deleted single file and did dvc checkout filename
.
This is an even worse case. Checkout of single file takes as long as adding all the files recursively.
1 file from 451 files / 14s = 0.071 files/s
1 file from 1242 files / 88s = 0.011 files/s
1 file from 2474 files / 324s = 0.003 files/s
1 file from 451 files / 2s = 0.500 files/s
1 file from 1242 files / 5s = 0.200 files/s
1 file from 2474 files / 15s = 0.067 files/s
1 file from 4188 files / 33s = 0.030 files/s
The most simple scenario I can come up with where this would be a serious issue:
Imagine two developers working on separate machines.
Dev1 runs an experiment with an experiment producing 10 thousand large files as results, adds them to dvc and pushes to remote with dvc and git. Dev2 then wants to examine a single file of the results. If the files were added recursively it's possible to pull just a single file, but if the directory was added as a whole the whole bunch needs to be pulled.
I meant the scenario where you need to add that number of files recursively, instead of as the whole dir. If you are using add -R
with large directory, it is a known issue that it will now slow down the dvc because of the way we collect and verify DAGs. If you could describe the scenario where you need add -R
maybe we could figure out a better workaround for you to use add dir
as a whole instead.
We have quite elaborate scenarios with multistage parallel data-pipelines producing thousands of files with the input files divided between workers, but I think the above highlights the issue.
Got it. So maybe it would be possible to put that parallelization into a script, that you would run with
dvc run ... -o output_dir ./my_parallel_script
so it would count it as a whole output?
So the culprit for now is the way we collect the DAG. Each time almost any dvc command is running, it is collecting all the dvc files in your repo in order to build a DAG. When we are doing that we are also doing things like verifying that you don't have any cycles in your DAG and stuff like that. 0.35.7 is pretty old, and we've added some additional checks since then. We'll need to investigate the issue more carefully.
As to checking out one particular dvc file or a specific set of those, we might optimize it for now by not computing the whole dag, since we already know specific targets to checkout.
Maybe we should discuss introducing some feature that will allow user to extract particular output?
Due to fact that DVC prevents user from reusing same output name, it could be as simple as:
dvc pull out_name
and for particular file from data directory:
dvc pull data/dog/1.jpg
It has to be noted, that introducing such command would result in some problems how to treat that dependency. What I have in mind is that if we allow user to download just one image from whole dataset under the same path (data/dog/1.jpg
), DVC, upon status check will report that this dependency (data
dir) has changed, which might result in user "overwriting" whole dataset for given branch with this single file.
We could probably approach that by treating such pull similar way as we do import
and treat it as "new" dependency (that would require renaming).
Note, that it will make some problems if we would like to repro
the pipeline and also repro this particular dependency.
We would need to do it in 2 steps:
@filipgrano out of curiosity:
Is your use case some kind of AutoML stuff, producing huge number of models?
test.sh
:
#!/bin/bash
set -evux
rm -rf perf
mkdir perf
pushd perf
git init
dvc init
mkdir data
for i in $(seq 1 5000);
do
echo $i > data/$i.txt
done
Running performance test:
PYTHONPATH=<path-to-dvc-repo>/dvc python -mcProfile -s cumtime -m dvc add -R data
:
ncalls tottime percall cumtime percall filename:lineno(function)
2185/1 0.009 0.000 1947.114 1947.114 {built-in method builtins.exec}
1 0.000 0.000 1947.113 1947.113 runpy.py:195(run_module)
1 0.000 0.000 1946.982 1946.982 runpy.py:62(_run_code)
1 0.000 0.000 1946.982 1946.982 __main__.py:1(<module>)
1 0.001 0.001 1946.256 1946.256 main.py:16(main)
1 0.000 0.000 1945.825 1945.825 base.py:58(run_cmd)
1 0.000 0.000 1945.824 1945.824 add.py:14(run)
1 0.000 0.000 1945.824 1945.824 scm_context.py:2(run)
1 0.007 0.007 1945.795 1945.795 add.py:14(add)
2 0.003 0.002 1829.134 914.567 __init__.py:142(check_dag)
2 33.953 16.977 1829.131 914.565 __init__.py:249(graph)
49995000 51.324 0.000 1781.897 0.000 path_info.py:55(isin)
50011525 49.291 0.000 1689.408 0.000 {built-in method builtins.any}
399960000 172.616 0.000 1634.022 0.000 path_info.py:60(<genexpr>)
399960000 132.770 0.000 903.307 0.000 _collections_abc.py:879(__iter__)
399960000 262.493 0.000 770.537 0.000 pathlib.py:594(__getitem__)
362462500 197.918 0.000 570.688 0.000 pathlib.py:735(__eq__)
724930000 304.088 0.000 337.218 0.000 pathlib.py:726(_cparts)
350010000 173.256 0.000 270.897 0.000 pathlib.py:666(_from_parsed_parts)
804198692/404234757 121.447 0.000 237.543 0.000 {built-in method builtins.len}
399960000 116.095 0.000 148.769 0.000 pathlib.py:588(__len__)
1 0.029 0.029 72.795 72.795 add.py:56(_create_stages)
350112991 66.678 0.000 66.678 0.000 {built-in method __new__ of type object at 0x10a070030}
10000 0.108 0.000 65.204 0.007 __init__.py:200(is_tracked)
10001 0.386 0.000 64.193 0.006 util.py:248(__getattr__)
10000 0.126 0.000 63.789 0.006 base.py:111(_set_cache_)
10000 58.315 0.006 58.315 0.006 util.py:163(file_contents_ro)
5000 0.022 0.000 58.205 0.012 stage.py:688(save)
5000 0.081 0.000 56.166 0.011 base.py:203(save)
417586604 46.596 0.000 46.666 0.000 {built-in method builtins.isinstance}
49995000 17.914 0.000 38.922 0.000 pathlib.py:916(parents)
349970000 33.129 0.000 33.129 0.000 pathlib.py:308(casefold_parts)
350050001 31.016 0.000 31.016 0.000 pathlib.py:683(_init)
1 0.000 0.000 29.695 29.695 add.py:43(_find_all_targets)
1 0.047 0.047 29.695 29.695 add.py:46(<listcomp>)
49995000 21.008 0.000 21.008 0.000 pathlib.py:581(__init__)
5000 0.065 0.000 14.127 0.003 stage.py:643(dump)
5009 0.280 0.000 14.099 0.003 __init__.py:117(ignore)
5000 0.236 0.000 11.293 0.002 stage.py:22(dump_stage_file)
5000 0.007 0.000 10.308 0.002 stage.py:726(commit)
5000 0.014 0.000 10.301 0.002 base.py:242(commit)
5000 0.018 0.000 10.269 0.002 base.py:362(save)
150009 0.346 0.000 10.007 0.000 __init__.py:392(relpath)
~If I get it right path_info
is extremely slow~. @Suor @efiop @pared @mroutis any thoughts?
So, it's actually crazy number of calls to path info.
@shcheklein That rather indicates that we spend a lot of time in PathInfo. Ill run similar test for 0.35.7 so we can try to compare.
EDIT:
My computer, 500 files, DVC version: 0.35.7:
``` 5993252 function calls (5931707 primitive calls) in 2.643 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
2057/1 0.005 0.000 2.645 2.645 {built-in method builtins.exec}
1 0.000 0.000 2.645 2.645
1 0.000 0.000 2.645 2.645 runpy.py:195(run_module)
1 0.000 0.000 2.578 2.578 runpy.py:62(_run_code)
1 0.000 0.000 2.578 2.578 __main__.py:1(
1 0.000 0.000 2.192 2.192 main.py:15(main)
1 0.000 0.000 2.015 2.015 base.py:55(run_cmd)
1 0.000 0.000 2.015 2.015 add.py:14(run)
1 0.000 0.000 2.015 2.015 scm_context.py:2(run)
1 0.001 0.001 2.015 2.015 add.py:9(add)
72 0.001 0.000 1.717 0.024 __init__.py:1(
1 0.001 0.001 0.908 0.908 add.py:38(_create_stages)
1485/13 0.006 0.000 0.726 0.056
1485/13 0.004 0.000 0.726 0.056
1417/13 0.005 0.000 0.725 0.056
1317/13 0.003 0.000 0.725 0.056
1752/13 0.001 0.000 0.724 0.056
2 0.000 0.000 0.609 0.304 __init__.py:137(check_dag)
2 0.207 0.103 0.609 0.304 __init__.py:299(graph)
619/46 0.001 0.000 0.548 0.012 {built-in method builtins.__import__}
500 0.001 0.000 0.465 0.001 stage.py:658(save)
```
DVC version: master:
82438717 function calls (76873063 primitive calls) in 22.374 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
2393/1 0.007 0.000 22.377 22.377 {built-in method builtins.exec}
1 0.000 0.000 22.377 22.377 <string>:1(<module>)
1 0.000 0.000 22.377 22.377 runpy.py:195(run_module)
1 0.000 0.000 22.315 22.315 runpy.py:62(_run_code)
1 0.000 0.000 22.315 22.315 __main__.py:1(<module>)
1 0.000 0.000 21.745 21.745 main.py:16(main)
1 0.000 0.000 21.518 21.518 base.py:58(run_cmd)
1 0.000 0.000 21.517 21.517 add.py:14(run)
1 0.000 0.000 21.517 21.517 scm_context.py:2(run)
1 0.000 0.000 21.517 21.517 add.py:14(add)
2 0.000 0.000 19.322 9.661 __init__.py:142(check_dag)
2 0.198 0.099 19.322 9.661 __init__.py:249(graph)
499500 0.449 0.000 18.777 0.000 path_info.py:55(isin)
507673 0.568 0.000 17.982 0.000 {built-in method builtins.any}
5494500 1.966 0.000 17.372 0.000 path_info.py:60(<genexpr>)
5494500 1.325 0.000 9.704 0.000 _collections_abc.py:879(__iter__)
5494500 2.906 0.000 8.379 0.000 pathlib.py:585(__getitem__)
5119750 2.205 0.000 5.801 0.000 pathlib.py:726(__eq__)
10240000 2.863 0.000 3.222 0.000 pathlib.py:717(_cparts)
4999500 2.067 0.000 2.961 0.000 pathlib.py:657(_from_parsed_parts)
So maybe it would be possible to put that parallelization into a script, that you would run with
dvc run ... -o output_dir ./my_parallel_script
so it would count it as a whole output?
That would work if we ran on a single machine or with shared storage. Our workers run in a k8s on separate machines and use dvc with local cache and remote storage to synchronize data.
@filipgrano out of curiosity:
Is your use case some kind of AutoML stuff, producing huge number of models?
Yes.
Might be a silly question, but why does the DAG need to be built when simply adding files? How can they be part of any dependencies if they are freshly added?
Might be a silly question, but why does the DAG need to be built when simply adding files? How can they be part of any dependencies if they are freshly added?
We need to make sure that there are no cycles or duplicated outputs. E.g.
dvc add foo -f myfoo.dvc
dvc add foo -f myotherfoo.dvc
or
dvc run ... -o foo ./mycmd
dvc add foo
Got a 3x-4x improvement with this simple trick:
def isin(self, other):
if isinstance(other, basestring):
other = self.__class__(other)
elif self.__class__ != other.__class__:
return False
for p in self.parents:
if p == other:
return True
if len(other.parts) >= len(p.parts):
return False
@Suor can you confirm that it makes sense?
next is to reduce (or cache?) number of isin calls in the first place, if possible
@filipgrano it looks promising ^^ . We can send an update as soon as I confirm that there is no issues.
@shcheklein I would rather fall back to comparing strings or go over _parts
from beginning. Constructing parents
is way slower.
I wonder if we should support some partial pulling though. Because:
.dvc
files might be a nuisance for a user.@Suor parents take only 16s for 49995000 calls. 16s out of 500s of execution. I would say it's not a problem at all. 49995000 is the problem. I hope all these checks inside graph can be rewritten to be linear (using some hashing).
@Suor of course, I'm not against using _cparts if it's easy enough.
If everything is done right, PR #2230 gives ~20x improvement:
2185/1 0.009 0.000 116.230 116.230 {built-in method builtins.exec}
1 0.000 0.000 116.229 116.229 runpy.py:195(run_module)
1 0.000 0.000 116.108 116.108 runpy.py:62(_run_code)
1 0.000 0.000 116.108 116.108 __main__.py:1(<module>)
1 0.000 0.000 115.397 115.397 main.py:16(main)
1 0.000 0.000 115.023 115.023 base.py:58(run_cmd)
1 0.000 0.000 115.022 115.022 add.py:14(run)
1 0.000 0.000 115.022 115.022 scm_context.py:2(run)
1 0.009 0.009 114.992 114.992 add.py:14(add)
1 0.031 0.031 66.866 66.866 add.py:56(_create_stages)
10000 0.125 0.000 62.874 0.006 __init__.py:200(is_tracked)
10001 0.337 0.000 61.609 0.006 util.py:248(__getattr__)
10000 0.158 0.000 61.249 0.006 base.py:111(_set_cache_)
10000 53.306 0.005 53.306 0.005 util.py:163(file_contents_ro)
5000 0.018 0.000 53.066 0.011 stage.py:688(save)
5000 0.089 0.000 51.082 0.010 base.py:203(save)
1 0.000 0.000 31.687 31.687 add.py:43(_find_all_targets)
1 0.066 0.066 31.687 31.687 add.py:46(<listcomp>)
5000 0.080 0.000 15.520 0.003 stage.py:643(dump)
5009 0.278 0.000 13.587 0.003 __init__.py:117(ignore)
5000 0.388 0.000 12.346 0.002 stage.py:22(dump_stage_file)
5000 0.009 0.000 10.467 0.002 main.py:420(dump)
5000 0.025 0.000 10.457 0.002 main.py:441(dump_all)
5000 0.021 0.000 10.277 0.002 main.py:796(dump)
150009 0.357 0.000 10.038 0.000 __init__.py:392(relpath)
5000 0.015 0.000 9.939 0.002 representer.py:82(represent)
5000 0.008 0.000 9.885 0.002 stage.py:726(commit)
5000 0.014 0.000 9.877 0.002 base.py:242(commit)
5000 0.016 0.000 9.845 0.002 base.py:364(save)
Most of the time goes to the file_contents_ro
check now.
@filipgrano please, check the latest release 0.51.1. It should be available via pip to install and give you ~20x performance improvement. Please, let us know if that works for you.
@filipgrano closing this since it should be fixed in the latest 0.51+ version.