Dvc: Speed up dvc status for large projects

Created on 4 Feb 2020  路  21Comments  路  Source: iterative/dvc

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

dvc version

DVC version: 0.82.8
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - not supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

expect dvc status can return in a shorter time, currently it takes 30s.

a few stats

du -shL my-project # 107G
dvc status -v

log.txt

awaiting response bug p1-important performance research

Most helpful comment

Reading through discord dialog and looking at the log I don't think this is about big repo. This is probably about many imported files, which create many clones on dvc status.

All 21 comments

I think we should try to reproduce such big repo and see what takes so much time, I suspect db access.

@pared @Ykid could we run it with a profiler please:

python -m cProfile -o status.prof -m dvc status -v

Reading through discord dialog and looking at the log I don't think this is about big repo. This is probably about many imported files, which create many clones on dvc status.

@Suor grep --include=\*.{dvc,} -rn my-data-dir -e ".git" | wc -l gives me 64. I think more or less like that. I have a dvc data registry and import some directories and files from there into the project. So does it sound better if I do dvc get followed by dvc add ? those imported files is not going to be updated very frequently after all.

num of files versioned by dvc is around 9k from find -L my-data-dir -type f ! -name "*.dvc" | wc -l. dvc pull shows 8.31k ( the number changes in each line of dvc pull though, but 8.31k is the most time consuming one )

may I know if there's any way to improve it ?

So does it sound better if I do dvc get followed by dvc add

This is obviously an issue on our end. I will think how this may be sped up.

@Ykid Can you say how many different sources do you use in those 64 import stages? Source is a pair of (url, rev) in deps.repo.

@Suor There is one url as it is our data registry. I follow data registries to set it up. For the number of pairs, here it is:

  • 60: (repo, revA)
  • 4: (repo, revB)

@Ykid thanks.

A note from discord - git repo is big, both history and checked out things:

84K ./tools
1.8M  ./.dvc
988M  ./notebooks
20K ./my-project.egg-info
16K ./configs
870M  ./.git
16M ./my-project
92K ./dockerfiles
28K ./tests
6.3M  ./data
24K ./pipeline
1.9G  .

The last change only caches single instance of repo, not all of them, which prevents us from needing new git clone, but makes a copy of the whole repo sans dvc cache each time. The reason for that is we do git pull each time, i.e. modifying the directory, which means caching cannot be used for it. Things are furthermore complicated by summon publishing stuff, which modifies the directory returned by external_repo(), which also requires a separate copy.

I am trying to untangle it now.

The last change only caches single instance of repo, not all of them

May I know what this means ?

May I know what this means ?

Sorry for slow response.

This means that when you have many imports from the same repo dvc will make a fresh copy of its clone many times, while clone is only done once. This is not the issue generally, but since you have huge git repo - all those notebooks doesn't really play nice - it takes time.

Anyway, this should be fixed after https://github.com/iterative/dvc/pull/3286 lands. You can try it right now with:

pip install git+https://github.com/Suor/dvc.git@erepo-ro

If you do, can you please tell how well does it work for you?

@Suor

DVC version: 0.82.9+f73900
Python version: 3.7.4
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: None
Cache: reflink - not supported, hardlink - not supported, symlink - supported

ERROR: failed to obtain data status - 'Git' object has no attribute 'is_known'

Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/command/status.py", line 50, in run
    with_deps=self.args.with_deps,
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/__init__.py", line 31, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 133, in status
    return _local_status(self, targets, with_deps=with_deps)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 36, in _local_status
    return _joint_status(stages)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 25, in _joint_status
    status.update(stage.status(check_updates=True))
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 161, in rwlocked
    return call()
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 1015, in status
    deps_status = self._status(self.deps)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 1004, in _status
    ret.update(entry.status())
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/dependency/repo.py", line 59, in status
    current_checksum = self._get_checksum(locked=True)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/dependency/repo.py", line 50, in _get_checksum
    with self._make_repo(locked=locked) as repo:
  File "/home/user/miniconda3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 27, in external_repo
    path = _cached_clone(url, rev, for_write=for_write)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 173, in _cached_clone
    clone_path = _clone_default_branch(url, rev)
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/flow.py", line 244, in wrap_with
    return call()
  File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 205, in _clone_default_branch
    if not Git.is_sha(rev) or not git.is_known(rev):
AttributeError: 'Git' object has no attribute 'is_known'

Reopening and escalating the priority. @Suor Please take a look ASAP, we need to release a new version with the fix ASAP as well.

Looks like is_known should've been is_tracked. Surely we should have some sort of test which should've found this bug

@Ykid 0.85.0 is out on pip and conda, please upgrade, give it a try and let us know if it fixed the issue for you. Thanks for the feedback! :slightly_smiling_face:

The bug related to git is fixed, but there seem to be not much performance improved. :(.

DVC version: 0.85.0
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - not supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')
time dvc status

real    0m30.635s
user    0m21.884s
sys 0m6.815s

@Suor Please take a look.

So the optimization works for me, no unneeded clones or copies done. git checkout takes significantly more time than expected though. Need to investigate @Ykid situation more before jumping on some advanced optimizations.

@Ykid I made a branch, which has more erepo logging, can you try it to see what is actually happening on your side and how much time that takes?

pip install git+https://github.com/Suor/dvc.git@erepo-log
dvc status -v
# And paste output here

Closing due to inactivity.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TezRomacH picture TezRomacH  路  3Comments

shcheklein picture shcheklein  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

prihoda picture prihoda  路  3Comments

siddygups picture siddygups  路  3Comments