Please provide information about your setup
DVC version(i.e. dvc --version
), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))
dvc version
DVC version: 0.82.8
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - not supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')
expect dvc status
can return in a shorter time, currently it takes 30s.
a few stats
du -shL my-project # 107G
dvc status -v
I think we should try to reproduce such big repo and see what takes so much time, I suspect db access.
@pared @Ykid could we run it with a profiler please:
python -m cProfile -o status.prof -m dvc status -v
Reading through discord dialog and looking at the log I don't think this is about big repo. This is probably about many imported files, which create many clones on dvc status
.
@Suor grep --include=\*.{dvc,} -rn my-data-dir -e ".git" | wc -l
gives me 64. I think more or less like that. I have a dvc data registry and import some directories and files from there into the project. So does it sound better if I do dvc get
followed by dvc add
? those imported files is not going to be updated very frequently after all.
num of files versioned by dvc is around 9k from find -L my-data-dir -type f ! -name "*.dvc" | wc -l
. dvc pull shows 8.31k ( the number changes in each line of dvc pull though, but 8.31k is the most time consuming one )
may I know if there's any way to improve it ?
So does it sound better if I do dvc get followed by dvc add
This is obviously an issue on our end. I will think how this may be sped up.
@Ykid Can you say how many different sources do you use in those 64 import stages? Source is a pair of (url, rev)
in deps.repo
.
@Suor There is one url as it is our data registry. I follow data registries
to set it up. For the number of pairs, here it is:
@Ykid thanks.
A note from discord - git repo is big, both history and checked out things:
84K ./tools
1.8M ./.dvc
988M ./notebooks
20K ./my-project.egg-info
16K ./configs
870M ./.git
16M ./my-project
92K ./dockerfiles
28K ./tests
6.3M ./data
24K ./pipeline
1.9G .
The last change only caches single instance of repo, not all of them, which prevents us from needing new git clone
, but makes a copy of the whole repo sans dvc cache each time. The reason for that is we do git pull
each time, i.e. modifying the directory, which means caching cannot be used for it. Things are furthermore complicated by summon publishing stuff, which modifies the directory returned by external_repo()
, which also requires a separate copy.
I am trying to untangle it now.
The last change only caches single instance of repo, not all of them
May I know what this means ?
May I know what this means ?
Sorry for slow response.
This means that when you have many imports from the same repo dvc
will make a fresh copy of its clone many times, while clone is only done once. This is not the issue generally, but since you have huge git repo - all those notebooks doesn't really play nice - it takes time.
Anyway, this should be fixed after https://github.com/iterative/dvc/pull/3286 lands. You can try it right now with:
pip install git+https://github.com/Suor/dvc.git@erepo-ro
If you do, can you please tell how well does it work for you?
@Suor
DVC version: 0.82.9+f73900
Python version: 3.7.4
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: None
Cache: reflink - not supported, hardlink - not supported, symlink - supported
ERROR: failed to obtain data status - 'Git' object has no attribute 'is_known'
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/command/status.py", line 50, in run
with_deps=self.args.with_deps,
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/__init__.py", line 31, in wrapper
ret = f(repo, *args, **kwargs)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 133, in status
return _local_status(self, targets, with_deps=with_deps)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 36, in _local_status
return _joint_status(stages)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/repo/status.py", line 25, in _joint_status
status.update(stage.status(check_updates=True))
File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 161, in rwlocked
return call()
File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
return self._func(*self._args, **self._kwargs)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 1015, in status
deps_status = self._status(self.deps)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/stage.py", line 1004, in _status
ret.update(entry.status())
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/dependency/repo.py", line 59, in status
current_checksum = self._get_checksum(locked=True)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/dependency/repo.py", line 50, in _get_checksum
with self._make_repo(locked=locked) as repo:
File "/home/user/miniconda3/lib/python3.7/contextlib.py", line 112, in __enter__
return next(self.gen)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 27, in external_repo
path = _cached_clone(url, rev, for_write=for_write)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 173, in _cached_clone
clone_path = _clone_default_branch(url, rev)
File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/flow.py", line 244, in wrap_with
return call()
File "/home/user/miniconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
return self._func(*self._args, **self._kwargs)
File "/home/user/miniconda3/lib/python3.7/site-packages/dvc/external_repo.py", line 205, in _clone_default_branch
if not Git.is_sha(rev) or not git.is_known(rev):
AttributeError: 'Git' object has no attribute 'is_known'
Reopening and escalating the priority. @Suor Please take a look ASAP, we need to release a new version with the fix ASAP as well.
Looks like is_known
should've been is_tracked
. Surely we should have some sort of test which should've found this bug
Handled in https://github.com/iterative/dvc/pull/3323.
@Ykid 0.85.0 is out on pip and conda, please upgrade, give it a try and let us know if it fixed the issue for you. Thanks for the feedback! :slightly_smiling_face:
The bug related to git is fixed, but there seem to be not much performance improved. :(.
DVC version: 0.85.0
Python version: 3.7.6
Platform: Linux-5.0.0-37-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - not supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/sda1')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')
time dvc status
real 0m30.635s
user 0m21.884s
sys 0m6.815s
@Suor Please take a look.
So the optimization works for me, no unneeded clones or copies done. git checkout
takes significantly more time than expected though. Need to investigate @Ykid situation more before jumping on some advanced optimizations.
@Ykid I made a branch, which has more erepo logging, can you try it to see what is actually happening on your side and how much time that takes?
pip install git+https://github.com/Suor/dvc.git@erepo-log
dvc status -v
# And paste output here
Closing due to inactivity.
Most helpful comment
Reading through discord dialog and looking at the log I don't think this is about big repo. This is probably about many imported files, which create many clones on
dvc status
.