Dvc: Unexpected error when run dvc gc in git repo with submodules

Created on 16 Mar 2020  路  5Comments  路  Source: iterative/dvc

I run the command:

dvc gc -a -v

I get some traceback:

2020-03-16 09:53:37,900 ERROR: unexpected error - 'be23d78966ad0171e87879e051edf6eb3f446e12'
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/main.py", line 50, in main
    ret = cmd.run()
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/command/gc.py", line 60, in run
    workspace=self.args.workspace,
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/repo/__init__.py", line 27, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/repo/gc.py", line 75, in gc
    jobs=jobs,
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/repo/__init__.py", line 256, in used_cache
    for stage, filter_info in pairs:
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/repo/__init__.py", line 252, in <genexpr>
    for target in targets
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/repo/__init__.py", line 202, in collect_granular
    return [(stage, None) for stage in self.stages]
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/repo/__init__.py", line 397, in stages
    for root, dirs, files in self.tree.walk(self.root_dir):
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/ignore.py", line 135, in walk
    dirs[:], files[:] = self.dvcignore(root, dirs, files)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/ignore.py", line 115, in dvcignore
    return DvcIgnoreFilter(self.tree)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/ignore.py", line 93, in __init__
    for root, dirs, files in self.tree.walk(self.tree.tree_root):
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/scm/git/tree.py", line 148, in walk
    yield from self._walk(tree, topdown=topdown)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/scm/git/tree.py", line 131, in _walk
    yield from self._walk(tree[i], topdown=topdown)
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/scm/git/tree.py", line 121, in _walk
    for i in _iter_tree(tree):
  File "/home/my_user/data/env/my_project/lib/python3.6/site-packages/dvc/scm/git/tree.py", line 26, in _iter_tree
    node = submodules[node.hexsha]
KeyError: 'be23d78966ad0171e87879e051edf6eb3f446e12'

My setup

  • DVC version: 0.88.0
  • Platform: Ubuntu 18.04
  • Method of installation: pip
bug p0-critical

Most helpful comment

It works. Thanks, @efiop

All 5 comments

Can reproduce with git repo that has submodules, e.g. with git clone https://github.com/githubtraining/example-dependency:

>>> from dvc.scm.git import Git
>>> git = Git(".")
>>> tree = git.get_tree("HEAD~1")
>>> for root, dnames, fnames in tree.walk("."):
...     print(root)
...     print(dnames)
...     print(fnames)
...
/home/efiop/git/feedstocks
['feedstocks']
['.gitmodules', 'LICENSE', 'README.md']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/efiop/git/dvc/dvc/scm/git/tree.py", line 148, in walk
    yield from self._walk(tree, topdown=topdown)
  File "/home/efiop/git/dvc/dvc/scm/git/tree.py", line 131, in _walk
    yield from self._walk(tree[i], topdown=topdown)
  File "/home/efiop/git/dvc/dvc/scm/git/tree.py", line 121, in _walk
    for i in _iter_tree(tree):
  File "/home/efiop/git/dvc/dvc/scm/git/tree.py", line 26, in _iter_tree
    node = submodules[node.hexsha]
KeyError: 'ba4d4c3b302af35049508e381fdff85072caa200'

so this affects all git repos with submodules, which is really bad.

Ok, so we are currently using this [0] hack to work around the fact that item.name is not always a basename. Even though IndexObject has it defined that way [1], but Submodule - doesn't [2]. So using item.name might cause issues such as [3], because GitPython doesn't pass name parameter when simply going through objects [4]. In other places, when working specifically with submodules, it sets _name attribute explicitly [5].

[0] https://github.com/iterative/dvc/blob/0.90.0/dvc/scm/git/tree.py#L15
[1] https://github.com/gitpython-developers/GitPython/blob/3.1.0/git/objects/base.py#L170
[2] https://github.com/gitpython-developers/GitPython/blob/3.1.0/git/objects/submodule/base.py#L1123
[3] https://github.com/gitpython-developers/GitPython/issues/597
[4] https://github.com/gitpython-developers/GitPython/blob/3.1.0/git/objects/tree.py#L237
[5] https://github.com/gitpython-developers/GitPython/blob/3.1.0/git/objects/submodule/base.py#L357

@hoangcao The fix was released in 0.90.1, please upgrade, give it a try and let us know if works for you or not. Thanks for the feedback! :pray:

It works. Thanks, @efiop

Was this page helpful?
0 / 5 - 0 ratings

Related issues

danfischetti picture danfischetti  路  41Comments

shcheklein picture shcheklein  路  36Comments

drorata picture drorata  路  46Comments

pared picture pared  路  73Comments

gvyshnya picture gvyshnya  路  36Comments