There are few DVC commands that accept --all-branches
and --all-tags
options. Namely, dvc metrics show
, dvc gc
, dvc fetch
, etc. For all of them what we need is to being able to analyze content of DVC metafile across different Git revisions. Right now it's done by running git checkout
(and then dvc checkout
if we need to get content of the file from cache). This approach is fragile, depends on the current state of the working space (e.g there are uncommitted changes) and even dangerous.
We should instead employ git API (like ls-tree
, or ls-files
?) and dvc API to get direct access to necessary files, directories, etc.
Current implementation is here: https://github.com/iterative/dvc/blob/master/dvc/scm/base.py#L79 and is used in two places: https://github.com/iterative/dvc/blob/master/dvc/repo/__init__.py#L239 and https://github.com/iterative/dvc/blob/master/dvc/repo/metrics/show.py#L144.
Directly related issue: https://github.com/iterative/dvc/issues/1009
Could someone assign me to this as we agreed with @dmpetrov, please?
I'm in the process of making a list of places in code which use the filesystem interface over the files checkouted by dvc.scm.Base.brancher
, and hence should be corrected.
As discussed, I'm not going to make a huge refactoring/improvements while working on this issue, but it has to be a good idea to introduce a better git interface library like libgit2 or dulwich and use it to access git objects directly in all dvc interactions with git, instead of calling the git executables via GitPython wrapper and accessing the filesystem. Though no code changes will be made in this direction for now, it might be a good time to start the discussion of such refactoring.
Here is a call graph of functions which use the dvc.scm.Base.brancher()
. Gray ones don't use filesystem directly.
It was intended just for me to get understanding of related parts in the DVC codebase. But I think it is better to share it here.
@ei-grad I've invited you as a collaborator to the project. I think I'll be able to assign you after you accept the invitation.
Great! I even could assign myself by myself now :).
So as of my current understanding - Repo.stages
and Repo.find_outs_by_path
should be rewritten to use SCM methods to list files and get their contents, also these methods should be implemented for Base and Git SCM backends. Also there would be some little fixes, like passing file-like objects instead of path into the Stage objects methods.
Most helpful comment
Great! I even could assign myself by myself now :).