Dvc: Failure to pull after importing a git repo

Created on 21 Feb 2020  ยท  15Comments  ยท  Source: iterative/dvc

Please provide information about your setup
DVC version(i.e. dvc --version), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))

DVC 0.86.4, conda installation, Windows.

See https://discuss.dvc.org/t/importing-from-a-git-repo-then-pulling/320 for background.

After importing a directory from a git repository (not a dvc repository), dvc pull fails with ERROR: unexpected error - 'ExternalGitRepo' object has no attribute 'cache'.

bug p1-important research

Most helpful comment

Works like a charm

All 15 comments

Script to reproduce:

mkdir git_repo
cd git_repo
git init
mkdir a_dir
touch a_dir/test.txt
git add .
git commit -m "Init git repo"

cd -
mkdir dvc_repo
cd dvc_repo
git init
dvc init
git add .
git commit -m "Init"
dvc import ../git_repo a_dir
dvc pull

Thanks for the bug report @charlesbaynham ! We'll take a look ASAP.

@jorgeorpinel, @efiop, how should this behave for git repo? Not pull at all or like dvc update? Could not find anything on https://man.dvc.org/pull about this.

Am I right in thinking that dvc repos don't put imported files into their cache / remote? So a pull will always need to access the original source?

@charlesbaynham it actually does put it in the cache. Continuing your script:

$ cat a_dir.dvc
...
  repo:
    url: ../git_repo
    rev_lock: 314393c2a7b82004302d5b7f219e724522db88ca
outs:
- md5: 59446444eae8d3c063f64fefd37b43b5
...
$ tree .dvc/cache 
.dvc/cache
โ”œโ”€โ”€ 59
โ”‚ย ย  โ””โ”€โ”€ 446444eae8d3c063f64fefd37b43b5.dir
โ””โ”€โ”€ d4
    โ””โ”€โ”€ 1d8cd98f00b204e9800998ecf8427e
$ cat .dvc/cache/59/446444eae8d3c063f64fefd37b43b5.dir 
[{"md5": "d41d8cd98f00b204e9800998ecf8427e", "relpath": "test.txt"}]

But shouldn't push it to the remote storage, indeed.

@skshetry in this case, I don't understand why dvc pull is even trying to pull from this _import stage_, since the output is already in cache ^ (and workspace).

image

https://dvc.org/doc/command-reference/pull

I think that's the current p0 bug.

  • [ ] Whether there are other problems with Git-tracked imported files, we would need to extensively Q/A re-importing and updating them... And probably dvc fetch\pull and dvc push as well, just in case.

@charlesbaynham it actually does put it in the cache. Continuing your script:

It actually doesn't, just that .dir file, which is more of a temporary file than a proper cache in that particular case. So we will indeed need to have access to the original source.

@skshetry It should git clone (as we usually do) and just copy stuff over, same as if you would've re-imported it again.

It actually doesn't

OK. But in my script you can see that both .dvc/cache/59/446444e...b5.dir (a_dir) AND .dvc/cache/d4/1d8cd98... (a_dir/test.txt) are in cache... what am I missing? ๐Ÿ™‚

Able to reproduce with following tests:

def test_pull_imports(tmp_dir, dvc, scm, git_dir):
    with git_dir.chdir():
        git_dir.scm_gen("foo", "foo", commit="first version")

    dvc.imp(fspath(git_dir), "foo")
    dvc.pull()

def test_pull_imports_erepo_dvc(tmp_dir, dvc, scm, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen("foo", "foo", commit="first version")

    dvc.imp(fspath(erepo_dir), "foo")
    dvc.pull()

Looks like it never worked before. dvc import already puts the file in the cache, and dvc update should be used for updating. Nevertheless, I'm trying to see what's going on.

@jorgeorpinel, current pull behavior is to download missing files from the remote storage to the cache. The error is obviously a bug, but, to be able to pull from the git-imported artifacts is a change in behavior.

Do we need to support this? Or, we should delegate it to dvc import/update for this?

The error is obviously a bug, but, to be able to pull from the git-imported artifacts is a change in behavior

How else can the bug be fixed? If theres something in the middle that would probably be a good start. Maybe just make pull ignore import stages? I'm not sure tbh, please check with @efiop and @shcheklein. Do we want pull to treat import stages like any other stages? I'm guessing yes if possible.

Do we need to support this? Or, we should delegate it to dvc import/update for this?

Shouldn't it be that dvc pull retrieves the version of the git import specified by rev_lock in the dvc file, in contrast to dvc update / dvc import which git pulls the latest revision and updates rev_lock accordingly?

@charlesbaynham, #3503 implemented exactly how you described. dvc pull will fetch using specified rev_lock. dvc import can be used to import (or, forced re-import), and update will update rev_lock if rev moves. We have update --rev to move to a different revision.

@skshetry @efiop Fantastic, thank you! I'll give it a try.

Works like a charm

Was this page helpful?
0 / 5 - 0 ratings