Please provide information about your setup
DVC version(i.e. dvc --version
), Platform and method of installation (pip, homebrew, pkg Mac, exe (Windows), DEB(Linux), RPM(Linux))
DVC 0.86.4, conda installation, Windows.
See https://discuss.dvc.org/t/importing-from-a-git-repo-then-pulling/320 for background.
After import
ing a directory from a git repository (not a dvc repository), dvc pull
fails with ERROR: unexpected error - 'ExternalGitRepo' object has no attribute 'cache'
.
Script to reproduce:
mkdir git_repo
cd git_repo
git init
mkdir a_dir
touch a_dir/test.txt
git add .
git commit -m "Init git repo"
cd -
mkdir dvc_repo
cd dvc_repo
git init
dvc init
git add .
git commit -m "Init"
dvc import ../git_repo a_dir
dvc pull
Thanks for the bug report @charlesbaynham ! We'll take a look ASAP.
@jorgeorpinel, @efiop, how should this behave for git repo? Not pull at all or like dvc update
? Could not find anything on https://man.dvc.org/pull about this.
Am I right in thinking that dvc repos don't put imported files into their cache / remote? So a pull will always need to access the original source?
@charlesbaynham it actually does put it in the cache. Continuing your script:
$ cat a_dir.dvc
...
repo:
url: ../git_repo
rev_lock: 314393c2a7b82004302d5b7f219e724522db88ca
outs:
- md5: 59446444eae8d3c063f64fefd37b43b5
...
$ tree .dvc/cache
.dvc/cache
โโโ 59
โย ย โโโ 446444eae8d3c063f64fefd37b43b5.dir
โโโ d4
โโโ 1d8cd98f00b204e9800998ecf8427e
$ cat .dvc/cache/59/446444eae8d3c063f64fefd37b43b5.dir
[{"md5": "d41d8cd98f00b204e9800998ecf8427e", "relpath": "test.txt"}]
But shouldn't push it to the remote storage, indeed.
@skshetry in this case, I don't understand why dvc pull
is even trying to pull from this _import stage_, since the output is already in cache ^ (and workspace).
I think that's the current p0
bug.
dvc fetch\pull
and dvc push
as well, just in case.@charlesbaynham it actually does put it in the cache. Continuing your script:
It actually doesn't, just that .dir
file, which is more of a temporary file than a proper cache in that particular case. So we will indeed need to have access to the original source.
@skshetry It should git clone (as we usually do) and just copy stuff over, same as if you would've re-imported it again.
It actually doesn't
OK. But in my script you can see that both .dvc/cache/59/446444e...b5.dir
(a_dir) AND .dvc/cache/d4/1d8cd98...
(a_dir/test.txt) are in cache... what am I missing? ๐
Able to reproduce with following tests:
def test_pull_imports(tmp_dir, dvc, scm, git_dir):
with git_dir.chdir():
git_dir.scm_gen("foo", "foo", commit="first version")
dvc.imp(fspath(git_dir), "foo")
dvc.pull()
def test_pull_imports_erepo_dvc(tmp_dir, dvc, scm, erepo_dir):
with erepo_dir.chdir():
erepo_dir.dvc_gen("foo", "foo", commit="first version")
dvc.imp(fspath(erepo_dir), "foo")
dvc.pull()
Looks like it never worked before. dvc import
already puts the file in the cache, and dvc update
should be used for updating. Nevertheless, I'm trying to see what's going on.
@jorgeorpinel, current pull
behavior is to download missing files from the remote storage to the cache. The error is obviously a bug, but, to be able to pull from the git-imported artifacts is a change in behavior.
Do we need to support this? Or, we should delegate it to dvc import/update
for this?
The error is obviously a bug, but, to be able to pull from the git-imported artifacts is a change in behavior
How else can the bug be fixed? If theres something in the middle that would probably be a good start. Maybe just make pull ignore import stages? I'm not sure tbh, please check with @efiop and @shcheklein. Do we want pull to treat import stages like any other stages? I'm guessing yes if possible.
Do we need to support this? Or, we should delegate it to
dvc import/update
for this?
Shouldn't it be that dvc pull
retrieves the version of the git import specified by rev_lock
in the dvc file, in contrast to dvc update / dvc import
which git pull
s the latest revision and updates rev_lock
accordingly?
@charlesbaynham, #3503 implemented exactly how you described. dvc pull
will fetch using specified rev_lock
. dvc import
can be used to import (or, forced re-import), and update will update rev_lock
if rev
moves. We have update --rev
to move to a different revision.
@skshetry @efiop Fantastic, thank you! I'll give it a try.
Works like a charm
Most helpful comment
Works like a charm