First off, I saw it coming, had my stuff backed up etc, so no harm done.
I was trying to test out dvc get to pull models from a different project and the cache must not have been present in the remote:
models/mlr/production is defined as the output of a dvc file.
$ dvc get <internal_repo> models/mlr/production/ --rev <name of branch>
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: models/mlr/production, md5: ebbb5c351028a6e5485dfdec639fbb57.dir
Missing cache for directory '../../../tmp/tmp5jtl3fccdvc-erepo/models/mlr/production'. Cache for files inside will be lost. Would you like to continue? Use '-f' to force. [y/n] y
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: models/mlr/production, md5: ebbb5c351028a6e5485dfdec639fbb57.dir
WARNING: Cache 'ebbb5c351028a6e5485dfdec639fbb57.dir' not found. File '.' won't be created.
file '.' is going to be removed. Are you sure you want to proceed? [y/n] y
ERROR: unexpected error - [Errno 2] No such file or directory
Once I saw the File '.' won't be created. file '.' is going to be removed. I knew what was going to happen, but was surprised that dvc lets you do that. (It wiped out my whole git repo ;))
DVC version 0.80.0,
Platform WSL
method of installation: pip
Hi @Persedes !
Wow, that is awful. Investigating right now. Thank you for reporting it! 馃檹
Indeed, the issue is caused by dvc being confused by the trailing / in the models/mlr/production/. A workaround for now is to not include the trailing slash. Will send a fix soon...
Reproduction script:
#!/bin/bash
rm -rf erepo repo git_repo
mkdir erepo repo git_repo
maindir=$(pwd)
set -ex
pushd git_repo
git init --bare --quiet
popd
pushd erepo
git init --quiet
git remote add origin $maindir/git_repo
dvc init -q
mkdir models
echo model >> models/1
dvc add models
git add -A
git commit -am "add stuff"
git push origin master
rm models/1
rm -rf .dvc/cache
popd
pushd repo
dvc get $maindir/erepo models/
Expected behavior:
repo for startersI have found another oddity, which might be related to the bug above. dvc get keeps nesting outputs.
The first get pulls a folder with two tsv files.
The second one creates a prepared folder inside that one. I would have expected dvc get to either do nothing or overwrite the existing tsv files when pulling from a different branch/ commit.
# start in an empty folder
$ ls
# pull output from a folder and define the output path
$ dvc get https://github.com/iterative/example-get-started data/prepared -o dvc_files/
$ ls dvc_files/
test.tsv train.tsv
# running it again creates a prepared folder inside of the defined output_path
$ dvc get https://github.com/iterative/example-get-started data/prepared -o dvc_files/
# prepared gets nested inside of dvc_files
$ ls dvc_files/
prepared test.tsv train.tsv
@efiop , are you working on this one?
https://github.com/iterative/dvc/issues/3105#issuecomment-573854955
Nesting only happens once, this is due to resolve_output. Can't tell what was the original intention behind those lines:
@efiop ?
@pared , I was trying to write a test based on your reproduction script. But didn't understand why you were git pushing. Could you explain it for me?
By the way, here's what I have so far:
def test_error(tmp_dir, dvc, erepo_dir):
with tempfile.TemporaryDirectory() as remote:
git.Repo.init(remote)
with erepo_dir.chdir():
erepo_dir.scm_gen("dir/file", "text", commit="create file")
erepo_dir.dvc_add("dir")
origin = erepo_dir.scm.repo.create_remote("origin", remote)
origin.push()
os.remove(fspath(erepo_dir / "dir" / "file"))
shutil.rmtree(fspath(erepo_dir / ".dvc" / "cache"))
dvc.get(erepo_dir, "dir")
assert git.Repo(remote)
It passes, so I'm not sure if I'm translating it correctly :confused:
@mroutis, git push is unnecessary, dunno how it got there, removing it still makes the bug reproducible.
[EDIT]
@mroutis, I modified your test a little bit:
def test_error(tmp_dir, dvc, erepo_dir):
with erepo_dir.chdir():
erepo_dir.gen("dir/file", "text") # just gen here, not scm_gen
erepo_dir.dvc_add("dir", commit="add dir")
os.remove(fspath(erepo_dir / "dir" / "file"))
shutil.rmtree(fspath(erepo_dir / ".dvc" / "cache"))
try:
with mock.patch("dvc.prompt.confirm", return_value=True):
dvc.get(fspath(erepo_dir), "dir/") # note that we get "dir/" not "dir"
except Exception:
pass
finally:
assert tmp_dir.exists()
@pared , great! Thanks for clearing it out)
@pared , I still don't get it. Tests are passing on my side, I'm loosing it on this one :sweat_smile:
@mroutis even the one that I provided?
@Persedes I did a little patch that should fix the original issue,
as to the second one, the one described in https://github.com/iterative/dvc/issues/3105#issuecomment-573854955. I am afraid that has to stay. I do agree that in this particular case it does not do what you would like it to do, but this behaviour is consistent with what filesystem operations allow as to do.
For example:
#!/bin/bash
rm -rf src clone
mkdir src
echo data >> src/file
cp -r src clone
tree clone
cp -r src clone
tree clone
Will show that the second cp will create clone/src.
If we were to prevent such behaviour, we would also have to forbid get-ing file -> dir that results in dir/file, or we would have to check whether src is directory, but then we would break consistency with filesystem operations.