dvc get can can delete the current directory

Created on 10 Jan 2020  路  12Comments  路  Source: iterative/dvc

First off, I saw it coming, had my stuff backed up etc, so no harm done.
I was trying to test out dvc get to pull models from a different project and the cache must not have been present in the remote:

models/mlr/production is defined as the output of a dvc file.

$ dvc get <internal_repo> models/mlr/production/ --rev <name of branch>
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:    
name: models/mlr/production, md5: ebbb5c351028a6e5485dfdec639fbb57.dir
Missing cache for directory '../../../tmp/tmp5jtl3fccdvc-erepo/models/mlr/production'. Cache for files inside will be lost. Would you like to continue? Use '-f' to force. [y/n] y            
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:   
name: models/mlr/production, md5: ebbb5c351028a6e5485dfdec639fbb57.dir
WARNING: Cache 'ebbb5c351028a6e5485dfdec639fbb57.dir' not found. File '.' won't be created.                                                                                                   
file '.' is going to be removed. Are you sure you want to proceed? [y/n] y                                                                                                                    
ERROR: unexpected error - [Errno 2] No such file or directory

Once I saw the File '.' won't be created. file '.' is going to be removed. I knew what was going to happen, but was surprised that dvc lets you do that. (It wiped out my whole git repo ;))

DVC version 0.80.0,
Platform WSL
method of installation: pip

bug p0-critical

All 12 comments

Hi @Persedes !

Wow, that is awful. Investigating right now. Thank you for reporting it! 馃檹

Indeed, the issue is caused by dvc being confused by the trailing / in the models/mlr/production/. A workaround for now is to not include the trailing slash. Will send a fix soon...

Reproduction script:

#!/bin/bash

rm -rf erepo repo git_repo
mkdir erepo repo git_repo

maindir=$(pwd)
set -ex
pushd git_repo
git init --bare --quiet
popd

pushd erepo
git init --quiet
git remote add origin $maindir/git_repo
dvc init -q

mkdir models
echo model >> models/1

dvc add models
git add -A
git commit -am "add stuff"

git push origin master
rm models/1
rm -rf .dvc/cache

popd

pushd repo
dvc get $maindir/erepo models/

Expected behavior:

  • do not remove whole repo for starters
  • we should probably rethink the output in this case, seems like the problem related to some recent UI issues #2839, #2691

I have found another oddity, which might be related to the bug above. dvc get keeps nesting outputs.

The first get pulls a folder with two tsv files.
The second one creates a prepared folder inside that one. I would have expected dvc get to either do nothing or overwrite the existing tsv files when pulling from a different branch/ commit.

# start in an empty folder
$ ls

# pull output from a folder and define the output path
$ dvc get https://github.com/iterative/example-get-started data/prepared -o dvc_files/
$ ls dvc_files/
test.tsv  train.tsv

# running it again creates a prepared folder inside of the defined output_path
$ dvc get https://github.com/iterative/example-get-started data/prepared -o dvc_files/
# prepared gets nested inside of dvc_files
$ ls dvc_files/
prepared  test.tsv  train.tsv

@efiop , are you working on this one?

https://github.com/iterative/dvc/issues/3105#issuecomment-573854955

Nesting only happens once, this is due to resolve_output. Can't tell what was the original intention behind those lines:

https://github.com/iterative/dvc/blob/fe635a5040c9d593cd1bdef3fa31f4df7af85259/dvc/utils/__init__.py#L336-L344

@efiop ?

@pared , I was trying to write a test based on your reproduction script. But didn't understand why you were git pushing. Could you explain it for me?

By the way, here's what I have so far:

def test_error(tmp_dir, dvc, erepo_dir):
    with tempfile.TemporaryDirectory() as remote:
        git.Repo.init(remote)

        with erepo_dir.chdir():
            erepo_dir.scm_gen("dir/file", "text", commit="create file")
            erepo_dir.dvc_add("dir")
            origin = erepo_dir.scm.repo.create_remote("origin", remote)
            origin.push()

        os.remove(fspath(erepo_dir / "dir" / "file"))
        shutil.rmtree(fspath(erepo_dir / ".dvc" / "cache"))

        dvc.get(erepo_dir, "dir")
        assert git.Repo(remote)

It passes, so I'm not sure if I'm translating it correctly :confused:

@mroutis, git push is unnecessary, dunno how it got there, removing it still makes the bug reproducible.

[EDIT]
@mroutis, I modified your test a little bit:

def test_error(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.gen("dir/file", "text") # just gen here, not scm_gen
        erepo_dir.dvc_add("dir", commit="add dir")

    os.remove(fspath(erepo_dir / "dir" / "file"))
    shutil.rmtree(fspath(erepo_dir / ".dvc" / "cache"))

    try:
        with mock.patch("dvc.prompt.confirm", return_value=True):
            dvc.get(fspath(erepo_dir), "dir/") # note that we get "dir/" not "dir"
    except Exception:
        pass
    finally:
        assert tmp_dir.exists()

@pared , great! Thanks for clearing it out)

@pared , I still don't get it. Tests are passing on my side, I'm loosing it on this one :sweat_smile:

@mroutis even the one that I provided?

@Persedes I did a little patch that should fix the original issue,

as to the second one, the one described in https://github.com/iterative/dvc/issues/3105#issuecomment-573854955. I am afraid that has to stay. I do agree that in this particular case it does not do what you would like it to do, but this behaviour is consistent with what filesystem operations allow as to do.
For example:

#!/bin/bash

rm -rf src clone
mkdir src
echo data >> src/file

cp -r src clone
tree clone
cp -r src clone
tree clone

Will show that the second cp will create clone/src.
If we were to prevent such behaviour, we would also have to forbid get-ing file -> dir that results in dir/file, or we would have to check whether src is directory, but then we would break consistency with filesystem operations.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ynop picture ynop  路  41Comments

drorata picture drorata  路  46Comments

dmpetrov picture dmpetrov  路  64Comments

kskyten picture kskyten  路  44Comments

jorgeorpinel picture jorgeorpinel  路  45Comments