Dvc: Bug: `dvc add` fails with a modified file (or directory) at the end of a list of files

Created on 3 Dec 2019  路  16Comments  路  Source: iterative/dvc

Using DVC 0.71.0 on RHEL6, installed via conda, configured for system-wide hard links (not sure if it's relevant):

$ dvc --version
0.71.0

$ cat ~/.config/dvc/config
[cache]
protected = true
type = hardlink

I have come across what appears to be a bug, where attempting to dvc add a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn't occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @mroutis):

dvc init --no-scm
mkdir data
echo "foo" > data/foo
dvc add data
echo "bar" > bar.txt
dvc unprotect data
echo "change" > data/foo
dvc add bar.txt data

This results in the following output:

WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'
100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅|Add                                       2.00/2.00 [00:02<00:00,  1.08s/file]
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc

But if the command is re-run:

$ dvc add bar.txt data
Stage is cached, skipping.
100% Add                                                  1.00/1.00 [00:01<00:00,  1.96s/file]

So it appears dvc is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.

Thanks for any effort to track down the source of the bug.

bug p0-critical research

Most helpful comment

@iterative/engineering it seems like an important bug. Should we make it p0?

All 16 comments

Thanks for reporting this, @jaredsampson !

Including the stack trace:

ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc
------------------------------------------------------------
Traceback (most recent call last):
  File "/dvc/command/add.py", line 25, in run
    fname=self.args.file,
  File "/dvc/repo/__init__.py", line 36, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/dvc/repo/add.py", line 50, in add
    repo.check_modified_graph(stages)
  File "/dvc/repo/__init__.py", line 177, in check_modified_graph
    self._collect_graph(self.stages + new_stages)
  File "/dvc/repo/__init__.py", line 327, in _collect_graph
    raise OutputDuplicationError(str(out), stages)
dvc.exceptions.OutputDuplicationError: file/directory 'data' is specified as an output in more than one stage: data.dvc
    data.dvc
------------------------------------------------------------

I get the same behavior with a flat file foo.txt instead of data/foo as well:

dvc init --no-scm
mkdir data
echo "foo" > foo.txt
dvc add foo.txt
echo "bar" > bar.txt
dvc unprotect foo.txt
echo "change" > foo.txt
dvc add foo.txt bar.txt

yields

WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅|Add                                       2.00/2.00 [00:01<00:00,  1.51file/s]
ERROR: file/directory 'foo.txt' is specified as an output in more than one stage: foo.txt.dvc
    foo.txt.dvc

However, reversing the order of the files in the last command, everything works fine:

$ dvc add foo.txt bar.txt
WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100% Add                                                  2.00/2.00 [00:01<00:00,  1.47file/s]

Thanks for the additional info, @jaredsampson !

The problem seems to be related to the way we collect the stages.

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

dvc init --no-scm

mkdir data

echo "foo" > foo.txt
dvc add foo.txt

echo "bar" > bar.txt
dvc unprotect foo.txt

echo "change" > foo.txt
dvc add foo.txt bar.txt

doesn't reproduce it for me 馃檨I suspect that the original issue might be cause by us not deduping the targets in dvc/repo/add.py, but I haven't looked into it closely enough.

@efiop , change the order of the last call and it should work:

- dvc add foo.txt bar.txt
+ dvc add bar.txt foo.txt

If someone is willing to give it a try, here's a test that you can play with:

diff --git a/tests/func/test_add.py b/tests/func/test_add.py
index 6d51f233..7328c84d 100644
--- a/tests/func/test_add.py
+++ b/tests/func/test_add.py
@@ -24,7 +24,7 @@ from dvc.system import System
 from dvc.utils import file_md5
 from dvc.utils import LARGE_DIR_SIZE
 from dvc.utils import relpath
-from dvc.utils.compat import range
+from dvc.utils.compat import range, pathlib
 from dvc.utils.stage import load_stage_file
 from tests.basic_env import TestDvc
 from tests.utils import get_gitignore_content
@@ -649,3 +649,10 @@ def test_escape_gitignore_entries(git, dvc_repo, repo_dir):
     dvc_repo.add(fname)

     assert ignored_fname in get_gitignore_content()
+
+
+def test_adding_several_files_after_one_has_been_modified(dvc_repo):
+    # https://github.com/iterative/dvc/issues/2886
+    dvc_repo.add('foo')
+    pathlib.Path('foo').write_text('change')
+    dvc_repo.add(['bar', 'foo'])

Might need revisiting the graph building process :grimacing:

@iterative/engineering it seems like an important bug. Should we make it p0?

can confirm that problem still exists on master:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
git init --quiet
dvc init -q

set -ex

mkdir data

echo foo>>data/foo
dvc add -q data

echo bar>>bar
echo change>>data/foo

dvc add bar data

Ok, so little investigation helped me to narrow down the issue.
The bug shows up when we are re-add-ing existing DVC-tracked output, but only after calling
dvc.stages first. For example:
Latest version i checked with: 0.86.1

def test_re_add(tmp_dir, scm, dvc):
    tmp_dir.dvc_gen( {"file": "file content"})

    tmp_dir.gen(
            {"file": "modified content"}
    )
    dvc.stages
    dvc.add(["file"])

We will get the same OutputDuplicationError.

  1. Why this is happening?
    repo.stages is cached property, so repo is storing file.dvc. The problem is that when we are re-add-ing some file, the chain of calls is repo.add -> repo._create_stages -> Stage.create which detects that file.dvc exists and removes it, so it can be created again.

So now we don't have file.dvc because it was removed by Stage.create, we have file.dvc in memory as "new" stage that we are creating right now, and because repo's stages were not invalidated, we also have repo.stages==[file.dvc]. That is why we try to check modified graph here during add we have self.stages==[file.dvc] and new_stages==[file.dvc]. While, actually self.stages should not return it (since it was removed by Stage.create.

TLDR
When re-adding we remove stage file, but we don't invalidate cached repo.stages and end up with repo._collect_graph(['file.dvc', 'file.dvc']) which has to fail.

Solution: invalidate repo.stages when removing the stage file.

dvc.stages
dvc.add(["file"])

But we don't do that during the dvc add, right? So what specifically has caused the issue?

@efiop

But we don't do that during the dvc add, right? So what specifically has caused the issue?

Well thats where the devil is. We call stages on repo.check_modified_graph that is why original issue was so strange:

  1. If we re-add only modified artifact, stages will be called only after file.dvc was removed on Stage.create
  2. If we do dvc add other_thingy file (where file exists and has been updated) repo.stages will be called when handling other_thingy, cached, and cause problems when handling updated file.

@pared Got it, makes sense now. So our issue is that we don't reset after creating a stage inside repo.add(), right? I suppose that issue was introduced while moving that multitarget logic from CLI to API.

@efiop
not only repo.add()
To generalize discussion here and under #3349 :
Whenever we Stage.create pre-existing stage and then repo.check_modified_graph (add, imp, run) we are vulnerable to cached stages.
examples:

def test_run_update(tmp_dir, dvc):
    dvc.run(outs=["file"], cmd="echo content > file")
    dvc.stages
    dvc.run(outs=["file"], cmd="echo new_content > file")

def test_imp_update(tmp_dir, dvc, erepo_dir):
    with erepo_dir.chdir():
        erepo_dir.dvc_gen({"file": "file content"}, commit="commit")

    dvc.imp(fspath(erepo_dir), "file")

    dvc.stages
    dvc.imp(fspath(erepo_dir), "file")

@jaredsampson the fix is done and should be included in next release

@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try :slightly_smiling_face:

Terrific. Thanks for the fix!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dnabanita7 picture dnabanita7  路  3Comments

analystanand picture analystanand  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

shcheklein picture shcheklein  路  3Comments

prihoda picture prihoda  路  3Comments