Using DVC 0.71.0 on RHEL6, installed via conda
, configured for system-wide hard links (not sure if it's relevant):
$ dvc --version
0.71.0
$ cat ~/.config/dvc/config
[cache]
protected = true
type = hardlink
I have come across what appears to be a bug, where attempting to dvc add
a directory previously under DVC version control whose contents have changed, results in an error, but only when adding along with a list of files, and doesn't occur if the command is repeated (i.e. after all the other files have been added). I have reproduced this in several project directories and via the following minimal example (courtesy of @mroutis):
dvc init --no-scm
mkdir data
echo "foo" > data/foo
dvc add data
echo "bar" > bar.txt
dvc unprotect data
echo "change" > data/foo
dvc add bar.txt data
This results in the following output:
WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'
100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅|Add 2.00/2.00 [00:02<00:00, 1.08s/file]
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
data.dvc
But if the command is re-run:
$ dvc add bar.txt data
Stage is cached, skipping.
100% Add 1.00/1.00 [00:01<00:00, 1.96s/file]
So it appears dvc
is somehow mishandling the list of files. Of course, the expected behavior is that it will add the directory successfully on the first try.
Thanks for any effort to track down the source of the bug.
Thanks for reporting this, @jaredsampson !
Including the stack trace:
ERROR: file/directory 'data' is specified as an output in more than one stage: data.dvc
data.dvc
------------------------------------------------------------
Traceback (most recent call last):
File "/dvc/command/add.py", line 25, in run
fname=self.args.file,
File "/dvc/repo/__init__.py", line 36, in wrapper
ret = f(repo, *args, **kwargs)
File "/dvc/repo/scm_context.py", line 4, in run
result = method(repo, *args, **kw)
File "/dvc/repo/add.py", line 50, in add
repo.check_modified_graph(stages)
File "/dvc/repo/__init__.py", line 177, in check_modified_graph
self._collect_graph(self.stages + new_stages)
File "/dvc/repo/__init__.py", line 327, in _collect_graph
raise OutputDuplicationError(str(out), stages)
dvc.exceptions.OutputDuplicationError: file/directory 'data' is specified as an output in more than one stage: data.dvc
data.dvc
------------------------------------------------------------
I get the same behavior with a flat file foo.txt instead of data/foo
as well:
dvc init --no-scm
mkdir data
echo "foo" > foo.txt
dvc add foo.txt
echo "bar" > bar.txt
dvc unprotect foo.txt
echo "change" > foo.txt
dvc add foo.txt bar.txt
yields
WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅|Add 2.00/2.00 [00:01<00:00, 1.51file/s]
ERROR: file/directory 'foo.txt' is specified as an output in more than one stage: foo.txt.dvc
foo.txt.dvc
However, reversing the order of the files in the last command, everything works fine:
$ dvc add foo.txt bar.txt
WARNING: Output 'foo.txt' of 'foo.txt.dvc' changed because it is 'modified'
100% Add 2.00/2.00 [00:01<00:00, 1.47file/s]
Thanks for the additional info, @jaredsampson !
The problem seems to be related to the way we collect the stages.
#!/bin/bash
set -e
set -x
rm -rf myrepo
mkdir myrepo
cd myrepo
dvc init --no-scm
mkdir data
echo "foo" > foo.txt
dvc add foo.txt
echo "bar" > bar.txt
dvc unprotect foo.txt
echo "change" > foo.txt
dvc add foo.txt bar.txt
doesn't reproduce it for me 馃檨I suspect that the original issue might be cause by us not deduping the targets in dvc/repo/add.py, but I haven't looked into it closely enough.
@efiop , change the order of the last call and it should work:
- dvc add foo.txt bar.txt
+ dvc add bar.txt foo.txt
If someone is willing to give it a try, here's a test that you can play with:
diff --git a/tests/func/test_add.py b/tests/func/test_add.py
index 6d51f233..7328c84d 100644
--- a/tests/func/test_add.py
+++ b/tests/func/test_add.py
@@ -24,7 +24,7 @@ from dvc.system import System
from dvc.utils import file_md5
from dvc.utils import LARGE_DIR_SIZE
from dvc.utils import relpath
-from dvc.utils.compat import range
+from dvc.utils.compat import range, pathlib
from dvc.utils.stage import load_stage_file
from tests.basic_env import TestDvc
from tests.utils import get_gitignore_content
@@ -649,3 +649,10 @@ def test_escape_gitignore_entries(git, dvc_repo, repo_dir):
dvc_repo.add(fname)
assert ignored_fname in get_gitignore_content()
+
+
+def test_adding_several_files_after_one_has_been_modified(dvc_repo):
+ # https://github.com/iterative/dvc/issues/2886
+ dvc_repo.add('foo')
+ pathlib.Path('foo').write_text('change')
+ dvc_repo.add(['bar', 'foo'])
Might need revisiting the graph building process :grimacing:
@iterative/engineering it seems like an important bug. Should we make it p0?
can confirm that problem still exists on master:
#!/bin/bash
rm -rf repo
mkdir repo
pushd repo
git init --quiet
dvc init -q
set -ex
mkdir data
echo foo>>data/foo
dvc add -q data
echo bar>>bar
echo change>>data/foo
dvc add bar data
Ok, so little investigation helped me to narrow down the issue.
The bug shows up when we are re-add
-ing existing DVC-tracked output, but only after calling
dvc.stages
first. For example:
Latest version i checked with: 0.86.1
def test_re_add(tmp_dir, scm, dvc):
tmp_dir.dvc_gen( {"file": "file content"})
tmp_dir.gen(
{"file": "modified content"}
)
dvc.stages
dvc.add(["file"])
We will get the same OutputDuplicationError.
repo.stages
is cached property, so repo is storing file.dvc
. The problem is that when we are re-add
-ing some file, the chain of calls is repo.add
-> repo._create_stages
-> Stage.create
which detects that file.dvc
exists and removes it, so it can be created again. So now we don't have file.dvc
because it was removed by Stage.create
, we have file.dvc
in memory as "new" stage that we are creating right now, and because repo
's stages were not invalidated, we also have repo.stages
==[file.dvc
]. That is why we try to check modified graph here during add
we have self.stages==[
file.dvc]
and new_stages==[
file.dvc]
. While, actually self.stages should not return it (since it was removed by Stage.create
.
TLDR
When re-add
ing we remove stage file, but we don't invalidate cached repo.stages
and end up with repo._collect_graph(['file.dvc', 'file.dvc'])
which has to fail.
Solution: invalidate repo.stages
when removing the stage file.
dvc.stages
dvc.add(["file"])
But we don't do that during the dvc add
, right? So what specifically has caused the issue?
@efiop
But we don't do that during the dvc add, right? So what specifically has caused the issue?
Well thats where the devil is. We call stages on repo.check_modified_graph
that is why original issue was so strange:
add
only modified artifact, stages
will be called only after file.dvc
was removed on Stage.create
dvc add other_thingy file
(where file exists and has been updated) repo.stages
will be called when handling other_thingy
, cached, and cause problems when handling updated file
.@pared Got it, makes sense now. So our issue is that we don't reset after creating a stage inside repo.add()
, right? I suppose that issue was introduced while moving that multitarget logic from CLI to API.
@efiop
not only repo.add()
To generalize discussion here and under #3349 :
Whenever we Stage.create
pre-existing stage and then repo.check_modified_graph
(add
, imp
, run
) we are vulnerable to cached stages
.
examples:
def test_run_update(tmp_dir, dvc):
dvc.run(outs=["file"], cmd="echo content > file")
dvc.stages
dvc.run(outs=["file"], cmd="echo new_content > file")
def test_imp_update(tmp_dir, dvc, erepo_dir):
with erepo_dir.chdir():
erepo_dir.dvc_gen({"file": "file content"}, commit="commit")
dvc.imp(fspath(erepo_dir), "file")
dvc.stages
dvc.imp(fspath(erepo_dir), "file")
@jaredsampson the fix is done and should be included in next release
@pared @jaredsampson Released in 0.86.4, please upgrade and give it a try :slightly_smiling_face:
Terrific. Thanks for the fix!
Most helpful comment
@iterative/engineering it seems like an important bug. Should we make it p0?