DVC 0.84.0, Mac, conda installation
When re-importing a directory from a git repository having specified an alternative output, dvc fails with an "overlapping outs" error. This does not occur if the output directory was not renamed.
The following code sets up a source git repo with a subdirectory and a dvc repo which imports from it. It
# Set up a source repo
mkdir source_repo
cd source_repo
git init
mkdir src
echo Here is some code > src/code.txt
git add .
git commit -m "Save some code"
cd -
# Set up a dvc repo
mkdir dvc_repo
cd dvc_repo
git init
dvc init
git add .
git commit -m "Init DVC"
echo --- Preparations completed --
read -rsp $'Press any key to continue...\n' -n1 key
# Import "src" directory from the source repo
dvc import ../source_repo src
# Reimport same: works fine
dvc import ../source_repo src
echo ---
echo That import + reimport worked fine, but the next will fail
echo ---
echo Now import the same directory, but rename it
read -rsp $'Press any key to continue...\n' -n1 key
# Now import src again, but rename it
dvc import -o src_renamed ../source_repo src
echo Reimport same...
read -rsp $'Press any key to continue...\n' -n1 key
# Finally, reimport the last
# This command will fail
dvc import -o src_renamed ../source_repo src
No error.
ERROR: failed to import 'src' from '../source_repo'. - Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
This doesn't seem to be a problem with importing files, only directories
This seems to occur because the import command behaves differently if there's already an output directory or not.
If I call e.g. dvc import -o "step_A" "../../sample_steps/step_A" src in an empty directory then I get a directory called step_A which contains the contents of src and a dvc file called step_A.dvc. I think that's expected behavior.
If however, I call it in a directory which already contains a subdir called step_A then I instead get a dvcfile called src.dvc and src is created as a subdir of step_A.
It's this changed behaviour that causes the reimport to fail, since the dvc file that it creates has a different name to the first run, so it's a duplicated output.
Full trace:
ERROR: failed to import 'src' from '../source_repo'. - Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
------------------------------------------------------------
Traceback (most recent call last):
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/command/imp.py", line 19, in run
rev=self.args.rev,
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/imp.py", line 6, in imp
return self.imp_url(path, out=out, erepo=erepo, locked=True)
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 31, in wrapper
ret = f(repo, *args, **kwargs)
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/scm_context.py", line 4, in run
result = method(repo, *args, **kw)
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/imp_url.py", line 20, in imp_url
self.check_modified_graph([stage])
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 184, in check_modified_graph
self._collect_graph(self.stages + new_stages)
File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 337, in _collect_graph
raise OverlappingOutputPathsError(outs[p], out)
dvc.exceptions.OverlappingOutputPathsError: Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
------------------------------------------------------------
@charlesbaynham, this is clearly mentioned in the docs (i know, it's not what you'd expect, and this is an edge-case that happened when re-importing).
-o, --out - specify a path (directory and/or file name) to the desired location to place the imported
data and import stage (DVC-file) in. The default value (when this option isn't used) is the current
working directory (.) and original file name. If an existing directory is specified, then the output
will be placed inside of it.
With #3337, update is on par with re-importing (yet to be released, though). And, I feel, we should encourage users to use it instead of re-importing.
Hi @skshetry ,
Ah thanks for that, sorry I missed it in the docs. Happy to close this issue therefore.
Those changes to update sound like they'll remove the strict need for reimports, but IMO the ability to reimport is still useful. In my use-case, for example, I need to create an import if it doesn't exist or update it if it does. I could write code to detect an existing import, but how will I know that it's the same as the one I want (e.g. what if the url has changed?). Without doing a reimport I have to:
update or delete the .dvc file and run importDo you know of a way to correctly reimport a directory?
@charlesbaynham Thanks for your patience, notification got lost in the flow :) Your command is correct, but as @skshetry noted, import behaves like other tools (e.g. cp) and if directory already exists then it places the output inside of it. So the responsibility for removing is on you in this case. You could do dvc update though (now released) and it will handle it for you. Closing this issue for now.
Most helpful comment
@charlesbaynham, this is clearly mentioned in the docs (i know, it's not what you'd expect, and this is an edge-case that happened when re-importing).
With #3337,
updateis on par with re-importing (yet to be released, though). And, I feel, we should encourage users to use it instead of re-importing.