Dvc: Error when re-importing a renamed directory

Created on 15 Feb 2020  路  6Comments  路  Source: iterative/dvc

DVC 0.84.0, Mac, conda installation

Summary

When re-importing a directory from a git repository having specified an alternative output, dvc fails with an "overlapping outs" error. This does not occur if the output directory was not renamed.

Reproduction

The following code sets up a source git repo with a subdirectory and a dvc repo which imports from it. It

  • Successfully imports the subdirectory
  • Successfully reimports the subdirectory
  • Successfully imports and renames the subdirectory
  • Fails to reimport the renamed subdirectory

# Set up a source repo
mkdir source_repo
cd source_repo
git init
mkdir src
echo Here is some code > src/code.txt
git add .
git commit -m "Save some code"
cd  -

# Set up a dvc repo
mkdir dvc_repo
cd dvc_repo
git init
dvc init
git add .
git commit -m "Init DVC"

echo --- Preparations completed --
read -rsp $'Press any key to continue...\n' -n1 key

# Import "src" directory from the source repo
dvc import ../source_repo src

# Reimport same: works fine
dvc import ../source_repo src

echo ---
echo That import + reimport worked fine, but the next will fail 
echo ---
echo Now import the same directory, but rename it
read -rsp $'Press any key to continue...\n' -n1 key

# Now import src again, but rename it
dvc import -o src_renamed ../source_repo src

echo Reimport same...
read -rsp $'Press any key to continue...\n' -n1 key

# Finally, reimport the last
# This command will fail
dvc import -o src_renamed ../source_repo src

Expected behavior

No error.

Actual behaviour

ERROR: failed to import 'src' from '../source_repo'. - Paths for outs: 'src_renamed'('src_renamed.dvc') 'src_renamed/src'('src.dvc') overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.

bug p0-critical

Most helpful comment

@charlesbaynham, this is clearly mentioned in the docs (i know, it's not what you'd expect, and this is an edge-case that happened when re-importing).

-o, --out - specify a path (directory and/or file name) to the desired location to place the imported
data and import stage (DVC-file) in. The default value (when this option isn't used) is the current
working directory (.) and original file name. If an existing directory is specified, then the output
will be placed inside of it
.

With #3337, update is on par with re-importing (yet to be released, though). And, I feel, we should encourage users to use it instead of re-importing.

All 6 comments

This doesn't seem to be a problem with importing files, only directories

This seems to occur because the import command behaves differently if there's already an output directory or not.

If I call e.g. dvc import -o "step_A" "../../sample_steps/step_A" src in an empty directory then I get a directory called step_A which contains the contents of src and a dvc file called step_A.dvc. I think that's expected behavior.

If however, I call it in a directory which already contains a subdir called step_A then I instead get a dvcfile called src.dvc and src is created as a subdir of step_A.

It's this changed behaviour that causes the reimport to fail, since the dvc file that it creates has a different name to the first run, so it's a duplicated output.

Full trace:

ERROR: failed to import 'src' from '../source_repo'. - Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
------------------------------------------------------------
Traceback (most recent call last):
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/command/imp.py", line 19, in run
    rev=self.args.rev,
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/imp.py", line 6, in imp
    return self.imp_url(path, out=out, erepo=erepo, locked=True)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 31, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/imp_url.py", line 20, in imp_url
    self.check_modified_graph([stage])
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 184, in check_modified_graph
    self._collect_graph(self.stages + new_stages)
  File "/anaconda/envs/dvc/lib/python3.6/site-packages/dvc/repo/__init__.py", line 337, in _collect_graph
    raise OverlappingOutputPathsError(outs[p], out)
dvc.exceptions.OverlappingOutputPathsError: Paths for outs:
'src_renamed'('src_renamed.dvc')
'src_renamed/src'('src.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
------------------------------------------------------------

@charlesbaynham, this is clearly mentioned in the docs (i know, it's not what you'd expect, and this is an edge-case that happened when re-importing).

-o, --out - specify a path (directory and/or file name) to the desired location to place the imported
data and import stage (DVC-file) in. The default value (when this option isn't used) is the current
working directory (.) and original file name. If an existing directory is specified, then the output
will be placed inside of it
.

With #3337, update is on par with re-importing (yet to be released, though). And, I feel, we should encourage users to use it instead of re-importing.

Hi @skshetry ,

Ah thanks for that, sorry I missed it in the docs. Happy to close this issue therefore.

Those changes to update sound like they'll remove the strict need for reimports, but IMO the ability to reimport is still useful. In my use-case, for example, I need to create an import if it doesn't exist or update it if it does. I could write code to detect an existing import, but how will I know that it's the same as the one I want (e.g. what if the url has changed?). Without doing a reimport I have to:

  • Check if there's an existing .dvc file
  • Check if it has the right url
  • Either run update or delete the .dvc file and run import

Do you know of a way to correctly reimport a directory?

@charlesbaynham Thanks for your patience, notification got lost in the flow :) Your command is correct, but as @skshetry noted, import behaves like other tools (e.g. cp) and if directory already exists then it places the output inside of it. So the responsibility for removing is on you in this case. You could do dvc update though (now released) and it will handle it for you. Closing this issue for now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Casyfill picture Casyfill  路  56Comments

dmpetrov picture dmpetrov  路  35Comments

ynop picture ynop  路  41Comments

JoeyCarson picture JoeyCarson  路  53Comments

gcoter picture gcoter  路  38Comments