I'm getting a quite confusing error when trying to call dvc run
ERROR: failed to run command - Paths for outs:
'data'('data.dvc')
'data/prepared/prepared_data.csv'('data_preparation.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
Here are the steps to reproduce the issue:
myproj, then call git init and dvc initinside of itdata and put some data in it. The structure of my data directory is as follows:data
|___ raw
|___ dir1
|___ file1
...
|___ fileN
|___ dir2
|___ file1
...
|___ fileN
dvc add data to track the directory using dvcmyproj directory which does some preprocessing of the datadvc run -d data/raw -d myscript.py -o data/prepared/prepared_data.csv -f data_preparation.dvc python3 myscript.py
DVC version (i.e. dvc --version): 0.77.3
Platform: KDE neon 5.17
Method of installation: DEB(Linux)
Hi @anferico ! This is the expected behavior. When you dvc add data, dvc will track that whole directory as one entity, so you can't use another dvc-file to output anything inside of it, as it will break the reproducibility and checkout in general.
Hi @anferico ! Good question! Like @efiop mentioned there is nothing wrong with DVC behavior indeed. I think there is some confusion in your workflow (and may be we don't communicate it somewhere well enough in our docs). In your case what you want to is to run:
dvc add data/raw
and then:
dvc run -d data/raw -d myscript.py -o data/prepared/prepared_data.csv -f data_preparation.dvc python3 myscript.py
(may be you would need to create the data/prepared directory first)
It might be that you actually can do this:
dvc run -d data/raw -d myscript.py -o data/prepared -f data_preparation.dvc python3 myscript.py
Thank you @efiop and @shcheklein, it is actually me who hasn't read the documentation thoroughly.
Concerning the second solution pointed out by @shcheklein, if I specified -o data/prepared, I wouldn't be able to output any other file inside data/prepared through subsequent calls to dvc run, correct?
@anferico no worries :) glad that it helped.
Yes, if you do -o data/prepared then it expects from your script that you write the _full content_ of that directory every time you run the script. In fact, it'll be removing the previous version to ensure that they are not mixed and actually reproducible.
But to some extent it's the same as with -o data/prepared/prepared_data.csv - every time you run dvc run it'll be removing prepared_data.csv (it's not data loss since it is saved in cache anyway) and expecting your script to write it again.
Does it work for you or you have different requirements in mind?
(I'm closing this since it's not an issue, but let's keep the discussion going)
Most helpful comment
Hi @anferico ! Good question! Like @efiop mentioned there is nothing wrong with DVC behavior indeed. I think there is some confusion in your workflow (and may be we don't communicate it somewhere well enough in our docs). In your case what you want to is to run:
dvc add data/rawand then:
dvc run -d data/raw -d myscript.py -o data/prepared/prepared_data.csv -f data_preparation.dvc python3 myscript.py(may be you would need to create the
data/prepareddirectory first)It might be that you actually can do this:
dvc run -d data/raw -d myscript.py -o data/prepared -f data_preparation.dvc python3 myscript.py