Dvc: Feature request: Better support for folder dependencies

Created on 28 Aug 2018  路  14Comments  路  Source: iterative/dvc

Say I have the following structure:

script.py
module/submodule/foo.py

Then:

dvc run script.py -d module -o bar.npy

If I modify foo.py, dvc repro does not notice. The reason is that neither the inode nor the mtime of module is updated when a modification is made to foo.py.

For now it is possible to add all the files in module as dependencies but it would surely be better to have a way to handle folders recursively.

bug

Most helpful comment

@tdeboissiere 0.18.8 is out. Please feel free to upgrade and see if the issue persists ( I've added a test, but it never hurts to confirm that original issue is indeed fixed :slightly_smiling_face: ).

Thanks,
Ruslan

All 14 comments

Hi @tdeboissiere !

Great point! Preparing a patch right now.

Thanks,
Ruslan

Hi @tdeboissiere !

The patch is merged, I'm releasing 0.18.6 with it right now. Should be ready in an hour or so. I'll let you know when it is out. Thank you so much for the feedback!

@tdeboissiere 0.18.6 is out :slightly_smiling_face: Please feel free to upgrade and give it a try.

Thanks,
Ruslan

Lightning fast as usual, the fix seems to work well !

2 observations:

  • The same problem seems to exist for outputs. If I have dvc run -d XXX -o output_folder, then dvc repro output_folder.dvc will not reproduce the steps if something within output_folder is changed.

  • Say I have a pipeline with two stages, and each stage has the same -d folder dependency. In that case, if I change a file in folder which would only be used in the second stage, dvc repro will reproduce the first stage as well because of the -d folder dependency. Solving this problem would require DVC to know in advance all the files which are used at a given stage, which is probably a bit tricky and best left to the user...

Hi @tdeboissiere !

The same problem seems to exist for outputs. If I have dvc run -d XXX -o output_folder, then dvc repro output_folder.dvc will not reproduce the steps if something within output_folder is changed.

Are you experiencing it even with the never version?

Say I have a pipeline with two stages, and each stage has the same -d folder dependency. In that case, if I change a file in folder which would only be used in the second stage, dvc repro will reproduce the first stage as well because of the -d folder dependency. Solving this problem would require DVC to know in advance all the files which are used at a given stage, which is probably a bit tricky and best left to the user...

Yes, this is something that should be handled by the user. Dvc can only know that you are using something if you explicitly specify it with -d. The scenario you've described can be worked around by separating files in a folder appropriately and then specifying separated and common parts as dependencies in your pipeline stages. I.e. something like:

$ dvc run -d common_dir -d dir_1 ...
$ dvc run -d common_dir -d dir_2 ...

Thanks,
Ruslan

  • Yes, even with the new version I have verified that the problem exists for outputs.
  • Thanks for the tip for my second point ! Rather than splitting sub directories, I think it is simpler if I externally enforce the same git status for all my depencies at all stages.

Hm, I'll investigate that ASAP, seems like a bug. Thank you for reporting the issue!

Reopening the issue to track the progress on the outputs bug.

@tdeboissiere Got it. Was able to reproduce. Preparing a patch right now and will release 0.18.8 in an hour or so. Thank you!

@tdeboissiere 0.18.8 is out. Please feel free to upgrade and see if the issue persists ( I've added a test, but it never hurts to confirm that original issue is indeed fixed :slightly_smiling_face: ).

Thanks,
Ruslan

It does work in the use case that caused me to raise the issue, thanks !

Glad it worked for you! One more question:

Thanks for the tip for my second point ! Rather than splitting sub directories, I think it is simpler if I externally enforce the same git status for all my depencies at all stages.

Could you please elaborate on what you meant by the git status please?

Thanks,
Ruslan

  • It could happen that I run a dvc command for part A of the pipeline, make some modification to the code, then run part dvc for part B.
  • To avoid reproducibility issues, I would have to correctly track all dependencies with dvc, which can be tricky when the dependencies are large and/or shared between stages.
  • My point was that it may be simpler to avoid this issue in the first place: for my use case, I plan to enforce that no modifications are made to the code between running part A and part B. To this end, I will check the git status and ensure it is unchanged.

Ah, got it. Thank you for clarifying!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

anotherbugmaster picture anotherbugmaster  路  3Comments

analystanand picture analystanand  路  3Comments

ghost picture ghost  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

dmpetrov picture dmpetrov  路  3Comments