Dvc: run: consider abandoning current automatic dvc file name generation

Created on 4 Sep 2018  路  17Comments  路  Source: iterative/dvc

Currently if no -f is specified for dvc run, we automatically use first output name + .dvc as a name for the produced dvc file. While this method is handy, it creates associations with output, while a better way would be to associate your stage name with the action that it performs rather than the name of the file that it outputs. Need to consider abandoning this practice starting from 1.0 and just using Dvcfile as a default name if no -f is specified.

enhancement

All 17 comments

Yep, current stage naming magic is ugly and fragile. For example, I'm not sure what is happening if I just change the order of outputs (or prepend one more output) and execute dvc run again? Will it create a duplicate stage?

Another option would be to ask a user for a name with a default option provided (that could be the same as we have now - first output) if name is not explicitly specified. This way we will be making dvc run more like a wizard. It could be aligned very well if we decide to change dvc files structure to be human-editable in the future, or even if we decide to push user more to edit and manage files manually.

Using just Dvcfile feels like an error-prone approach. It would be easy to lose some changes. It's safer to make it a requirement to specify a name (you like to be explicit, @efiop, right? :) ).

Another suggestion. In certain cases (python or R scripts are pretty common, right?) we can try to analyze the command we are running and suggest name based of script name, not output data file. This idea makes sense only if we are going to ask users about the name. Asking users about the name makes even more sense if migrate current "create stage" logic to the dvc add.

I also think the current solution is sub-optimal.
I still think having the .dvc file next to the output (eg. dvc run -o data/output ... creates a data/output.dvc) is the way with the least surprise.

Sorry @shcheklein , forgot to click "comment" and closed the tab.

Yep, current stage naming magic is ugly and fragile. For example, I'm not sure what is happening if I just change the order of outputs (or prepend one more output) and execute dvc run again? Will it create a duplicate stage?

Currently it uses the first output name, so yes, if you change order it will create a duplicate stage. :slightly_frowning_face:

Another option would be to ask a user for a name with a default option provided (that could be the same as we have now - first output) if name is not explicitly specified. This way we will be making dvc run more like a wizard. It could be aligned very well if we decide to change dvc files structure to be human-editable in the future, or even if we decide to push user more to edit and manage files manually.

This is exactly what we do currently.

Using just Dvcfile feels like an error-prone approach. It would be easy to lose some changes. It's safer to make it a requirement to specify a name (you like to be explicit, @efiop, right? :) ).

Yes, I'm thinking more and more about this.

Another suggestion. In certain cases (python or R scripts are pretty common, right?) we can try to analyze the command we are running and suggest name based of script name, not output data file. This idea makes sense only if we are going to ask users about the name. Asking users about the name makes even more sense if migrate current "create stage" logic to the dvc add.

Yet again, this is more more from a magic section, which is secondary at best. For now we are talking about primary structure and implicit one seems like a way to go, though will definitely give it a better thought when preparing 1.0.

Thanks,
Ruslan

Hi @sotte !

I also think the current solution is sub-optimal.
I still think having the .dvc file next to the output (eg. dvc run -o data/output ... creates a data/output.dvc) is the way with the least surprise.

Makes sense. The current problem is -c and -f behavior, which will be solved in v1.0, by storing cwd in dvc file instead of relying on dvc file location purely. Looks like the best solution is between explicit and implicit approaches and we will eventually leave some more strict magic like 'output_path'+.dvc if there is only 1 output, so that dvc file is placed by the output file and will require -f if there are more outputs.

Thank you for the feedback!

-Ruslan

Oh ya, storing cwd is great!

Could storing the .dvc file next to the data even work with more than one output?

So this:

dvc run \
  -d data/scores.csv \
  -o data/plot1.png \
  -o data/plot2.png \
  python plot.py data/scores.csv

could create two files data/plot1.png.dvc and data/plot2.png.dvc which would be identical files. Reproducing the one plot would also reproduce the other one. Sounds good to me.
Also there is no magic involved. Each out file has a corresponding .dvc file which describes how it was created.

No, I don't think that we should duplicate dvcfiles at all, because it creates too much confusion and creates additional ways to shoot yourself in the leg. I think it is much better to call your stage plot.dvc and put it somewhere(e.g. at data if you wish so), so it describes creation of both plot1.png and plot2.png.

What is the status on this? Or more concretely, when are you planing to to store the cwd in the dvc file?
I still think having a .dvc file (which stores the cwd) next to the output file is the best solution: path/to/my/data -> path/to/my/data.dvc. In my opinion this would make dvc much more user friendly (at least for my workflow :)).

Hi @sotte ! Thank you for your interest! This, among many other new features/redesigns, is planned for 1.0 release. We plan to implement and release it in a month or so.

Great. I'll give DVC a try at work once that feature is implemented.

@sotte btw, we have released a new version recently where wdir is introduced on the DVC file level and you can now specify wdir and -f values that contains full path. Please, give it a try and let us know what your thoughts are.

Thanks, I'll give it a try.

@efiop , are you still working on this one? I'd be happy to deprecate Dvcfile once and for all :stuck_out_tongue:

@mroutis Could you elaborate what is the problem with Dvcfile?

@mroutis discussion about it is also here

Dvcfile is a generic name given to a stage without output, in such cases, I think it is better to use the --file flag, giving more context to the operation.

For example:

  • dvc run date -> Dvcfile
  • dvc run -f say_date.dvc date -> say_date.dvc

It's not a big difference but will make it easier to determine if a file is a _dvcfile_ or not just by looking at the extension. (glob.glob('**/*.dvc') or glob.iglob)

Another one would be generating a Dvcfile.dvc

Fixed by https://github.com/iterative/dvc/issues/1871 . We will switch default behaviour to adding stages to dvc.yaml file, so this issue is no longer relevant.

Was this page helpful?
0 / 5 - 0 ratings