Extracted from https://github.com/iterative/dvc.org/pull/1951#pullrequestreview-532832838
The new desc
field made me rethink the ability to have user data in DVC files. Expand below for the current options (this emoji ๐ points to unique characteristics of each one):
YAML comments
meta
field
desc
field
dvc add/run --desc
can be used to write this.I see significant overlap. What are we trying to solve and what's the ideal design to implement it?
At the same time these are probably the least relevant fields in the file formats but still, I suspect at least one of these options is redundant.
I think the reasoning is that desc
is not only relevant to the user, it's also relevant to the viewer. The field can specifically be used to add a text description to an output, that will be displayed in the viewer.
meta
can contain literally anything and generally only means anything to the user that put something into the meta
field. So it does not make sense for the viewer to try and figure out what is inside meta
and then attempt to display whatever that might be.
Doesn't look like there is anything left to say here. Those are just different by definition. Closing as stale, happy to reopen if anyone else has to say anything here.
desc is not only relevant to the user, it's also relevant to the viewer
meta can contain literally anything and generally only means anything to the user
These seem like arbitrary distinctions to me. meta
could mean something to the viewer if it was used that way.
But my question goes beyond that, why is desc
at the stage level and at the output level for example? Just seems like an ad hoc implementation (I don't understand the design) but OK, I suppose it's not a big feature? I hope it doesn't cause confusion or other problems in the future...
These seem like arbitrary distinctions to me. meta could mean something to the viewer if it was used that way.
But the fact is that it doesn't use it. desc
is meant for stage or output description, meta
is just whatever you, as a user, want to shove in it, dvc or other tools won't care and won't interpret it or expect any particular format.
But my question goes beyond that, why is desc at the stage level and at the output level for example? Just seems like an ad hoc implementation (I don't understand the design) but OK, I suppose it's not a big feature? I hope it doesn't cause confusion or other problems in the future...
Because it can describe both stage and particular output right now.
the fact is that it doesn't use it
My impression is that was a decision by the same people here ๐ฌ in fact the Viewer team didn't seem to realize meta
even existed if I understood https://github.com/iterative/dvc.org/pull/1951#discussion_r532569582 correctly. Also in https://github.com/iterative/dvc.org/pull/1951#discussion_r533008967 it was requested that meta
is displayed (again if I understand that conversation correctly) so... Overlap.
it can describe both stage and particular output right now
OK but for .dvc files desc
only works at the "output" level, not at the file (stage) level. Seems inconsistent.
Also, meta
only works at the stage level (or file level for .dvc files) so it's not as free as purported.
@jorgeorpinel Correct, that was my decision, I started using meta first but then people from viewer team said that a separate field makes sense so I've switched. And I now agree that desc should be separate, they have different purpose.
We spend too much time ranting about it here, that's counterproductive. I understand that you don't like it, but we need to disagree and commit to this.
I already agreed to disagree some time ago. Not expecting a change, just leaving feedback. The docs probably do need more clarification on all of the details mentioned above though, as it's confusing (not a priority).
p.s. I'm not ranting ๐, I explained my position/questions in detail above.
@jorgeorpinel Thanks for clarifying! Sorry for putting it that way and overreacting. I thought that we've discussed this enough in linked issues/prs, and the reasoning was obvious, but now I understand that I was obviously very wrong, because I didn't consider that you weren't present during all of the discussions about it. So let me try to show a bigger picture here.
In general, the desc is somewhat related to https://github.com/iterative/dvc/issues/1487 . So wanting to have a description for a particular dataset/artifact, that would be preserved during repro
and other types of updates. Desc only serves as an aid to understanding particular artifacts, but there will be an additional label
mechanism (separate from desc, just mentioning since it is related) that would also propagate labels through your pipeline, so that the end result would have all labels of input data that it was using. We've also added stage-level desc because spacy guys do that in their files and it just makes sense to have an aid to understanding what particular stage does inside.
In case of pure dvc, desc
is only currently useful when viewing dvc.yaml's(yes, old *.dvc too for now), and you are right that regular comments would do the same job for that, same as when you use comments in your code/bash scripts. But stage desc
might also be used in https://github.com/iterative/dvc/issues/3743 or something similar, for more concentrated (than viewing yaml) representation of your pipeline. There might be a similar thing for output descs too, e.g. dvc dag --list --outs
, but that's just fantasies at this point. But for the most part, right now desc
is as good as putting comments in yaml files by hand. But we've allocated a special field for it for people to start using it now, so we could figure out good ways to use them in dvc in the future. One more thing is that since it is an officially recognized field, it will be available in (to-be-official in the future) python API, so people could use it directly (much harder to do with comments). The main difference with meta is that desc
has a particular purpose and schema/type officially is recognized by dvc standard, so it won't change.
Viewer is the first such user of desc, which will be using the descriptions for outputs. It won't use stage descriptions though, as there is no pipeline support there yet, but when/if it will be added - the stage desc will be there to use.
Thank for the details! All interesting possibilities. I guess the answer to my question ("What are we trying to solve?") is that we're not sure but we see potential and will keep an eye on whether/how users employ this.
I'd just clarify that I don't think comments are an alternative actually, I see more overlap with
meta
.
To make my position more constructive, I tried to come up with a specific format for meta
that would also cover the same needs and be officially recognized but in that attempt I realized it would be much more complicated than simple desc
fields.
So the only concern left from my part is on the consistency i.e. where in the formats are desc
and meta
valid:
desc
be allowed in deps
? Why only outs
? What about other things like cmd
, metrics/plots
(which are outpus), etc.? desc
could be universal if our idea is to explore posibilities.desc
be allowed at the file level in .dvc files? Similar to having desc
at the stage level in dvc.yaml filesmeta
out of the stage level and make it a file-level thing in both formats? That would make it clearly not desc
, and I think it makes the files much cleaner when you can completely separate the official DVC-recognized structure from any ignored user metadata.Example:
stages:
training:
desc: Training stage description
cmd:
- pip install -r requirements.txt
- python train.py
desc: Performs model training. Requires Python (and lib req's)
deps:
- train.py
- features
desc: We get these features from blah.
outs:
- model.pkl:
desc: Linear/xyz model to predict something
plots:
- logs.csv:
x: epoch
desc: Logs of the training process
# Completely ignored but valid user info.
meta:
name: Jonny Deep
pipeline: 'For deployment'
Most helpful comment
@jorgeorpinel Thanks for clarifying! Sorry for putting it that way and overreacting. I thought that we've discussed this enough in linked issues/prs, and the reasoning was obvious, but now I understand that I was obviously very wrong, because I didn't consider that you weren't present during all of the discussions about it. So let me try to show a bigger picture here.
In general, the desc is somewhat related to https://github.com/iterative/dvc/issues/1487 . So wanting to have a description for a particular dataset/artifact, that would be preserved during
repro
and other types of updates. Desc only serves as an aid to understanding particular artifacts, but there will be an additionallabel
mechanism (separate from desc, just mentioning since it is related) that would also propagate labels through your pipeline, so that the end result would have all labels of input data that it was using. We've also added stage-level desc because spacy guys do that in their files and it just makes sense to have an aid to understanding what particular stage does inside.In case of pure dvc,
desc
is only currently useful when viewing dvc.yaml's(yes, old *.dvc too for now), and you are right that regular comments would do the same job for that, same as when you use comments in your code/bash scripts. But stagedesc
might also be used in https://github.com/iterative/dvc/issues/3743 or something similar, for more concentrated (than viewing yaml) representation of your pipeline. There might be a similar thing for output descs too, e.g.dvc dag --list --outs
, but that's just fantasies at this point. But for the most part, right nowdesc
is as good as putting comments in yaml files by hand. But we've allocated a special field for it for people to start using it now, so we could figure out good ways to use them in dvc in the future. One more thing is that since it is an officially recognized field, it will be available in (to-be-official in the future) python API, so people could use it directly (much harder to do with comments). The main difference with meta is thatdesc
has a particular purpose and schema/type officially is recognized by dvc standard, so it won't change.Viewer is the first such user of desc, which will be using the descriptions for outputs. It won't use stage descriptions though, as there is no pipeline support there yet, but when/if it will be added - the stage desc will be there to use.