Dvc: Consistency of metadata (meta/desc fields) in dvc.yaml and .dvc files

Created on 4 Dec 2020  ยท  9Comments  ยท  Source: iterative/dvc

Extracted from https://github.com/iterative/dvc.org/pull/1951#pullrequestreview-532832838

The new desc field made me rethink the ability to have user data in DVC files. Expand below for the current options (this emoji ๐Ÿ‘‰ points to unique characteristics of each one):

YAML comments

  • ๐Ÿ‘‰ Part of the YAML format anyway, I guess
  • Only meaningful to the user
  • DVC should ignore (but preserve) these.

meta field

  • A valid field in dvc.yaml, at the stage level โ€” part of its schema
  • A valid field in .dvc files, ๐Ÿ‘‰ at the file level (these files were considered stages in 0.x)
  • ๐Ÿ‘‰ Accepts any YAML structure.
  • Only meaningful to the user
  • DVC should ignore (but preserve) these.

desc field

  • A valid field in dvc.yaml, at the stage ๐Ÿ‘‰ and/or output levels โ€” part of its schema
  • A valid field in .dvc files, ๐Ÿ‘‰ at the output level (I suppose these files are considered something closer to outputs in 1.0+ ?)
  • ๐Ÿ‘‰ Only accepts strings.
  • Only meaningful to the user
  • ๐Ÿ‘‰ dvc add/run --desc can be used to write this.
  • DVC ignores (but preserves) it otherwise?
  • ๐Ÿ‘‰ DVC Viewer requires it? See https://github.com/iterative/viewer/issues/1371#issuecomment-738549604

TL;DR

I see significant overlap. What are we trying to solve and what's the ideal design to implement it?

At the same time these are probably the least relevant fields in the file formats but still, I suspect at least one of these options is redundant.

awaiting response discussion p3-nice-to-have ui

Most helpful comment

@jorgeorpinel Thanks for clarifying! Sorry for putting it that way and overreacting. I thought that we've discussed this enough in linked issues/prs, and the reasoning was obvious, but now I understand that I was obviously very wrong, because I didn't consider that you weren't present during all of the discussions about it. So let me try to show a bigger picture here.

In general, the desc is somewhat related to https://github.com/iterative/dvc/issues/1487 . So wanting to have a description for a particular dataset/artifact, that would be preserved during repro and other types of updates. Desc only serves as an aid to understanding particular artifacts, but there will be an additional label mechanism (separate from desc, just mentioning since it is related) that would also propagate labels through your pipeline, so that the end result would have all labels of input data that it was using. We've also added stage-level desc because spacy guys do that in their files and it just makes sense to have an aid to understanding what particular stage does inside.

In case of pure dvc, desc is only currently useful when viewing dvc.yaml's(yes, old *.dvc too for now), and you are right that regular comments would do the same job for that, same as when you use comments in your code/bash scripts. But stage desc might also be used in https://github.com/iterative/dvc/issues/3743 or something similar, for more concentrated (than viewing yaml) representation of your pipeline. There might be a similar thing for output descs too, e.g. dvc dag --list --outs, but that's just fantasies at this point. But for the most part, right now desc is as good as putting comments in yaml files by hand. But we've allocated a special field for it for people to start using it now, so we could figure out good ways to use them in dvc in the future. One more thing is that since it is an officially recognized field, it will be available in (to-be-official in the future) python API, so people could use it directly (much harder to do with comments). The main difference with meta is that desc has a particular purpose and schema/type officially is recognized by dvc standard, so it won't change.

Viewer is the first such user of desc, which will be using the descriptions for outputs. It won't use stage descriptions though, as there is no pipeline support there yet, but when/if it will be added - the stage desc will be there to use.

All 9 comments

I think the reasoning is that desc is not only relevant to the user, it's also relevant to the viewer. The field can specifically be used to add a text description to an output, that will be displayed in the viewer.

meta can contain literally anything and generally only means anything to the user that put something into the meta field. So it does not make sense for the viewer to try and figure out what is inside meta and then attempt to display whatever that might be.

Doesn't look like there is anything left to say here. Those are just different by definition. Closing as stale, happy to reopen if anyone else has to say anything here.

desc is not only relevant to the user, it's also relevant to the viewer
meta can contain literally anything and generally only means anything to the user

These seem like arbitrary distinctions to me. meta could mean something to the viewer if it was used that way.

But my question goes beyond that, why is desc at the stage level and at the output level for example? Just seems like an ad hoc implementation (I don't understand the design) but OK, I suppose it's not a big feature? I hope it doesn't cause confusion or other problems in the future...

These seem like arbitrary distinctions to me. meta could mean something to the viewer if it was used that way.

But the fact is that it doesn't use it. desc is meant for stage or output description, meta is just whatever you, as a user, want to shove in it, dvc or other tools won't care and won't interpret it or expect any particular format.

But my question goes beyond that, why is desc at the stage level and at the output level for example? Just seems like an ad hoc implementation (I don't understand the design) but OK, I suppose it's not a big feature? I hope it doesn't cause confusion or other problems in the future...

Because it can describe both stage and particular output right now.

the fact is that it doesn't use it

My impression is that was a decision by the same people here ๐Ÿ˜ฌ in fact the Viewer team didn't seem to realize meta even existed if I understood https://github.com/iterative/dvc.org/pull/1951#discussion_r532569582 correctly. Also in https://github.com/iterative/dvc.org/pull/1951#discussion_r533008967 it was requested that meta is displayed (again if I understand that conversation correctly) so... Overlap.

it can describe both stage and particular output right now

OK but for .dvc files desc only works at the "output" level, not at the file (stage) level. Seems inconsistent.

Also, meta only works at the stage level (or file level for .dvc files) so it's not as free as purported.

@jorgeorpinel Correct, that was my decision, I started using meta first but then people from viewer team said that a separate field makes sense so I've switched. And I now agree that desc should be separate, they have different purpose.

We spend too much time ranting about it here, that's counterproductive. I understand that you don't like it, but we need to disagree and commit to this.

I already agreed to disagree some time ago. Not expecting a change, just leaving feedback. The docs probably do need more clarification on all of the details mentioned above though, as it's confusing (not a priority).

p.s. I'm not ranting ๐Ÿ˜†, I explained my position/questions in detail above.

@jorgeorpinel Thanks for clarifying! Sorry for putting it that way and overreacting. I thought that we've discussed this enough in linked issues/prs, and the reasoning was obvious, but now I understand that I was obviously very wrong, because I didn't consider that you weren't present during all of the discussions about it. So let me try to show a bigger picture here.

In general, the desc is somewhat related to https://github.com/iterative/dvc/issues/1487 . So wanting to have a description for a particular dataset/artifact, that would be preserved during repro and other types of updates. Desc only serves as an aid to understanding particular artifacts, but there will be an additional label mechanism (separate from desc, just mentioning since it is related) that would also propagate labels through your pipeline, so that the end result would have all labels of input data that it was using. We've also added stage-level desc because spacy guys do that in their files and it just makes sense to have an aid to understanding what particular stage does inside.

In case of pure dvc, desc is only currently useful when viewing dvc.yaml's(yes, old *.dvc too for now), and you are right that regular comments would do the same job for that, same as when you use comments in your code/bash scripts. But stage desc might also be used in https://github.com/iterative/dvc/issues/3743 or something similar, for more concentrated (than viewing yaml) representation of your pipeline. There might be a similar thing for output descs too, e.g. dvc dag --list --outs, but that's just fantasies at this point. But for the most part, right now desc is as good as putting comments in yaml files by hand. But we've allocated a special field for it for people to start using it now, so we could figure out good ways to use them in dvc in the future. One more thing is that since it is an officially recognized field, it will be available in (to-be-official in the future) python API, so people could use it directly (much harder to do with comments). The main difference with meta is that desc has a particular purpose and schema/type officially is recognized by dvc standard, so it won't change.

Viewer is the first such user of desc, which will be using the descriptions for outputs. It won't use stage descriptions though, as there is no pipeline support there yet, but when/if it will be added - the stage desc will be there to use.

Thank for the details! All interesting possibilities. I guess the answer to my question ("What are we trying to solve?") is that we're not sure but we see potential and will keep an eye on whether/how users employ this.

I'd just clarify that I don't think comments are an alternative actually, I see more overlap with meta.

To make my position more constructive, I tried to come up with a specific format for meta that would also cover the same needs and be officially recognized but in that attempt I realized it would be much more complicated than simple desc fields.


So the only concern left from my part is on the consistency i.e. where in the formats are desc and meta valid:

  • Should desc be allowed in deps? Why only outs? What about other things like cmd, metrics/plots (which are outpus), etc.? desc could be universal if our idea is to explore posibilities.
  • Should desc be allowed at the file level in .dvc files? Similar to having desc at the stage level in dvc.yaml files
  • Ideally, could we get meta out of the stage level and make it a file-level thing in both formats? That would make it clearly not desc, and I think it makes the files much cleaner when you can completely separate the official DVC-recognized structure from any ignored user metadata.

Example:

stages:
  training:
    desc: Training stage description
    cmd:
      - pip install -r requirements.txt
      - python train.py
          desc: Performs model training. Requires Python (and lib req's)
    deps:
      - train.py
      - features
          desc: We get these features from blah.
    outs:
      - model.pkl:
          desc: Linear/xyz model to predict something
    plots:
      - logs.csv:
          x: epoch
          desc: Logs of the training process

# Completely ignored but valid user info.
meta:
  name: Jonny Deep
  pipeline: 'For deployment'
Was this page helpful?
0 / 5 - 0 ratings

Related issues

siddygups picture siddygups  ยท  3Comments

shcheklein picture shcheklein  ยท  3Comments

anotherbugmaster picture anotherbugmaster  ยท  3Comments

ghost picture ghost  ยท  3Comments

analystanand picture analystanand  ยท  3Comments