Dvc: support meta key at the root level

Created on 24 Nov 2020  ยท  6Comments  ยท  Source: iterative/dvc

git itself doesn't have a configuration file to store tagging and related meta, and since dvc.yaml is used to track the whole project, it makes sense to put these meta data into dvc.yaml

An example of this is:

meta:
  name: DVC_Test_Project
  version: "0.1.0"
  author:
    - "Johnny Chen <[email protected]>"
    - "Jane Doe <[email protected]>"
  description: >-
    this project is a playground of dvc to see
    how it can be used in real world data science.

stages:
  build:
    cmd: echo "a test project"

This idea comes from Julia where all packages have a Project.toml file. For example, Pkg.jl/Project.toml

feature request p3-nice-to-have

Most helpful comment

So it doesn't make as much sense for DVC to define a root "project" level schema, other than the types of configuration info that goes into .dvc (again, similar to how git configuration works).

To be clear, it is also the same case for Julia's Project.toml file. And also like how git's submodule functionality does.

As an example, the following is a project I'm working on, as you can see, there're multiple Project.toml files in each subfolders, dvc.yaml defines a stage for each subfolder/subproject.

.
โ”œโ”€โ”€ DnCNN.zip.dvc
โ”œโ”€โ”€ dvc.lock
โ”œโ”€โ”€ dvc.yaml
โ”œโ”€โ”€ evaluate
โ”‚ย ย  โ”œโ”€โ”€ main.jl
โ”‚ย ย  โ”œโ”€โ”€ Manifest.toml
โ”‚ย ย  โ””โ”€โ”€ Project.toml
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ params.yaml
โ”œโ”€โ”€ prepare
โ”‚ย ย  โ”œโ”€โ”€ generate_data.jl
โ”‚ย ย  โ”œโ”€โ”€ main.jl
โ”‚ย ย  โ”œโ”€โ”€ Manifest.toml
โ”‚ย ย  โ””โ”€โ”€ Project.toml
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ train
    โ”œโ”€โ”€ compat.jl
    โ”œโ”€โ”€ config.json
    โ”œโ”€โ”€ main.jl
    โ”œโ”€โ”€ Manifest.toml
    โ”œโ”€โ”€ model.jl
    โ”œโ”€โ”€ Project.toml
    โ””โ”€โ”€ train_network.jl

and for each Project.toml, it only defines what packages are use, just like Python's requirements.txt, and those meta info are optional:

[deps]
ArgParse = "c7e460c6-2fb9-53a9-8c5b-16f535851c63"
Augmentor = "02898b10-1f73-11ea-317c-6393d7073e15"
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
ImageCore = "a09fc81d-aa75-5fe9-8630-4744c3626534"
ImageMagick = "6218d12a-5da1-5696-b52f-db25d2ecc6d1"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[compat]
Augmentor = "0.6"
FileIO = "1"
ImageMagick = "1"
JLD2 = "0.3"
ProgressMeter = "1"
julia = "1"

So yeah, it doesn't make much sense to define a root "root" project since we always interpret this term in a relative way.

For this specific example, it's the dvc.yaml file that unifies the entire project, and naturally I want to add some project info into dvc.yaml. This is exactly why I propose to add some root meta info here.

Said that, It's totally fine to have a whatever file to record this information, a meta.yaml, for example. It just could be nice if DVC has some recommendations on this.

All 6 comments

It seems that the following patch works:

https://github.com/iterative/dvc/blob/a7a007e4f97ff08bc27af6be7af66262d17e1ae2/dvc/schema.py#L89-L92

  MULTI_STAGE_SCHEMA = {
+     StageParams.PARAM_META: object,
      STAGES: SINGLE_PIPELINE_STAGE_SCHEMA,
      VARS_KWD: VARS_SCHEMA,
  }

Thanks for the suggestion @johnnychen94! We do support meta at the stage level, and adding support for it at the pipeline level is something we can consider.

One thing to note though is that dvc.yaml is not really used to track an entire project. It would be more accurate to say that it tracks a single pipeline within your project, and you can have multiple dvc.yaml pipeline files in a DVC repo (like in this example: https://dvc.org/doc/command-reference/run#example-separate-stages-in-a-subdirectory)

To get something equivalent to Julia's Project.toml, it may make more sense to just define your own top level metadata file for your projects and track them with git.

DVC does not mandate any particular project structure and can be used for a wide range of use cases (in the same way that git can be used to version almost anything). So it doesn't make as much sense for DVC to define a root "project" level schema, other than the types of configuration info that goes into .dvc (again, similar to how git configuration works).

So it doesn't make as much sense for DVC to define a root "project" level schema, other than the types of configuration info that goes into .dvc (again, similar to how git configuration works).

To be clear, it is also the same case for Julia's Project.toml file. And also like how git's submodule functionality does.

As an example, the following is a project I'm working on, as you can see, there're multiple Project.toml files in each subfolders, dvc.yaml defines a stage for each subfolder/subproject.

.
โ”œโ”€โ”€ DnCNN.zip.dvc
โ”œโ”€โ”€ dvc.lock
โ”œโ”€โ”€ dvc.yaml
โ”œโ”€โ”€ evaluate
โ”‚ย ย  โ”œโ”€โ”€ main.jl
โ”‚ย ย  โ”œโ”€โ”€ Manifest.toml
โ”‚ย ย  โ””โ”€โ”€ Project.toml
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ params.yaml
โ”œโ”€โ”€ prepare
โ”‚ย ย  โ”œโ”€โ”€ generate_data.jl
โ”‚ย ย  โ”œโ”€โ”€ main.jl
โ”‚ย ย  โ”œโ”€โ”€ Manifest.toml
โ”‚ย ย  โ””โ”€โ”€ Project.toml
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ train
    โ”œโ”€โ”€ compat.jl
    โ”œโ”€โ”€ config.json
    โ”œโ”€โ”€ main.jl
    โ”œโ”€โ”€ Manifest.toml
    โ”œโ”€โ”€ model.jl
    โ”œโ”€โ”€ Project.toml
    โ””โ”€โ”€ train_network.jl

and for each Project.toml, it only defines what packages are use, just like Python's requirements.txt, and those meta info are optional:

[deps]
ArgParse = "c7e460c6-2fb9-53a9-8c5b-16f535851c63"
Augmentor = "02898b10-1f73-11ea-317c-6393d7073e15"
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
ImageCore = "a09fc81d-aa75-5fe9-8630-4744c3626534"
ImageMagick = "6218d12a-5da1-5696-b52f-db25d2ecc6d1"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
ProgressMeter = "92933f4c-e287-5a05-a399-4b506db050ca"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[compat]
Augmentor = "0.6"
FileIO = "1"
ImageMagick = "1"
JLD2 = "0.3"
ProgressMeter = "1"
julia = "1"

So yeah, it doesn't make much sense to define a root "root" project since we always interpret this term in a relative way.

For this specific example, it's the dvc.yaml file that unifies the entire project, and naturally I want to add some project info into dvc.yaml. This is exactly why I propose to add some root meta info here.

Said that, It's totally fine to have a whatever file to record this information, a meta.yaml, for example. It just could be nice if DVC has some recommendations on this.

Is p3-nice-to-have an accepted proposal but requires community efforts to work on? If so I could take a try this weekend by adding some tests based on https://github.com/iterative/dvc/issues/4960#issuecomment-733171910. And also some docs to the dvc.org repo.

@johnnychen94 If you'd like to work on this issue feel free to take it!

p3 issues are generally ones that the core team may get to eventually if there's enough user interest or need. But we have limited bandwidth and there's just higher priority features & issues that we need to address in the meantime.

Was this page helpful?
0 / 5 - 0 ratings