dvc: finalize `dvc tag` feature

Created on 20 Jul 2019  路  11Comments  路  Source: iterative/dvc

dvc tag feature was implemented awhile ago, but is hidden from CLI since then. Need to give it a good thought in current dvc model (especially with dvc import and dvc get) and see what needs to be done(if anything) for it next.

Related #1766

p3-nice-to-have question research

Most helpful comment

To give everyone more context.

The idea was to give users a way to assign custom tags to files that are under DVC control:

dvc tag add v1.0 model.pkl

This command would update the DVC-file model.pkl output is defined in to save the tag name, + current checksum of the output.

Why don't we reuse git tags for that? They are not granular enough (the apply to the commit as a whole, which includes multiple datasets, models, code, etc). They are git-specific. They will create additional mess on Github/Gitlab for projects. They will create additional mess with -T option across our commands.

Why scenarios we have in mind those tags can be useful? In general, it's all about being able to granularly address an artifact and a certain version of that artifact. Top use cases that come to my mind are:

  1. dvc checkout [email protected] or something similar, to get a dataset of a certain version. It's instead of doing something like git checkout -- data.dvc && dvc checkout data.dvc, etc.
  2. Use with dvc import, dvc get to provide a data artifact specific way to address it.
  3. See the list of "releases" for the data artifact. It's not easy to achieve with Git tags.

Anything else I'm missing there?

@villasv @sotte @PeterFogh @prihoda any thoughts on this guys?

It does not look like a critical feature to be honest to me, but can be really nice to have to organize data/models in a proper way. On the other had, it complicates the tool by introducing a new abstraction.

All 11 comments

To give everyone more context.

The idea was to give users a way to assign custom tags to files that are under DVC control:

dvc tag add v1.0 model.pkl

This command would update the DVC-file model.pkl output is defined in to save the tag name, + current checksum of the output.

Why don't we reuse git tags for that? They are not granular enough (the apply to the commit as a whole, which includes multiple datasets, models, code, etc). They are git-specific. They will create additional mess on Github/Gitlab for projects. They will create additional mess with -T option across our commands.

Why scenarios we have in mind those tags can be useful? In general, it's all about being able to granularly address an artifact and a certain version of that artifact. Top use cases that come to my mind are:

  1. dvc checkout [email protected] or something similar, to get a dataset of a certain version. It's instead of doing something like git checkout -- data.dvc && dvc checkout data.dvc, etc.
  2. Use with dvc import, dvc get to provide a data artifact specific way to address it.
  3. See the list of "releases" for the data artifact. It's not easy to achieve with Git tags.

Anything else I'm missing there?

@villasv @sotte @PeterFogh @prihoda any thoughts on this guys?

It does not look like a critical feature to be honest to me, but can be really nice to have to organize data/models in a proper way. On the other had, it complicates the tool by introducing a new abstraction.

Hi @shcheklein and @efiop. In my team, we are currently working on putting one of our models (developed using a DVC pipeline) into production in Azure, using Azure blob storage and Azure functions. Version tagging of model and dataset files is an essential part of the management of production systems, however, there are many ways to reach the same effect in our use case.

Currently, we have a separate DVC stage which uploads the model files to Azure blob storage, and "tags" (i.e. names it) with the current git commit hash. And, as we do not use DVC to upload the model to Azure, I cannot see how a dvc tag command can help our current implementation.

My opinion:
We will probably not use a dvc tag command, mainly because of our setup of running and storing all DVC files on a remote Dask scheduler/workers, hence we neither use dvc import or dvc get.
I agree with @shcheklein,

"It does not look like a critical feature to be honest to me, but can be really nice to have to organize data/models in a proper way."

However, if it stood to me I would focus on the current commands in DVC, as they can fulfil my team' needs for pipeline management.

But still, I'm interested in hearing others thoughts in dvc tag. Maybe, I just overlooked some of its possibilities 馃槄

@PeterFogh thank you for the feedback!! can you please, give more details on how do you exactly connect the model produced with the production system? The intention of the dvc get and dvc.api.open Python API was to give an easy way to fetch an artifact in the production environment, addressing it by the git repo. No need to copy it into some ad-hoc place and hard code a path to it.

I think this part specifically can be removed if you start using DVC API/dvc get:

Currently, we have a separate DVC stage which uploads the model files to Azure blob storage, and "tags" (i.e. names it) with the current git commit hash. And, as we do not use DVC to upload the model to Azure, I cannot see how a dvc tag command can help our current implementation.

I would love to hear your opinion on this.

And just to have some sense. How large are these models? If you run them using Functions I assume they are small enough?

@shcheklein what about the --all-tags option, is this going to work with SCM tags and DVC tags?

I would assume that --all-tags keep working with SCM tags. The only difference is that now DVC-file can contain multiple versions (tags) of an output. So, GC will preserve those if they are committed as part of one of SCM tags.

Some questions for the research:

  • Are we going to have a "tree" for tags?

  • Why not using meta attribute?

  • Is this going to affect run like dvc run -d data.csv@v1 / how are we going to know which version to use?

Also as @Suor noted, we might want to reconsider dvc tag in general and its name. E.g. to avoid confusion with git tags, we might call our tags labels, but, then again, that might create confusion with ML labeling 馃檪

@mroutis Great questions!

Are we going to have a "tree" for tags?

What is tree?

Why not using meta attribute?

meta was added later and it is not really meant for anything that dvc itself could heavily use. The main purpose of meta is to just not overlap user-defined keys with our dvc keys.

Is this going to affect run like dvc run -d data.csv@v1 / how are we going to know which version to use?

Ideally, yes. There are some implementation details that we would have to figure out for that precise one though. E.g. how should dvc replace data.csv with v1 for that run? Should it do that in the workspace with someting like dvc checkout? Or in some temporary directory? The current workflow is to run dvc checkout data@v1 and then run dvc run -d data as always, which at least does all the actions explicitly, so there is less confusion going on.

@efiop , like SCM tree -- although, I'm not sure if dvc tags is going to be like a replacement for git itself to manage experiments (instead of creating branches or tags, use dvc tags) -- but instead of searching .git, using .dvc/cache

My biggest expectation regarding tags is that when we build a shared/base project that generates datasets that can be reused by other projects (that is, other repositories), we get to import those datasets by tag (and magic tags like latest as in Docker images).

For now, we dont share/publish datasets to other projects/people so we don't have fancy names. If the timestamp is newer, the dataset is newer and that's enough for us.

I have serious doubt in its usability or necessity. As I see it we already have a way to address artifacts with git tags and have dvc get/import and dvc.api.read()/open()/get_url() machinery around it. Another reason I don't want to decouple this from git tags is that an artifact without corresponding stage file is volatile: suppose you do dvc checkout data@v1 then run dvc repro and data is overwritten.

So I don't see much people can usefully do with these tags. The only thing that makes some sense to me is getting a list of versions of particular artifacts (datasets or models) used in a repo. Not sure that use case is used properly by such tags, because by decoupling this from git tags we loose info on how that artifact is produced, e.g. which data/method/hyperparameters were used to train a model.

If the point is having some "in-repo-registry" of available artifacts then we should probably add a separate JSON/YAML file describing whats available with tags/labels, git shas, checksums and optional fields like description, creation date, etc. This is still a big discussion.

I would like to hear other workflows that would benefit from this feature in any form.

Ok, seems like this is feature is not really desired by anyone. Also it is not required for get/import, as you can use git revisions, and if it will be needed at some point, we will be able to implement it later, it is not a feature that is holding us back. I suggest we don't remove the current implementation for now and keep this ticket opened, to see if any requests will come in after we have a big announcement for get/import features.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TezRomacH picture TezRomacH  路  3Comments

shcheklein picture shcheklein  路  3Comments

tc-ying picture tc-ying  路  3Comments

jorgeorpinel picture jorgeorpinel  路  3Comments

ghost picture ghost  路  3Comments