Dvc: Feature theme suggestion: data versioning

Created on 3 Jan 2018 · 13Comments · Source: iterative/dvc

I'm not sure if this is already baked in or not. It would be a great feature theme to automatically version data artefacts, especially for the final outputs of a workflow. On the one hand this is at par with going in lockstep with what git is about, yet on the other hand it might be a whole feature theme to consider with great care, rather than a small addition.

Anyway the motivation being, that data processing, machine learning in particular, is a very iterative process, and we gain a lot by being able to version the code and workflow that created a result along with the result itself. This would seem to elegantly materialize what we call _reproducible data science_.

question

Source

matanster

Most helpful comment

We are discussing dataset scenarios in #1487.
The discussion is related to some of the questions that @matanster raised: _How to checkout a specific version of a dataset in a convenient way?_

Guys, please feel free to join the discussion.

dmpetrov on 2 Feb 2019

❤2

All 13 comments

It would be great to have some clarification about "data artefacts".

I'll try to answer the question based on my understanding... All outputs (-o option) and dependencies (-d option) are versioned and have a tight relationship for the reproducibility. It includes recursive dependencies. So, you can specify many outputs (artifacts?).

Many outputs example:

dvc run -d code/featurization.py -d data/Posts-train.tsv -d data/Posts-test.tsv -o data/matrix-train.p -o data/matrix-test.p python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p

In the old\released version it works pretty much the same way but the syntax and the dependency logic are a bit different.

dmpetrov on 4 Jan 2018

I think that he meant individual data files versioning, e.g. the ability to roll back some csv file to a previous version.

But I think that goes in the wrong direction. The script that generated that csv should be versioned and reproducible, hence versioning the code should be enough. That's one of the main reasons to have reproducible pipelines, so you don't have to version outputs.

I can imagine valid arguments that some data files may not belong to the pipeline and are output from external sources. In that case I think it's best to fall back to git or git-lfs.

villasv on 4 Jan 2018

👍1

@dmpetrov @villasv apologies for having introduced the redundant terminology ― by artefacts, I meant outputs indeed. I will try to elucidate.

Sometimes you want to preserve not just the skeleton of the pipeline that created an output, but also keep the output itself, the final output of the pipeline mainly. That's because it may typically take between hours to days for the machine learning process to re-create it.

For example, in a certain kind of workflow, you'd later compare that output to newer ones you are creating, and/or possibly revert to it down the road. Alternatively, if you are adding evaluation steps as you learn more about the data, you may have reason to go back evaluate against older outputs of the same pipeline, without re-building them from scratch. Waiting hours or days to reproduce a previous version of the output would then be counter-productive, in these cases. Or, you found a bug in your evaluation script, and would like to re-evaluate some older outputs.

So by versioning outputs, I do mean _providing options to keep some of them some of the time_, granted that traceability to the version of the code/pipeline that had created each one is preserved (and not just having the latest version of outputs files, which is what git-lfs could satisfy).

matanster on 4 Jan 2018

Hmm. Yeah. Comparing before/after a pipeline change makes sense, even though I usually try my best to avoid that (e.g. having always the "current" and "best" outputs active in the pipeline). But you're right that sometimes I wish I could revert back my changes and end up forcing myself to grab a coffee before the change is undone and I reproduce an old result.

Not sure how easy it would be to correctly "check out" the correct output cache at different stages. I think _keep some of them some of the time_ is a sensible request and is more like _extra chaching_ instead of a full fledged versioning feature of those files like I interpreted before.

villasv on 4 Jan 2018

Thank you guys for the clarification! Yeah, this is the most interesting subject about DVC...

As you see there are two different types of reproducibility "philosophies": version only code (data can be easily derived from the code) and version both - code and data. Makefile (and analogs) versions only code. DVC versions code and data.

Why versioning code+data? Two advantages:

Reusing/not-retraining data from previous commits\branches
Repository sharing (as Git users we can probably say - reusing data from remotes).

Previous commits situation was well explained by @matanster - sometimes we don't want to wait 15 minutes or even 5 hours for reproduction of the previous version that we had already built yesterday. With DVC you can do dvc checkout classes_5_beta_07 and dvc repro data/eval.txt will reproduce nothing because data is already in cache and all data files (which are actually pointers to the actual files) will be correctly pointed to right files in DVC cache directory as a result of the checkout command. This is one of DVC advantages over Makefiles where data and model re-building will be required after each checkout.

Repos sharing scenario is the same but we avoid rebuilding in different machines: I can train a model with my 12Gb GPU machine, sync result to a cloud storage and then resync and reuse it from my laptop: dvc sync data/cnn_model.p.

Okay, DVC uses code+data "philosophy". Does it support code only philosophy like @villasv described? To some extent - yes. In local machine it is code+data. But when you share a Git repository you share only code and the code is still reproducible. But if you share a repository AND an access to your synced cloud storage then you make it code+data sharing. This is a DVC feature I'm very proud of - we were able to separate code reproducibility and code+data reproducibility and stay compatible with Git where Git-annex becomes incompatible with Git (please try to share your Git-annex repo through GitHub) and Git LFS requires a special (not open sourced) Git server.

My personal opinion - code based reproducibility is the right way to share result where code+data based reproducibility is kind of optimization and it is the most convenient way to work on models by yourself or in a team (using dvc sharing).

What is implemented today? In the old, released DVC version and in the new one both of these scenarios are implemented. git checkout resolves file pointers correctly and dvc sync does his job. However, in the old version git merge becomes messy and dvc repro after checkout makes mistakes in some circumstances and reproduces steps when it should not. It was one of motivation to redesign DVC a bit and introduce the new version (the next release) with API change.

Also, in the new version, you will be able to commit outputs to Git directly if it's needed dvc run -o model.p -O eval.txt Rscipt mycode.R input.csv model.p eval.txt. This is a small and convenient feature for final evaluation\metrics files moslty.

So, to answer the original question, yes, this is already baked. And this is the most important and interesting part of DVC with many stories behind :)

dmpetrov on 5 Jan 2018

👍1

Ah, I see. I imagined that the _extra caching_ was happening and wasn't sure it was being fully exposed as feature, but it is. These first few paragraphs deserve to become a blog post eventually if they're not already :)

villasv on 5 Jan 2018

👍1

Yes, I'm going to publish all this information before the next release.

dmpetrov on 6 Jan 2018

Hi, I just discovered DVC yesterday, and it seems very close to what we need in our team as well. Thanks for developing it!

I'd like to strongly support that code+data versioning is extremely important in practice. @dmpetrov Your scenario about training the cnn_model.p on the GPU machine and then syncing it to the laptop(s of multiple team members) is exactly the kind of situation I'm interested in. I look forward to your blog post that explains this with great interest.

However, let's say you change your CNN a bit, and then generate a new version of cnn_model.p. How will dvc sync data/cnn_model.p know which version of cnn_model.p to download? Is there a way to select the version? (Adding all version of cnn_model.p to a Git repository seems clunky, because then each copy of the repository will contain all old versions of all models.)

alexanderkoller on 2 Mar 2018

@alexanderkoller thank you for your feedback! We are making final steps on releasing the new DVC version in mid-March.

dvc sync data/cnn_model.p (dvc cache pull in new version) won't know about the change until you git pull all the meta information from your (changed) repository. Then DVC will see the new model version and you can sync or reproduce.

In general, DVC stores all meta information in your Git repository, dvc repro rebuilds data files based on the meta information and dvc cache pull\push synchronizes data if you don't want to spend your resources on the rebuilding and re-training.

dmpetrov on 2 Mar 2018

@alexanderkoller Regarding the version selection...

"Adding all version of cnn_model.p to a Git repository seems clunky"
I agree. In the previous version (0.8), DVC makes commit for each dvc run ... command. In the coming version 0.9, a user decides what to commit. You can dvc run ... and do git add . && git commit -m 'Alpha=0.025' or revert the produced result.

But it is nice to have the entire history of ML project. I personally try to keep all the attempts I made. However, I don't keep them as linear changes in a single branch. I make a separate branch for each of my hyperparameters and then merge the best branch\params into master. It's like feature branch in software development but you can have 15 features\branches and only one will be used\merged.

If you see the value of your failed (not merged) experiments you can even push them to origin and share the history with a team.

dmpetrov on 2 Mar 2018

We are discussing dataset scenarios in #1487.
The discussion is related to some of the questions that @matanster raised: _How to checkout a specific version of a dataset in a convenient way?_

Guys, please feel free to join the discussion.

dmpetrov on 2 Feb 2019

❤2

@dmpetrov , do you think we can close this? Looks like DVC already support checkout

ghost on 29 Nov 2019

@mroutis I think the discussion started after dvc checkout was implemented. Actually, dvc checkout was implemented long long time ago. It's one of the basic features. I think the point here was about providing a higher level way (tags?) to identify specific outputs and a simpler interface to checkout (actually similar to dvc get . model.pkl --rev ...) a specific version of it.

So, it might be resolved with proper git tags and dvc get now.

Btw, I wonder if dvc get . is optimized in some way? :)