Dvc: import: add a mechanism to lock external dependencies

Created on 17 Jun 2019 · 6Comments · Source: iterative/dvc

From #2012

awaiting response feature request p1-important

Source

efiop

Most helpful comment

Summary of changes

The most recent PR https://github.com/iterative/dvc/pull/2160 and a few previous merged PRs have introduced a mechanism to reuse (import) a data artifact (model or data file or directory) from one Github repository into another:

cd myrepo
dvc import https://github.com/iterative/example-get-started model.pkl

creates a DVC-file that has a dependency to the (the repo field below) https://github.com/iterative/example-get-started, and specifically to the model.pkl file inside produced by the DVC-pipeline inside it:

md5: 39c9c563afc97491c86db391afd9cf82
wdir: .
deps:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
  path: model.pkl
  repo:
    url: https://github.com/iterative/example-get-started
outs:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
  path: model.pkl
  cache: true
  metric: false
  persist: false

this command also automatically "executes" this DVC-file, more specifically, it brings the model.pkl file into the workspace.

The existing dvc import command has been moved to the dvc import-url.

What is the "lock" mechanism about?

Now that we have an external dependency in some sense, the question is what should we do when we run dvc status, dvc repro? Pull the latest version as we do with dvc import-url every single time?

After a few discussions we have decided that we need a common semantics for all external dependencies - import and import-url. Semantics should be aligned with the DVC purpose in the first place - ensure the reproducibility.

It means that if we, say, in one moth decide to do git checkout <1-moth-ago> in myrepo, to get that same DVC-file from the above, DVC must guarantee that we are getting the same result (the same model.pkl in the first place) if we run repro. Even if https://github.com/iterative/example-get-started has changed a lot since that.

The same for import-url which can import a file via https, from s3, etc to the workspace. Right now it _always_ checks and updates the data in the workspace when you run dvc repro.

Thus we need to introduce some "lock" mechanism for these "import" DVC-files. They need to capture somewhere all the necessary context to get the same data when it's needed. It does not necessarily mean that we need some new fields, we might end up reusing what we have already and just change behavior.

### Implementation options

aka Distibuted - add a special field for dvc import produced DVC-file, more specifically rev-lock that will contain a specific hash of the git commit the repo was fetched initially from:

...
  repo:
    url: https://github.com/iterative/example-get-started
    rev-lock: 123456dfdfgdfgdf
...

So, each DVC-file is self-contained and has all the information to restore that specific version of the repo initial data was taken from.

Cons/pros for this option:

✅No additional files or concepts, easier to implement (at least initially);
✅We can easily work with multiple versions of the same repo;
❌We are going even further into DVC-files being machine-controlled, not human;
❌Consequent dvc import of different data artifacts from the same Github repo may lock onto different versions of the repo. This semantics might be confusing. Even though the repo is installed already, the same dvc import might fail since remote repo has changed.
❌Mass-update becomes a mess. Since we now have multiple files containing the same rev-lock information to update a singe dependency we will have to update all these files. Do we need this? Yes, if we introduce aliases, it would be nice to have something like dvc update requests, instead of dvc update github.com/sdfdsds/requests.
❌Duplication of information means less control. For example, if I have two dependencies - a model and a dict from the same remote repo. How do I make sure that two DVC-files are always in sync. DVC does not guarantee that. It will be easier to update only one.

aka Centralized - introduce a separate file .dvc/repos/lock (and may be .dvc/repos/list in the future to support aliases or work with multiple model/data versions) (place and names TBD). Every time we do dvc import github/something data we update those files automatically, add put installed repo revision hash to the lock file.

✅Repo update becomes easier - a single place to update.
✅No information duplication - less probability of an error.
✅Consequent dvc import commands install data from the same repo. It's easier to predict the outcome.
❌Update affects all stages that depend on it. It's fine if we don't want to depend on multiple versions of the same repo. Sometimes it might lead to mass update of a lot of pipelines we actually do not want even touching.
❌Calculating status, pull, etc becomes more complicates - need to go and check a central lock file. For example, md5 of the DVC-file itself now has to include some information from
❌Harder to implement (at least) initially.
❌Requires introducing aliases to deal with multiple versions of the repo.

Anything else I forgot @efiop @dmpetrov @Suor? @mroutis @pared @jorgeorpinel @villasv @ei-grad @sotte let's discuss, share you opinions and votes? Let me know if something is not clear - a lot of information to digest :)

shcheklein on 22 Jun 2019

👍3

All 6 comments

efiop on 18 Jun 2019

Summary of changes

cd myrepo
dvc import https://github.com/iterative/example-get-started model.pkl

md5: 39c9c563afc97491c86db391afd9cf82
wdir: .
deps:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
  path: model.pkl
  repo:
    url: https://github.com/iterative/example-get-started
outs:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
  path: model.pkl
  cache: true
  metric: false
  persist: false

this command also automatically "executes" this DVC-file, more specifically, it brings the model.pkl file into the workspace.

The existing dvc import command has been moved to the dvc import-url.

What is the "lock" mechanism about?

The same for import-url which can import a file via https, from s3, etc to the workspace. Right now it _always_ checks and updates the data in the workspace when you run dvc repro.

### Implementation options

aka Distibuted - add a special field for dvc import produced DVC-file, more specifically rev-lock that will contain a specific hash of the git commit the repo was fetched initially from:

...
  repo:
    url: https://github.com/iterative/example-get-started
    rev-lock: 123456dfdfgdfgdf
...

So, each DVC-file is self-contained and has all the information to restore that specific version of the repo initial data was taken from.

Cons/pros for this option:

aka Centralized - introduce a separate file .dvc/repos/lock (and may be .dvc/repos/list in the future to support aliases or work with multiple model/data versions) (place and names TBD). Every time we do dvc import github/something data we update those files automatically, add put installed repo revision hash to the lock file.

shcheklein on 22 Jun 2019

👍3

Mass-update becomes a mess.

Isn't the same as doing a git pull repo && find -name *.dvc | grep repo | sed s/rev-lock: old-hash/rev-lock: new-hash/ && dvc checkout ?

ghost on 26 Jun 2019

yep, but it can touch potentially a lot of files. Imagine you would have to update every single python code file that has import requests every time you are updating requests library.

Also, what should I do if some of them are not in sync? Should I update all of them anyway?

shcheklein on 26 Jun 2019

What do you mean by _not in sync_, @shcheklein, also, why DVC is allowing files to be out-of-sync, is it useful? Is it like a submodule of requests (let's say requests.adapters) in v1 and the other submodule (maybe requests.api) in v2?

ghost on 26 Jun 2019

yep, you got it right! Since we essentially have a lock per DVC-files nothing prevents us to lock two files to two different versions (revisions) of the same repository.

It's hard to predict how useful is it. But if we go with a decentralized approach (and looks like we do) it means that we have this feature out of the box. The question is how much will we have to pay for this down the road :)

shcheklein on 26 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings