From #2012
The most recent PR https://github.com/iterative/dvc/pull/2160 and a few previous merged PRs have introduced a mechanism to reuse (import) a data artifact (model or data file or directory) from one Github repository into another:
cd myrepo
dvc import https://github.com/iterative/example-get-started model.pkl
creates a DVC-file that has a dependency to the (the repo field below) https://github.com/iterative/example-get-started, and specifically to the model.pkl file inside produced by the DVC-pipeline inside it:
md5: 39c9c563afc97491c86db391afd9cf82
wdir: .
deps:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
path: model.pkl
repo:
url: https://github.com/iterative/example-get-started
outs:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
path: model.pkl
cache: true
metric: false
persist: false
this command also automatically "executes" this DVC-file, more specifically, it brings the model.pkl file into the workspace.
The existing dvc import command has been moved to the dvc import-url.
Now that we have an external dependency in some sense, the question is what should we do when we run dvc status, dvc repro? Pull the latest version as we do with dvc import-url every single time?
After a few discussions we have decided that we need a common semantics for all external dependencies - import and import-url. Semantics should be aligned with the DVC purpose in the first place - ensure the reproducibility.
It means that if we, say, in one moth decide to do git checkout <1-moth-ago> in myrepo, to get that same DVC-file from the above, DVC must guarantee that we are getting the same result (the same model.pkl in the first place) if we run repro. Even if https://github.com/iterative/example-get-started has changed a lot since that.
The same for import-url which can import a file via https, from s3, etc to the workspace. Right now it _always_ checks and updates the data in the workspace when you run dvc repro.
Thus we need to introduce some "lock" mechanism for these "import" DVC-files. They need to capture somewhere all the necessary context to get the same data when it's needed. It does not necessarily mean that we need some new fields, we might end up reusing what we have already and just change behavior.
### Implementation options
dvc import produced DVC-file, more specifically rev-lock that will contain a specific hash of the git commit the repo was fetched initially from:...
repo:
url: https://github.com/iterative/example-get-started
rev-lock: 123456dfdfgdfgdf
...
So, each DVC-file is self-contained and has all the information to restore that specific version of the repo initial data was taken from.
Cons/pros for this option:
鉁匩o additional files or concepts, easier to implement (at least initially);
鉁匴e can easily work with multiple versions of the same repo;
鉂學e are going even further into DVC-files being machine-controlled, not human;
鉂孋onsequent dvc import of different data artifacts from the same Github repo may lock onto different versions of the repo. This semantics might be confusing. Even though the repo is installed already, the same dvc import might fail since remote repo has changed.
鉂孧ass-update becomes a mess. Since we now have multiple files containing the same rev-lock information to update a singe dependency we will have to update all these files. Do we need this? Yes, if we introduce aliases, it would be nice to have something like dvc update requests, instead of dvc update github.com/sdfdsds/requests.
鉂孌uplication of information means less control. For example, if I have two dependencies - a model and a dict from the same remote repo. How do I make sure that two DVC-files are always in sync. DVC does not guarantee that. It will be easier to update only one.
.dvc/repos/lock (and may be .dvc/repos/list in the future to support aliases or work with multiple model/data versions) (place and names TBD). Every time we do dvc import github/something data we update those files automatically, add put installed repo revision hash to the lock file.鉁匯epo update becomes easier - a single place to update.
鉁匩o information duplication - less probability of an error.
鉁匔onsequent dvc import commands install data from the same repo. It's easier to predict the outcome.
鉂孶pdate affects all stages that depend on it. It's fine if we don't want to depend on multiple versions of the same repo. Sometimes it might lead to mass update of a lot of pipelines we actually do not want even touching.
鉂孋alculating status, pull, etc becomes more complicates - need to go and check a central lock file. For example, md5 of the DVC-file itself now has to include some information from
鉂孒arder to implement (at least) initially.
鉂孯equires introducing aliases to deal with multiple versions of the repo.
Anything else I forgot @efiop @dmpetrov @Suor? @mroutis @pared @jorgeorpinel @villasv @ei-grad @sotte let's discuss, share you opinions and votes? Let me know if something is not clear - a lot of information to digest :)
Mass-update becomes a mess.
Isn't the same as doing a git pull repo && find -name *.dvc | grep repo | sed s/rev-lock: old-hash/rev-lock: new-hash/ && dvc checkout ?
yep, but it can touch potentially a lot of files. Imagine you would have to update every single python code file that has import requests every time you are updating requests library.
Also, what should I do if some of them are not in sync? Should I update all of them anyway?
What do you mean by _not in sync_, @shcheklein, also, why DVC is allowing files to be out-of-sync, is it useful? Is it like a submodule of requests (let's say requests.adapters) in v1 and the other submodule (maybe requests.api) in v2?
yep, you got it right! Since we essentially have a lock per DVC-files nothing prevents us to lock two files to two different versions (revisions) of the same repository.
It's hard to predict how useful is it. But if we go with a decentralized approach (and looks like we do) it means that we have this feature out of the box. The question is how much will we have to pay for this down the road :)
Most helpful comment
Summary of changes
The most recent PR https://github.com/iterative/dvc/pull/2160 and a few previous merged PRs have introduced a mechanism to reuse (import) a data artifact (model or data file or directory) from one Github repository into another:
creates a DVC-file that has a dependency to the (the
repofield below)https://github.com/iterative/example-get-started, and specifically to themodel.pklfile inside produced by the DVC-pipeline inside it:this command also automatically "executes" this DVC-file, more specifically, it brings the
model.pklfile into the workspace.The existing
dvc importcommand has been moved to thedvc import-url.What is the "lock" mechanism about?
Now that we have an external dependency in some sense, the question is what should we do when we run
dvc status,dvc repro? Pull the latest version as we do withdvc import-urlevery single time?After a few discussions we have decided that we need a common semantics for all external dependencies -
importandimport-url. Semantics should be aligned with the DVC purpose in the first place - ensure the reproducibility.It means that if we, say, in one moth decide to do
git checkout <1-moth-ago>inmyrepo, to get that same DVC-file from the above, DVC must guarantee that we are getting the same result (the samemodel.pklin the first place) if we runrepro. Even if https://github.com/iterative/example-get-started has changed a lot since that.The same for
import-urlwhich can import a file via https, from s3, etc to the workspace. Right now it _always_ checks and updates the data in the workspace when you rundvc repro.Thus we need to introduce some "lock" mechanism for these "import" DVC-files. They need to capture somewhere all the necessary context to get the same data when it's needed. It does not necessarily mean that we need some new fields, we might end up reusing what we have already and just change behavior.
### Implementation options
dvc importproduced DVC-file, more specificallyrev-lockthat will contain a specific hash of the git commit the repo was fetched initially from:So, each DVC-file is self-contained and has all the information to restore that specific version of the repo initial data was taken from.
Cons/pros for this option:
鉁匩o additional files or concepts, easier to implement (at least initially);
鉁匴e can easily work with multiple versions of the same repo;
鉂學e are going even further into DVC-files being machine-controlled, not human;
鉂孋onsequent
dvc importof different data artifacts from the same Github repo may lock onto different versions of the repo. This semantics might be confusing. Even though the repo is installed already, the samedvc importmight fail since remote repo has changed.鉂孧ass-update becomes a mess. Since we now have multiple files containing the same rev-lock information to update a singe dependency we will have to update all these files. Do we need this? Yes, if we introduce aliases, it would be nice to have something like
dvc update requests, instead ofdvc update github.com/sdfdsds/requests.鉂孌uplication of information means less control. For example, if I have two dependencies - a model and a dict from the same remote repo. How do I make sure that two DVC-files are always in sync. DVC does not guarantee that. It will be easier to update only one.
.dvc/repos/lock(and may be.dvc/repos/listin the future to support aliases or work with multiple model/data versions) (place and names TBD). Every time we dodvc import github/something datawe update those files automatically, add put installed repo revision hash to the lock file.鉁匯epo update becomes easier - a single place to update.
鉁匩o information duplication - less probability of an error.
鉁匔onsequent
dvc importcommands install data from the same repo. It's easier to predict the outcome.鉂孶pdate affects all stages that depend on it. It's fine if we don't want to depend on multiple versions of the same repo. Sometimes it might lead to mass update of a lot of pipelines we actually do not want even touching.
鉂孋alculating status, pull, etc becomes more complicates - need to go and check a central lock file. For example, md5 of the DVC-file itself now has to include some information from
鉂孒arder to implement (at least) initially.
鉂孯equires introducing aliases to deal with multiple versions of the repo.
Anything else I forgot @efiop @dmpetrov @Suor? @mroutis @pared @jorgeorpinel @villasv @ei-grad @sotte let's discuss, share you opinions and votes? Let me know if something is not clear - a lot of information to digest :)