Dvc: import: allow chaining imports somehow? (except circular)

Created on 11 Feb 2020  路  3Comments  路  Source: iterative/dvc

More context around https://discordapp.com/channels/485586884165107732/485596304961962003/676405501940072453

As of now, imported data is not cached by default, so you won't be able to import any imported data:

repo1 -> 鉁旓笍 dvc import data -> repo2 馃檪 -> 鉁栵笍 dvc import data -> repo3 馃檨

Somehow allowing this could be useful for the case where you're building a data registry based on other previous smaller DVC repos, for example. Right now you have to dvc get and then dvc add those artifacts from scratch in the data registry (so they can be imported in further DVC repos).

Ruslan mentioned something about using "links" to implement this (on Discord).

enhancement feature request p2-medium research

Most helpful comment

This would be a very important feature for me. The use case is the following: files (datasets, pretrained models...) is generated across many different repositories, so I need a central data registry to easily catalogue data. Also, in case some of the original data creating repositories are renamed or merged together, I only need to change the import in data registry and not have to track down every single user of the data.

All 3 comments

except circular

This means repo1 -> import into repo2 -> import back to repo1 of course cannot be allowed.

So maybe the solution is that when you import an import stage, you simply copy the DVC-file as-is (with it's original source repo URL, rev, etc. and if the original rev_lock exists in the present repo, the import recognizes a circular import and fails.

Just a note that we need to be careful about this and consider all the possible corner cases (e.g. circular dependencies). If we allow this, dvc will have to behave like a proper package manager when resolving dependencies, which is very hard (remember pip dep resolution PEP?).

This would be a very important feature for me. The use case is the following: files (datasets, pretrained models...) is generated across many different repositories, so I need a central data registry to easily catalogue data. Also, in case some of the original data creating repositories are renamed or merged together, I only need to change the import in data registry and not have to track down every single user of the data.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrethrill picture andrethrill  路  70Comments

pared picture pared  路  73Comments

drorata picture drorata  路  46Comments

jorgeorpinel picture jorgeorpinel  路  45Comments

danfischetti picture danfischetti  路  41Comments