- what do you do for a living?
- oh, different things, import, export, family business
(any Hollywood smuggler and now a dvc dev)
This feature was first proposed by @dmpetrov and was discussed several times here and there. Recent conversations/debates with @dmpetrov and @shcheklein resulted into some more or less defined approach to it, which I will lay out here.
The need: we want things to be easier reused across repos, so we want some higher level api than just get()/read()/open()
, presumably some call, which will return a ready to use dataframe, an instantiated model, etc.
So here goes the design so far. The exporting side will need to list available things and write some instantiating glue code. Like:
# dvc-artifacts.yml
artifacts:
- name: face_rec_model:
description: Tensorflow face recognition model
call: export.make_face_rec
deps: ["data/face_rec_model.dat", "data/extra.dat"]
```python
import tflearn
def make_face_rec(face_rec_model_data, extra_data):
# ... construct a model
return model
Then the importting side might simply do:
```python
import dvc.api
model = dvc.api.summon("face_rec_model", "https://path/to/repo", rev=...)
model.predict(...)
Summon is a good (to my liking) and short name that refers to summoning a genie or a tesla :)
This will provide a flexible base to export/summon anything. Some questions arise about requrements, repeatetive glue code, Python/R interoperability and easing things for a publisher. I will go through this.
Instantiating a model or a dataframe will require some libraries installed like pandas or PyTorch, or even their particular versions. In basic scenario we might ignore that - in most cases a user will know what is this about an will have it installed already. The thing is we can't provide it seamlessly since installing python libs inside dvc.api.summon()
call will be surprising and might break some things. So deps management will be separate anyway. The possible way:
artifacts:
- name: some_dataframe
...
reqs:
- pandas>=1.2.3,<2
- otherlib>=1.2
$ dvc summon-reqs <repo-url> <name>
pandas>=1.2.3,<2
otherlib>=1.2
$ dvc summon-reqs <repo-url> <name> | xargs pip install
There are many common scenarios like export csv-file as dataframe, which will produce repetitive glue code. This could be handled by providing common export functions in our dvcx
(for dvc export) library:
artifacts:
- name: some_dataframe
call: dvcx.csv_to_df
deps: ["path/to/some.csv"]
reqs:
- dvcx
- pandas
In this particular scenario one might even use pandas.read_csv
directly. There should be a way to parametrize those:
artifacts:
- name: some_dataframe
call: pandas.read_csv
args:
sep: ;
true_values: ["yes", "oui", "ja"]
deps: ["path/to/some.csv"]
reqs:
- dvcx
- pandas
Note that deps are passed as positional arguments here.
This is a big concern for @dmpetrov. This will mostly work for simpler cases like dataframes, there is probably no practical way to make PyTorch models usable in R, there might be for some simpler models like linear regression. Anyway to solve those I suggest providing dvcx
library for R, sharing function and param names, so that same yaml file will work for R too. This is to my mind 2nd or even 3rd layer on top of basic summon functionality, so might be added later.
Raised by @shcheklein. All this additional yaml/python files might stop someone from publishing things. I don't see this as an issue since:
.read()/.open()/.get()
So the "plain text alternative" - one might just write somewhere in a readme - I have this and this things and provide snippets like:
import dvc.api
import pandas
with dvc.api.open("cities.csv", repo="http://path/to/repo") as f:
cities = pandas.read_csv(f)
This is easy to use, does not require any new concepts and any additional effort from our side.
In my opinion the basic summon functionality might be implemented in the near future as having high value/effort rate and more or less clear design.
Still all of this was only discussed in my 1-1s with @dmpetrov and @shcheklein, so some input from the rest of the team would be highly desirable. So @efiop @mroutis @pared @jorgeorpinel please join.
Good one, @Suor . Thanks!
Just to start the discussion, the first question I would clarify is why DVC is not enough and we need to specify (effectively) types and list of exports in a new file.
And related one - if DVC-file is enough, can we let regular existing dvc.api be aware about types? And do not introduce new APIs?
@shcheklein in a general case an exported object might depend on several outs from different .dvc files. Even if we limit ourselves to single .dvc I don't like complicating stage files. We also start something representing catalog with that yaml file.
Existing api calls return strings or file handles if they suddenly start to return something else that would be crazy.
(DISCLAIMER: all stuff below is not an attempt to say that @Suor's approach is not right and we should go with DVC-files alone, please, read this as an attempt to challenge some assumptions, get a bit more examples, and understand better the decisions we make)
in a general case
Would love to have real-life examples before we make this decision. I would not overgeneralize until it's not necessary. @dmpetrov you should have some from the model zoo experiment? and may be from some other experiments? Would be great to see more use cases here, more example of what we are trying to solve, etc.
We also start something representing catalog with that yaml file.
not sure I get that point? Do you mean something that improves discoverability? Gives a way to higher level dvc list
interface or something? Again, not sure why DVC-file is not enough here (granted with some additional changes).
Existing api calls return strings or file handles if they suddenly start to return something else that would be crazy.
that's true. I didn't mean reusing the same API calls literally. Some modification (or may be even another call) might be required. Probably this is the secondary question to the first one - what is the best way to export artifacts.
I will let someone else do the input )
Looks like a good API design)
I'm still not buy into the idea that this is needed, might be the lack of understanding of the problem. However, I'm down for trying new features.
Some questions:
How would this change the current deploy workflow? Does it make sense for someone to write a Flask REST API using dvc.summon
?
Who is the "target" audience? In what cases someone would need to summon
something?
With the current desing, is it possible to retrieve a specific version of such artifact?
When _summoning_ something, what would DVC do? (i.e. What would be the implementation for the current desing?)
Cool stuff, @Suor š
With the current desing, is it possible to retrieve a specific version of such artifact?
The rev
arg is still there to refer to any git revision same as in other api calls.
When summoning something, what would DVC do? (i.e. What would be the implementation for the current desing?)
With the example yaml file:
# dvc-artifacts.yml
artifacts:
- name: face_rec_model:
description: Tensorflow face recognition model
call: export.make_face_rec
deps: ["data/face_rec_model.dat", "data/extra.dat"]
dvc.api.summon("face_rec_model", repo="<repo-url>", rev="<rev>")
will:
repo-url
to a tmp dir, checkout rev
.dvc-artifacts.yml
and find face_rec
key there."data/face_rec_model.dat"
and "data/extra.dat"
) into that tmp dir.sys.path.insert(0, tmp_dir)
# This is extracted from "call" and "deps"
import export
artifact = export.make_face_rec("data/face_rec_model.dat", "data/extra.dat")
sys.path.popleft()
return artifact
- Clone the repo-url to a tmp dir, checkout rev.
- Read dvc-artifacts.yml and find face_rec key there.
Wouldn't it be "Read dvc-artifacts.yaml" and then checkout to revision? Checking out before reading dvc-artifacts.yaml
might give us a working directory where dvc-artifacts.yaml
doesn't exist.
echo "v1" > foo
dvc add foo
git add foo.dvc
git commit -m "add foo v1"
echo "v2" > foo
dvc add foo
git add foo.dvc
git commit -m "add foo v2"
touch dvc-artifacts.yaml
git add dvc-artifacts.yaml
git commit -m "add artifacts"
If you want to access foo v1
, checking out before reading dvc-artifacts.yaml
will not work.
The overlaps between args
and deps
is a little bit confusing
Wouldn't it be "Read dvc-artifacts.yaml" and then checkout to revision?
The dvc-artifacts.yml
resides in source repo, so you need to checkout that first. The rev
is specified in an api call, not in dvc-artifacts.yml
so single name there point to all available git revisions in a way.
Checking out before reading dvc-artifacts.yaml might give us a working directory where dvc-artifacts.yaml doesn't exist.
Yes, this is the issue. But if you don't have dvc-artifacts.yml
or don't have specific artifact referenced there at specific revision then the instantiating code (export.py
) won't be there also and you won't know how to instantiate the thing.
It is possible to separate dvc-artifacts.yml
from dvc data repo somehow:
# dvc-artifacts.yml in https://github.com/company/dvc-catalog
artifacts:
- name: face_rec_model
description: Tensorflow face recognition model
repo: https://github.com/company/face-rec
rev: 1.1 # Or should we allow specifying this in a .summon() call
call: export.make_face_rec # refers to export.py in `dvc-catalog` repo
deps: "data/face_rec_model.dat" # refers to a file managed by `face-rec` repo
- name: cats_dogs_model
description: Tensorflow face recognition model
repo: https://github.com/company/cats-dogs # other repo
rev: v2
call: ...
deps: ...
This way you can reference artifacts from external sources and wrap them into summonable form. This could be the future extension. This, however, implies at least two source repos (catalog and data) and might be an overkill for many use cases.
There are lots of questions here also arise. Like distinguishing source and catalog revs, code refs in call
and overall usability. So let's postpone this.
This extension possibility is however an another argument for separating artifacts decl files from stage files.
The overlaps between args and deps is a little bit confusing
It is. I didn't come up with something better for now.
@Suor thank you for initiating the discussion! This is a really great direction for DVC.
In this issue, only data reusage/getting part is discussed which is the most important part from the implementation point of view. I think it makes sense to discuss all the data catalog scenarios here to make better requirements for this feature.
But first, what makes a set of files a data catalog:
A special use case: model zoo.
Ideally, we should be able to create a repository with many models like https://github.com/cadene/pretrained-models.pytorch with all the information available through DVC API.
So, instead of reading text README we can use programmatic API:
https://intake.readthedocs.io/en/latest/
@Suor proposing implementing all through Python which limits interoperability.
I'd suggest creating a more universal approach with different types/plugins: there are dozens of data frame types - each can be used from any language. See https://intake.readthedocs.io/en/latest/plugin-directory.html.
Python can be one of the supported types/plugins and all the ML models (like PyTorch) will be used through this type. So, we won't block future interoperability for common types.
Initially, only two types are needed CSV and Python (which covers the rest types initially).
One of the technical challenges that I see today is importing\gettingsummoning multi-file resources. For example, to import a PyTorch model you need to get:
@Suor What would be the best way to overcome this issue? Do we need a specific API call? Should we clone the repo and keep it until the importing and instantiation happening. You should have more context on this. Could you please specify this and even open an issue?
@dmpetrov depending on several artifacts is demonstrated in the very first example, they need to be specified though. This is the only way for dvc to know what it should fetch and checkout. I don't know any reliable way to do that implicitly, there are a few ways to make it unreliably though, e.g. monkeypatching open()
or inspecting code :).
Depending on any files managed by git is not an issue with current implementation, we do a clone anyway. I guess we will need to stick to that because people will tend to forget some utils.py
they import or config.yaml
they read along the way, and since these are not programmers but data scientists, who will be exporting stuff, we should design the system to have forgiving, DWIM style.
List all the models (50+ models)
This would look cool in a jupiter notebook.
Show metrics available
I guess we can implement this on top of dvc pipelines somehow. We can also show/visualize the whole pipeline: let people see the data and the code used, and all the data transformations.
This looks like an advanced thing for me.
Upload an updated version of the model back to the repo (advanced case)
I don't think this is possible without breaking a run or an import stage, which has that model as an out. Only possible for add stages, which means this model pipeline is already broken and we don't know how it was build, etc.
@Suor proposing implementing all through Python which limits interoperability.
We discussed this and implementing it whichever way limits interoperability or extensibility. I see how more declarative way will facilitate eyeballing and maybe some alternative implementations for very limited number of data types like csv.
Still not sure this is a good idea to have from the start. E.g. we have 2 ways of doing things instead of one.
Different scenarios of data catalog usage. We can think about the scenarios as a requirement for this issue.
DVC-repo with pertained pytorch model (resnet18, resnet50, resnet101): https://github.com/dmpetrov/dvc-pretrained-resnet
Usage:
model = dvc.api.summon('https://github.com/dmpetrov/dvc-pretrained-resnet'
artifact='resnet50ā) # since there are many artifacts/models
Note, type: python
specifies instantiation code. The code might be in the repo or in the current repo or a common library (like pd.read_csc
- see @Suor examples). So, it looks like this method can be a part of DVC codebase since no pytorch specific stuff is required.
Example of using model zoos: https://github.com/dmpetrov/dvc_for_resnet-50
Repo: https://github.com/iterative/df_sea_ice
Usage:
df = dvc_cat.summon('https://github.com/iterative/df_sea_ice')
Note, the instantiation logic should be implemented as a library (dvc_cat?). The same way we can provide a library for R
/Julia
/Spark
.
Repo: https://github.com/iterative/df_sea_ice_no_header
Usage:
df = dvc_cat.summon('https://github.com/iterative/df_sea_ice_no_header')
To bring only a slice of data. A repo example is not ready.
Usage:
df = dvc_cat.summon.df_pattern(āhttps://github.com/iterative/df_slicing',
pattern=dvc_cat.FileNamePattern(āuser-actions-2019-11-*.csvā))
The pattern can be more complicated: dvc_cat.DataRangePattern("2019-%m-%d", start_date=x, end_date=y)
. This scenario is very important for analytical scenarios but it is not needed in the first implementation.
@iterative/engineering please share your opinions about the scenarios. Is something missing?
We are on the same page in the big picture part. Some thoughts mostly small or even cosmetic go below.
We need to answer a couple of questions as I see it:
.yml
or we will have a default one?dvcartifacts.yml
or dvcexport.yml
, then may we postpone multiple yamls in a repo support? We might never actualy need it.My answers: No and postpone.
I would also like to keep .summon()
API close to the other calls, so:
model = dvc.api.summon(
'resnet50', # artifact name
repo='https://github.com/dmpetrov/dvc-pretrained-resnet', # pass by name
rev='v1.2' # optional
)
We still need to pass artifact name:
# I prefer `dvcx` for dvc export/extension, also short
df = dvcx.summon('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')
This, however, presupposes that we have some default way to summon csv files. We might want to avoid this and make it:
# Read csv into dataframe
df = dvcx.to_df('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')
# Read csv into Numpy array
arr = dvcx.to_ndarray('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')
Note that I didn't call it .csv_to_df()
, a summoning user may not worry about format as long as it is supported, say csv
, hdf
, json
or whatever. He or she worries about the result format, i.e. a dataframe or a numpy array or whatever it might be on particular platform.
These are two scenarios, which slightly contradict each other. In a "model source" scenario we have a model with data and code used to build it and its metrics. All these things are stored in git and we may show them to user potentially which is cool.
However, if we dvc add
a model file then we don't have any history or metrics, which we probably want to provide in an export .yml
file. We will need some coordinated changes to keep those relevant, i.e. update a model, update metrics, maybe some other metadata (dates, authors, refs to data, method description). All of it duplicates data available naturally in a model source repo.
Then again we may turn model zoo into a model source by importing models from source repos, if that models are built within dvc repos of cause.
On patterns. Let's postpone that discussion, I see the value.
I prefer
dvcx
for dvc export/extension
Quick question, why isn't it just dvc.api.summon
, dvc.api.summon( ... ).to_df()
, etc.
Model Zoo vs Model Source
I also think DVC shines best in the model source scenario, but building model registries/catalogs without code or data sounds like a useful thing to do and it can be achieved with dvc import
instead of dvc add
, so the connection to the source projects is kept. š
UPDATE: Oh, I see this is already mentioned:
Then again we may turn model zoo into a model source by importing models from source repos...
So it seems we agree the Model Zoo concept is most appropriate, as it can indirectly include model sources with DVC.
@Suor thank you for the great feedback and the details that you have noticed! I agree with most of the items you brought.
It looks like the major unanswered question is model source vs. model zoos. Metrics files and the number of artifact-files are related to this I believe.
First, it looks like the scenario with a model source that was generated by data, code and a corresponding metrics file is a bit restrictive. In reality, metrics files can be created separately (custom or taken from previous runs) and DVC should not have control over this. From this point of view, the difference between metrics file (csv/json) and an artifact file that we are inventing now is not so dramatic as it might look like.
Second, in an ideal scenario, an artifact yaml file should be generated from a dvc pipeline the same way as metrics file. We will probably need a library to create metrics files and the artifact-yaml. Even without artifacts it would be nice to have.
So, why donāt we even merge these two concepts? We can say that any output yml file which marked as a metric file is an dvc-artifact.
Disadvantage: you need to have a corresponded DVC-file even if you donāt use data-files at all. Create that file dvc run -M artifact.yml -O sea_ice.csv
What do you folks think about treating yaml files the same way as metric ones?
PS: Also, we discussed how to get artifact but had not discussed how artifacts can be imported. We might need a new command like dvc import-artifact resnet50 http://myrepoaddress
.
I prefer
dvcx
for dvc export/extensionQuick question, why isn't it just
dvc.api.summon
,dvc.api.summon( ... ).to_df()
, etc.
@jorgeorpinel dvc.api.summon.*
or dvcx.*
it is a really good question. Which method to use depends on where the deserialization code lives.
For CSV we need a code that can deserialize it. The code will depend on libraries like pandas
that we don't want to include as a dependency of DVC. So, it is better to extract it as a separate package dvcx
.
In the resnet
case, python code is a part of the imported repo - type: python
. User just call the code for deserialization. In this case, it is okay to keep all the dependency-library management to user's shoulders. So, we can implement basic stuff inside DVC without introducing new dependencies.
Second, in an ideal scenario, an artifact yaml file should be generated from a dvc pipeline
The artifact file is completely human written, it has many things like name, description and ref to human written code. None of this can be generated. Metric in contrast is an output, which will be rewritten by command.
So I don't see how this may be combined:
dvc add
a model then how do you add metric file to it?dvcexport.yml
, so all .summon()
calls will need to specify .yml
filedvc run
some command to generate a .yml
file for you?Also, we discussed how to get artifact but had not discussed how artifacts can be imported.
You first need to say what do you mean by importing artifacts. For me artifacts are strictly in-memory objects. If you mean importing all their deps plus instantiating code then why do we need this?
So for model zoo I see two options:
These are not mutually exclusive, but I would prefer starting with one.
A high-level consideration: do not mix human written and machine generated content in a single place. This inhibits its human writability and readability.
P.S. I know we already do that in stage files, but it's hard to do anything about that now and we had issues with that.
@Suor your points about dvc add
, single yml files, missing metrics files and human readability are good but could be overcome or not so crucial.
However, the point about output (generating artifact file) is great - it will create duplications (retrain a model if model description changes or modify in two places and commit). It could be better to keep artifact files and metrics files separately. I donāt think that any change coordination is needed at this point - the user should decide. Also, it looks like the most simple decision from the implementation standpoint.
PS: Letās discuss importing artifact later. I think it is a useful feature. The only problem - it wonāt work for python imports, only to ācatalog nativeā types such as csv. This discussion can be postponed till the first implementation.
should we add dvc.api
to analytics? does it make sense?
@mroutis Good question. It does make sense, but we need to take special care with that one. Consider creating a separate ticket.
what about using dvc-formulas.yml
or dvc-conjures.yaml
instead of artifacts?
The _caster_ (_magician_) uses _formulas_ to _summon_ _agents_.
My favorite would be grimoire.yaml
(but quite obscure)
_reading that sentence with the following soundtrack_ https://www.youtube.com/watch?v=Bz6Kc4rcfOk
reference: https://github.com/huggingface/transformers
I think a nice addition would be to pass parameters through summon
:
dvc.api.summon("say_hello", params={"name": "world"}, repo="https://github.com/blahblah")
This brings more flexibility:
@mroutis yes, parameters overwriting should be supported in the function. Sorry, it was not specified directly.
@mroutis formulas.yml
(without dvc-
prefix) is a good one.
@mroutis for the items inside the yml you can use term data objects
- it is a nice addition to data files and data artifacts.
These items have types. So, object
explains the nature of this thingy pretty well.
I like the purposed terminology, @dmpetrov , will do the change :)
So type: csv
summons are implemented in https://github.com/iterative/dvcx, it requires latest master dvc as of now. Things that are not in yet:
docs?
Yes, def. docs! Please see iterative/dvc.org/pull/908 (although I'll probably open a separate issue for a more detailed summon()
ref), #3092, and new: iterative/dvcx/issues/1
docs have question because I see keeping everything in README as an advantage for smaller projects both for users and devs. Creating docs makes it slow, which is not good for a new evolving project. It takes weeks to merge something to docs now.
@Suor README + good docstings + examples should be enough, I agree. We'll move stuff that needs to be on dvc.org (for SEO, for awareness of summon existence, etc) after that.
It takes weeks to merge something to docs now.
yep, we can try and address together on the next retro.
Yeah "docs" doesn't have to mean the dvc.org website. 2 of the 3 issues are linked are about docstrings. READMEs are great too (and may need more love, do we ever review them?). But all these kinds of docs have different public and goals probably. Good topic for retro indeed.
weeks to merge something to docs
I checked the last 10 normal PRs in dvc.org. Average merge time is around 24 hours.
But yes, it can take much longer sometimes.
@Suor What is the status in this ticket? What are our next steps?
Closing the remaining points in favor of separate issues in dvcx: https://github.com/iterative/dvcx/issues/5 and https://github.com/iterative/dvcx/issues/6.
Most helpful comment
what about using
dvc-formulas.yml
ordvc-conjures.yaml
instead of artifacts?The _caster_ (_magician_) uses _formulas_ to _summon_ _agents_.
My favorite would be
grimoire.yaml
(but quite obscure)_reading that sentence with the following soundtrack_ https://www.youtube.com/watch?v=Bz6Kc4rcfOk