UPDATE: See https://github.com/iterative/dvc/issues/3369#issuecomment-588657129 for main purpose of this issue
If I have a data registry monorepo with several subrepos inside it, each pointing to it's own remote storage path, I want to be able to import and/or get data from a specific subrepo/remote location.
My situation: I have an s3 bucket, s3://datasets/
where I store multiple datasets, say audio/
and text/
. I want to track these data using dvc and a single, git-based data registry. Then I want to be able to selectively push and pull/import data from s3 through this registry for use in my other dvc projects.
So, I can make subrepos for audio/
and text/
, each initialized with its own dvc file and remote, and push data to s3 this way. Then, if I want to only download the audio data into a new project, then I can run something like dvc import [email protected]/datasets/audio
and it will automatically pull from the correct path in s3, corresponding to the default remote for the audio subrepo.
Thank you!
Hi @ehutt
I have a data registry monorepo with several subrepos inside it, each pointing to it's own remote storage
At first I thought this was essentially the same as #2349, but I see its not about subdirectories but rather about Git subrepos that are DVC projects in a parent Git repo (which is not a DVC project). Am I correct? Maybe if you can provide a sample file hierarchy including .git and .dvc dirs, this could be cleared up.
I have an s3 bucket where I store multiple datasets, say audio/ and text/. I want to track these using dvc...
So, I can make subrepos for audio/ and text/, each initialized with its own dvc file and remote...
I'm not sure you need to use separate DVC projects for each sub-dataset. You may be overengineering a little ๐ You could first try to follow the simple pattern explained in https://dvc.org/doc/use-cases/data-registries which just tracks multiple directories in a single DVC repo.
BTW, since your data is already in S3, you probably want to dvc add
the sub-datasets as external outputs in your data registry.
Then I want to be able to selectively push and pull/import data from s3 through this registry for use in my other dvc projects...
then I can run something like dvc import [email protected]/datasets/audio and it will automatically pull from the correct path in s3
Yep. dvc get
or dvc import
would be the commands for this. They take both an url
(the data reg repo) as well as a path
(each data artifact i.e. audio
or text
) arguments, so you can be selective with that combination. (Pulling and pushing are for synchronizing data between the cache of a DVC project and its remote storage.)
Hmmm I just noticed there's an entire conversation on this in https://discordapp.com/channels/485586884165107732/485596304961962003/679842782046584883
All that said, even if a simple data registry (single DVC+Git repo with subdirs) solves @ehutt's problem, given _SCM: allow multiple DVC repos inside single SCM repo_ #3257 (WIP), checking that dvc get
and dvc import
continue to work in this kind of DVC sub-projects seems important as well.
Hi @jorgeorpinel thanks for the swift response.
Hmmm I just noticed there's an entire conversation on this in https://discordapp.com/channels/485586884165107732/485596304961962003/679842782046584883
That's me. @efiop recommended that I use subrepos that are in the same git repo but dvc initialized independently. This will only be helpful to me if dvc get
and dvc import
support dvc subrepos, where there may be multiple dvc files, one for each dataset rather than at the root.
The problem I face now is that dvc import
and dvc get
require a default remote to be defined, but my data registry contains multiple datasets, all stored in different paths of an s3 bucket. I do not want to import an entire s3 bucket whenever I start a new project nor do I want to have separate git repos for every dataset. This is the dilemma.
BTW, since your data is already in S3, you probably want to dvc add the sub-datasets as external outputs in your data registry.
To this point, if I keep all of the data in s3 and manage it as external dependencies/outputs, does this defeat the point of having a github data registry? For example, one of my use cases involves taking raw audio, passing it through an ASR model, and then saving the resulting transcripts. Those transcripts will then be used for downstream NLP tasks. I want to be able to track which audio and model version were used to generate the transcripts, and then link that tracked info to the results of any downstream NLP tasks. If I can do this without the data ever leaving s3, then that is all the better. I just want to make sure the pipeline can be reproduced and the data are versioned.
I am building this data science/ml workflow from scratch and just trying to figure out the best way forward to effectively version our data/models/experiments for optimal reproducibility and clarity. I have not settled on a single way to organize our s3 storage or data registry so I am open to suggestions as to what you have seen work in your experience.
Thanks for your help
The problem I face now is that dvc import and dvc get require a default remote to be defined, but my data registry contains multiple datasets, all stored in different paths of an s3 bucket
Ahhh good point! ๐ It's a common pattern for dataset registry repos to use several remotes and not set any one as default actually, so you don't accidentally pull/push to the wrong one. So this one I see as a great enhancement request, let's see what the core team thinks: #3371
if I keep all of the data in s3 and manage it as external dependencies/outputs, does this defeat the point of having a github data registry? ... If I can do this without the data ever leaving s3, then that is all the better.
It doesn't defeat the point because you're augmenting an existing data repository with DVC's versioning and pipelining features. But please keep in mind that using external dependencies/outputs changes your workflow: DVC doesn't copy them to your local project or provide any interface for your scripts to access the files as if they were local. Your code will have to connect to S3 and stream or download the data before it can read or write it.
Also note that external outputs specified to dvc run
require an external cache setup in the same remote location - I think, double checking on Discord.
I am building this data science/ml workflow from scratch...
Maybe downloading the datasets locally and adding them to DVC (one by one if needed โ you can delete them locally once no longer needed [related: https://github.com/iterative/dvc/issues/3355#issuecomment-587554733]) is the most straightforward way to build the dataset, to avoid need for external data shenanigans.
Ahhh good point! ๐ It's a common pattern for dataset registry repos to use several remotes and not set any one as default actually, so you don't accidentally pull/push to the wrong one. So this one I see as a great enhancement request, let's see what the core team thinks: #3371
Yes! This is exactly what I need. I noticed other people requesting this feature and thought you weren't planning to implement it. But simply allowing for a remote to be specified (rather than requiring a default) when calling dvc import
would solve all my problems.
We may eventually change the workflow to use external dependencies so as to avoid people making unnecessary copies of the data that's on s3 (for privacy/security reasons when working with customer data), but for now I think the data registry + support for importing from other remotes is the most elegant solution.
Any idea if/when you will add this feature? I will be eagerly awaiting any updates!
@ehutt please follow (and upvote with ๐) #3371 for updates on that ๐
But indeed, it sounds like the teams inclination for the short term is to provide a --remote
option for get and import. I find this a workaround but should do the trick. (Please see/follow #2466)
The other WIP solution (probably the one that will be ready first) is that dvc init
will have a --subdir
option soon to let you create DVC preojects in subdirectories of a parent DVC project/Git repo (See #3257 PR). This would change your workflow more along the lines of the chat you had with Ruslan on Discord though.
@ehutt, get/import for subrepos is in the latest version, i.e. 1.7.0. Please, let us know if there's any issues. :slightly_smiling_face:
Most helpful comment
@ehutt, get/import for subrepos is in the latest version, i.e. 1.7.0. Please, let us know if there's any issues. :slightly_smiling_face: