Dvc: Support for multiple .dvc roots in a single git repo

Created on 31 Jul 2019  路  19Comments  路  Source: iterative/dvc

This request comes from a large company that has used DVC in the past but moved away mostly due to this issue.

Some companies (.e.g Google) store all of their different projects' and teams' code in a single big git repo.

In this scenario, each project has its own subdirectory in the repo, and they are expected to only make changes to that subdirectory, unless they are contributing code to another project in the organization.

Unfortunately, DVC can only create its .dvc folder in the git repo root. This is problematic for a couple of reasons:

  1. The data science team in question might want to work with DVC, but they are blocked from doing so, either just politically or even with automatically enforced authorizations
  2. DVC can break backwards compatibility, and different projects in the monorepo might use different versions of DVC

I could probably think of a couple other reasons why this might be a problem.

It seems to me that requiring the .dvc folder be in the root folder is pretty arbitrary, and giving the option to put it in other places in the tree would open the way for wider adoption.

feature request p1-important product research

Most helpful comment

@efiop Our team maintains multiple related projects in the same git repository and would love to get this feature running. I would like to get started contributing to this feature. I am new to dvc's code base, will start looking into it this week. Any pointers to what to look for and where would be great.

And what features would this change affect would also be great. e.g. would creating multiple roots in the same repository need GC code to be modified etc.

All 19 comments

Hi @guysmoilov !

Great suggestion! Are you talking about dvc init specifically or running dvc commands after dvc init --no-scm? The latter one could be ran anywhere, but sequential commands from within that subrepo won't be able to use git. Though we would have to fix both in any case.

Btw, I suppose you are not talking about a repo consisting of git submodules, right? We are able to handle those right now. In that case to support this scenario, we'll have to

1) allow dvc init in directories that are under git control but are not git repo root. We could consider doing this by default, but I'm afraid that some people will do that accidentally and then will be confused as to why they are not able to dvc add something in the parent dir. A safer choice is to introduce something like dvc init --sub.
2) make Git class(and scm in general) search for git repo root up the tree, instead of just trying to use dvc repo root. Enabling this behavior by default might also be confusing, so we might introduce a config option that would be set by dvc init --sub that would tell dvc to behave this way.

Also, for both parts, we would be able to detect misuse and print some nice hints for users.

Overall, seems pretty straightforward. What are your thoughts, guys?

@efiop sounds very reasonable to me.

@efiop Hi Ruslan, I was referring to a normal dvc init, in a normal git repo without submodules.
Just adding the ability to treat separate subdirectories in a normal git repo as separate dvc projects.
(And this is just my experience/opinion, but almost no one uses git submodules, it's really awkward)

@guysmoilov Thanks for clarifying! I agree that there are not that many people that using git submodules, but we've got that feature contributed by the user, so that says something :slightly_smiling_face:

@efiop Our team maintains multiple related projects in the same git repository and would love to get this feature running. I would like to get started contributing to this feature. I am new to dvc's code base, will start looking into it this week. Any pointers to what to look for and where would be great.

And what features would this change affect would also be great. e.g. would creating multiple roots in the same repository need GC code to be modified etc.

@sai-prasanna thank you for your interest in this issue!

To give you some starting point:

  1. You will surely need to comment-out this check
  2. when initializing DvcRepo we are assuming that git root is the same as dvc root here. That should be reconsidered.

Some useful methods should be under dvc/scm package, there lies main logic related to code version control.

I hope that will give you some grasp on the issue. Please ping us with any further questions.

@pared Thanks, will take a look.

Unfortunately I can't start working on it proper immediately. If anyone else want to work on this immediately, feel free.

Ok, for request from one of the users I will try to summarize what is needed to complete this task:

  1. Introduce --sub flag for dvc init: in case of using this flag, we should set proper config value
    so one would need to modify config, provide new config option that most likely be boolean.
  1. In few places where we initialize SCM (eg dvc init or when initializing git repo) we should be aware of mentioned config option and provide some logic to search for Git root up the tree, if its not avialable in the same dir as we are trying to initialize it. SCM is also used in analytics module, so that has to be checked too.

  2. We should probably detect when trying to use dvc init --sub in the same dir as .git is, as it indicates misuse of flag, and user should be hinted that he is probably not doing what he wants to do.

Point from @shcheklein to think about:
What about use case when there is already git & dvc root and someone want's to initialize sub dvc repo somewhere down the tree?

@pared I think we should handle it as git in that case by just ignoring it when collecting stages :slightly_smiling_face: Basically the same as having sub-repo in the .dvcignore of the host repo

@efiop or, for starters we can simply forbid that :)

@pared but that will break all of my tests, that I run in the dvc root :smile: I might be missing some issues here, but at least to me it feels like git-like behaviour is reasonable. Maybe you'll find some arguments against that.

We discussed this issue during planning and haven't come to any conclusion. My personal opinion that implementing this will be costly in the long run:

  • most probably lots of code in dvc implicitly presupposes dvc and git having same roots
  • most future features will need to be aware of this possibility
  • existing features will need to be adopted, we'll miss most of this on initial implementation and will be fixing related bugs for months
  • this will add another factor to our tests combinatoric explosion

So I expect this slowing us permanently or for a long period of time. This leads to a question - how valuable is this? Previously this was mentioned in the context of configurable/partial remotes in #2095 and #2825, maybe that would be enough for many cases?

@Suor Well I don't know about prioritization, but I've talked to several companies who wanted to use DVC but this was a 100% blocker for them. They won't change their whole organization's setup to use DVC. So if you want them to ever be users you'll probably need to support it at some point.

@guysmoilov but if it would be possible to use single dvc repo for git repo, but configure remotes by folder won't that be good enough?

@Suor I wouldn't think so. What makes remotes so special? I might want to use different versions of DVC for different parts of the tree for example, have separate caches, etc. Think completely different teams in the same organization, they might not know each other or not be in the same continent.

Fwiw my use case should be solvable using configurable / partial remotes. We're building a centralized repo for all of our datasets, and some of them live in Box, others live in s3 buckets on different AWS accounts which can't be allowed to cross-contaminate etc. So if there is a robust solution to configurable / partial remotes that can convince our infosec people, I'm happy.

I think supporting separate caches might make the separation even cleaner

@pokey do you have a clear separation - one remote per project? or do you want to keep different data types (e.g. models in one, datasets - another). It seems to that configurable remotes is a good features by itself - just trying to clarify what is the best way to achieve it:

  1. being able to specify per "output" aka data artifact certain options - do we need to push/pull it by default, which remote should be used by default, etc. Means you just change it in the .dvc files or specify when you run dvc add.
  2. set in the .dvc/config or somewhere nearby certain rules - mapping path (glob) -> remote.
  3. probably "specifying remotes per folder" is similar to the 2. cc @Suor

I would say this feature is more related to:

https://github.com/iterative/dvc/issues/2095
https://stackoverflow.com/questions/58952962/how-to-use-different-remotes-for-different-folders
https://github.com/iterative/dvc/issues/2095#issuecomment-571457021

and a bunch of other related things with push/pull granularity

It's not directly related to the "multiple roots" support, but itself is a very common issue our user hit.

@pokey please chime in in the ^^ ticket.

Was this page helpful?
0 / 5 - 0 ratings