This request comes from a large company that has used DVC in the past but moved away mostly due to this issue.
Some companies (.e.g Google) store all of their different projects' and teams' code in a single big git repo.
In this scenario, each project has its own subdirectory in the repo, and they are expected to only make changes to that subdirectory, unless they are contributing code to another project in the organization.
Unfortunately, DVC can only create its .dvc
folder in the git repo root. This is problematic for a couple of reasons:
I could probably think of a couple other reasons why this might be a problem.
It seems to me that requiring the .dvc
folder be in the root folder is pretty arbitrary, and giving the option to put it in other places in the tree would open the way for wider adoption.
Hi @guysmoilov !
Great suggestion! Are you talking about dvc init
specifically or running dvc
commands after dvc init --no-scm
? The latter one could be ran anywhere, but sequential commands from within that subrepo won't be able to use git. Though we would have to fix both in any case.
Btw, I suppose you are not talking about a repo consisting of git submodules, right? We are able to handle those right now. In that case to support this scenario, we'll have to
1) allow dvc init
in directories that are under git control but are not git repo root. We could consider doing this by default, but I'm afraid that some people will do that accidentally and then will be confused as to why they are not able to dvc add
something in the parent dir. A safer choice is to introduce something like dvc init --sub
.
2) make Git class(and scm in general) search for git repo root up the tree, instead of just trying to use dvc repo root. Enabling this behavior by default might also be confusing, so we might introduce a config option that would be set by dvc init --sub
that would tell dvc to behave this way.
Also, for both parts, we would be able to detect misuse and print some nice hints for users.
Overall, seems pretty straightforward. What are your thoughts, guys?
@efiop sounds very reasonable to me.
@efiop Hi Ruslan, I was referring to a normal dvc init
, in a normal git repo without submodules.
Just adding the ability to treat separate subdirectories in a normal git repo as separate dvc projects.
(And this is just my experience/opinion, but almost no one uses git submodules, it's really awkward)
@guysmoilov Thanks for clarifying! I agree that there are not that many people that using git submodules, but we've got that feature contributed by the user, so that says something :slightly_smiling_face:
@efiop Our team maintains multiple related projects in the same git repository and would love to get this feature running. I would like to get started contributing to this feature. I am new to dvc's code base, will start looking into it this week. Any pointers to what to look for and where would be great.
And what features would this change affect would also be great. e.g. would creating multiple roots in the same repository need GC code to be modified etc.
@sai-prasanna thank you for your interest in this issue!
To give you some starting point:
DvcRepo
we are assuming that git root is the same as dvc root here. That should be reconsidered. Some useful methods should be under dvc/scm
package, there lies main logic related to code version control.
I hope that will give you some grasp on the issue. Please ping us with any further questions.
@pared Thanks, will take a look.
Unfortunately I can't start working on it proper immediately. If anyone else want to work on this immediately, feel free.
Ok, for request from one of the users I will try to summarize what is needed to complete this task:
--sub
flag for dvc init
: in case of using this flag, we should set proper config valueIn few places where we initialize SCM (eg dvc init
or when initializing git repo) we should be aware of mentioned config option and provide some logic to search for Git root up the tree, if its not avialable in the same dir as we are trying to initialize it. SCM
is also used in analytics module, so that has to be checked too.
We should probably detect when trying to use dvc init --sub
in the same dir as .git
is, as it indicates misuse of flag, and user should be hinted that he is probably not doing what he wants to do.
Point from @shcheklein to think about:
What about use case when there is already git & dvc root
and someone want's to initialize sub dvc repo
somewhere down the tree?
@pared I think we should handle it as git in that case by just ignoring it when collecting stages :slightly_smiling_face: Basically the same as having sub-repo
in the .dvcignore
of the host repo
@efiop or, for starters we can simply forbid that :)
@pared but that will break all of my tests, that I run in the dvc root :smile: I might be missing some issues here, but at least to me it feels like git-like behaviour is reasonable. Maybe you'll find some arguments against that.
We discussed this issue during planning and haven't come to any conclusion. My personal opinion that implementing this will be costly in the long run:
So I expect this slowing us permanently or for a long period of time. This leads to a question - how valuable is this? Previously this was mentioned in the context of configurable/partial remotes in #2095 and #2825, maybe that would be enough for many cases?
@Suor Well I don't know about prioritization, but I've talked to several companies who wanted to use DVC but this was a 100% blocker for them. They won't change their whole organization's setup to use DVC. So if you want them to ever be users you'll probably need to support it at some point.
@guysmoilov but if it would be possible to use single dvc repo for git repo, but configure remotes by folder won't that be good enough?
@Suor I wouldn't think so. What makes remotes so special? I might want to use different versions of DVC for different parts of the tree for example, have separate caches, etc. Think completely different teams in the same organization, they might not know each other or not be in the same continent.
Fwiw my use case should be solvable using configurable / partial remotes. We're building a centralized repo for all of our datasets, and some of them live in Box, others live in s3 buckets on different AWS accounts which can't be allowed to cross-contaminate etc. So if there is a robust solution to configurable / partial remotes that can convince our infosec people, I'm happy.
I think supporting separate caches might make the separation even cleaner
@pokey do you have a clear separation - one remote per project? or do you want to keep different data types (e.g. models in one, datasets - another). It seems to that configurable remotes is a good features by itself - just trying to clarify what is the best way to achieve it:
dvc add
..dvc/config
or somewhere nearby certain rules - mapping path (glob) -> remote
. I would say this feature is more related to:
https://github.com/iterative/dvc/issues/2095
https://stackoverflow.com/questions/58952962/how-to-use-different-remotes-for-different-folders
https://github.com/iterative/dvc/issues/2095#issuecomment-571457021
and a bunch of other related things with push/pull granularity
It's not directly related to the "multiple roots" support, but itself is a very common issue our user hit.
@pokey please chime in in the ^^ ticket.
Most helpful comment
@efiop Our team maintains multiple related projects in the same git repository and would love to get this feature running. I would like to get started contributing to this feature. I am new to dvc's code base, will start looking into it this week. Any pointers to what to look for and where would be great.
And what features would this change affect would also be great. e.g. would creating multiple roots in the same repository need GC code to be modified etc.