Our team starting use DVC, working on different projects and different machines using some shared servers - (large # of cpu/gpu/memory, etc..)
The shared cache config is somewhat cumbersome - as its specific to a group of repos... One set of projects and remotes will use one cache, and another set would use a different cache, as there are different restrictions on data and who has access to the information. We have specifically used dvc.local to assign cache.dir to keep path/machine specific attributes out of git; however, this does mean that folks must remember to set the local cache.dir whenever they clone or setup a new training project w/dvc import (otherwise they end up creating a copy of the data). Users could set global - but this is cumbersome working with different caches and different projects (set/unset/etc/get wrong!).
What might be nice - if the cache could be configured based on the remote... This way a system and or global configuration could associate the correct shared cache dir.
somewhat better global config:
['remote "prject_a"']
url = s3://dvc-bucket/datasets/project_a_dataset
endpointurl = https://some.endpoint.com
profile = some-profile
cache_dir = "/project_a/shared_dvc/cache
Even better would be to allow configuration based on url matching, avoiding the need for a any remote or profile naming convention... This could enable the use of a system or global config that has bucket specific settings... as the bucket itself defines access control, and already implies a grouping of what data might be shared in a cache.
even better system or global config:
['remote "s3://dvc-bucket-a/datasets/*"']
cache_dir = "/project_a/shared_dvc/cache
profile = some-profile
['remote "s3://dvc-bucket-b/*"']
cache_dir = "/project_b/shared_dvc/cache
profile = some-profile
@wdixon thanks! Few questions here:
How do you define which remote to use per project? Do you save the name in the project's config? As a workaround for configuring cache locations, I guess, you can come up with a convention around cache locations (using symlinks to actual location if needed) and save those into the project's config, WDYT? Something like - for a remoteA we always have /mnt/cache/remoteA.
How will the default DVC caching mechanism (when you run dvc add, dvc run, etc) pick which cache to use? The one attached to the default remote? Does it mean we'll need to add --remote to those commands as well?
It feels that this can be a bit more general problem. Let's say we have a project with some configuration, name, etc - we can come up with a mechanism to match it in the config and override settings. In you case we would be matching based on the default remote path/name. I wonder if there are some similar mechanism in git, or some other tools?
We are pretty new to DVC - so perhaps there is a better way to do things... Currently, each project saves the remote definition in the .dvc/config... The way we envision using these - is defining a curated dataset that is effectively a "data registry"... A separate git/dvc project would define a particular network architecture and import one or more subsets of data registry projects required to do the training and track the model provenance.
We would like to avoid having naming conventions, etc... that provide the mapping - that work on one system - and would break when those mappings/links don't exist on another system.
I suppose multiple remotes complicates things - that is something I hadn't considered.... maybe you would have to adopt a naming conventions for projects themselves - and allow the global configuration to have project/pattern matching to allow the assignment of things like cach.dir
git uses a similar approach for selecting proxies... You can configure different URL patterns to use different settings. That is the basis of how I proposed the remote matching.... but maybe if you swap remote url for project name it would be a clean solution?
thanks @wdixon !
It's totally reasonable to have multiple remotes. We even have a ticket prioritized to have multiple remotes per project- it's a common ask due to a different nature of artifacts (e.g. sensitive data vs models).
Having separate caches is also reasonable to my mind. Even though dvc push won't upload "extra" artifacts, those that belong to a different project, it can be important for security reason- better to be safe then sorry.
We would like to avoid having naming conventions, etc... that provide the mapping - that work on one system - and would break when those mappings/links don't exist on another system.
agreed! was suggesting this as a workaround only
but maybe if you swap remote url for project name it would be a clean solution
it can be a good idea ... like have name=ProjectA optional config setting, and in the global config [project 'name'] section to override/set settings that are relevant for that project on that machine, like cache.dir ... @iterative/engineering what do you think, guys?
git uses a similar approach for selecting proxies
could you point me to the docs?
@wdixon @shcheklein I think I still don't quite get the situation.
cache dir via dvc config cache.dir and keep this information in repo?@wdixon
The thought was that machine specific information did not belong in the tracked configuration.
got it, but then, even if we were supporting remote-cache.dir pairs, that wouldn't solve a problem for inter-os projects, you would still need to define it before starting work on a project, right?
Ahh... Ok I think I get it now:
so you would like to be able to define remote "parameters" on local config, so that, for example:
"On current machine, if project uses s3://specific/remote" we should automatically use cache specified for that remote? So that you define it once, and later DVC would now by itself which remote is associated with witch cache on this particular machine that you are working on?
could you point me to the docs?
https://git-scm.com/docs/git-config, and also with their credential interface https://git-scm.com/docs/gitcredentials
You might also have something similar to this in your .ssh/config file, allowing you to specify identity, credentials, etc.
@wdixon
The thought was that machine specific information did not belong in the tracked configuration.
got it, but then, even if we were supporting
remote-cache.dirpairs, that wouldn't solve a problem for inter-os projects, you would still need to define it before starting work on a project, right
Was thinking that this would be in the users global config or system config. If its not defined, the users would end up with their own private cache under ./dvc/cache - so still workable with just a checkout/pull (not broken) - but would require configuration to point to a shared cache.
Ahh... Ok I think I get it now:
so you would like to be able to define remote "parameters" on local config, so that, for example:
"On current machine, if project usess3://specific/remote" we should automatically use cache specified for that remote? So that you define it once, and later DVC would now by itself which remote is associated with witch cache on this particular machine that you are working on?
Yes - that was the thought.... but project name could also work as long as we kept the name unique.
If this were a configuration based on project name.... That would imply you would need to define a name (perhaps in the core namespace?)
```bash
dvc/.config
[core]
name = "some_name"
global or system dvc config:
[cache "some_n*"]
dir = /path/to/cache
````
@wdixon yes, that's pretty much what I had in my mind! name can be optional, so people would use only when it's needed.
Hello! Would #2095 address this? (slightly different solution). If so maybe we can merge the issues.
Hello! Would #2095 address this? (slightly different solution). If so maybe we can merge the issues.
My initial thought was based on bucket or bucket pattern... but based on the thread - having a name associated with DVC - and being able to provide global/system override of these settings is a much more generic solution to just the cache-dir.
dvc/.config
[core]
name = "some_name"
['remote "origin"']
url = s3://dvc-bucket/dev-repo
endpointurl = https://custom-endpoint.com
profile = your-profile
global or system dvc config:
[cache "some_n*"]
dir = /path/to/cache
url = s3://dvc-bucket/prod-repo
Right, so this is more about configuring the cache, and the remote was being proposed as a way to do that, but now we're thinking about adding a core.name config option instead. Got it.