Dvc: config: set cache per project at a global/system level

Created on 2 Sep 2020  路  13Comments  路  Source: iterative/dvc

Our team starting use DVC, working on different projects and different machines using some shared servers - (large # of cpu/gpu/memory, etc..)

The shared cache config is somewhat cumbersome - as its specific to a group of repos... One set of projects and remotes will use one cache, and another set would use a different cache, as there are different restrictions on data and who has access to the information. We have specifically used dvc.local to assign cache.dir to keep path/machine specific attributes out of git; however, this does mean that folks must remember to set the local cache.dir whenever they clone or setup a new training project w/dvc import (otherwise they end up creating a copy of the data). Users could set global - but this is cumbersome working with different caches and different projects (set/unset/etc/get wrong!).

What might be nice - if the cache could be configured based on the remote... This way a system and or global configuration could associate the correct shared cache dir.

somewhat better global config:
['remote "prject_a"']
url = s3://dvc-bucket/datasets/project_a_dataset
endpointurl = https://some.endpoint.com
profile = some-profile
cache_dir = "/project_a/shared_dvc/cache

Even better would be to allow configuration based on url matching, avoiding the need for a any remote or profile naming convention... This could enable the use of a system or global config that has bucket specific settings... as the bucket itself defines access control, and already implies a grouping of what data might be shared in a cache.

even better system or global config:
['remote "s3://dvc-bucket-a/datasets/*"']
cache_dir = "/project_a/shared_dvc/cache
profile = some-profile

['remote "s3://dvc-bucket-b/*"']
cache_dir = "/project_b/shared_dvc/cache
profile = some-profile

feature request

All 13 comments

@wdixon thanks! Few questions here:

  • How do you define which remote to use per project? Do you save the name in the project's config? As a workaround for configuring cache locations, I guess, you can come up with a convention around cache locations (using symlinks to actual location if needed) and save those into the project's config, WDYT? Something like - for a remoteA we always have /mnt/cache/remoteA.

  • How will the default DVC caching mechanism (when you run dvc add, dvc run, etc) pick which cache to use? The one attached to the default remote? Does it mean we'll need to add --remote to those commands as well?

  • It feels that this can be a bit more general problem. Let's say we have a project with some configuration, name, etc - we can come up with a mechanism to match it in the config and override settings. In you case we would be matching based on the default remote path/name. I wonder if there are some similar mechanism in git, or some other tools?

We are pretty new to DVC - so perhaps there is a better way to do things... Currently, each project saves the remote definition in the .dvc/config... The way we envision using these - is defining a curated dataset that is effectively a "data registry"... A separate git/dvc project would define a particular network architecture and import one or more subsets of data registry projects required to do the training and track the model provenance.

We would like to avoid having naming conventions, etc... that provide the mapping - that work on one system - and would break when those mappings/links don't exist on another system.

I suppose multiple remotes complicates things - that is something I hadn't considered.... maybe you would have to adopt a naming conventions for projects themselves - and allow the global configuration to have project/pattern matching to allow the assignment of things like cach.dir

git uses a similar approach for selecting proxies... You can configure different URL patterns to use different settings. That is the basis of how I proposed the remote matching.... but maybe if you swap remote url for project name it would be a clean solution?

thanks @wdixon !

It's totally reasonable to have multiple remotes. We even have a ticket prioritized to have multiple remotes per project- it's a common ask due to a different nature of artifacts (e.g. sensitive data vs models).

Having separate caches is also reasonable to my mind. Even though dvc push won't upload "extra" artifacts, those that belong to a different project, it can be important for security reason- better to be safe then sorry.

We would like to avoid having naming conventions, etc... that provide the mapping - that work on one system - and would break when those mappings/links don't exist on another system.

agreed! was suggesting this as a workaround only

but maybe if you swap remote url for project name it would be a clean solution

it can be a good idea ... like have name=ProjectA optional config setting, and in the global config [project 'name'] section to override/set settings that are relevant for that project on that machine, like cache.dir ... @iterative/engineering what do you think, guys?

git uses a similar approach for selecting proxies

could you point me to the docs?

@wdixon @shcheklein I think I still don't quite get the situation.

  1. Is it possible that one project will use more than one remote?
  2. If not why not configure cache dir via dvc config cache.dir and keep this information in repo?
  • It is unlikely we would use more than 1 remote in a project definition (I say that now - but we do use multiple remotes in git all the time).
  • The project will be checked out on a variety of systems (even a mixture of Linux and windows). The cache.dir is specific to a system, and will be an invalid path on other systems. The thought was that machine specific information did not belong in the tracked configuration.

@wdixon

The thought was that machine specific information did not belong in the tracked configuration.

got it, but then, even if we were supporting remote-cache.dir pairs, that wouldn't solve a problem for inter-os projects, you would still need to define it before starting work on a project, right?

Ahh... Ok I think I get it now:
so you would like to be able to define remote "parameters" on local config, so that, for example:
"On current machine, if project uses s3://specific/remote" we should automatically use cache specified for that remote? So that you define it once, and later DVC would now by itself which remote is associated with witch cache on this particular machine that you are working on?

could you point me to the docs?

https://git-scm.com/docs/git-config, and also with their credential interface https://git-scm.com/docs/gitcredentials
You might also have something similar to this in your .ssh/config file, allowing you to specify identity, credentials, etc.

@wdixon

The thought was that machine specific information did not belong in the tracked configuration.

got it, but then, even if we were supporting remote-cache.dir pairs, that wouldn't solve a problem for inter-os projects, you would still need to define it before starting work on a project, right

Was thinking that this would be in the users global config or system config. If its not defined, the users would end up with their own private cache under ./dvc/cache - so still workable with just a checkout/pull (not broken) - but would require configuration to point to a shared cache.

Ahh... Ok I think I get it now:
so you would like to be able to define remote "parameters" on local config, so that, for example:
"On current machine, if project uses s3://specific/remote" we should automatically use cache specified for that remote? So that you define it once, and later DVC would now by itself which remote is associated with witch cache on this particular machine that you are working on?

Yes - that was the thought.... but project name could also work as long as we kept the name unique.

If this were a configuration based on project name.... That would imply you would need to define a name (perhaps in the core namespace?)

```bash
dvc/.config
[core]
name = "some_name"

global or system dvc config:
[cache "some_n*"]
dir = /path/to/cache
````

@wdixon yes, that's pretty much what I had in my mind! name can be optional, so people would use only when it's needed.

Hello! Would #2095 address this? (slightly different solution). If so maybe we can merge the issues.

Hello! Would #2095 address this? (slightly different solution). If so maybe we can merge the issues.

2095 seems to be about defining a configuration to override the bucket... #4519 is about having the ability to define where shared caches, containing machine specific paths, might be associated with a given repo or set of repos.

My initial thought was based on bucket or bucket pattern... but based on the thread - having a name associated with DVC - and being able to provide global/system override of these settings is a much more generic solution to just the cache-dir.

dvc/.config
[core]
    name = "some_name"
['remote "origin"']
    url = s3://dvc-bucket/dev-repo
    endpointurl = https://custom-endpoint.com
    profile = your-profile

global or system dvc config:
[cache "some_n*"]
    dir = /path/to/cache
    url = s3://dvc-bucket/prod-repo

Right, so this is more about configuring the cache, and the remote was being proposed as a way to do that, but now we're thinking about adding a core.name config option instead. Got it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mdscruggs picture mdscruggs  路  3Comments

shcheklein picture shcheklein  路  3Comments

robguinness picture robguinness  路  3Comments

GildedHonour picture GildedHonour  路  3Comments

shcheklein picture shcheklein  路  3Comments