Dvc: shared cache and dvc import

Created on 26 Aug 2020  路  9Comments  路  Source: iterative/dvc

This is more of a question - related to setting up data registries and the implications of shared cache with dvc import.

Presently I have a few datasets - each created as a separate git/dvc project (each say in the 1000GB range).
Each dataset contains a group of specific images, along with several different annotations types.
Each dataset has been configured to use a separate (independent) shared cache on network attached storage - visible to several shared development servers(s)

/network/storage/shared_dvc/cache/project_A
/network/storage/shared_dvc/cache/project_B
/network/storage/shared_dvc/cache/project_C

This part is working.

Now the question arises from consuming these registries - with a 4th project (project_D). This project contains the code defining a DL network and training script.. The network consumes a composite of information contained in registries project_B and project_C ( accomplished with dvc import )

It would seem unnecessary to duplicate the cache storage.
1) Is there a way to share the existing caches for project_B and project_C?
2) Should all these independent DVC/git projects be configured to use the same cache dir?
3) Do we setup a shared cache for project_D - which will have its own independent shared cache/copy, duplicating a subset of project_B and project_C + whatever we are tracking in D?

The datasets eat up storage fairly quickly - looking for guidance to minimize the impact of duplicate copies

awaiting response

All 9 comments

And just to comment.... I did attempt to configure this such that all of the DVC projects point to a single shared cache - and it appears to work (project_A, project_B, project_C, project_D). Any concerns?

@wdixon Using the same shared cache dir is a good approach. The only thing that you need to be aware of is that dvc gc will need to know about all of those projects, or it might delete some files. See dvc gc --projects.

thank you.... It might be nice to show an example of 2 projects using the same cache in the documentation. It wasn't clear the first time reading the page that the shared cache was meant to be shared across projects. Going back and reading again - i do see it indicates for "everybody's projects"

@wdixon You mean in the https://dvc.org/doc/use-cases/shared-development-server ? So your case is multiple projects and multiple users, right?

Yes, that is correct URL. and Yes, multiple projects, and multiple users. Initially I had setup separate caches for each project (which would be fine, when the projects are completely independent). However, if you pull in (through a dvc import) a portion of a data registry, that is where I encountered the cache duplication. I ended up trying a single cache directory for all the projects, which seems to be what you recommend - and now don't face any duplicate cache storage.

Thanks for the help.

@wdixon btw, just in case you missed this - you might want to enable cache.type symlink. If your cache is located on a different volume or NAS that would help you to avoid copy operations from/to cache on dvc add, dvc checkout, etc.

@jorgeorpinel should we update the doc a bit to make it explicit that shared cache is also about sharing cache across different projects?

Sure. I should be getting to that use case soon. Will keep this in mind 馃憤

@jorgeorpinel should we update the doc a bit to make it explicit that shared cache is also about sharing cache across different projects?

I do think a bit more on the docs related to sharing cache across projects would be helpful.

I'm adding the info here: https://github.com/iterative/dvc.org/pull/1724/files#diff-7b8425b522dc0dcb5f8845ed84d12ce6L10-R18 PTAL

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gregfriedland picture gregfriedland  路  3Comments

shcheklein picture shcheklein  路  3Comments

nik123 picture nik123  路  3Comments

mfrata picture mfrata  路  3Comments

mdscruggs picture mdscruggs  路  3Comments