Dvc: Better way to add large directories

Created on 24 Oct 2020  路  8Comments  路  Source: iterative/dvc

Apologies if this is a duplicate.

Currently, dvc lets you add a directory in two different ways:

  • As a monolith, where there is a single .dvc file in git, pointing to a big json file in the dvc cache
  • As individual files, where you have a bazillion .dvc files, one for each file in the directory

Neither is a great choice for large directories where you need to modify files independently:

  • If you go the monolith route, any single-file change changes the directory dvc hash. Git history can't show you what file changed, unless you're really good about comments.
  • If you got the individual files route (with dvc add --recursive), you get the ability to track changes to individual files, but you get so many .dvc files in git, it's unwieldy.

A nicer option would be if the monolith, single .dvc file didn't point to a json in the dvc cache, but rather had a line for each file's path and md5 hash.

E.g., instead of mybigdir.dvc looking like

outs:
- md5: 09302c382a910fd2825f99eb437a309d
  path: 'mybigdir'

where the md5 points to a json file in the cache, it would be nice if instead the file looked like

dir: 'mybigdir'
outs:
- md5: 09302c382a910fd2825f99eb437a309d
  path: 'mybigdir/file1.txt'
- md5: b131e8b3550edaa71d010acc112e51d9
  path: 'mybigdir/file2.txt'
...

In this way, I only have one dvc file (more sane), but I also can see which file changed by looking at line differences in git history.

awaiting response feature request

Most helpful comment

A good question.
I agreed that our detailed cache file needs to be more comparable. But all the details in a .dvc file might meet trouble in tracking some really big datasets ( over 1 million files inside). In this case, the whole size of the .dvc would go to hundreds of MB. Git is not good at dealing with such big files

All 8 comments

A good question.
I agreed that our detailed cache file needs to be more comparable. But all the details in a .dvc file might meet trouble in tracking some really big datasets ( over 1 million files inside). In this case, the whole size of the .dvc would go to hundreds of MB. Git is not good at dealing with such big files

Yeah, that's a good point. There needs to be a happy medium.

What about:

  • dvc add works as it does now, creating a big json in the dvc cache
  • dvc add --recursive works as it does now, creating a .dvc file per child file
  • dvc add --detailed works as I've described, where the single .dvc file has an entry per file.

    • Alternatively, dvc add --depth N only adds to detail to N folders down.

If you go the monolith route, any single-file change changes the directory dvc hash. Git history can't show you what file changed, unless you're really good about comments.

dvc diff/status could, but they have a bit limited support for them right now. Have you tried those?

Overall I agree with @karajan1001 , storing that stuff in git will get unwieldy, even though that might be an option for some scenarios. It is not as bad as it would've been with binary files (.dvc is a yaml that is easy for git to split and store), but still we need very strong reasons to go that route. Let's see if status/diff are enough in this particular case.

I have not tried dvc diff enough! I'll give that a try.

However, that doesn't totally solve the issue. dvc diff will let me see which files changed, but I wouldn't be able to cherry-pick some changes and not others. They'd still be a monolith as far as dvc checkout is concerned.

dvc diff will let me see which files changed, but I wouldn't be able to cherry-pick some changes and not others.

Just to clarify, by "cherry-pick" do you mean pulling/checking out specific versions of specific files inside the directory (without having to use git checkout <rev> and dvc checkout)?

dvc get supports downloading specific files (or entire subdirectories) from inside a versioned directory. So you could try using dvc diff to figure out which revision of the file you are looking for, and then use dvc get --rev to retrieve the specific file itself.

dvc pull can also "cherry-pick" a file or a dir:

dvc pull mybigdir/dogs/shepherd/img8337.jpg
dvc pull mybigdir/dogs/bulldog/

However, today DVC cannot add a file without pulling the entire dir #4657 Also, it might be a problem to pull a subset of files based on filename pattern like img8*.jpg or get a list of file to pull from a label-format file.

The idea of keeping .dir content in Git can be a good option in some cases. It codifies the entire dirs to Git which simplifies scenarios like DVC Lite #4539

@gamis please take a look at dvc diff/status as @efiop suggested and let us know if something is missing - we can create a feature request for these commands and close the current issue. If you think .dir "codification" to Git is the missing part then we probably need to rename this issue.

Thanks, Dmitry. I'll definitely try out dvc diff, but I also think .dir codification could be nice!

@gamis Btw, not sure if by cherry-picking you meant the merge process, but if you did - you might find our experimental merge-driver interesting https://github.com/iterative/dvc/issues/4162

Was this page helpful?
0 / 5 - 0 ratings