As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.
For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out.
@Witiko noted that there is ficlonerange that we could use to create file by reflinking smaller files into it. https://github.com/iterative/dvc/issues/2073#issuecomment-497864614
This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/.
If anything you could use the algorithm.
the way Git handles this is
looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next
See https://git-scm.com/book/en/v2/Git-Internals-Packfiles
So that would be another possible approach — equivalent to chunking down to the byte. Mentioned in #1487 BTW:
- Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.
Either approach complicates the file linking from cache to workspace anyway.
As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.
@efiop There is a problem with it that if the new line was inserted on top of the file, or in the middle.
problem with it that if the new line was inserted on top of the file, or in the middle.
There are methods that can deal with inserts at various positions. I mentioned this earlier:
https://github.com/iterative/dvc/issues/829#issuecomment-438705585
Most of the chunks/blocks of a file stay the same, only chunks/blocks that change are actually added. Many backup tools use similar methods.
For the record: another thing to consider is to revisit .dir
format to be able to split it into smaller files. E.g.
123456.dir
could look like (it might still be json or not)
{
"dir": "654321.dir"
"foo": "...",
"bar": "...",
}
so each .dir
file contains a listing of the directory (e.g. os.listdir), which will result in us being able to optimize cases when you are adding, say, 1 file to a giant dataset. Currently, that will result in a new .dir
file that might be a new like a 30M json file, which might be more than the file you've added.
CC @pmrowla , since we've talked about it previously.
Most helpful comment
This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/.
If anything you could use the algorithm.