Dvc: Split data into blocks

Created on 30 Jun 2018 · 7Comments · Source: iterative/dvc

As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.

enhancement feature request p3-nice-to-have

Source

efiop

👍4

Most helpful comment

This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/.
If anything you could use the algorithm.

salotz on 12 Dec 2019

👍5 ❤1

All 7 comments

For a private data sharing/deduplication tool I've used a buzhash/rolling hash for chunking big files into small chunks pretty successfully. The borg backup tool has a nice implementation of a buzhash chunker. Maybe it's worth checking out.

sotte on 14 Nov 2018

👍5

@Witiko noted that there is ficlonerange that we could use to create file by reflinking smaller files into it. https://github.com/iterative/dvc/issues/2073#issuecomment-497864614

efiop on 1 Jun 2019

This is a pretty good solution to the chunking deduplication problem. I've used it with success for backing things up: https://duplicacy.com/.
If anything you could use the algorithm.

salotz on 12 Dec 2019

👍5 ❤1

the way Git handles this is

looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next
See https://git-scm.com/book/en/v2/Git-Internals-Packfiles

So that would be another possible approach — equivalent to chunking down to the byte. Mentioned in #1487 BTW:

Diff's for dataset versions (see 2.1.). Which files were added\deleted\modified.

Either approach complicates the file linking from cache to workspace anyway.

jorgeorpinel on 2 Sep 2020

👍2

As @shcheklein suggested, we should consider splitting data into small blocks to track data changes more efficiently. Example: giant file that has one line appended to it.

@efiop There is a problem with it that if the new line was inserted on top of the file, or in the middle.

karajan1001 on 27 Sep 2020

problem with it that if the new line was inserted on top of the file, or in the middle.

There are methods that can deal with inserts at various positions. I mentioned this earlier:
https://github.com/iterative/dvc/issues/829#issuecomment-438705585

Most of the chunks/blocks of a file stay the same, only chunks/blocks that change are actually added. Many backup tools use similar methods.

sotte on 27 Sep 2020

👍2

For the record: another thing to consider is to revisit .dir format to be able to split it into smaller files. E.g.
123456.dir could look like (it might still be json or not)

{
   "dir": "654321.dir"
   "foo": "...",
   "bar": "...",
}

so each .dir file contains a listing of the directory (e.g. os.listdir), which will result in us being able to optimize cases when you are adding, say, 1 file to a giant dataset. Currently, that will result in a new .dir file that might be a new like a 30M json file, which might be more than the file you've added.

CC @pmrowla , since we've talked about it previously.

efiop on 18 Oct 2020

👍3

Was this page helpful?

0 / 5 - 0 ratings