Borg: decompress, dedup, store, load, reassemble, recompress

Created on 21 Jun 2015 · 11Comments · Source: borgbackup/borg

JS had a crazy (and maybe not easy to implement) idea I just wanted to keep here:

the backup tool could recognize some popular compression formats and decompress them before running the data through the deduplication and storing them into repository. this would vastly increase chances for deduplication. e.g. .tar.gz / bz2 / xz if most files in the tar are the same or similar to stuff we already processed.

at restore time, it would have to reassemble / recompress the original file so that we get back a file that is identical to the original file.

Source

ThomasWaldmann

All 11 comments

maybe a similar effect can be had with no effort by using deduplication-friendly archive formats, where not the complete compressed stream changes if there is one little change at the beginning of the uncompressed data, but just one or a few blocks.

ThomasWaldmann on 21 Jun 2015

I use gzip --rsyncable. Do you know any other deduplicatable compression schemes?

lfam on 22 Jun 2015

👍1

No, but that would be an interesting topic to research.

ThomasWaldmann on 22 Jun 2015

Imho a backup should retrieve/restore bit identical data that you throw at it 100% of the time, no matter which data it is. This is probably difficult to achieve with compression algorithms that might have slight inconsistencies (e.g. embedded timestamps?) between versions.

I'm also not so sure if simple deduplication would be compressing data better than dedicated compression algorithms, so unless there is a lot of duplicated data being compressed in many seperate archives, a decompression before deduplication might not be very helpful. In that case also a global compression step (create an archive of all input files and store this) would help.

MarkusTeufelberger on 14 Jul 2015

👍1

@MarkusTeufelberger the usecase JS had in mind is archiving NixOS source packages. Over time, there can be a lot of duplication between historical versions of the same package's contents (but as some parts of the content change, the package as a whole is maybe not deduplicatable - at least not if a "streaming compression" of everything is used).

ThomasWaldmann on 14 Jul 2015

archive duplication should only happen if the tool can perfectly restore them

i suspect zip files will be impossible, but various others may just fit very well (tarball streams)

RonnyPfannschmidt on 14 Jul 2015

👍1

I see issues with reproducing bit-identical data with that method also, so maybe it's better to use some compressor with a compression method optimized for deduplication (see --rsyncable).

ThomasWaldmann on 14 Jul 2015

👍1

I think this should be general being avoided but I could imagine that you could allow to define "data_unpack / data_pack" scripts for single files, which the user has to define and therefor are probably not fully transparent. Like this:

You store a folder /var/xxx-files/ as /var/backup/mytar.tgz and borg gets a file which says that this tgz file has to be feed to a script which returns a "temporary path" (or errror) for create and extract (mount?).

That script could be "un-pack-tgz " and returns /tmp/unpack/file/ as path which then is used by borg to backup this file. Modes could be "unpack", "cleanup", "pack" ... and could be just some simple shell scripts the community provides.

BTW.. to store "tgz" this could simply gunzip/gzip the tar file.

oderwat on 14 Jul 2015

Decompression would allow e.g. borg.tgz and borg/ to be deduplicated. Not trival, so probably not a priority at this point, but zsync has achieved an even more impressive goal: rsyncing non-rsyncable gzips, so definitely possible.

gmatht on 4 Aug 2015

i think it would be a acceptable as opt-in for stream compressed formats like tar overplayed with bzip/gzip/lza

i care about bit-identical content of the uncompressed data,

RonnyPfannschmidt on 22 Aug 2015

👍1

considering the complexity of this, the concerns about / potential issues with bit-identical reproduction and that there was no actual work / progress on this for over 2 years, i am closing this.

ThomasWaldmann on 6 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings