JS had a crazy (and maybe not easy to implement) idea I just wanted to keep here:
the backup tool could recognize some popular compression formats and decompress them before running the data through the deduplication and storing them into repository. this would vastly increase chances for deduplication. e.g. .tar.gz / bz2 / xz if most files in the tar are the same or similar to stuff we already processed.
at restore time, it would have to reassemble / recompress the original file so that we get back a file that is identical to the original file.
maybe a similar effect can be had with no effort by using deduplication-friendly archive formats, where not the complete compressed stream changes if there is one little change at the beginning of the uncompressed data, but just one or a few blocks.
I use gzip --rsyncable. Do you know any other deduplicatable compression schemes?
No, but that would be an interesting topic to research.
Imho a backup should retrieve/restore bit identical data that you throw at it 100% of the time, no matter which data it is. This is probably difficult to achieve with compression algorithms that might have slight inconsistencies (e.g. embedded timestamps?) between versions.
I'm also not so sure if simple deduplication would be compressing data better than dedicated compression algorithms, so unless there is a lot of duplicated data being compressed in many seperate archives, a decompression before deduplication might not be very helpful. In that case also a global compression step (create an archive of all input files and store this) would help.
@MarkusTeufelberger the usecase JS had in mind is archiving NixOS source packages. Over time, there can be a lot of duplication between historical versions of the same package's contents (but as some parts of the content change, the package as a whole is maybe not deduplicatable - at least not if a "streaming compression" of everything is used).
archive duplication should only happen if the tool can perfectly restore them
i suspect zip files will be impossible, but various others may just fit very well (tarball streams)
I see issues with reproducing bit-identical data with that method also, so maybe it's better to use some compressor with a compression method optimized for deduplication (see --rsyncable).
I think this should be general being avoided but I could imagine that you could allow to define "data_unpack / data_pack" scripts for single files, which the user has to define and therefor are probably not fully transparent. Like this:
You store a folder /var/xxx-files/ as /var/backup/mytar.tgz and borg gets a file which says that this tgz file has to be feed to a script which returns a "temporary path" (or errror) for create and extract (mount?).
That script could be "un-pack-tgz
BTW.. to store "tgz" this could simply gunzip/gzip the tar file.
Decompression would allow e.g. borg.tgz and borg/ to be deduplicated. Not trival, so probably not a priority at this point, but zsync has achieved an even more impressive goal: rsyncing non-rsyncable gzips, so definitely possible.
i think it would be a acceptable as opt-in for stream compressed formats like tar overplayed with bzip/gzip/lza
i care about bit-identical content of the uncompressed data,
considering the complexity of this, the concerns about / potential issues with bit-identical reproduction and that there was no actual work / progress on this for over 2 years, i am closing this.