Borg: Deduplication not working with data read from stdin?

Created on 17 Jun 2018  路  15Comments  路  Source: borgbackup/borg

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

System information. For client/server mode post info for both machines.

Arch Linux, Kernel 4.16.13-2-ARCH

Your borg version (borg -V).

1.1.6

Operating system (distribution) and version.

Arch Linux, rolling release

Hardware / network configuration, and filesystems used.

BTRFS

How much data is handled by borg?

100 MB

Full borg commandline that lead to the problem (leave away excludes and passwords)

[user@ideapad fstbkp]$ borg init -e keyfile-blake2 /volumes/bkp/borg
# 72K /volumes/bkp/borg

[user@ideapad fstbkp]$ tar --xattrs -p -c -C /var/lib/pacman/local . | borg create -v -p --stats --stdin-name tar /volumes/bkp/borg::pacman-local-'{now}' -
Enter passphrase for key /home/user/.config/borg/keys/volumes_bkp_borg.2:
------------------------------------------------------------------------------
Archive name: pacman-local-2018-06-17T01:49:45
Archive fingerprint: cae5730bb70d16168417033314f01f6a186ee5c00auseraaac050dee201604b64                                                                                   
Time (start): Sun, 2018-06-17 01:49:51
Time (end):   Sun, 2018-06-17 01:49:52
Duration: 1.07 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               49.05 MB             25.05 MB             25.05 MB
All archives:               49.05 MB             25.05 MB             25.05 MB

                       Unique chunks         Total chunks
Chunk index:                      20                   20
------------------------------------------------------------------------------

# 24M /volumes/bkp/borg

# Immediately after the above command completes, I re-issue the same command:

[user@ideapad fstbkp]$ tar --xattrs -p -c -C /var/lib/pacman/local . | borg create -v -p --stats --stdin-name tar /volumes/bkp/borg::pacman-local-'{now}' -              
Enter passphrase for key /home/user/.config/borg/keys/volumes_bkp_borg.2:
------------------------------------------------------------------------------                                                                                           
Archive name: pacman-local-2018-06-17T01:50:15
Archive fingerprint: 33be76cfad2654d2aabbfd3ce992cf10cf1c74e3d091dbb400a0ce2c73ad96eb
Time (start): Sun, 2018-06-17 01:50:18
Time (end):   Sun, 2018-06-17 01:50:18
Duration: 0.79 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               49.05 MB             25.05 MB             25.05 MB
All archives:               98.10 MB             50.11 MB             50.11 MB

                       Unique chunks         Total chunks
Chunk index:                      42                   42
------------------------------------------------------------------------------

# 48M /volumes/bkp/borg

Describe the problem you're observing.

I create a tar file from /var/lib/pacman/local and pipe it to borg, which makes the empty borg repo grow from 76K to 24M in size. Immediately after this first command ends, I simply type the up arrow key followed by enter (to repeat the same command). Once it finishes, I check the borg repo size, and it has grown to 48 MB.

I concede that 2 consecutive runs of the tar command above will produce different content, but the difference probably can only be found on the tar header field mtime and nothing else.
So, piping 2 almost identical chunks of data uses 2x the expected size.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

This can be easily reproduced by just copying and pasting the commands indicated above.

Include any warning/errors/backtraces from the system logs

None

Most helpful comment

I did a similar test on my system, with 50MB source directory. The first borg backup compressed this to 10.4MB. The second borg backup only used an additional 2.6MB. So it's better than what you found, but not perfect. If I save the output of the tar command to a file, and backup that file via stdin twice, then the second backup uses almost zero space. So the output of the tar command must change each time it is run.

All 15 comments

borg is using a probability based round robin hash to chunk files into chunks that average about 2mb

if you tar a lot of small files of which some change, then borg is unaware of the structure of that tar and the deduplication suffers due to simly mismatching what could be deduplicated

content aware chucking could help, but isn't implemented as of now,
same goes for a import command that would read the tarball into a actual archive

But the files included in the first tar ball are exactly the same as the second tar ball.
In principle (bar a hacker attack or neutrinos flying through my SSD), they could only have changed if I had invoked pacman (Archlinux package manager) before creating the second tar ball, which I haven't.

@elifarley borg is unaware of what a tarball is, it acts as if the whole tarball is one big file hence not finding duplicate files within

I did a similar test on my system, with 50MB source directory. The first borg backup compressed this to 10.4MB. The second borg backup only used an additional 2.6MB. So it's better than what you found, but not perfect. If I save the output of the tar command to a file, and backup that file via stdin twice, then the second backup uses almost zero space. So the output of the tar command must change each time it is run.

I've just found out that 2 invocations of tar on the same set of files will have a lot of differences when the --xattrs argument is used, as can be seen by diffoscope:

diffoscope /tmp/pacman-local.tar.{A,B}> /tmp/pacman-local.tar.diffoscope

grep -Pc '^[+-]' /tmp/pacman-local.tar.diffoscope
25678

head -n40 /tmp/pacman-local.tar.diffoscope
--- /tmp/pacman-local.tar.A
+++ /tmp/pacman-local.tar.B
鈹傗攧 No file format specific differences found inside, yet data differs (POSIX tar archive)
@@ -1,17 +1,17 @@
-00000000: 2e2f 5061 7848 6561 6465 7273 2e34 3530  ./PaxHeaders.450
-00000010: 352f 2e00 0000 0000 0000 0000 0000 0000  5/..............
+00000000: 2e2f 5061 7848 6561 6465 7273 2e34 3433  ./PaxHeaders.443
+00000010: 312f 2e00 0000 0000 0000 0000 0000 0000  1/..............
 00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000060: 0000 0000 3030 3030 3634 3400 3030 3030  ....0000644.0000
 00000070: 3030 3000 3030 3030 3030 3000 3030 3030  000.0000000.0000
 00000080: 3030 3030 3133 3200 3133 3331 3134 3630  0000132.13311460
-00000090: 3336 3700 3031 3037 3531 0020 7800 0000  367.010751. x...
+00000090: 3336 3700 3031 3037 3437 0020 7800 0000  367.010747. x...
 000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000100: 0075 7374 6172 0030 3000 0000 0000 0000  .ustar.00.......
@@ -90,24 +90,24 @@
 00000590: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000005a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000005b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000005c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000005d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000005e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 000005f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
-00000600: 2e2f 5061 7848 6561 6465 7273 2e34 3530  ./PaxHeaders.450
-00000610: 352f 6135 3264 6563 2d30 2e37 2e34 2d39  5/a52dec-0.7.4-9
+00000600: 2e2f 5061 7848 6561 6465 7273 2e34 3433  ./PaxHeaders.443
+00000610: 312f 6135 3264 6563 2d30 2e37 2e34 2d39  1/a52dec-0.7.4-9
 00000620: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000630: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000640: 0000 0000 0000 0000 0000 0000 0000 0000  ................

So I'll have to cope with that (omit the --xattrs when possible) unless I can find a smarter deduplicating backup tool.

I re-tested without the --xattrs argument and the second tarball added close to zero bytes to the borg repo, as expected.

@elifarley why not just run borg on it directly? without the tar middleman borg will capture all the files and dedulicate all the contents

@RonnyPfannschmidt Well, when not using the --xattrs arg, borg is able to take the second tarball with a near-zero increase in repo size, so it works well on that case.

If I let borg archive each file, the repo size ends up being a few MB bigger (2 MB bigger in my test, see my other comment below).
Besides that, I won't be able to use --anchored --exclude-from=my-excluded-files.txt, which are tar arguments to specify which files should be excluded from archiving, unless borg can understand the same syntax as tar does.

There seems to be a better deduplication algorithm. I fed the same 2 tarballs with --xattrs to another tool, bup, and instead of a 100% increase in repo size, it only increased 32%:

bup init
du -hs ~/.bup/
64K

cat /tmp/pacman-local.tar.A | bup split -n pacman-local -vv
bloom: creating from 1 file (5757 objects).
bup: 48000.00kbytes in 2.53 secs = 18972.68 kbytes/sec

du -hs ~/.bup/
25M

cat /tmp/pacman-local.tar.B | bup split -n pacman-local -vv
bloom: adding 1 file (1734 objects).
bup: 48000.00kbytes in 2.24 secs = 21447.95 kbytes/sec

du -hs ~/.bup/
33M

Maybe bup's algorithm could be copied over to borg, as they are both open source. The problem with bup is that it doesn't provide built-in encryption as borg does.

Edit: Just tested with zbackup, which also provides built-in encryption, and the size increase was 56%.

Really you should only feed a tar file to borg as a last resort. Getting borg to scan the files directly will be much more space efficient and also much faster, since it will completely skip unchanged files.

About the deduplication algorithm: see the borg docs, as you can adjust the parameters borg uses. You will be able to achieve better deduplication, but at the cost of more overhead.

@jdchristensen Your statements don't quite match my tests. When letting borg archive the files instead of piping a tarball to it, the repo gets 86% bigger than it would be when using a tarball, and it is ~120% slower on the first run (and about the same speed after that).
In other words, borg has a higher size overhead AND time overhead when compared to the tarball method. But this could be different if we consider a folder with millions of files. I suppose the time difference would be smaller (because borg with no tarball is slower only on the first run) and the space difference would get bigger (looks like borg has a higher overhead per file than tar)

Consider this legend:
tar-xattrs: tarball with extended attributes (--xattr)
tar-simple: tarball without extended attributes
borg: no tarball, borg does the archiving on its own

To test how faster borg would archive my directory as compared to tar-simple, let's take /usr/include as input (319MB).
I won't use any compression, because compression time isn't what we are measuring here, so the commands look like:

# tar-simple:
# Clear PageCache, dentries and inodes:
sync; echo 3 > /proc/sys/vm/drop_caches; \
tar -p --sort=name -c -C /usr/include . | borg create -v -p --stats -C none --stdin-name tar /volumes/bkp/tar-simple::test-'{now}' -

# borg:
sync; echo 3 > /proc/sys/vm/drop_caches; \
borg create -v -p --stats -C none /volumes/bkp/borg::test-'{now}' /usr/include

The results show borg is 123% slower than tar-simple on the first run (when the repo is empty), but then it's 1% faster on the second run (when all files are already present on the repo):

tar-simple: 16.15s
Run 1: 10.1 seconds
Run 2: 6.05 seconds

borg: 28.52s
Run 1: 22.52 seconds
Run 2: 6.00 seconds

Below we can see the size-related results. For this case, I'm not measuring how long each command takes to finish, so there's no need to clear caches, and in fact we can use a previously-generated tarball, as we're not measuring speed now.
Also, as I'm now interested in size and compressibility differences, it makes sense to enable compression.

----------|---|---------|
   mode   |run|repo size|
----------|---|---------|
tar-xattrs| 1 |     24M | (/var/lib/pacman/local)
tar-xattrs| 2 |     48M | (/var/lib/pacman/local)
----------|---|---------|
tar-simple| 1 |     22M | (/var/lib/pacman/local)
tar-simple| 2 |     22M | (/var/lib/pacman/local)
----------|---|---------|
tar-simple| 1 |     32M | (/usr/include)
tar-simple| 2 |     32M | (/usr/include)
----------|---|---------|
borg      | 1 |     23M | (/var/lib/pacman/local)
borg      | 2 |     24M | (/var/lib/pacman/local)
----------|---|---------|
borg      | 1 |     60M | (/usr/include)
borg      | 2 |     60M | (/usr/include)
----------|---|---------|

The command lines were like:

# tar-simple 1:
cat /tmp/pacman-local.tar.A | borg create -v -p --stats -C auto,zstd,22 --stdin-name tar /volumes/bkp/borg::pacman-local-'{now}' -

# tar-simple 2:
cat /tmp/pacman-local.tar.B | borg create -v -p --stats -C auto,zstd,22 --stdin-name tar /volumes/bkp/borg::pacman-local-'{now}' -

# borg 1:
borg create -v -p --stats -C auto,zstd,22 /volumes/bkp/borg::pacman-local-'{now}' /var/lib/pacman/local

# borg 2:
borg create -v -p --stats -C auto,zstd,22 /volumes/bkp/borg::pacman-local-'{now}' /var/lib/pacman/local

Using borg without tar will be faster when the files aren't in the filesystem cache, which will be the typical situation. Try dropping the caches before every test to get realistic numbers. But with a source directory of only 50MB, the difference won't be large.

Using borg without tar will have smaller backups in the typical case when there are some changes to some files. The tar file approach does achieve slightly better compression for the first backup, which is expected, but will likely be worse in the long run. Also, if you use borg to store xattrs directly, rather than with tar, you'll probably see much bigger space savings.

By the way, your command lines above are confusing. In some of them, you use no compression. In others, you use a pregenerated tar file, so the time taken by tar isn't taken into account.

Thanks for pointing that out. I've edited my comment to make things clearer, and I have also testes with a bigger folder (/usr/include - 319M)

As @jdchristensen already pointed out, the effect originally observed (and also that bup did better dedup) is due to a granularity mismatch.

Borg has relatively coarse grained default chunker granularity (~2MB) and your tar stream had fine grained changes, spoiling the deduplication.

Using --chunker-params you could get finer grained chunker granularity, but the reason this is not the default is that it also means much higher resource usage (e.g. RAM needed).

So, my advice would also be to not use tar, use borg's include/exclude patterns and to not change the chunker params.

About the observed tar stream differences you maybe want to ask the tar developers about whether this can be improved.

I am closing this, deduplication IS working on stdin streams.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rugk picture rugk  路  3Comments

enkore picture enkore  路  5Comments

htho picture htho  路  5Comments

phdoerfler picture phdoerfler  路  6Comments

anarcat picture anarcat  路  4Comments