Borg: Borg repositories constantly getting corrupted on server grade hardware

Created on 7 Nov 2019 · 70Comments · Source: borgbackup/borg

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes.

Is this a BUG / ISSUE report or a QUESTION?

Issue.

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

Currently 1.1.10 on client and server, using the official binary, but the issue persists all the way from 1.1.4. Always used official binaries from this page.

Operating system (distribution) and version.

The issues persists from Debian 8 (Jessie) to Debian 10 (Buster), on client and/or server.

Hardware / network configuration, and filesystems used.

Various Dell PowerEdges, with ECC RAM, disks in RAID-10 (mdadm), gigabit network, ext4 filesystems.

How much data is handled by borg?

We backup ~1TB with Borg on total, but using several smaller repositories by VM/data type.

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --list --filter AME -C zstd [email protected]:postgresql::{now} .

Describe the problem you're observing.

We use Borg for backups for almost two years. Most data are backup-ed from an LVM snapshot. Most are backup-ed every hour, but some data (ex: databases) are backed up every 10 minutes. Some VMs (qcow2 disks) are backup-ed up daily (they are shutdown off-hours for the backup).

With this setup, since we started using Borg, we usually experience 1 or 2 problems per month with Borg repository corruption. In every case, we could fix the problem with borg check --repair.

Every backup is done using SSH.

We had this issue backing-up from VMs and physical servers.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

We could never reproduce the issue.

We checked every disk from every server with a complete S.M.A.R.T. test, with no error detected.

We checked RAM with memtest86 for hours and nothing were found.

We checked that every backup BEFORE the backup job that broke finished successfully.

Destination server never got short on RAM or disk space.

Include any warning/errors/backtraces from the system logs

 + borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --list --filter AME -C zstd [email protected]:postgresql::{now} .
Creating archive at "[email protected]:postgresql::{now}"
M 9.6/main/base/141585/171760
M 9.6/main/base/16507/156896
M 9.6/main/base/16507/1259
M 9.6/main/base/16507/2619
M 9.6/main/base/16507/16551
M 9.6/main/base/16507/16577
M 9.6/main/base/16507/16582
M 9.6/main/base/16507/16606
M 9.6/main/base/16507/16731
M 9.6/main/base/16507/16606_fsm
M 9.6/main/base/16507/17268
M 9.6/main/base/16507/17274
M 9.6/main/base/16507/17326
M 9.6/main/base/16507/17327
M 9.6/main/base/16507/16606_vm
M 9.6/main/pg_clog/0003
A 9.6/main/pg_xlog/000000010000001200000078
M 9.6/main/global/pg_control
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 442, in commit

  File "borg/repository.py", line 789, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64
Borg server: Linux: debian 9.9
Borg server: Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 15222  CWD: /srv/borg/root@vm-postgres-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-postgres-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-postgres-1 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64
Linux: debian 9.9
Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
PID: 3953  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:postgresql::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

bug corruption

Source

bd830dbd7432

Most helpful comment

Just a quick feedback after 2 months using Borg 1.1.11 -- we didn't experience any corruption issues. Backups, checks and restores are working smoothly. Thanks again, @ThomasWaldmann!

bd830dbd7432 on 6 May 2020

👍2

All 70 comments

You said you could fix the issue by running borg check --repair - did it output any additional information / warnings / error messages (when using -v)?

ThomasWaldmann on 7 Nov 2019

Some other stuff I noticed (not necessarily related to the issue, but just wondering):

-x and --one-file-system is the same, you only need it once.
why do you use --files-cache=mtime,size? if it is ext4, the default should work ok.

ThomasWaldmann on 7 Nov 2019

Greetings!

Some other stuff I noticed (not necessarily related to the issue, but just wondering):
* `-x` and `--one-file-system` is the same, you only need it once.

This is redundant, you're right.

* why do you use `--files-cache=mtime,size`? if it is ext4, the default should work ok.

Because when using LVM snapshots, files seems (or seemed) to be re-scanned by Borg, and files not changed were printed by --list --filter AME. This happened on ext4 and XFS on the past, and now we standardize on using --files-cache=mtime,size.

bd830dbd7432 on 7 Nov 2019

You said you could fix the issue by running borg check --repair - did it output any additional information / warnings / error messages (when using -v)?

On 2019-02-06 we had a corruption on borg create:

M 9.6/main/base/16507/17338
M 9.6/main/pg_clog/0001
A 9.6/main/pg_xlog/000000010000000800000034
M 9.6/main/global/pg_control
M 9.6/main/pg_subtrans/0018
Traceback (most recent call last):

  File "borg/remote.py", line 248, in serve

  File "borg/repository.py", line 430, in commit

  File "borg/repository.py", line 777, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Borg server: Linux: debian 9.7
Borg server: Borg: 1.1.8  Python: CPython 3.5.6
Borg server: PID: 23309  CWD: /srv/borg/root@vm-postgres-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-postgres-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-postgres-1 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Linux: debian 9.7
Borg: 1.1.8  Python: CPython 3.5.6
PID: 25777  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:postgresql::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

Result from borg check -vp:

Starting repository check
Starting repository index check                                                                                      
Index mismatch for key b'+\xd7dq\x8c\xf2\xf59)\xdc\xc1\xca\x98\x05\xb9\x08\x9f#5Ji$\xa3\x9c\xc0]\x81\xacK\x89\x10\xd0'. (45623, 21329774) != (-1, -1)
Completed repository check, errors found.

Result from borg check -vp --repair:

'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue: YES
Starting repository check
Starting repository index check                                                                                      
Completed repository check, no problems found.                                                                        
Starting archive consistency check...
Analyzing archive 2019-01-06T23:57:03 (1/227)
Analyzing archive 2019-01-13T23:57:04 (2/227)
(.....) <- Only repeating lines with archive names
Analyzing archive 2019-01-20T23:57:04 (3/227)
Analyzing archive 2019-01-27T23:57:04 (4/227)
Analyzing archive 2019-02-06T13:37:05 (226/227)
Analyzing archive 2019-02-06T13:47:05 (227/227)
Writing Manifest.
Committing repo (may take a while, due to compact_segments)...
Finished committing repo.                                                                                            
Archive consistency check complete, no problems found.

After repairing, we ran borg check -vp again, but it didn't show any error this time (don't have the output saved).

bd830dbd7432 on 7 Nov 2019

Index mismatch for key b'...'. (45623, 21329774) != (-1, -1)

(45623, 21329774) / left is (segment, offset) of rebuilt index (in memory).

(-1, -1) / right is (segment, offset) of on-disk index - (-1, -1) means no entry for that key.

so the committed repo contents had a chunk with that key (id), but it was not in the on-disk index, which is strange.

note: with --repair, these warnings are not emitted, but the in-memory index is written to disk.

ThomasWaldmann on 7 Nov 2019

When looking at the 1.1.8 code, this message is only emitted if the on-disk index size and the in-memory rebuilt index size is the same.

So, if one key is not present on disk, that means that there is another key present on disk. The current code just does not find out which key that is.

I'll change that code to display both in-memory and on-disk key, so that e.g. a corruption of a few bits can be seen, see #4831.

ThomasWaldmann on 7 Nov 2019

You said you did a complete SMART test, does that mean a SMART long test (smartctl -t long)?

ThomasWaldmann on 7 Nov 2019

You said you did a complete SMART test, does that mean a SMART long test (smartctl -t long)?

Yes, using smartctl -t long.

In every disk:

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     14507         -

bd830dbd7432 on 8 Nov 2019

@bd830dbd7432 I once had the embarrassing experience of an usb hub silently corrupting the data on their way from the harddisk to the PC's memory.

borg noticed it due to its 256bit MACs, but there was no indication at all in the kernel log. Disks and memory were also ok, like in your case.

It was quite rare corruption, maybe 1bit per 100GB..1TB. I found the culprit by removing the USB hub, then the problem went away. Inserted it again, problem reappeared. Did this multiple times and it always behaved like that.

You don't have an USB hub in your data path I guess, but it could be also something else.

Another place of failure could be also multi-bit RAM errors (too much for ECC to cope with) or some bit flip that is outside an ECC protected path.

Even CRCs or ECCs used by HW are sometimes too weak to detect corruption. Or some stuff is not protected by such at all.

ThomasWaldmann on 8 Nov 2019

Hello Thomas,

The disks are attached via SATA ports on the server.

The strange thing is that we never noticed any corruption on other backups, in this same server, from different methods (eg: tar). This server aggregates backups for other servers, in different VMs.

For example, we have SQL Server backups being written to this server (inside another VM), with ~100GB, and they are restored nightly on another server. We have other backups (via tar) being written (and already encrypted on source). Daily, we have jobs that copy those backups to other servers, and they are being decrypted correctly, without a sign of corruption.

I don't have much knowledge in Borg internals, but it seem to me that only indexes are being corrupted, aren't they? Data chunks do not seem to be lost, and after borg check --repair, backups are validated by borg check and can be restored. Given that fact that data is much bigger than indexes, what odds are that only indexes are getting corrupted? But I may be talking nonsense...

bd830dbd7432 on 8 Nov 2019

The difference is that content data is only going once over the cable/interface.

Indexes are loaded, modified, stored. And if you do backups very frequently (like every 10mins), indexes get touched often.

But that reminds me: we have cryptographic strength checksums over index data since a while, so on-disk / on-sata-cable corruption should be noticed. It does not help against in-memory corruption that happens before computing the checksum and storing the index though.

ThomasWaldmann on 8 Nov 2019

Could you locate the root cause of the problem?

ThomasWaldmann on 20 Nov 2019

Hello Thomas, thanks for your attention.

I ran memtester inside the VM that Borg stores the repositories, and on the physical server. No problem was detected.

Until now, I couldn't find any hardware issue, and there were no freezing/corruption problem on other VMs running on the same physical server, besides Borg repositories.

A complete Dell hardware diagnostic didn't find any problem either.

I'm just waiting for another corruption to happen, so I can run memtester inside the Borg VM, trying to pinpoint a bad RAM segment mapped to the VM.

bd830dbd7432 on 20 Nov 2019

Above you said you ran memtest86 - I usually use memtest86+, did you use the same?
It requires to be booted, so the server would be offine during the test.

ThomasWaldmann on 21 Nov 2019

Yeah, that was Memtest86+ 5.01. Server was offline through the night for the test.

bd830dbd7432 on 22 Nov 2019

🎉1

so the committed repo contents had a chunk with that key (id), but it was not in the on-disk index, which is strange.

Just a wild guess...

Our VMs and servers have a maintenance window of one hour at night, where patches are applied and they are rebooted if needed (not everything on the same day). This includes servers which Borg make backups and servers which Borg stores the backups.

On 2019-10-21, I changed all Borg backups and repositories' check jobs, so they do not run inside this maintenance window, avoiding a reboot while Borg is busy.

Usually, Borg --check is run everyday.

At 2019-11-04, two different Borg repositories were corrupted, but this problem didn't happen again until today. Output for one of them:

 + borg check -v --last 1 [email protected]:vms
Remote: Starting repository check
Remote: Starting repository index check
Remote: Index mismatch for key b'b\x9a\xb3\x9c\xc7\x87\x0c9\x84\x90\x96\xd3p2\x12\xf4\x1e\x1c\xd5S\xf3#7\xc5\xb6\x1c5U(\xc0\x8d\x18'. (893, 25399177) != (-1, -1)
Remote: Completed repository check, errors found.

(I know this may be another problem, sorry for adding on this issue)

Could this sort of corruption happen while Borg is backing up, pruning, and the process is aborted for a reboot? And this problem be detected later, on other backups and pruning jobs?

Whenever we had corruptions, we always check if Borg were killed by some problem (error, reboot, out of memory, etc), but I don't have a clear evidence of this.

bd830dbd7432 on 4 Dec 2019

Theoretically, reboot, kills or connection breakdowns should not affect repo or indexes.
If the repo or cache is not committed, it should just roll back to last valid (committed) state.

ThomasWaldmann on 5 Dec 2019

Had another corruption this night:

+ export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/bin:/bin
+ export BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK=yes
+ export BORG_RSH=ssh -o StrictHostKeyChecking=no
+ trap cleanup EXIT
+ borg init -e none [email protected]:clog
A repository already exists at [email protected]:clog.
+ true
+ lvcreate -n clog-snap -L 3g -s vg0/clog
  Using default stripesize 64.00 KiB.
  Logical volume "clog-snap" created.
+ mount -t ext4 -o ro /dev/vg0/clog-snap /mnt
+ cd /mnt
+ borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --exclude lost+found --list --filter AME -C zstd [email protected]:clog::{now} .
Creating archive at "[email protected]:clog::{now}"
M clogd.db
A clogd.db-wal
M clogd.db-shm
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 442, in commit

  File "borg/repository.py", line 789, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64
Borg server: Linux: debian 10.1
Borg server: Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 29775  CWD: /srv/borg/root@vm-clog-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-clog-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-clog-1 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Linux: debian 9.11
Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
PID: 5503  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:clog::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2
+ cleanup
+ ORIG=2
+ cd /
+ umount /mnt
+ lvremove -f vg0/clog-snap
  Logical volume "clog-snap" successfully removed
+ exit 2

I made a copy of the repository before repairing.

Running borg check on this copy:

$ borg check -vp borg-clog/
Starting repository check
Starting repository index check                                                                                       
Index mismatch for key b'>\xa2\xb0\xdb\x95\xb1O\x100~)Sdsr\x17\xd2,\xb3\xc8{\xef\xc5\x9b\xd7$\xdbz\x0c\xceS\x85'. (21922, 66184646) != (-1, -1)
Completed repository check, errors found.

bd830dbd7432 on 6 Dec 2019

As the above is still the output of 1.1.10 without #4831, it doesn't give us more information.

ThomasWaldmann on 6 Dec 2019

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /usr/local/bin/borg -V

borg 1.1.10

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /usr/local/bin/borg check /srv/tmp/borg-clog/

Index mismatch for key b'>\xa2\xb0\xdb\x95\xb1O\x100~)Sdsr\x17\xd2,\xb3\xc8{\xef\xc5\x9b\xd7$\xdbz\x0c\xceS\x85'. (21922, 66184646) != (-1, -1)
Completed repository check, errors found.

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /home/xxxxxxx/.local/share/virtualenvs/c3430191e778c85/bin/borg -V

borg 1.1.11.dev95+g21bbac79

c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /home/xxxxxxx/.local/share/virtualenvs/c3430191e778c85/bin/borg check /srv/tmp/borg-clog/

ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>

bd830dbd7432 on 6 Dec 2019

Was that the complete output of 1.1.11.dev95?

ThomasWaldmann on 7 Dec 2019

Yes. Tested again:

65de1674b53dc75 [xxxxxxxx@xxxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check /srv/tmp/borg-clog/
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>

With check --v:

65de1674b53dc75 [xxxxxxx@xxxxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check -v /srv/tmp/borg-clog/
Starting repository check
Starting repository index check
Index object count match.
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
Analyzing archive 2019-11-03T22:12:04 (1/38)
Analyzing archive 2019-11-10T22:12:04 (2/38)
Analyzing archive 2019-11-17T22:12:04 (3/38)
Analyzing archive 2019-11-24T22:12:04 (4/38)
Analyzing archive 2019-11-28T22:12:05 (5/38)
Analyzing archive 2019-11-29T22:12:05 (6/38)
Analyzing archive 2019-11-30T22:12:04 (7/38)
Analyzing archive 2019-12-01T22:12:04 (8/38)
Analyzing archive 2019-12-02T22:12:04 (9/38)
Analyzing archive 2019-12-03T22:12:05 (10/38)
Analyzing archive 2019-12-04T22:12:05 (11/38)
Analyzing archive 2019-12-05T03:12:05 (12/38)
Analyzing archive 2019-12-05T04:12:05 (13/38)
Analyzing archive 2019-12-05T05:12:05 (14/38)
Analyzing archive 2019-12-05T06:12:05 (15/38)
Analyzing archive 2019-12-05T08:12:05 (16/38)
Analyzing archive 2019-12-05T09:12:05 (17/38)
Analyzing archive 2019-12-05T10:12:05 (18/38)
Analyzing archive 2019-12-05T11:12:05 (19/38)
Analyzing archive 2019-12-05T12:12:04 (20/38)
Analyzing archive 2019-12-05T13:12:05 (21/38)
Analyzing archive 2019-12-05T14:12:04 (22/38)
Analyzing archive 2019-12-05T15:12:05 (23/38)
Analyzing archive 2019-12-05T16:12:05 (24/38)
Analyzing archive 2019-12-05T17:12:04 (25/38)
Analyzing archive 2019-12-05T18:12:05 (26/38)
Analyzing archive 2019-12-05T19:12:05 (27/38)
Analyzing archive 2019-12-05T20:12:04 (28/38)
Analyzing archive 2019-12-05T21:12:04 (29/38)
Analyzing archive 2019-12-05T22:12:05 (30/38)
Analyzing archive 2019-12-06T01:12:05 (31/38)
Analyzing archive 2019-12-06T02:12:08 (32/38)
Analyzing archive 2019-12-06T03:12:05 (33/38)
Analyzing archive 2019-12-06T04:12:10 (34/38)
Analyzing archive 2019-12-06T05:12:05 (35/38)
Analyzing archive 2019-12-06T06:12:04 (36/38)
Analyzing archive 2019-12-06T08:12:05 (37/38)
Analyzing archive 2019-12-06T09:12:08 (38/38)
Archive consistency check complete, no problems found.

With check --v --debug (if it makes sense):

65de1674b53dc75 [xxxxxxx@xxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check -v --debug /srv/tmp/borg-clog/
using builtin fallback logging configuration
35 self tests completed in 0.05 seconds
Starting repository check
Verified integrity of /srv/tmp/borg-clog/index.21927
Read committed index of transaction 21927
Segment transaction is    21927
Determined transaction is 21927
Found 8950 segments
Starting repository index check
Index object count match.
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /srv/tmp/borg-clog/index.21927
TAM-verified manifest
Analyzing archive 2019-11-03T22:12:04 (1/38)
Analyzing archive 2019-11-10T22:12:04 (2/38)
Analyzing archive 2019-11-17T22:12:04 (3/38)
Analyzing archive 2019-11-24T22:12:04 (4/38)
Analyzing archive 2019-11-28T22:12:05 (5/38)
Analyzing archive 2019-11-29T22:12:05 (6/38)
Analyzing archive 2019-11-30T22:12:04 (7/38)
Analyzing archive 2019-12-01T22:12:04 (8/38)
Analyzing archive 2019-12-02T22:12:04 (9/38)
Analyzing archive 2019-12-03T22:12:05 (10/38)
Analyzing archive 2019-12-04T22:12:05 (11/38)
Analyzing archive 2019-12-05T03:12:05 (12/38)
Analyzing archive 2019-12-05T04:12:05 (13/38)
Analyzing archive 2019-12-05T05:12:05 (14/38)
Analyzing archive 2019-12-05T06:12:05 (15/38)
Analyzing archive 2019-12-05T08:12:05 (16/38)
Analyzing archive 2019-12-05T09:12:05 (17/38)
Analyzing archive 2019-12-05T10:12:05 (18/38)
Analyzing archive 2019-12-05T11:12:05 (19/38)
Analyzing archive 2019-12-05T12:12:04 (20/38)
Analyzing archive 2019-12-05T13:12:05 (21/38)
Analyzing archive 2019-12-05T14:12:04 (22/38)
Analyzing archive 2019-12-05T15:12:05 (23/38)
Analyzing archive 2019-12-05T16:12:05 (24/38)
Analyzing archive 2019-12-05T17:12:04 (25/38)
Analyzing archive 2019-12-05T18:12:05 (26/38)
Analyzing archive 2019-12-05T19:12:05 (27/38)
Analyzing archive 2019-12-05T20:12:04 (28/38)
Analyzing archive 2019-12-05T21:12:04 (29/38)
Analyzing archive 2019-12-05T22:12:05 (30/38)
Analyzing archive 2019-12-06T01:12:05 (31/38)
Analyzing archive 2019-12-06T02:12:08 (32/38)
Analyzing archive 2019-12-06T03:12:05 (33/38)
Analyzing archive 2019-12-06T04:12:10 (34/38)
Analyzing archive 2019-12-06T05:12:05 (35/38)
Analyzing archive 2019-12-06T06:12:04 (36/38)
Analyzing archive 2019-12-06T08:12:05 (37/38)
Analyzing archive 2019-12-06T09:12:08 (38/38)
Archive consistency check complete, no problems found.

bd830dbd7432 on 7 Dec 2019

Hmm, looks like your repo has 1 item more than your index.

My suspicion was that 1 index key is not in the repo and 1 repo key is not in the index (that could happen by corruption of key bits). But that seems to not being the case.

ThomasWaldmann on 7 Dec 2019

@bd830dbd7432 - is borg server running inside a VM or on bare metal? Can you describe your stack and backup process a bit more precisely (from the physical disk(s) to the server, and clients; including any volume managers, filesystem(s), mount options, etc..)? Nothing in your original post indicates what steps did you take to set up your back-end so it at least detects and reports ongoing or spreading corruption before corrupted data can be seen by consumers (borg).

borkd on 9 Dec 2019

Hi, @borkd.

Clients

Everything running on Dell PowerEdges with ECC RAM, redundant PSU units.
Bare metal or VMs. Most are VMs, but some virtualization hosts use Borg to backup entire VMs.
Operating System: Debian 9 or Debian 10.
Storage: SAS or SATA disks, enterprise-grade. RAID-1 or RAID-10 using software RAID via mdadm.
Partitioning: bare metal uses simple partitions, and VMs uses LVM.
Filesystem: ext4, with default options.

Borg "server"

Host running on Dell PowerEdge with ECC RAM.
VM using KVM, managed by libvirt.
Operating System: Debian 10, but corruption was happening on Debian 9 too.
Storage: virtual disk using qcow2 format. with VirtIO SCSI interface, cache mode writeback, with discard unmap enabled (VM runs periodic fstrim to discard unused blocks).
Inside the VM, using LVM for volume management.
Borg volume using ext4, using default mount options.

Backup method

For virtualization hosts

Some VMs are backed-up nightly via scripts:

Stop VM.
Backup qcow2 image via Borg.
Run borg check and borg prune after every backup.
Start VM.

For VMs

Some VMs are backed-up every 10 minutes, others every hour:

Do a LVM snapshot.
Mount snapshot.
Backup data using Borg from snapshot.
Umount snapshot.
Delete snapshot.
Once per day, run borg check and borg prune.

Observations

We have observed (and registered) corruptions since 01/2018.
Borg is used since 12/2016.
We always used the same physical server to store backups, but we replaced the disks recently (last year, but after corruptions started).
This server stores other backups, like simple encrypted tarballs and SQL Server backups. Those are checked daily (they are decrypted and restored on other servers) and we never found problems.
This server does not have a history of crashing, freezing, strange error messages from kernel, etc, and its VMs never had strange issues.

Let me know if you need more information. Thanks!

bd830dbd7432 on 9 Dec 2019

"We have observed (and registered) corruptions since 01/2018."

Do you have logs/docs whether you changed the borg version in between it working correctly and not working correctly any more? If so, what version was it before/after?

ThomasWaldmann on 9 Dec 2019

We install/update Borg via Ansible, so we have the history on Git. Approximated dates (we roll updates spreading some days):

2019-05-23 - Borg 1.1.9 -> Borg 1.1.10
2019-03-09 - Borg 1.1.8 -> Borg 1.1.9
2019-01-29 - Borg 1.1.7 -> Borg 1.1.8
2018-08-28 - Borg 1.1.6 -> Borg 1.1.7
2018-06-18 - Borg 1.1.5 -> Borg 1.1.6
2018-04-24 - Borg 1.1.4 -> Borg 1.1.5
2018-01-02 - Borg 1.1.3 -> Borg 1.1.4 <-- first logged corruption after this
2017-12-18 - Borg 1.0.11 -> Borg 1.1.3
2017-09-27 - Borg 1.0.9 -> Borg 1.0.11
2017-01-23 - Borg 1.0.8 -> Borg 1.0.9
2016-11-29 - Started using Borg 1.0.8

bd830dbd7432 on 9 Dec 2019

It could be that one of the many possible components in the path is flaky or quietly failing. Mix of qcow2, writeback cache and ext4 on mdadm mirror while fast, does not offer much in terms of ensuring data integrity once luck runs out.

If you have a chance make the machine running borg server a standby redirecting backups to a fresh, clean repo hosted elsewhere (different power circuit if you can) and check if the errors follow. If you can, do setup the new repo on ZFS, and use writethrough or direct access.

borkd on 9 Dec 2019

Are we sure that the corruption began between 1.1.3 and 1.1.4 and not between 1.0.11 and 1.1.3?

ThomasWaldmann on 9 Dec 2019

This is the first registered corruption that we got. Borg version 1.1.4. I don't remember why Borg downloaded the chunks cache, maybe because before running check --repair (which solved the problem), I deleted ~/.cache/borg on the client.

+ nice ionice -c3 borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --exclude lost+found --list --filter AME -C lzma [email protected]:mysql::1516960621 .
Synchronizing chunks cache...
Archives: 37, w/ cached Idx: 36, w/ outdated Idx: 0, w/o cached Idx: 1.
Reading cached archive chunk index for 1516413422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516859822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516845422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516935422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516877822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516910222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516881421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1515981421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516953421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516586222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516939021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516946222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516924622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516931822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516906621 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516499822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516942622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516917422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516327022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516892222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516885022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516672622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516870622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516867021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516903022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516928221 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516949822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516759022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1515376622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516921022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516874221 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516899421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516913821 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516888622 ...
Merging into master chunks index ...
Fetching and building archive index for 1516957021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516895821 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516863422 ...
Merging into master chunks index ...
Done.
A ibdata1
A ib_logfile0
A xxxxxxx_assembly2/xxxxxxxxx.ibd
A xxxxxxx_assembly2/xxxxxxxxxSummaries.ibd
A xxxxxxx_assembly2/config.ibd
A xxxxxxx_assembly/xxxxxxxxxSummaries.ibd
A xxxxxxx_assembly/config.ibd
A xxxxxxx_assembly/xxxxxxx.ibd
A xxxxxxx_assembly/xxxxxxxxx.ibd
A mysql/innodb_table_stats.ibd
A mysql/innodb_index_stats.ibd
Traceback (most recent call last):

  File "borg/remote.py", line 248, in serve

  File "borg/repository.py", line 423, in commit

  File "borg/repository.py", line 770, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux haskell 3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64
Borg server: Linux: debian 8.10
Borg server: Borg: 1.1.4  Python: CPython 3.5.4
Borg server: PID: 14173  CWD: /srv/borg/root@vm-mysql-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-mysql-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-mysql-1 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64
Linux: debian 9.3
Borg: 1.1.4  Python: CPython 3.5.4
PID: 21835  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'lzma', '[email protected]:mysql::1516960621', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

bd830dbd7432 on 9 Dec 2019

https://github.com/borgbackup/borg/blob/master/docs/changes.rst#pre-114-potential-data-corruption-issue

Just to get that off the table (if it does not apply): did you have that problem while running 1.1.3?

ThomasWaldmann on 9 Dec 2019

I also had a look at the changelog for 1.1.4, but nothing there that seems to have to do with your problem:

https://github.com/borgbackup/borg/blob/master/docs/changes.rst#version-114-2017-12-31

ThomasWaldmann on 9 Dec 2019

https://github.com/borgbackup/borg/blob/master/docs/changes.rst#pre-114-potential-data-corruption-issue

Just to get that off the table (if it does not apply): did you have that problem while running 1.1.3?

I have no record of that on our issue tracking system.

However, we only used Borg 1.1.3 for a few days:

2018-01-02 - Borg 1.1.3 -> Borg 1.1.4
2017-12-18 - Borg 1.0.11 -> Borg 1.1.3

bd830dbd7432 on 10 Dec 2019

Another RAM test inside Borg server VM. The VM has 2GB RAM, and for the test, it was allocated 1.6GB. It ran for 2 hours without detecting problems:

 root@vm-borg-1:~# free
              total        used        free      shared  buff/cache   available
Mem:        2042904       92804     1829812        9200      120288     1809976
Swap:       2094076        9692     2084384
root@vm-borg-1:~# timeout 2h memtester 1600m
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1600MB (1677721600 bytes)
got  1600MB (1677721600 bytes), trying mlock ...locked.
Loop 1:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 2:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok

(...)

Loop 17:
  Stuck Address       : ok        
  Random Value        : \root@vm-borg-1:~#

bd830dbd7432 on 11 Dec 2019

I created a script for "stressing" Borg backups, with the goal of reproducing the corruption problem. The script is running on my work machine, and Borg repo is on the same VM that stores production Borg repos, which are having corruption problems.

The script is running for 17 hours straight without a sign of corruption:

#!/bin/bash

set -e -u -x -o pipefail

DIR="/srv/tmp/borg-stress"
REPO="[email protected]:borg-stress"

mkdir -p "$DIR"
borg init -e none "$REPO" || true

trap date EXIT # print date on exit

while :; do
    # A random number between 1 and 100000, for using with file
    # names. Thus, we well have a maximum of 100000 files in the
    # test repo.
    set +x
    # Creating/modifying 10 files at a time for backup.
    for _ in $(seq 10); do
        random=$(shuf -i 1-100000 -n 1)
        # Writing random bytes, with $random size.
        dd if=/dev/urandom of="${DIR}/${random}" bs="${random}" count=1 2>/dev/null
    done
    set -x
    borg create -vps -C zstd "${REPO}::{now}" "$DIR"
    # Probability of running on 1 in 15 (sort of?).
    r=$(( RANDOM % 15 ))
    if (( r == 0 )); then
        borg check -vp "$REPO"
        borg prune -vp --list --keep-minutely 15 "$REPO" # churn!
    fi
done

bd830dbd7432 on 11 Dec 2019

I made some tweaks on the script, and it ran for three days. Couldn't reproduce repository corruption.

#!/bin/bash

set -e -u -x -o pipefail

DIR="/srv/tmp/borg-stress"
REPO="[email protected]:borg-stress"

mkdir -p "$DIR"
borg init -e none "$REPO" || true

trap date EXIT # print date on exit

while :; do
    # A random number between 1 and 500000, for using with file
    # names. Thus, we well have a maximum of 500000 files in the
    # test repo.
    set +x
    for i in $(seq 10000); do
        random=$(shuf -i 1-500000 -n 1)
        dd if=/dev/urandom of="${DIR}/${random}" bs="$((i * 2))" count=1 2>/dev/null
    done
    set -x
    borg create -vps -C none "${REPO}::{now}" "$DIR"
    # Probability of running on 1 in 15 (sort of?).
    r=$(( RANDOM % 15 ))
    if (( r == 0 )); then
        borg check -vp "$REPO"
        borg prune -vp --list --keep-minutely 30 "$REPO" # churn!
    fi
done

bd830dbd7432 on 14 Dec 2019

Recreated Borg server VM from scratch on another physical server, already in production and without any historical issues. Let's see how it goes.

bd830dbd7432 on 21 Dec 2019

👍1

I'm seeing a very similar issue here. I'm still trying to figure out if there is some hardware issue but I already tried #4831.

Output from Borg ~1.0.9~ 1.1.9 (as shipped with Debian 10 (Buster):

$ borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
37 self tests completed in 0.07 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
finished segment check at segment 12820
Starting repository index check
Index mismatch for key b'C\xb9\x06P.\x948\xfb\t0r\xd2\x0c\xed:l]\xa1\xa3l\xb3\x85t\x96\x8b\x1b;\xb2\xb6\xab\x99\x00'. (12703, 201309690) != (-1, -1)
Finished full repository check, errors found.

Output using the 1.1-maint branch:

$ ~/.local/bin/borg --version
borg 1.1.11.dev107+g0d9e88d9
$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900 rebuilt index: (12703, 201309690) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /home/user/borg/host03d.old/index.12820
Starting cryptographic data integrity verification...
Repo/Chunks index object count vs. segment files object count mismatch.
Repo/Chunks index: 168180 objects != segment files: 168179 objects
Finished cryptographic data integrity verification, verified 168179 chunks with 0 integrity errors.
TAM-verified manifest
Analyzing archive main-2019-11-24T23:21:02 (1/117)
Analyzing archive main-2019-12-01T23:21:01 (2/117)
Analyzing archive main-2019-12-08T23:21:02 (3/117)
Analyzing archive vm_100-2019-12-09T19:47:15 (4/117)
Analyzing archive vm_102-2019-12-10T19:30:02 (5/117)
Analyzing archive vm_103-2019-12-11T20:12:55 (6/117)
…
Analyzing archive main-2020-01-03T15:21:01 (117/117)
Archive consistency check complete, no problems found.

There doesn't appear to be any more information in the output and the check is now considered succeeded. Was that intended?

pgerber on 4 Jan 2020

I modified Borg to write the committed index and the recreated index to disk. When I diff the two, there is no difference yet still a missing item is reported:

user@tocco-borg:/tmp$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900 rebuilt index: (12703, 201309690) committed index: <not found>     
Completed repository check, no problems found.
Starting archive consistency check...
…
user@tocco-borg:/tmp$ ls -lh *.idx
-rw------- 1 user user 14M Jan  4 17:10 disk.idx
-rw------- 1 user user 14M Jan  4 17:10 mem.idx
user@tocco-borg:/tmp$ sort disk.idx >disk.idx.sorted
user@tocco-borg:/tmp$ sort mem.idx >mem.idx.sorted
user@tocco-borg:/tmp$ diff disk.idx.sorted mem.idx.sorted
user@tocco-borg:/tmp$

I also tested a small change that copied the indexes into a new dict. After that no more error was reported:

$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is    12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /home/user/borg/host03d.old/index.12820
Starting cryptographic data integrity verification...
…

Either I'm missing something here or something is going horribly wrong.

pgerber on 4 Jan 2020

@pgerber Debian buster has borgbackup 1.1.9, not 1.0.9.

If you meant that, please edit your post.

ThomasWaldmann on 5 Jan 2020

ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900
     rebuilt index: (12703, 201309690) committed index: <not found>     
Completed repository check, no problems found.

That looks incorrect. I'll have a look at the code...

Starting archive consistency check...
...
Repo/Chunks index object count vs. segment files object count mismatch.
Repo/Chunks index: 168180 objects != segment files: 168179 objects
...
Archive consistency check complete, no problems found.

Also a bit strange, but might be correct if that one chunk is not used by any archive.

ThomasWaldmann on 5 Jan 2020

/home/user/borg/host03d.old/index.12820 could you give that file to me for further analysis?

The repo index does not contain private data btw, it just maps ID -> (segment, offset). No filenames, no file contents, no file meta data in there.

ThomasWaldmann on 5 Jan 2020

Your "small change" is interesting, should behave the same as before, but does not.

I'ld like to reproduce that with your index file and dig in a bit deeper to see whether that is a bug, a corrupted hash table or a hw issue.

ThomasWaldmann on 5 Jan 2020

BTW, if you could keep the repo (or a copy of it in "unrepaired" state) that produces this strange effect, that might be useful in case I can not reproduce / find the issue just with the index.

ThomasWaldmann on 5 Jan 2020

@pgerber Debian buster has borgbackup 1.1.9, not 1.0.9.

If you meant that, please edit your post.

Corrected.

/home/user/borg/host03d.old/index.12820 could you give that file to me for further analysis?

The repo index does not contain private data btw, it just maps ID -> (segment, offset). No filenames, no file contents, no file meta data in there.

The index as requested: index.12820.xz.

BTW, if you could keep the repo (or a copy of it in "unrepaired" state) that produces this strange effect, that might be useful in case I can not reproduce / find the issue just with the index.

Will do.

pgerber on 5 Jan 2020

👍1

Hello!

Unfortunately, we had another backup corruption on 2019-12-24, three days after reinstalling Borg VM server on another physical server (https://github.com/borgbackup/borg/issues/4829#issuecomment-568184376):

Creating archive at "[email protected]:prometheus::{now}"
A wal/00006332
A wal/00006331
Traceback (most recent call last):

  File "borg/remote.py", line 247, in serve

  File "borg/repository.py", line 442, in commit

  File "borg/repository.py", line 789, in compact_segments

AssertionError: Corrupted segment reference count - corrupted index or hints

Borg server: Platform: Linux vm-borg-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Borg server: Linux: debian 10.2 
Borg server: Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 28652  CWD: /srv/borg/root@vm-prometheus-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-prometheus-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-prometheus-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Linux: debian 10.1 
Borg: 1.1.10  Python: CPython 3.5.7 msgpack: 0.5.6
PID: 10113  CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:prometheus::{now}', '.']
SSH_ORIGINAL_COMMAND: None

terminating with error status, rc 2

No more corruption after this day, until now.

bd830dbd7432 on 13 Jan 2020

OK, finally having some time to look into this - working on this now...

ThomasWaldmann on 1 Mar 2020

OK, just as an update: I have played with the index file given by @pgerber.

That hashtable has one entry that is found by NSIndex.iteritems(), but not by NSIndex.__getitem__ or NSIndex.__contains__. This is why copying all stuff into a python dict fixes the issue.

The reason why the entry is not found is that hashindex_lookup starts at index 163136, finds an empty bucket at that index and gives up.

That is correct behavior of the algorithm. Finding an empty bucket while looking for a bucket with a matching key means that we are finished, because either we should find a used bucket with another key (hash function collision), a deleted bucket (which once was used for a collision, but then was deleted - a tombstone entry) or a bucket with the correct key.

The bucket with the key we search for is at index 163139.

163136 b'0000000000000000000000000000000000000000000000000000000000000000' b'ffffffff' b'00000000'
163137 b'dbdc7a7f2935228de94cc586d6892df219ffeea1e4019e05d10913fe82fa005f' b'5d2a0000' b'a598fc0a'
163138 b'4fea6a91e72554fc70722f86c80d8862ddd9b747e25c86327346121181ef43ca' b'feffffff' b'58f3401c'
163139 b'43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900' b'9f310000' b'fabdff0b'

Now have to find out the root cause why the index file looks like that.

ThomasWaldmann on 1 Mar 2020

@bd830dbd7432 @pgerber thanks a lot for finding this and helping here in this ticket.

I guess I have found the root cause and fixed it in #4967.

The problem was not that the hashindex lookup was broken in the code, but that the hashtable was brought into an invalid state by hashindex_set writing the new entry to an incorrect index in the hashtable. So it was there (and thus found by iteration), but could not get discovered by lookup.

This happened under these conditions:

trying to hashindex_set some value
hashtable not having too many entries (not too full)
set wants to use an empty bucket and by doing that, minimum amount of empty buckets goes below threshold and triggers a hashtable rebuild ("same size")
hashtable rebuild may change positions of entries in the new table (compared to old table) as well as amount and positions of empty buckets and if that makes the index we wanted to use incorrect for the current set operation, the bug occurred because the index was not recomputed after the hashtable rebuild.

ThomasWaldmann on 1 Mar 2020

Consequences of bug for the hashtables / borg indexes:

set operation puts data at incorrect place
- if that place was empty, only check/lookup fails, no other harm
- worse: if that place was used, another entry gets overwritten in the hashtable
check (key in hashtable) or lookup (hashtable[key]) fails although entry exists, iteration will still find it.
for the worse case, that other entry is lost in the hashtable, iteration also won't find it any more

So, pretty bad as far as the hashtables / indexes are concerned.

Affects:

repo index (mapping chunk id -> segment, offset)
chunks index (mapping chunk id -> size, csize, refcount)

But there are good news:

seems to be not too frequent
borg check repair can rebuild the index from repo segment files, so there is no data loss.

ThomasWaldmann on 1 Mar 2020

Affected branches / releases:

fd06497 introduced the bug into 1.1-maint branch - it affects all borg 1.1.x since 1.1.0b4.
fd06497 introduced the bug into master branch - it affects all borg 1.2.0 alpha releases.
c5cd882 introduced the bug into 1.0-maint branch - it affects all borg 1.0.x since 1.0.11rc1.

The bug was fixed by:

701159a fixes the bug in 1.1-maint branch - will be released with borg 1.1.11.
fa63150 fixes the bug in master branch - will be released with borg 1.2.0a8.
7bb90b6 fixes the bug in 1.0-maint branch. Branch is EOL, so no new release is planned right now.

ThomasWaldmann on 1 Mar 2020

What's the best course to remedy this / minimize impact in existing repositories? Delete caches and repository index, then borg check?

I think deleting caches may be necessary (because iirc check doesn't check cached indices and it really doesn't make much sense to do so, considering that rebuilding them is faster). Repository index should be fixed by a borg check run, since it is supposed to force the index to match segment contents. So deleting repository index is probably unnecessary.

enkore on 1 Mar 2020

Labelled it "corruption" because of the mentioned indexes getting corrupted.

It does not corrupt repository segment files, though (this is where your backup data is).

ThomasWaldmann on 1 Mar 2020

@enkore yes, guess we need to:

install borg version with fix on client and server
delete client-side cache borg delete --cache-only repo
check/repair repository borg check --repair --repository-only repo

Guess that is the minimum required to bring everything into a for-sure valid state.
Might not always be required, but better play safe.

ThomasWaldmann on 1 Mar 2020

And maybe a borg check without --repair might be useful to find other, unrelated issues first.

ThomasWaldmann on 1 Mar 2020

@enkore do you think we run into some problem due to PUT x .... PUT x done against the repo (for both x being the same id) instead of the usual PUT x ... DEL x ... PUT x?

ThomasWaldmann on 1 Mar 2020

No, set/set/set would just overwrite in the same slot. That's the !(idx<0) path.

enkore on 2 Mar 2020

Note: issue has been fixed in master and 1.1-maint.

PR for 1.0-maint is there: #4968 (was merged into 1.0-maint meanwhile)

ThomasWaldmann on 3 Mar 2020

Now the question is of course why one particular user had this happen so frequently, and most others never saw it even once. I suspect this is due to BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK, i.e. using unencrypted repositories. Because this means that the same data will produce the same hashes and thus the same idx % M in each hash table, the issue will come back with newly created repositories and be somewhat reproducible just by having a few pieces of data in the backup set that collide to problematic idx % M (in the context of this issue). Using encrypted repositories very thoroughly randomizes idx % M and makes it entirely independent of the data. This effectively removes "how many repositories are used" as a dampening factor. Backups in their setup were run a lot as well (hourly to every ten minutes).

The next question to answer is the actual impact of this. If this could have led to permanent data loss, it's a big deal. If this just leads to errors but no data loss, not so much.

What is effective effect of this? Random chunks fail to look up in indices
What can happen if random chunks fail to look up in indices?
- Low-level Repository: Overwriting a chunk probably leads to incorrect accounting of free space.
- borg create: Deduplication against the cache index may fail despite the chunk actually existing. In this case, it is uploaded again. No data loss. Both identical plaintexts are encrypted under independent keys/IVs; no security impact.
- borg extract: Trying to extract a chunk that fails lookup results in an error. borg check --repair would have restored the index and allowed operation to succeed. No data loss.
- borg check repository: This builds an index by replaying the log, so GET/PUT/DEL. The rebuilt index may be subject to this bug. IIRC the new index is iterated for comparison with the old one, which would raise errors. borg check repository IIRC never modifies the log based on index, so no data loss.
- borg check archive: IIRC this fills an index from the repository's object list (based on iteration, so not affected by this bug). Only inserting into an index doesn't trigger this bug. So the index should be correct. Thus orphaned object deletion should not be affected by this bug. No data loss.

That's just a one minute analysis from the top of my head. A more thorough analysis may be warranted. But overall I think this bug may cause loud failures (as described in this thread), but no data loss.

enkore on 3 Mar 2020

Now the question is of course why one particular user had this happen so frequently, and most others never saw it even once. I suspect this is due to BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK, i.e. using unencrypted repositories. Because this means that the same data will produce the same hashes and thus the same idx % M in each hash table, the issue will come back with newly created repositories and be somewhat reproducible just by having a few pieces of data in the backup set that collide to problematic idx % M (in the context of this issue). Using encrypted repositories very thoroughly randomizes idx % M and makes it entirely independent of the data. This effectively removes "how many repositories are used" as a dampening factor. Backups in their setup were run a lot as well (hourly to every ten minutes).

I've seen curruption several times and all repositories in use here are using encryption mode repokey or repokey-blake2. We've 26 repositories in total and most of them are updated every 2 hours to add a backup.

What is effective effect of this? Random chunks fail to look up in indices

What can happen if random chunks fail to look up in indices?

Here is a collection of corruption errors I've seen so far. Perhaps this helps figuring out possible consequences of the index corruption. That they are related to the issue fixed here is mostly just an assumption:

Corruption 1

Both errors are from the same repo. The issue was fixed using:

rm -rf /root/.cache/borg
borg check --info --repair

log 1:

root Remote: Object with key 4fe9c16ac6552ad50cb5fa6c3c677d17c52475329b3bf6dd0161e84e0885d6a3 not found in repository /backup/borg_repos/db01a.
root Remote: Traceback (most recent call last):
Remote:   File "/usr/lib/python3/dist-packages/borg/remote.py", line 248, in serve
Remote:     res = f(**args)
Remote:   File "/usr/lib/python3/dist-packages/borg/repository.py", line 1046, in get
Remote:     raise self.ObjectNotFound(id, self.path) from None
Remote: borg.repository.Repository.ObjectNotFound: Object with key 4fe9c16ac6552ad50cb5fa6c3c677d17c52475329b3bf6dd0161e84e0885d6a3 not found in repository /backup/borg_repos/db01a.
Remote: Remote: Platform: Linux backup01 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64
Remote: Linux: Ubuntu 18.04 bionic
Remote: Borg: 1.1.5  Python: CPython 3.6.8
Remote: PID: 38323  CWD: /backup/borg_repos/db01a
Remote: sys.argv: ['/usr/bin/borg', 'serve', '--restrict-to-path', '/backup/borg_repos/db01a']
Remote: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Remote: Object with key 82 not found in repository [email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 548, in get_many
    yield self.caching_repo.get(key)
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 533, in get
    raise self.ObjectNotFound(id_, self.path) from None
borg.repository.Repository.ObjectNotFound: Object with key b'O\xe9\xc1j\xc6U*\xd5\x0c\xb5\xfal<g}\x17\xc5$u2\x9b;\xf6\xdd\x01a\xe8N\x08\x85\xd6\xa3' not found in repository /tmp/borg-tmpwiheuhx5.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 2181, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 2117, in run
    return set_ec(func(args))
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 107, in wrapper
    return method(self, args, repository=repository, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 326, in do_create
    with Cache(repository, key, manifest, do_files=args.cache_files, lock_wait=self.lock_wait) as cache:
  File "/usr/lib/python3/dist-packages/borg/cache.py", line 103, in __init__
    self.sync()
  File "/usr/lib/python3/dist-packages/borg/cache.py", line 412, in sync
    self.chunks = create_master_idx(self.chunks)
  File "/usr/lib/python3/dist-packages/borg/cache.py", line 374, in create_master_idx
    archive_chunk_idx = fetch_and_build_idx(archive_id, repository, self.key)
  File "/usr/lib/python3/dist-packages/borg/cache.py", line 327, in fetch_and_build_idx
    for item_id, chunk in zip(archive[b'items'], repository.get_many(archive[b'items'])):
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 550, in get_many
    for key_, data in repository_iterator:
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 457, in get_many
    for resp in self.call_many('get', [(id_,) for id_ in ids], is_preloaded=is_preloaded):
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 360, in call_many
    handle_error(error, res)
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 346, in handle_error
    raise Repository.ObjectNotFound(res[0], self.location.orig)
borg.repository.Repository.ObjectNotFound: Object with key 82 not found in repository [email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}.

Platform: Linux db01a 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019 x86_64
Linux: Ubuntu 16.04 xenial
Borg: 1.0.12  Python: CPython 3.5.2
PID: 193515  CWD: /root
sys.argv: ['/usr/bin/borg', 'create', '--error', '--compression', 'lz4', '--lock-wait', '14400', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}', '/etc', '/root', '/boot/grub', '/etc', '/root', '/usr/local', '/var/backups', '/var/log', '/var/spool']
SSH_ORIGINAL_COMMAND: None

log 2:

root Remote: AssertionError Exception in RPC call
root Remote: Traceback (most recent call last):
Remote:   File "/usr/lib/python3/dist-packages/borg/remote.py", line 248, in serve
Remote:     res = f(**args)
Remote:   File "/usr/lib/python3/dist-packages/borg/repository.py", line 430, in commit
Remote:     self.compact_segments()
Remote:   File "/usr/lib/python3/dist-packages/borg/repository.py", line 777, in compact_segments
Remote:     assert segments[segment] == 0, 'Corrupted segment reference count - corrupted index or hints'
Remote: AssertionError: Corrupted segment reference count - corrupted index or hints
Remote: Remote: Platform: Linux backup01 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64
Remote: Linux: Ubuntu 18.04 bionic
Remote: Borg: 1.1.5  Python: CPython 3.6.8
Remote: PID: 32104  CWD: /backup/borg_repos/db01a
Remote: sys.argv: ['/usr/bin/borg', 'serve', '--restrict-to-path', '/backup/borg_repos/db01a']
Remote: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Remote: ('Remote Exception (see remote log for the traceback)', 'AssertionError')
Platform: Linux db01a 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019 x86_64
Linux: Ubuntu 16.04 xenial
Borg: 1.0.12  Python: CPython 3.5.2
PID: 183085  CWD: /root
sys.argv: ['/usr/bin/borg', 'create', '--error', '--compression', 'lz4', '--lock-wait', '14400', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}', '/etc', '/root', '/boot/grub', '/etc', '/root', '/usr/local', '/var/backups', '/var/log', '/var/spool']
SSH_ORIGINAL_COMMAND: None

Corruption 2

Fixed by removing local cache.


Local Exception
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4455, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4387, in run
    return set_ec(func(args))
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 152, in wrapper
    return method(self, args, repository=repository, cache=cache_, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1443, in do_info
    return self._info_archives(args, repository, manifest, key, cache)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1463, in _info_archives
    info = archive.info()
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 386, in info
    stats = self.calc_stats(self.cache)
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 503, in calc_stats
    unique_csize = archive_index.stats_against(cache.chunks)[3]
  File "src/borg/hashindex.pyx", line 408, in borg.hashindex.ChunkIndex.stats_against (src/borg/hashindex.c:6302)
ValueError: stats_against: key contained in self but not in master_index.

Platform: Linux host03d 4.15.18-18-pve #1 SMP PVE 4.15.18-44 (Wed, 03 Jul 2019 11:19:13 +0200) x86_64
Linux: debian 9.9 Borg: 1.1.9  Python: CPython 3.5.3
PID: 23858  CWD: /root
sys.argv: ['/usr/bin/borg', 'info', '--glob', '*', '--lock-wait', '7200', '--last', '300', '--json', '[email protected]:~']
SSH_ORIGINAL_COMMAND: None

Curruption 3

Fixed via repair command.

What interesting here is that these two errors shown below were reported simultaneously. Yet the two errors are for two distinct repositories located on two servers. The last verify was a week earlier and succeed which means the corruption happened within the same week.

$ borg-1 check --verify
Remote: Index mismatch for key b'\xfa\xf6Hs\x0b\x0f\x9bv\xbc\x90\xe9\xd2\xcd\xd5\xd5\x94\xe4`\x16"}8Q!]\xcb\xae\x12S\x0c\xf2i'. (11779, 241957887) != (-1, -1)

Remote: Completed repository check, errors found.

$ borg-2 check --verify                                                                       
Remote: Index mismatch for key b'C\xb9\x06P.\x948\xfb\t0r\xd2\x0c\xed:l]\xa1\xa3l\xb3\x85t\x96\x8b\x1b;\xb2\xb6\xab\x99\x00'. (12703, 201309690) != (-1, -1)

Remote: Completed repository check, errors found.

Curruption 4

Object with key 0b94abb6e7c65a2e7d770ab38656de1ab095371f660b1555f35ff72c521ee601 not found in repository [email protected]:~.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4253, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4185, in run
    return set_ec(func(args))
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 152, in wrapper
    return method(self, args, repository=repository, **kwargs)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1568, in do_prune
    progress=args.progress).delete(stats, forced=args.forced)
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 821, in delete
    while fetch_async_response(wait=True) is not None:
  File "/usr/lib/python3/dist-packages/borg/archive.py", line 768, in fetch_async_response
    return self.repository.async_response(wait=wait)
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 967, in async_response
    for resp in self.call_many('async_responses', calls=[], wait=True, async_wait=wait):
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 793, in call_many
    handle_error(unpacked)
  File "/usr/lib/python3/dist-packages/borg/remote.py", line 750, in handle_error
    raise Repository.ObjectNotFound(args[0].decode(), self.location.orig)
borg.repository.Repository.ObjectNotFound: Object with key 0b94abb6e7c65a2e7d770ab38656de1ab095371f660b1555f35ff72c521ee601 not found in repository [email protected]:~.

Platform: Linux backup01 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64
Linux: Ubuntu 18.04 bionic
Borg: 1.1.5  Python: CPython 3.6.9
PID: 5286  CWD: /root
sys.argv: ['/usr/bin/borg', 'prune', '[email protected]:~', '--lock-wait', '14400', '--keep-daily', '7', '--keep-weekly', '4', '--prefix', 'postgres-']
SSH_ORIGINAL_COMMAND: None

pgerber on 3 Mar 2020

I've come up with a scenario in which this bug causes data loss:

"borg create" adds a new chunk to the repository and hits this bug while inserting into the cache.
Some time later, another "borg create" encounters the same chunk. Because it fails to look up, it is added again. The reference count of the chunk is thus 1.
Either archive is deleted. This will delete the chunk, due to the incorrect reference count, despite an archive still referencing it. The chunk is lost.

Note that this only works if the same cache is used for 1, 2 and 3, and if there is no hashindex_resize between 1 and 2, because that repopulates the new from iteration, restoring "lost" keys. The same cache has to be used for 3 as well, because a remote synchronized cache isn't affected since it is built from scratch using insertions only.

enkore on 5 Mar 2020

Uuh, right. I'll add something to the changelog to advise everybody to rebuild the repo index and chunks cache.

Will be a bit painful for bigger repos, but better than taking risks.

Rebuilding the chunks cache will also fix the refcounts.

ThomasWaldmann on 5 Mar 2020

Worth pointing out that I can't see how this could cause undetectable data loss (which makes sense, because this issue only affects metadata around the object graph), so if borg check doesn't complain, nothing has been lost.

enkore on 6 Mar 2020

Everybody please review: https://github.com/borgbackup/borg/pull/4989/files

Something incorrect / something missing?

ThomasWaldmann on 6 Mar 2020

It's clear for me.

@ThomasWaldmann, glad you found the cause and came with a fix. Thank you very much.

After Borg 1.1.11 is released, I will upgrade clients and servers, and will report back after some time.

bd830dbd7432 on 6 Mar 2020

The advisory has a few markup issues, it is easier to read as raw markup:

https://raw.githubusercontent.com/borgbackup/borg/1.1.11/docs/changes.rst

See "Pre-1.1.11 potential index corruption / data loss issue" there.

ThomasWaldmann on 8 Mar 2020

Just a quick feedback after 2 months using Borg 1.1.11 -- we didn't experience any corruption issues. Backups, checks and restores are working smoothly. Thanks again, @ThomasWaldmann!

bd830dbd7432 on 6 May 2020

👍2

It's not completely clear from the docs, are we advised to run borg delete cache-only and borg check repair _unconditionally_, or only if there are signs of corruption (such as a failed readonly borg check, or backup errors)?

I ask because of the warning when running borg check repair on 1.1.11 - it's not clear what level of risk is present in running check repair vs not

'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue:

lukefor on 9 May 2020

Well, there is a "you might be affected if ..." section with some typical errors occurring due to the issue.

If you are pretty sure you never had any such (or other strange) issue and doing a check --repair is super expensive for you (e.g. due to repo size or speed), you could just not do that.

If is is not expensive, I'ld just unconditionally do it.

The repair warning message is there because we can't be sure repair can deal with all the issues someone might have with their repo. Usually it works, but no guarantees.

ThomasWaldmann on 9 May 2020

👍1

@bd830dbd7432 Sorry to hihack this thread, but could you by any chance share your LVM snapshot code +launch borg code as a gist ?
I am currently working only with qcow2 disks, but this seems interesting to me.
My goal is to use borg to read the lvm snapshot block device as file (producing a raw file backup), so full disk backups could be done for LVM backed VMs, including boot sector etc.