Yes.
Issue.
Currently 1.1.10 on client and server, using the official binary, but the issue persists all the way from 1.1.4. Always used official binaries from this page.
The issues persists from Debian 8 (Jessie) to Debian 10 (Buster), on client and/or server.
Various Dell PowerEdges, with ECC RAM, disks in RAID-10 (mdadm), gigabit network, ext4 filesystems.
We backup ~1TB with Borg on total, but using several smaller repositories by VM/data type.
borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --list --filter AME -C zstd [email protected]:postgresql::{now} .
We use Borg for backups for almost two years. Most data are backup-ed from an LVM snapshot. Most are backup-ed every hour, but some data (ex: databases) are backed up every 10 minutes. Some VMs (qcow2 disks) are backup-ed up daily (they are shutdown off-hours for the backup).
With this setup, since we started using Borg, we usually experience 1 or 2 problems per month with Borg repository corruption. In every case, we could fix the problem with borg check --repair.
Every backup is done using SSH.
We had this issue backing-up from VMs and physical servers.
We could never reproduce the issue.
We checked every disk from every server with a complete S.M.A.R.T. test, with no error detected.
We checked RAM with memtest86 for hours and nothing were found.
We checked that every backup BEFORE the backup job that broke finished successfully.
Destination server never got short on RAM or disk space.
+ borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --list --filter AME -C zstd [email protected]:postgresql::{now} .
Creating archive at "[email protected]:postgresql::{now}"
M 9.6/main/base/141585/171760
M 9.6/main/base/16507/156896
M 9.6/main/base/16507/1259
M 9.6/main/base/16507/2619
M 9.6/main/base/16507/16551
M 9.6/main/base/16507/16577
M 9.6/main/base/16507/16582
M 9.6/main/base/16507/16606
M 9.6/main/base/16507/16731
M 9.6/main/base/16507/16606_fsm
M 9.6/main/base/16507/17268
M 9.6/main/base/16507/17274
M 9.6/main/base/16507/17326
M 9.6/main/base/16507/17327
M 9.6/main/base/16507/16606_vm
M 9.6/main/pg_clog/0003
A 9.6/main/pg_xlog/000000010000001200000078
M 9.6/main/global/pg_control
Traceback (most recent call last):
File "borg/remote.py", line 247, in serve
File "borg/repository.py", line 442, in commit
File "borg/repository.py", line 789, in compact_segments
AssertionError: Corrupted segment reference count - corrupted index or hints
Borg server: Platform: Linux vm-borg-1 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64
Borg server: Linux: debian 9.9
Borg server: Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 15222 CWD: /srv/borg/root@vm-postgres-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-postgres-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-postgres-1 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64
Linux: debian 9.9
Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
PID: 3953 CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:postgresql::{now}', '.']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 2
You said you could fix the issue by running borg check --repair - did it output any additional information / warnings / error messages (when using -v)?
Some other stuff I noticed (not necessarily related to the issue, but just wondering):
-x and --one-file-system is the same, you only need it once.--files-cache=mtime,size? if it is ext4, the default should work ok.Greetings!
Some other stuff I noticed (not necessarily related to the issue, but just wondering):
* `-x` and `--one-file-system` is the same, you only need it once.
This is redundant, you're right.
* why do you use `--files-cache=mtime,size`? if it is ext4, the default should work ok.
Because when using LVM snapshots, files seems (or seemed) to be re-scanned by Borg, and files not changed were printed by --list --filter AME. This happened on ext4 and XFS on the past, and now we standardize on using --files-cache=mtime,size.
You said you could fix the issue by running
borg check --repair- did it output any additional information / warnings / error messages (when using-v)?
On 2019-02-06 we had a corruption on borg create:
M 9.6/main/base/16507/17338
M 9.6/main/pg_clog/0001
A 9.6/main/pg_xlog/000000010000000800000034
M 9.6/main/global/pg_control
M 9.6/main/pg_subtrans/0018
Traceback (most recent call last):
File "borg/remote.py", line 248, in serve
File "borg/repository.py", line 430, in commit
File "borg/repository.py", line 777, in compact_segments
AssertionError: Corrupted segment reference count - corrupted index or hints
Borg server: Platform: Linux vm-borg-1 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Borg server: Linux: debian 9.7
Borg server: Borg: 1.1.8 Python: CPython 3.5.6
Borg server: PID: 23309 CWD: /srv/borg/root@vm-postgres-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-postgres-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-postgres-1 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Linux: debian 9.7
Borg: 1.1.8 Python: CPython 3.5.6
PID: 25777 CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:postgresql::{now}', '.']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 2
Result from borg check -vp:
Starting repository check
Starting repository index check
Index mismatch for key b'+\xd7dq\x8c\xf2\xf59)\xdc\xc1\xca\x98\x05\xb9\x08\x9f#5Ji$\xa3\x9c\xc0]\x81\xacK\x89\x10\xd0'. (45623, 21329774) != (-1, -1)
Completed repository check, errors found.
Result from borg check -vp --repair:
'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue: YES
Starting repository check
Starting repository index check
Completed repository check, no problems found.
Starting archive consistency check...
Analyzing archive 2019-01-06T23:57:03 (1/227)
Analyzing archive 2019-01-13T23:57:04 (2/227)
(.....) <- Only repeating lines with archive names
Analyzing archive 2019-01-20T23:57:04 (3/227)
Analyzing archive 2019-01-27T23:57:04 (4/227)
Analyzing archive 2019-02-06T13:37:05 (226/227)
Analyzing archive 2019-02-06T13:47:05 (227/227)
Writing Manifest.
Committing repo (may take a while, due to compact_segments)...
Finished committing repo.
Archive consistency check complete, no problems found.
After repairing, we ran borg check -vp again, but it didn't show any error this time (don't have the output saved).
Index mismatch for key b'...'. (45623, 21329774) != (-1, -1)
(45623, 21329774) / left is (segment, offset) of rebuilt index (in memory).
(-1, -1) / right is (segment, offset) of on-disk index - (-1, -1) means no entry for that key.
so the committed repo contents had a chunk with that key (id), but it was not in the on-disk index, which is strange.
note: with --repair, these warnings are not emitted, but the in-memory index is written to disk.
When looking at the 1.1.8 code, this message is only emitted if the on-disk index size and the in-memory rebuilt index size is the same.
So, if one key is not present on disk, that means that there is another key present on disk. The current code just does not find out which key that is.
I'll change that code to display both in-memory and on-disk key, so that e.g. a corruption of a few bits can be seen, see #4831.
You said you did a complete SMART test, does that mean a SMART long test (smartctl -t long)?
You said you did a complete SMART test, does that mean a SMART long test (
smartctl -t long)?
Yes, using smartctl -t long.
In every disk:
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 14507 -
@bd830dbd7432 I once had the embarrassing experience of an usb hub silently corrupting the data on their way from the harddisk to the PC's memory.
borg noticed it due to its 256bit MACs, but there was no indication at all in the kernel log. Disks and memory were also ok, like in your case.
It was quite rare corruption, maybe 1bit per 100GB..1TB. I found the culprit by removing the USB hub, then the problem went away. Inserted it again, problem reappeared. Did this multiple times and it always behaved like that.
You don't have an USB hub in your data path I guess, but it could be also something else.
Another place of failure could be also multi-bit RAM errors (too much for ECC to cope with) or some bit flip that is outside an ECC protected path.
Even CRCs or ECCs used by HW are sometimes too weak to detect corruption. Or some stuff is not protected by such at all.
Hello Thomas,
The disks are attached via SATA ports on the server.
The strange thing is that we never noticed any corruption on other backups, in this same server, from different methods (eg: tar). This server aggregates backups for other servers, in different VMs.
For example, we have SQL Server backups being written to this server (inside another VM), with ~100GB, and they are restored nightly on another server. We have other backups (via tar) being written (and already encrypted on source). Daily, we have jobs that copy those backups to other servers, and they are being decrypted correctly, without a sign of corruption.
I don't have much knowledge in Borg internals, but it seem to me that only indexes are being corrupted, aren't they? Data chunks do not seem to be lost, and after borg check --repair, backups are validated by borg check and can be restored. Given that fact that data is much bigger than indexes, what odds are that only indexes are getting corrupted? But I may be talking nonsense...
The difference is that content data is only going once over the cable/interface.
Indexes are loaded, modified, stored. And if you do backups very frequently (like every 10mins), indexes get touched often.
But that reminds me: we have cryptographic strength checksums over index data since a while, so on-disk / on-sata-cable corruption should be noticed. It does not help against in-memory corruption that happens before computing the checksum and storing the index though.
Could you locate the root cause of the problem?
Hello Thomas, thanks for your attention.
I ran memtester inside the VM that Borg stores the repositories, and on the physical server. No problem was detected.
Until now, I couldn't find any hardware issue, and there were no freezing/corruption problem on other VMs running on the same physical server, besides Borg repositories.
A complete Dell hardware diagnostic didn't find any problem either.
I'm just waiting for another corruption to happen, so I can run memtester inside the Borg VM, trying to pinpoint a bad RAM segment mapped to the VM.
Above you said you ran memtest86 - I usually use memtest86+, did you use the same?
It requires to be booted, so the server would be offine during the test.
Yeah, that was Memtest86+ 5.01. Server was offline through the night for the test.
so the committed repo contents had a chunk with that key (id), but it was not in the on-disk index, which is strange.
Just a wild guess...
Our VMs and servers have a maintenance window of one hour at night, where patches are applied and they are rebooted if needed (not everything on the same day). This includes servers which Borg make backups and servers which Borg stores the backups.
On 2019-10-21, I changed all Borg backups and repositories' check jobs, so they do not run inside this maintenance window, avoiding a reboot while Borg is busy.
Usually, Borg --check is run everyday.
At 2019-11-04, two different Borg repositories were corrupted, but this problem didn't happen again until today. Output for one of them:
+ borg check -v --last 1 [email protected]:vms
Remote: Starting repository check
Remote: Starting repository index check
Remote: Index mismatch for key b'b\x9a\xb3\x9c\xc7\x87\x0c9\x84\x90\x96\xd3p2\x12\xf4\x1e\x1c\xd5S\xf3#7\xc5\xb6\x1c5U(\xc0\x8d\x18'. (893, 25399177) != (-1, -1)
Remote: Completed repository check, errors found.
(I know this may be another problem, sorry for adding on this issue)
Could this sort of corruption happen while Borg is backing up, pruning, and the process is aborted for a reboot? And this problem be detected later, on other backups and pruning jobs?
Whenever we had corruptions, we always check if Borg were killed by some problem (error, reboot, out of memory, etc), but I don't have a clear evidence of this.
Theoretically, reboot, kills or connection breakdowns should not affect repo or indexes.
If the repo or cache is not committed, it should just roll back to last valid (committed) state.
Had another corruption this night:
+ export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/bin:/bin
+ export BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK=yes
+ export BORG_RSH=ssh -o StrictHostKeyChecking=no
+ trap cleanup EXIT
+ borg init -e none [email protected]:clog
A repository already exists at [email protected]:clog.
+ true
+ lvcreate -n clog-snap -L 3g -s vg0/clog
Using default stripesize 64.00 KiB.
Logical volume "clog-snap" created.
+ mount -t ext4 -o ro /dev/vg0/clog-snap /mnt
+ cd /mnt
+ borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --exclude lost+found --list --filter AME -C zstd [email protected]:clog::{now} .
Creating archive at "[email protected]:clog::{now}"
M clogd.db
A clogd.db-wal
M clogd.db-shm
Traceback (most recent call last):
File "borg/remote.py", line 247, in serve
File "borg/repository.py", line 442, in commit
File "borg/repository.py", line 789, in compact_segments
AssertionError: Corrupted segment reference count - corrupted index or hints
Borg server: Platform: Linux vm-borg-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64
Borg server: Linux: debian 10.1
Borg server: Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 29775 CWD: /srv/borg/root@vm-clog-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-clog-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-clog-1 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64
Linux: debian 9.11
Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
PID: 5503 CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:clog::{now}', '.']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 2
+ cleanup
+ ORIG=2
+ cd /
+ umount /mnt
+ lvremove -f vg0/clog-snap
Logical volume "clog-snap" successfully removed
+ exit 2
I made a copy of the repository before repairing.
Running borg check on this copy:
$ borg check -vp borg-clog/
Starting repository check
Starting repository index check
Index mismatch for key b'>\xa2\xb0\xdb\x95\xb1O\x100~)Sdsr\x17\xd2,\xb3\xc8{\xef\xc5\x9b\xd7$\xdbz\x0c\xceS\x85'. (21922, 66184646) != (-1, -1)
Completed repository check, errors found.
As the above is still the output of 1.1.10 without #4831, it doesn't give us more information.
c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /usr/local/bin/borg -V
borg 1.1.10
c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /usr/local/bin/borg check /srv/tmp/borg-clog/
Index mismatch for key b'>\xa2\xb0\xdb\x95\xb1O\x100~)Sdsr\x17\xd2,\xb3\xc8{\xef\xc5\x9b\xd7$\xdbz\x0c\xceS\x85'. (21922, 66184646) != (-1, -1)
Completed repository check, errors found.
c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /home/xxxxxxx/.local/share/virtualenvs/c3430191e778c85/bin/borg -V
borg 1.1.11.dev95+g21bbac79
c3430191e778c85 [xxxxxxx@xxx borg (1.1-maint)]$ /home/xxxxxxx/.local/share/virtualenvs/c3430191e778c85/bin/borg check /srv/tmp/borg-clog/
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>
Was that the complete output of 1.1.11.dev95?
Yes. Tested again:
65de1674b53dc75 [xxxxxxxx@xxxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check /srv/tmp/borg-clog/
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>
With check --v:
65de1674b53dc75 [xxxxxxx@xxxxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check -v /srv/tmp/borg-clog/
Starting repository check
Starting repository index check
Index object count match.
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>
Completed repository check, no problems found.
Starting archive consistency check...
Analyzing archive 2019-11-03T22:12:04 (1/38)
Analyzing archive 2019-11-10T22:12:04 (2/38)
Analyzing archive 2019-11-17T22:12:04 (3/38)
Analyzing archive 2019-11-24T22:12:04 (4/38)
Analyzing archive 2019-11-28T22:12:05 (5/38)
Analyzing archive 2019-11-29T22:12:05 (6/38)
Analyzing archive 2019-11-30T22:12:04 (7/38)
Analyzing archive 2019-12-01T22:12:04 (8/38)
Analyzing archive 2019-12-02T22:12:04 (9/38)
Analyzing archive 2019-12-03T22:12:05 (10/38)
Analyzing archive 2019-12-04T22:12:05 (11/38)
Analyzing archive 2019-12-05T03:12:05 (12/38)
Analyzing archive 2019-12-05T04:12:05 (13/38)
Analyzing archive 2019-12-05T05:12:05 (14/38)
Analyzing archive 2019-12-05T06:12:05 (15/38)
Analyzing archive 2019-12-05T08:12:05 (16/38)
Analyzing archive 2019-12-05T09:12:05 (17/38)
Analyzing archive 2019-12-05T10:12:05 (18/38)
Analyzing archive 2019-12-05T11:12:05 (19/38)
Analyzing archive 2019-12-05T12:12:04 (20/38)
Analyzing archive 2019-12-05T13:12:05 (21/38)
Analyzing archive 2019-12-05T14:12:04 (22/38)
Analyzing archive 2019-12-05T15:12:05 (23/38)
Analyzing archive 2019-12-05T16:12:05 (24/38)
Analyzing archive 2019-12-05T17:12:04 (25/38)
Analyzing archive 2019-12-05T18:12:05 (26/38)
Analyzing archive 2019-12-05T19:12:05 (27/38)
Analyzing archive 2019-12-05T20:12:04 (28/38)
Analyzing archive 2019-12-05T21:12:04 (29/38)
Analyzing archive 2019-12-05T22:12:05 (30/38)
Analyzing archive 2019-12-06T01:12:05 (31/38)
Analyzing archive 2019-12-06T02:12:08 (32/38)
Analyzing archive 2019-12-06T03:12:05 (33/38)
Analyzing archive 2019-12-06T04:12:10 (34/38)
Analyzing archive 2019-12-06T05:12:05 (35/38)
Analyzing archive 2019-12-06T06:12:04 (36/38)
Analyzing archive 2019-12-06T08:12:05 (37/38)
Analyzing archive 2019-12-06T09:12:08 (38/38)
Archive consistency check complete, no problems found.
With check --v --debug (if it makes sense):
65de1674b53dc75 [xxxxxxx@xxx borg (1.1-maint)]$ $HOME/.local/share/virtualenvs/65de1674b53dc75/bin/borg check -v --debug /srv/tmp/borg-clog/
using builtin fallback logging configuration
35 self tests completed in 0.05 seconds
Starting repository check
Verified integrity of /srv/tmp/borg-clog/index.21927
Read committed index of transaction 21927
Segment transaction is 21927
Determined transaction is 21927
Found 8950 segments
Starting repository index check
Index object count match.
ID: 3ea2b0db95b14f10307e295364737217d22cb3c87befc59bd724db7a0cce5385 rebuilt index: (21922, 66184646) committed index: <not found>
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /srv/tmp/borg-clog/index.21927
TAM-verified manifest
Analyzing archive 2019-11-03T22:12:04 (1/38)
Analyzing archive 2019-11-10T22:12:04 (2/38)
Analyzing archive 2019-11-17T22:12:04 (3/38)
Analyzing archive 2019-11-24T22:12:04 (4/38)
Analyzing archive 2019-11-28T22:12:05 (5/38)
Analyzing archive 2019-11-29T22:12:05 (6/38)
Analyzing archive 2019-11-30T22:12:04 (7/38)
Analyzing archive 2019-12-01T22:12:04 (8/38)
Analyzing archive 2019-12-02T22:12:04 (9/38)
Analyzing archive 2019-12-03T22:12:05 (10/38)
Analyzing archive 2019-12-04T22:12:05 (11/38)
Analyzing archive 2019-12-05T03:12:05 (12/38)
Analyzing archive 2019-12-05T04:12:05 (13/38)
Analyzing archive 2019-12-05T05:12:05 (14/38)
Analyzing archive 2019-12-05T06:12:05 (15/38)
Analyzing archive 2019-12-05T08:12:05 (16/38)
Analyzing archive 2019-12-05T09:12:05 (17/38)
Analyzing archive 2019-12-05T10:12:05 (18/38)
Analyzing archive 2019-12-05T11:12:05 (19/38)
Analyzing archive 2019-12-05T12:12:04 (20/38)
Analyzing archive 2019-12-05T13:12:05 (21/38)
Analyzing archive 2019-12-05T14:12:04 (22/38)
Analyzing archive 2019-12-05T15:12:05 (23/38)
Analyzing archive 2019-12-05T16:12:05 (24/38)
Analyzing archive 2019-12-05T17:12:04 (25/38)
Analyzing archive 2019-12-05T18:12:05 (26/38)
Analyzing archive 2019-12-05T19:12:05 (27/38)
Analyzing archive 2019-12-05T20:12:04 (28/38)
Analyzing archive 2019-12-05T21:12:04 (29/38)
Analyzing archive 2019-12-05T22:12:05 (30/38)
Analyzing archive 2019-12-06T01:12:05 (31/38)
Analyzing archive 2019-12-06T02:12:08 (32/38)
Analyzing archive 2019-12-06T03:12:05 (33/38)
Analyzing archive 2019-12-06T04:12:10 (34/38)
Analyzing archive 2019-12-06T05:12:05 (35/38)
Analyzing archive 2019-12-06T06:12:04 (36/38)
Analyzing archive 2019-12-06T08:12:05 (37/38)
Analyzing archive 2019-12-06T09:12:08 (38/38)
Archive consistency check complete, no problems found.
Hmm, looks like your repo has 1 item more than your index.
My suspicion was that 1 index key is not in the repo and 1 repo key is not in the index (that could happen by corruption of key bits). But that seems to not being the case.
@bd830dbd7432 - is borg server running inside a VM or on bare metal? Can you describe your stack and backup process a bit more precisely (from the physical disk(s) to the server, and clients; including any volume managers, filesystem(s), mount options, etc..)? Nothing in your original post indicates what steps did you take to set up your back-end so it at least detects and reports ongoing or spreading corruption before corrupted data can be seen by consumers (borg).
Hi, @borkd.
qcow2 format. with VirtIO SCSI interface, cache mode writeback, with discard unmap enabled (VM runs periodic fstrim to discard unused blocks).ext4, using default mount options.Some VMs are backed-up nightly via scripts:
qcow2 image via Borg.borg check and borg prune after every backup.Some VMs are backed-up every 10 minutes, others every hour:
borg check and borg prune.Let me know if you need more information. Thanks!
"We have observed (and registered) corruptions since 01/2018."
Do you have logs/docs whether you changed the borg version in between it working correctly and not working correctly any more? If so, what version was it before/after?
We install/update Borg via Ansible, so we have the history on Git. Approximated dates (we roll updates spreading some days):
2019-05-23 - Borg 1.1.9 -> Borg 1.1.10
2019-03-09 - Borg 1.1.8 -> Borg 1.1.9
2019-01-29 - Borg 1.1.7 -> Borg 1.1.8
2018-08-28 - Borg 1.1.6 -> Borg 1.1.7
2018-06-18 - Borg 1.1.5 -> Borg 1.1.6
2018-04-24 - Borg 1.1.4 -> Borg 1.1.5
2018-01-02 - Borg 1.1.3 -> Borg 1.1.4 <-- first logged corruption after this
2017-12-18 - Borg 1.0.11 -> Borg 1.1.3
2017-09-27 - Borg 1.0.9 -> Borg 1.0.11
2017-01-23 - Borg 1.0.8 -> Borg 1.0.9
2016-11-29 - Started using Borg 1.0.8
It could be that one of the many possible components in the path is flaky or quietly failing. Mix of qcow2, writeback cache and ext4 on mdadm mirror while fast, does not offer much in terms of ensuring data integrity once luck runs out.
If you have a chance make the machine running borg server a standby redirecting backups to a fresh, clean repo hosted elsewhere (different power circuit if you can) and check if the errors follow. If you can, do setup the new repo on ZFS, and use writethrough or direct access.
Are we sure that the corruption began between 1.1.3 and 1.1.4 and not between 1.0.11 and 1.1.3?
This is the first registered corruption that we got. Borg version 1.1.4. I don't remember why Borg downloaded the chunks cache, maybe because before running check --repair (which solved the problem), I deleted ~/.cache/borg on the client.
+ nice ionice -c3 borg create -vsx --files-cache=mtime,size --show-rc --one-file-system --exclude lost+found --list --filter AME -C lzma [email protected]:mysql::1516960621 .
Synchronizing chunks cache...
Archives: 37, w/ cached Idx: 36, w/ outdated Idx: 0, w/o cached Idx: 1.
Reading cached archive chunk index for 1516413422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516859822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516845422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516935422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516877822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516910222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516881421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1515981421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516953421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516586222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516939021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516946222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516924622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516931822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516906621 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516499822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516942622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516917422 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516327022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516892222 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516885022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516672622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516870622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516867021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516903022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516928221 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516949822 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516759022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1515376622 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516921022 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516874221 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516899421 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516913821 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516888622 ...
Merging into master chunks index ...
Fetching and building archive index for 1516957021 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516895821 ...
Merging into master chunks index ...
Reading cached archive chunk index for 1516863422 ...
Merging into master chunks index ...
Done.
A ibdata1
A ib_logfile0
A xxxxxxx_assembly2/xxxxxxxxx.ibd
A xxxxxxx_assembly2/xxxxxxxxxSummaries.ibd
A xxxxxxx_assembly2/config.ibd
A xxxxxxx_assembly/xxxxxxxxxSummaries.ibd
A xxxxxxx_assembly/config.ibd
A xxxxxxx_assembly/xxxxxxx.ibd
A xxxxxxx_assembly/xxxxxxxxx.ibd
A mysql/innodb_table_stats.ibd
A mysql/innodb_index_stats.ibd
Traceback (most recent call last):
File "borg/remote.py", line 248, in serve
File "borg/repository.py", line 423, in commit
File "borg/repository.py", line 770, in compact_segments
AssertionError: Corrupted segment reference count - corrupted index or hints
Borg server: Platform: Linux haskell 3.16.0-5-amd64 #1 SMP Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64
Borg server: Linux: debian 8.10
Borg server: Borg: 1.1.4 Python: CPython 3.5.4
Borg server: PID: 14173 CWD: /srv/borg/root@vm-mysql-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-mysql-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-mysql-1 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64
Linux: debian 9.3
Borg: 1.1.4 Python: CPython 3.5.4
PID: 21835 CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'lzma', '[email protected]:mysql::1516960621', '.']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 2
https://github.com/borgbackup/borg/blob/master/docs/changes.rst#pre-114-potential-data-corruption-issue
Just to get that off the table (if it does not apply): did you have that problem while running 1.1.3?
I also had a look at the changelog for 1.1.4, but nothing there that seems to have to do with your problem:
https://github.com/borgbackup/borg/blob/master/docs/changes.rst#version-114-2017-12-31
https://github.com/borgbackup/borg/blob/master/docs/changes.rst#pre-114-potential-data-corruption-issue
Just to get that off the table (if it does not apply): did you have that problem while running 1.1.3?
I have no record of that on our issue tracking system.
However, we only used Borg 1.1.3 for a few days:
2018-01-02 - Borg 1.1.3 -> Borg 1.1.4
2017-12-18 - Borg 1.0.11 -> Borg 1.1.3
Another RAM test inside Borg server VM. The VM has 2GB RAM, and for the test, it was allocated 1.6GB. It ran for 2 hours without detecting problems:
root@vm-borg-1:~# free
total used free shared buff/cache available
Mem: 2042904 92804 1829812 9200 120288 1809976
Swap: 2094076 9692 2084384
root@vm-borg-1:~# timeout 2h memtester 1600m
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1600MB (1677721600 bytes)
got 1600MB (1677721600 bytes), trying mlock ...locked.
Loop 1:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
(...)
Loop 17:
Stuck Address : ok
Random Value : \root@vm-borg-1:~#
I created a script for "stressing" Borg backups, with the goal of reproducing the corruption problem. The script is running on my work machine, and Borg repo is on the same VM that stores production Borg repos, which are having corruption problems.
The script is running for 17 hours straight without a sign of corruption:
#!/bin/bash
set -e -u -x -o pipefail
DIR="/srv/tmp/borg-stress"
REPO="[email protected]:borg-stress"
mkdir -p "$DIR"
borg init -e none "$REPO" || true
trap date EXIT # print date on exit
while :; do
# A random number between 1 and 100000, for using with file
# names. Thus, we well have a maximum of 100000 files in the
# test repo.
set +x
# Creating/modifying 10 files at a time for backup.
for _ in $(seq 10); do
random=$(shuf -i 1-100000 -n 1)
# Writing random bytes, with $random size.
dd if=/dev/urandom of="${DIR}/${random}" bs="${random}" count=1 2>/dev/null
done
set -x
borg create -vps -C zstd "${REPO}::{now}" "$DIR"
# Probability of running on 1 in 15 (sort of?).
r=$(( RANDOM % 15 ))
if (( r == 0 )); then
borg check -vp "$REPO"
borg prune -vp --list --keep-minutely 15 "$REPO" # churn!
fi
done
I made some tweaks on the script, and it ran for three days. Couldn't reproduce repository corruption.
#!/bin/bash
set -e -u -x -o pipefail
DIR="/srv/tmp/borg-stress"
REPO="[email protected]:borg-stress"
mkdir -p "$DIR"
borg init -e none "$REPO" || true
trap date EXIT # print date on exit
while :; do
# A random number between 1 and 500000, for using with file
# names. Thus, we well have a maximum of 500000 files in the
# test repo.
set +x
for i in $(seq 10000); do
random=$(shuf -i 1-500000 -n 1)
dd if=/dev/urandom of="${DIR}/${random}" bs="$((i * 2))" count=1 2>/dev/null
done
set -x
borg create -vps -C none "${REPO}::{now}" "$DIR"
# Probability of running on 1 in 15 (sort of?).
r=$(( RANDOM % 15 ))
if (( r == 0 )); then
borg check -vp "$REPO"
borg prune -vp --list --keep-minutely 30 "$REPO" # churn!
fi
done
Recreated Borg server VM from scratch on another physical server, already in production and without any historical issues. Let's see how it goes.
I'm seeing a very similar issue here. I'm still trying to figure out if there is some hardware issue but I already tried #4831.
Output from Borg ~1.0.9~ 1.1.9 (as shipped with Debian 10 (Buster):
$ borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
37 self tests completed in 0.07 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is 12820
Determined transaction is 12820
Found 5591 segments
finished segment check at segment 12820
Starting repository index check
Index mismatch for key b'C\xb9\x06P.\x948\xfb\t0r\xd2\x0c\xed:l]\xa1\xa3l\xb3\x85t\x96\x8b\x1b;\xb2\xb6\xab\x99\x00'. (12703, 201309690) != (-1, -1)
Finished full repository check, errors found.
Output using the 1.1-maint branch:
$ ~/.local/bin/borg --version
borg 1.1.11.dev107+g0d9e88d9
$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is 12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900 rebuilt index: (12703, 201309690) committed index: <not found>
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /home/user/borg/host03d.old/index.12820
Starting cryptographic data integrity verification...
Repo/Chunks index object count vs. segment files object count mismatch.
Repo/Chunks index: 168180 objects != segment files: 168179 objects
Finished cryptographic data integrity verification, verified 168179 chunks with 0 integrity errors.
TAM-verified manifest
Analyzing archive main-2019-11-24T23:21:02 (1/117)
Analyzing archive main-2019-12-01T23:21:01 (2/117)
Analyzing archive main-2019-12-08T23:21:02 (3/117)
Analyzing archive vm_100-2019-12-09T19:47:15 (4/117)
Analyzing archive vm_102-2019-12-10T19:30:02 (5/117)
Analyzing archive vm_103-2019-12-11T20:12:55 (6/117)
…
Analyzing archive main-2020-01-03T15:21:01 (117/117)
Archive consistency check complete, no problems found.
There doesn't appear to be any more information in the output and the check is now considered succeeded. Was that intended?
I modified Borg to write the committed index and the recreated index to disk. When I diff the two, there is no difference yet still a missing item is reported:
user@tocco-borg:/tmp$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is 12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900 rebuilt index: (12703, 201309690) committed index: <not found>
Completed repository check, no problems found.
Starting archive consistency check...
…
user@tocco-borg:/tmp$ ls -lh *.idx
-rw------- 1 user user 14M Jan 4 17:10 disk.idx
-rw------- 1 user user 14M Jan 4 17:10 mem.idx
user@tocco-borg:/tmp$ sort disk.idx >disk.idx.sorted
user@tocco-borg:/tmp$ sort mem.idx >mem.idx.sorted
user@tocco-borg:/tmp$ diff disk.idx.sorted mem.idx.sorted
user@tocco-borg:/tmp$
I also tested a small change that copied the indexes into a new dict. After that no more error was reported:
$ ~/.local/bin/borg check --verify --debug ~/borg/host03d.old/
using builtin fallback logging configuration
35 self tests completed in 0.08 seconds
Starting repository check
Verified integrity of /home/user/borg/host03d.old/index.12820
Read committed index of transaction 12820
Segment transaction is 12820
Determined transaction is 12820
Found 5591 segments
Starting repository index check
Index object count match.
Completed repository check, no problems found.
Starting archive consistency check...
Verified integrity of /home/user/borg/host03d.old/index.12820
Starting cryptographic data integrity verification...
…
Either I'm missing something here or something is going horribly wrong.
@pgerber Debian buster has borgbackup 1.1.9, not 1.0.9.
If you meant that, please edit your post.
ID: 43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900
rebuilt index: (12703, 201309690) committed index: <not found>
Completed repository check, no problems found.
That looks incorrect. I'll have a look at the code...
Starting archive consistency check...
...
Repo/Chunks index object count vs. segment files object count mismatch.
Repo/Chunks index: 168180 objects != segment files: 168179 objects
...
Archive consistency check complete, no problems found.
Also a bit strange, but might be correct if that one chunk is not used by any archive.
/home/user/borg/host03d.old/index.12820 could you give that file to me for further analysis?
The repo index does not contain private data btw, it just maps ID -> (segment, offset). No filenames, no file contents, no file meta data in there.
Your "small change" is interesting, should behave the same as before, but does not.
I'ld like to reproduce that with your index file and dig in a bit deeper to see whether that is a bug, a corrupted hash table or a hw issue.
BTW, if you could keep the repo (or a copy of it in "unrepaired" state) that produces this strange effect, that might be useful in case I can not reproduce / find the issue just with the index.
@pgerber Debian buster has borgbackup 1.1.9, not 1.0.9.
If you meant that, please edit your post.
Corrected.
/home/user/borg/host03d.old/index.12820 could you give that file to me for further analysis?
The repo index does not contain private data btw, it just maps ID -> (segment, offset). No filenames, no file contents, no file meta data in there.
The index as requested: index.12820.xz.
BTW, if you could keep the repo (or a copy of it in "unrepaired" state) that produces this strange effect, that might be useful in case I can not reproduce / find the issue just with the index.
Will do.
Hello!
Unfortunately, we had another backup corruption on 2019-12-24, three days after reinstalling Borg VM server on another physical server (https://github.com/borgbackup/borg/issues/4829#issuecomment-568184376):
Creating archive at "[email protected]:prometheus::{now}"
A wal/00006332
A wal/00006331
Traceback (most recent call last):
File "borg/remote.py", line 247, in serve
File "borg/repository.py", line 442, in commit
File "borg/repository.py", line 789, in compact_segments
AssertionError: Corrupted segment reference count - corrupted index or hints
Borg server: Platform: Linux vm-borg-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Borg server: Linux: debian 10.2
Borg server: Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
Borg server: PID: 28652 CWD: /srv/borg/root@vm-prometheus-1
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/srv/borg/root@vm-prometheus-1']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Platform: Linux vm-prometheus-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
Linux: debian 10.1
Borg: 1.1.10 Python: CPython 3.5.7 msgpack: 0.5.6
PID: 10113 CWD: /mnt
sys.argv: ['borg', 'create', '-vsx', '--files-cache=mtime,size', '--show-rc', '--one-file-system', '--exclude', 'lost+found', '--list', '--filter', 'AME', '-C', 'zstd', '[email protected]:prometheus::{now}', '.']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 2
No more corruption after this day, until now.
OK, finally having some time to look into this - working on this now...
OK, just as an update: I have played with the index file given by @pgerber.
That hashtable has one entry that is found by NSIndex.iteritems(), but not by NSIndex.__getitem__ or NSIndex.__contains__. This is why copying all stuff into a python dict fixes the issue.
The reason why the entry is not found is that hashindex_lookup starts at index 163136, finds an empty bucket at that index and gives up.
That is correct behavior of the algorithm. Finding an empty bucket while looking for a bucket with a matching key means that we are finished, because either we should find a used bucket with another key (hash function collision), a deleted bucket (which once was used for a collision, but then was deleted - a tombstone entry) or a bucket with the correct key.
The bucket with the key we search for is at index 163139.
163136 b'0000000000000000000000000000000000000000000000000000000000000000' b'ffffffff' b'00000000'
163137 b'dbdc7a7f2935228de94cc586d6892df219ffeea1e4019e05d10913fe82fa005f' b'5d2a0000' b'a598fc0a'
163138 b'4fea6a91e72554fc70722f86c80d8862ddd9b747e25c86327346121181ef43ca' b'feffffff' b'58f3401c'
163139 b'43b906502e9438fb093072d20ced3a6c5da1a36cb38574968b1b3bb2b6ab9900' b'9f310000' b'fabdff0b'
Now have to find out the root cause why the index file looks like that.
@bd830dbd7432 @pgerber thanks a lot for finding this and helping here in this ticket.
I guess I have found the root cause and fixed it in #4967.
The problem was not that the hashindex lookup was broken in the code, but that the hashtable was brought into an invalid state by hashindex_set writing the new entry to an incorrect index in the hashtable. So it was there (and thus found by iteration), but could not get discovered by lookup.
This happened under these conditions:
Consequences of bug for the hashtables / borg indexes:
key in hashtable) or lookup (hashtable[key]) fails although entry exists, iteration will still find it.So, pretty bad as far as the hashtables / indexes are concerned.
Affects:
But there are good news:
Affected branches / releases:
fd06497 introduced the bug into 1.1-maint branch - it affects all borg 1.1.x since 1.1.0b4.
fd06497 introduced the bug into master branch - it affects all borg 1.2.0 alpha releases.
c5cd882 introduced the bug into 1.0-maint branch - it affects all borg 1.0.x since 1.0.11rc1.
The bug was fixed by:
701159a fixes the bug in 1.1-maint branch - will be released with borg 1.1.11.
fa63150 fixes the bug in master branch - will be released with borg 1.2.0a8.
7bb90b6 fixes the bug in 1.0-maint branch. Branch is EOL, so no new release is planned right now.
What's the best course to remedy this / minimize impact in existing repositories? Delete caches and repository index, then borg check?
I think deleting caches may be necessary (because iirc check doesn't check cached indices and it really doesn't make much sense to do so, considering that rebuilding them is faster). Repository index should be fixed by a borg check run, since it is supposed to force the index to match segment contents. So deleting repository index is probably unnecessary.
Labelled it "corruption" because of the mentioned indexes getting corrupted.
It does not corrupt repository segment files, though (this is where your backup data is).
@enkore yes, guess we need to:
borg delete --cache-only repoborg check --repair --repository-only repoGuess that is the minimum required to bring everything into a for-sure valid state.
Might not always be required, but better play safe.
And maybe a borg check without --repair might be useful to find other, unrelated issues first.
@enkore do you think we run into some problem due to PUT x .... PUT x done against the repo (for both x being the same id) instead of the usual PUT x ... DEL x ... PUT x?
No, set/set/set would just overwrite in the same slot. That's the !(idx<0) path.
Note: issue has been fixed in master and 1.1-maint.
PR for 1.0-maint is there: #4968 (was merged into 1.0-maint meanwhile)
Now the question is of course why one particular user had this happen so frequently, and most others never saw it even once. I suspect this is due to BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK, i.e. using unencrypted repositories. Because this means that the same data will produce the same hashes and thus the same idx % M in each hash table, the issue will come back with newly created repositories and be somewhat reproducible just by having a few pieces of data in the backup set that collide to problematic idx % M (in the context of this issue). Using encrypted repositories very thoroughly randomizes idx % M and makes it entirely independent of the data. This effectively removes "how many repositories are used" as a dampening factor. Backups in their setup were run a lot as well (hourly to every ten minutes).
The next question to answer is the actual impact of this. If this could have led to permanent data loss, it's a big deal. If this just leads to errors but no data loss, not so much.
That's just a one minute analysis from the top of my head. A more thorough analysis may be warranted. But overall I think this bug may cause loud failures (as described in this thread), but no data loss.
Now the question is of course why one particular user had this happen so frequently, and most others never saw it even once. I suspect this is due to BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK, i.e. using unencrypted repositories. Because this means that the same data will produce the same hashes and thus the same idx % M in each hash table, the issue will come back with newly created repositories and be somewhat reproducible just by having a few pieces of data in the backup set that collide to problematic idx % M (in the context of this issue). Using encrypted repositories very thoroughly randomizes idx % M and makes it entirely independent of the data. This effectively removes "how many repositories are used" as a dampening factor. Backups in their setup were run a lot as well (hourly to every ten minutes).
I've seen curruption several times and all repositories in use here are using encryption mode repokey or repokey-blake2. We've 26 repositories in total and most of them are updated every 2 hours to add a backup.
- What is effective effect of this? Random chunks fail to look up in indices
- What can happen if random chunks fail to look up in indices?
Here is a collection of corruption errors I've seen so far. Perhaps this helps figuring out possible consequences of the index corruption. That they are related to the issue fixed here is mostly just an assumption:
Corruption 1
Both errors are from the same repo. The issue was fixed using:
rm -rf /root/.cache/borg
borg check --info --repair
log 1:
root Remote: Object with key 4fe9c16ac6552ad50cb5fa6c3c677d17c52475329b3bf6dd0161e84e0885d6a3 not found in repository /backup/borg_repos/db01a.
root Remote: Traceback (most recent call last):
Remote: File "/usr/lib/python3/dist-packages/borg/remote.py", line 248, in serve
Remote: res = f(**args)
Remote: File "/usr/lib/python3/dist-packages/borg/repository.py", line 1046, in get
Remote: raise self.ObjectNotFound(id, self.path) from None
Remote: borg.repository.Repository.ObjectNotFound: Object with key 4fe9c16ac6552ad50cb5fa6c3c677d17c52475329b3bf6dd0161e84e0885d6a3 not found in repository /backup/borg_repos/db01a.
Remote: Remote: Platform: Linux backup01 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64
Remote: Linux: Ubuntu 18.04 bionic
Remote: Borg: 1.1.5 Python: CPython 3.6.8
Remote: PID: 38323 CWD: /backup/borg_repos/db01a
Remote: sys.argv: ['/usr/bin/borg', 'serve', '--restrict-to-path', '/backup/borg_repos/db01a']
Remote: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Remote: Object with key 82 not found in repository [email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/remote.py", line 548, in get_many
yield self.caching_repo.get(key)
File "/usr/lib/python3/dist-packages/borg/repository.py", line 533, in get
raise self.ObjectNotFound(id_, self.path) from None
borg.repository.Repository.ObjectNotFound: Object with key b'O\xe9\xc1j\xc6U*\xd5\x0c\xb5\xfal<g}\x17\xc5$u2\x9b;\xf6\xdd\x01a\xe8N\x08\x85\xd6\xa3' not found in repository /tmp/borg-tmpwiheuhx5.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 2181, in main
exit_code = archiver.run(args)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 2117, in run
return set_ec(func(args))
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 107, in wrapper
return method(self, args, repository=repository, **kwargs)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 326, in do_create
with Cache(repository, key, manifest, do_files=args.cache_files, lock_wait=self.lock_wait) as cache:
File "/usr/lib/python3/dist-packages/borg/cache.py", line 103, in __init__
self.sync()
File "/usr/lib/python3/dist-packages/borg/cache.py", line 412, in sync
self.chunks = create_master_idx(self.chunks)
File "/usr/lib/python3/dist-packages/borg/cache.py", line 374, in create_master_idx
archive_chunk_idx = fetch_and_build_idx(archive_id, repository, self.key)
File "/usr/lib/python3/dist-packages/borg/cache.py", line 327, in fetch_and_build_idx
for item_id, chunk in zip(archive[b'items'], repository.get_many(archive[b'items'])):
File "/usr/lib/python3/dist-packages/borg/remote.py", line 550, in get_many
for key_, data in repository_iterator:
File "/usr/lib/python3/dist-packages/borg/remote.py", line 457, in get_many
for resp in self.call_many('get', [(id_,) for id_ in ids], is_preloaded=is_preloaded):
File "/usr/lib/python3/dist-packages/borg/remote.py", line 360, in call_many
handle_error(error, res)
File "/usr/lib/python3/dist-packages/borg/remote.py", line 346, in handle_error
raise Repository.ObjectNotFound(res[0], self.location.orig)
borg.repository.Repository.ObjectNotFound: Object with key 82 not found in repository [email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}.
Platform: Linux db01a 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019 x86_64
Linux: Ubuntu 16.04 xenial
Borg: 1.0.12 Python: CPython 3.5.2
PID: 193515 CWD: /root
sys.argv: ['/usr/bin/borg', 'create', '--error', '--compression', 'lz4', '--lock-wait', '14400', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}', '/etc', '/root', '/boot/grub', '/etc', '/root', '/usr/local', '/var/backups', '/var/log', '/var/spool']
SSH_ORIGINAL_COMMAND: None
log 2:
root Remote: AssertionError Exception in RPC call
root Remote: Traceback (most recent call last):
Remote: File "/usr/lib/python3/dist-packages/borg/remote.py", line 248, in serve
Remote: res = f(**args)
Remote: File "/usr/lib/python3/dist-packages/borg/repository.py", line 430, in commit
Remote: self.compact_segments()
Remote: File "/usr/lib/python3/dist-packages/borg/repository.py", line 777, in compact_segments
Remote: assert segments[segment] == 0, 'Corrupted segment reference count - corrupted index or hints'
Remote: AssertionError: Corrupted segment reference count - corrupted index or hints
Remote: Remote: Platform: Linux backup01 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64
Remote: Linux: Ubuntu 18.04 bionic
Remote: Borg: 1.1.5 Python: CPython 3.6.8
Remote: PID: 32104 CWD: /backup/borg_repos/db01a
Remote: sys.argv: ['/usr/bin/borg', 'serve', '--restrict-to-path', '/backup/borg_repos/db01a']
Remote: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Remote: ('Remote Exception (see remote log for the traceback)', 'AssertionError')
Platform: Linux db01a 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019 x86_64
Linux: Ubuntu 16.04 xenial
Borg: 1.0.12 Python: CPython 3.5.2
PID: 183085 CWD: /root
sys.argv: ['/usr/bin/borg', 'create', '--error', '--compression', 'lz4', '--lock-wait', '14400', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}', '/etc', '/root', '/boot/grub', '/etc', '/root', '/usr/local', '/var/backups', '/var/log', '/var/spool']
SSH_ORIGINAL_COMMAND: None
Corruption 2
Fixed by removing local cache.
Local Exception
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4455, in main
exit_code = archiver.run(args)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4387, in run
return set_ec(func(args))
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 152, in wrapper
return method(self, args, repository=repository, cache=cache_, **kwargs)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1443, in do_info
return self._info_archives(args, repository, manifest, key, cache)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1463, in _info_archives
info = archive.info()
File "/usr/lib/python3/dist-packages/borg/archive.py", line 386, in info
stats = self.calc_stats(self.cache)
File "/usr/lib/python3/dist-packages/borg/archive.py", line 503, in calc_stats
unique_csize = archive_index.stats_against(cache.chunks)[3]
File "src/borg/hashindex.pyx", line 408, in borg.hashindex.ChunkIndex.stats_against (src/borg/hashindex.c:6302)
ValueError: stats_against: key contained in self but not in master_index.
Platform: Linux host03d 4.15.18-18-pve #1 SMP PVE 4.15.18-44 (Wed, 03 Jul 2019 11:19:13 +0200) x86_64
Linux: debian 9.9 Borg: 1.1.9 Python: CPython 3.5.3
PID: 23858 CWD: /root
sys.argv: ['/usr/bin/borg', 'info', '--glob', '*', '--lock-wait', '7200', '--last', '300', '--json', '[email protected]:~']
SSH_ORIGINAL_COMMAND: None
Curruption 3
Fixed via repair command.
What interesting here is that these two errors shown below were reported simultaneously. Yet the two errors are for two distinct repositories located on two servers. The last verify was a week earlier and succeed which means the corruption happened within the same week.
$ borg-1 check --verify
Remote: Index mismatch for key b'\xfa\xf6Hs\x0b\x0f\x9bv\xbc\x90\xe9\xd2\xcd\xd5\xd5\x94\xe4`\x16"}8Q!]\xcb\xae\x12S\x0c\xf2i'. (11779, 241957887) != (-1, -1)
Remote: Completed repository check, errors found.
$ borg-2 check --verify
Remote: Index mismatch for key b'C\xb9\x06P.\x948\xfb\t0r\xd2\x0c\xed:l]\xa1\xa3l\xb3\x85t\x96\x8b\x1b;\xb2\xb6\xab\x99\x00'. (12703, 201309690) != (-1, -1)
Remote: Completed repository check, errors found.
Curruption 4
Object with key 0b94abb6e7c65a2e7d770ab38656de1ab095371f660b1555f35ff72c521ee601 not found in repository [email protected]:~.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4253, in main
exit_code = archiver.run(args)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4185, in run
return set_ec(func(args))
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 152, in wrapper
return method(self, args, repository=repository, **kwargs)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1568, in do_prune
progress=args.progress).delete(stats, forced=args.forced)
File "/usr/lib/python3/dist-packages/borg/archive.py", line 821, in delete
while fetch_async_response(wait=True) is not None:
File "/usr/lib/python3/dist-packages/borg/archive.py", line 768, in fetch_async_response
return self.repository.async_response(wait=wait)
File "/usr/lib/python3/dist-packages/borg/remote.py", line 967, in async_response
for resp in self.call_many('async_responses', calls=[], wait=True, async_wait=wait):
File "/usr/lib/python3/dist-packages/borg/remote.py", line 793, in call_many
handle_error(unpacked)
File "/usr/lib/python3/dist-packages/borg/remote.py", line 750, in handle_error
raise Repository.ObjectNotFound(args[0].decode(), self.location.orig)
borg.repository.Repository.ObjectNotFound: Object with key 0b94abb6e7c65a2e7d770ab38656de1ab095371f660b1555f35ff72c521ee601 not found in repository [email protected]:~.
Platform: Linux backup01 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64
Linux: Ubuntu 18.04 bionic
Borg: 1.1.5 Python: CPython 3.6.9
PID: 5286 CWD: /root
sys.argv: ['/usr/bin/borg', 'prune', '[email protected]:~', '--lock-wait', '14400', '--keep-daily', '7', '--keep-weekly', '4', '--prefix', 'postgres-']
SSH_ORIGINAL_COMMAND: None
I've come up with a scenario in which this bug causes data loss:
Note that this only works if the same cache is used for 1, 2 and 3, and if there is no hashindex_resize between 1 and 2, because that repopulates the new from iteration, restoring "lost" keys. The same cache has to be used for 3 as well, because a remote synchronized cache isn't affected since it is built from scratch using insertions only.
Uuh, right. I'll add something to the changelog to advise everybody to rebuild the repo index and chunks cache.
Will be a bit painful for bigger repos, but better than taking risks.
Rebuilding the chunks cache will also fix the refcounts.
Worth pointing out that I can't see how this could cause undetectable data loss (which makes sense, because this issue only affects metadata around the object graph), so if borg check doesn't complain, nothing has been lost.
Everybody please review: https://github.com/borgbackup/borg/pull/4989/files
Something incorrect / something missing?
It's clear for me.
@ThomasWaldmann, glad you found the cause and came with a fix. Thank you very much.
After Borg 1.1.11 is released, I will upgrade clients and servers, and will report back after some time.
The advisory has a few markup issues, it is easier to read as raw markup:
https://raw.githubusercontent.com/borgbackup/borg/1.1.11/docs/changes.rst
See "Pre-1.1.11 potential index corruption / data loss issue" there.
Just a quick feedback after 2 months using Borg 1.1.11 -- we didn't experience any corruption issues. Backups, checks and restores are working smoothly. Thanks again, @ThomasWaldmann!
It's not completely clear from the docs, are we advised to run borg delete cache-only and borg check repair _unconditionally_, or only if there are signs of corruption (such as a failed readonly borg check, or backup errors)?
I ask because of the warning when running borg check repair on 1.1.11 - it's not clear what level of risk is present in running check repair vs not
'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue:
Well, there is a "you might be affected if ..." section with some typical errors occurring due to the issue.
If you are pretty sure you never had any such (or other strange) issue and doing a check --repair is super expensive for you (e.g. due to repo size or speed), you could just not do that.
If is is not expensive, I'ld just unconditionally do it.
The repair warning message is there because we can't be sure repair can deal with all the issues someone might have with their repo. Usually it works, but no guarantees.
@bd830dbd7432 Sorry to hihack this thread, but could you by any chance share your LVM snapshot code +launch borg code as a gist ?
I am currently working only with qcow2 disks, but this seems interesting to me.
My goal is to use borg to read the lvm snapshot block device as file (producing a raw file backup), so full disk backups could be done for LVM backed VMs, including boot sector etc.
Most helpful comment
Just a quick feedback after 2 months using Borg 1.1.11 -- we didn't experience any corruption issues. Backups, checks and restores are working smoothly. Thanks again, @ThomasWaldmann!