borg 1.1.4 hangs trying to do any actions with repo

Created on 26 Feb 2018 · 39Comments · Source: borgbackup/borg

borg 1.1.4 (from apt ppa) on an Ubuntu 16.04 server. I have a repo about 6 months old that suddenly stopped backing up (would get errors about timeouts) so I tried to work with the repo manually and I am unable to run any commands on the repo (list, info, etc), the repo is not locked (tried break-lock JIC) and the process itself is sitting doing nothing. There is only a couple files open in the borg repo (data files), and the opened files do not change. The files data it is looking at/frozen on have no problem with reading/writing. I also have other servers backing up to this network share (each server has it's own borg repo, samba share is from a MyBookLive on the local network mounted locally to /mnt/MyBookLiveInt) and none of the other systems have this issue. Running borg with a -v gives no additional details, it will only ask for passphrase for repo then just sit there at a blinking cursor.

I can try to delete those data files to see what it does but wanted to get some input first on how I should proceed with debugging. I can recreate the repo (luckily I don't need it currently or it would be completely useless) but this hanging problem could be very dangerous if unsolved (and happens when a backup is actually needed)

Source

DaSpawn

Most helpful comment

OK, so borg break-lock is likely ok (maybe we could add a note to docs about the cache lock breaking only working when invoked from the client machine).

What has to be checked is the silent behaviour of borg when there is no repo lock, but a cache lock.

ThomasWaldmann on 1 Mar 2018

👍2

All 39 comments

For interactive usage, make sure output (stdout, stderr) is not redirected, so you see it on your screen in case borg is asking something and waiting for your reply.

For more logging output: borg create --debug --list ...

Please edit your above post to clarify "network share" (samba share? windows share? nfs?).
Also make sure you restart your samba / windows / nfs server to make sure it's not the file server causing strange issues.

As you backup to that network share, I assume you just backup into the local mountpoint of that share and you do not use borg in client/server mode (ssh repo url)?

That ubuntu server is directly installed on hardware? What brand/type of network card does it use?
What is the read speed in MB/s if you just copy files from the repo to a local directory (or /dev/null)?

Don't delete files in your repo.

ThomasWaldmann on 27 Feb 2018

samba share is from a MyBookLive on the local network mounted locally to /mnt/MyBookLiveInt, been restarted and disk checked already.

I just tried:
borg create --debug --list /mnt/MyBookLiveInt/borgBackups::manual-2018-02-27 /etc
using builtin fallback logging configuration
35 self tests completed in 0.16 seconds
Enter passphrase for key /mnt/MyBookLiveInt/borgBackups:
TAM not found and not required

then it hangs there, same 2 data files open only. You are correct it is backing up to local mount point. I have 2 others servers backing up to same network share no problem, and this has been working for almost a year on the one server; just started failing (timing out) recently (possibility of power outage causing a data file corruption is all I can think of since nothing in the setup has changed)

The same server has also been backing up to another borg repo that is done via ssh to a local borg instance and that is still working no problem. Hardware is bonded intel gig cards (and runs multiple network services, if was even a hint of network problems my phone would be off the hook)

The 2 files it is stuck on:
dd if=/mnt/MyBookLiveInt/borgBackups/data/39/399433 of=/dev/null
9104+1 records in
9104+1 records out
4661333 bytes (4.7 MB, 4.4 MiB) copied, 2.43554 s, 1.9 MB/s

dd if=/mnt/MyBookLiveInt/borgBackups/data/39/399436 of=/dev/null
2861+1 records in
2861+1 records out
1464910 bytes (1.5 MB, 1.4 MiB) copied, 0.769619 s, 1.9 MB/s

I tried from another server and that one gets 1.2MB/s for the same files (MyBookLive is not fast, but good enough)

DaSpawn on 27 Feb 2018

This may have been a borg cache issue, when the job was sitting there I hit ctrl+c a couple times and I got a lock error within the /root/.cache/borg/ directory (I had previously ran borg break-lock many times without any change), so I renamed that and a backup appears to be running fine now

so easy to fix in this instance, but really bad it causes borg to hang without error/timeout subsequently

DaSpawn on 27 Feb 2018

Hmm, can you find a way to reproduce this?

ThomasWaldmann on 28 Feb 2018

the backup completed yesterday and I was able to pull a list from the repo now. So I took the current cache, renamed and put the broken one back and the problem persists, so issue appears entirely with local cache (also explains why I didn't realize it was broken only on the server doing the backup since when this first started I was working/checking the repo with my local machine)

there is something a little different reproducing the issue this time though the data files it was hanging on are not the same now (it is hung on 435468, 434962, 434959 now). digging through the cache I found a lock file in one of the folders still...

So I decided to do a break-lock from the server and it was the lock file in the cache directory causing borg to hang the entire time (I think I was trying to do a break-lock from my local machine/another server on the repo instead of on the production server having the issue)

So problem solved, error handling may little better than just hanging indefinitely on a bad cache lock (I would have expected it to error similar to a lock on the repo)

Not sure if you want to close this or not

Thanks for your time!

DaSpawn on 28 Feb 2018

Above you said, you tried break-lock many times before - did you try these from the problematic client or from the server (or both)?

About the lock file in the broken client cache, can you give me a ls -lR so I can see it?

The question in the end is whether a client-side borg break-lock does not work correctly / completely (in all circumstances).

ThomasWaldmann on 28 Feb 2018

I think the break-locks were only tried from another host so it missed the cache locks (I try not to do backup management from production servers, only from workstation)

ll -lR /root/.cache/borg
....(truncated list, only one folder has any lock folders/files)
./4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3:
total 81380
-rw------- 1 root root 56668366 Oct 11 01:46 chunks
drwx------ 2 root root 4096 Jun 9 2017 chunks.archive.d
-rw------- 1 root root 292 Oct 11 01:46 config
-rw------- 1 root root 26634095 Oct 11 01:46 files
drwx------ 2 root root 4096 Oct 10 22:00 lock.exclusive
-rw------- 1 root root 42 Oct 10 22:00 lock.roster
-rw------- 1 root root 20 Jun 9 2017 README
drwx------ 2 root root 4096 Oct 11 01:46 txn.active
...........
./4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive:
total 0
-rw------- 1 root root 0 Oct 10 22:00 production-lin4.26516-0

The break-lock (on the server doing the backup) works correctly (that is how I pinpointed, I did a break lock from server and it fixed the broken cache), but that does not solve the problem of borg completely hanging forever without any errors due to a lock file in cache (it should error immediately similar to how it errors with a lock file in the repo. borg should not even ask for a repo password if there is an existing lock file in the cache)

DaSpawn on 28 Feb 2018

OK, so borg break-lock is likely ok (maybe we could add a note to docs about the cache lock breaking only working when invoked from the client machine).

What has to be checked is the silent behaviour of borg when there is no repo lock, but a cache lock.

ThomasWaldmann on 1 Mar 2018

👍2

I think I am experiencing exactly the same problem. Was able to run backup using sudo (so apparently user cache) and it worked without a problem. Starting as service was hanging (up to 4 days as it was a server). Trying to investigate I finally run backup as proper root, and it hang.

# borg list --debug
using builtin fallback logging configuration
35 self tests completed in 1.01 seconds
Verified integrity of /drobo/backups/ordi/index.420
TAM-verified manifest
^CKeyboard interrupt

Shows the same behaviour as above and # borg break-lock fixed the issue.

It is indeed very hard to figure out as there is no output at all.

Sorry, I didn't preserve .cache from before break lock.

mchwalisz on 2 Mar 2018

👍1

I did a first try at reproducing this, with a local repo.

I ran a backup with borg, kill -9ed it, removed repo lock manually, then accessed the repo:

$ borg list repo
Enter passphrase for key ...: 
Killed stale lock [email protected].
Removed stale exclusive roster lock for pid 6985.
Removed stale exclusive roster lock for pid 6985.
arch                                 Fri, 2018-03-02 22:29:01 [620c...6d50]

So, the stale lock detection came to the rescue in the local repo case.

ThomasWaldmann on 2 Mar 2018

2nd try with a (pseudo-)remote repo (ssh to localhost).

Again, ran a backup, killed borg serve with -9, borg create terminated itself after that. There was a lock in the repo, but no lock in the cache in that case. removed repo lock, borg list afterwards works with no issue.

ThomasWaldmann on 2 Mar 2018

3rd try with a (pseudo-)remote repo (ssh to localhost):

Again, ran a backup, killed borg create with -9, borg serve terminated itself after that. There was no lock in the repo, but a lock in the cache in that case. tried borg list and it killed the stale lock automatically as in 1st try.

ThomasWaldmann on 2 Mar 2018

So, I can't reproduce, please add a way how to reproduce the problematic case before reopening.

ThomasWaldmann on 2 Mar 2018

@DaSpawn did you disable stale lock removal via BORG_HOSTNAME_IS_UNIQUE=no?

Hmm, even that case works with a quite clear message:

$ BORG_HOSTNAME_IS_UNIQUE=no borg list tw@localhost:/home/tw/borg/repo
Enter passphrase for key ssh://tw@localhost/home/tw/borg/repo: 
Found stale lock [email protected], but not deleting because BORG_HOSTNAME_IS_UNIQUE is not set.

ThomasWaldmann on 2 Mar 2018

I sync to the NFS drive. Maybe that stalled and created problems?

mchwalisz on 3 Mar 2018

Using bormatic to manage jobs, no special vars

How the cache got in this state is not entirely known, could have been power outage during backup and locks on repo cleared remotely from my desktop (missing cache on server) or even a chance I cleared the locks from the repo but there was actually something running on the server that backs up to the repo

The hang can be easily recreated if you create a folder manually in one of the cache folders along with a file indicating server/pid, ie

~/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive/production-lin4.26516-0

if you remove that file the hang will clear and borg will actually remove the lock.exclusive folder on next run. the lock.roster file makes no difference there or not

strace:
sudo strace -s 99 -ffp 5108 strace: Process 5108 attached select(0, NULL, NULL, NULL, {0, 858245}) = 0 (Timeout) mkdir("/root/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive", 0777) = -1 EEXIST (File exists) stat("/root/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive/[email protected]", 0x7ffd09b11fb0) = -1 ENOENT (No such file or directory) open("/root/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 fstat(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 getdents(5, /* 3 entries */, 32768) = 88 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) mkdir("/root/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive", 0777) = -1 EEXIST (File exists) stat("/root/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive/[email protected]", 0x7ffd09b11fb0) = -1 ENOENT (No such file or directory) open("/root/.cache/borg/4cc11c5bbd0572563379fa4ad0bd38196ca5236d9da41063e2f7c86c199708b3/lock.exclusive", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 fstat(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 getdents(5, /* 3 entries */, 32768) = 88 getdents(5, /* 0 entries */, 32768) = 0 close(5) = 0

loops that same thing every second

DaSpawn on 3 Mar 2018

That should correspond to my first try and assuming that the hostname is same as local host, borg should kill the stale lock as it did in my first try.

ThomasWaldmann on 4 Mar 2018

not sure why it is not catching the stale lock as in your example but I just tested and can run a mkdir and touch and each time it will hang consistently and rm the lock file and borg acts normal

DaSpawn on 4 Mar 2018

If you use a different hostname or pid, the stale lock killer will not kick in.

ThomasWaldmann on 4 Mar 2018

I having the same issue as well..

[root@server ~]# sshpass -p '*******' borg info --debug [email protected]:/home/backups/server
using builtin fallback logging configuration
35 self tests completed in 0.21 seconds
SSH command line: ['ssh', '[email protected]', 'borg', 'serve', '--umask=077', '--debug']
Remote: using builtin fallback logging configuration
Remote: 35 self tests completed in 0.18 seconds
Remote: using builtin fallback logging configuration
Remote: Initialized logging system for JSON-based protocol
Remote: Resolving repository path b'/home/backups/server'
Remote: Resolved repository path to '/home/backups/server'
Remote: Verified integrity of /home/backups/server/index.255865
TAM-verified manifest

and then it just hangs...

aiso-net on 12 Mar 2018

👍1

The only other thing I can add was I discovered I had the cron job entry twice in the server attempting the same backup at the same time so they most likely tried to lock at the same time causing a strange condition in creation of locking files in the local borg cache

You should be able to run a borg break-lock from any machine that accesses the backup repo and it should fix the cache lock. you may also need to do break-lock on the machine borg is ssh'ing into (the repo host)

DaSpawn on 12 Mar 2018

I did that (no errors) and ran info again and got an error:

[root@server ~]# sshpass -p '****' borg info --debug [email protected]:/home/backups/server
using builtin fallback logging configuration
35 self tests completed in 0.28 seconds
SSH command line: ['ssh', '[email protected]', 'borg', 'serve', '--umask=077', '--debug']
Remote: using builtin fallback logging configuration
Remote: 35 self tests completed in 0.16 seconds
Remote: using builtin fallback logging configuration
Remote: Initialized logging system for JSON-based protocol
Remote: Resolving repository path b'/home/backups/server'
Remote: Resolved repository path to '/home/backups/server'
Remote: Verified integrity of /home/backups/server/index.255865
TAM-verified manifest
security: read previous location 'ssh://[email protected]/home/backups/server'
security: read manifest timestamp '2018-03-10T04:24:31.636849'
security: determined newest manifest timestamp as 2018-03-10T04:24:31.636849
security: repository checks ok, allowing access
RemoteRepository: 194 B bytes sent, 2.20 MB bytes received, 4 messages sent
Local Exception
Traceback (most recent call last):
File "/usr/lib64/python3.4/site-packages/borg/archiver.py", line 4175, in main
exit_code = archiver.run(args)
File "/usr/lib64/python3.4/site-packages/borg/archiver.py", line 4107, in run
return set_ec(func(args))
File "/usr/lib64/python3.4/site-packages/borg/archiver.py", line 147, in wrapper
progress=getattr(args, 'progress', False), lock_wait=self.lock_wait) as cache_:
File "/usr/lib64/python3.4/site-packages/borg/cache.py", line 372, in __new__
return local()
File "/usr/lib64/python3.4/site-packages/borg/cache.py", line 366, in local
lock_wait=lock_wait)
File "/usr/lib64/python3.4/site-packages/borg/cache.py", line 449, in __init__
self.open()
File "/usr/lib64/python3.4/site-packages/borg/cache.py", line 494, in open
self.rollback()
File "/usr/lib64/python3.4/site-packages/borg/cache.py", line 592, in rollback
self._do_open()
File "/usr/lib64/python3.4/site-packages/borg/cache.py", line 487, in _do_open
self.chunks = ChunkIndex.read(fd)
File "src/borg/hashindex.pyx", line 107, in borg.hashindex.IndexBase.read
File "src/borg/hashindex.pyx", line 94, in borg.hashindex.IndexBase.__cinit__
ValueError: Incorrect file length (expected 96336214, got 37748736)

Platform: Linux server 3.10.0-714.10.2.lve1.5.12.el7.x86_64 #1 SMP Fri Feb 2 00:27:48 EST 2018 x86_64 x86_64
Linux: CloudLinux 7.4 Georgy Grechko
Borg: 1.1.4 Python: CPython 3.4.5
PID: 1648 CWD: /root
sys.argv: ['/bin/borg', 'info', '--debug', '[email protected]:/home/backups/sever']
SSH_ORIGINAL_COMMAND: None

aiso-net on 12 Mar 2018

@DaSpawn borg's locking is based on mkdir which is usually an atomic filesystem operation. And even if a fs does not implement that atomically, it would happen in a very short timeframe.

So, "double locks" are impossible or extremely unlikely.

ThomasWaldmann on 12 Mar 2018

@aiso-net ValueError: Incorrect file length (expected 96336214, got 37748736) - that cache file is obviously corrupted. Guess you should clear the cache: borg delete --cache-only repo

ThomasWaldmann on 12 Mar 2018

The server was setup like this for months running the job duplicated in the cron.. it finnally found a way...but that was never the point of my bug report (I obviously did somethin gto cause this)

the entire reason for my bug report was borg hanging without error which is a serious problem for me and obviously a problem for others, even if it is easily solved with a break-lock (which does not in any way solve the problem, it just makes postpones a completely random problem that completely breaks backup routines as no errors are thrown as you would expect to happen with a backup problem)

I am not trying to hate on borg here, it is absolutely spectacular software and I have used if for almost a decade in many places and will continue to do so, I am just trying to help solve an issue others are also having for very different reasons with the same results, a hung backup process without error

DaSpawn on 12 Mar 2018

I'll fix any bug I can reproduce, but see above, I could not.

ThomasWaldmann on 12 Mar 2018

The original cause of the bug is most likely a condition beyond control and/or extremely difficult to reproduce, but the eventual results (hang on any subsequent borg runs due to extraneous cache lock) can be reproduced easily by manually creating a lock folder in the cache (that would be a legitimate lock normally but not removed for some reason)

I will do some more tests to see if issue is in any way reproducible for you to see

DaSpawn on 12 Mar 2018

@DaSpawn that only qualifies if you create a state that could be also created by borg.

borg does not just create empty folders, but there is a info file in it (with hostname/id, pid, tid) which is used by the stale lock killer to automatically remove locks (if borg can know for sure that the lock creating process does not exist any more).

ThomasWaldmann on 12 Mar 2018

borg does not just create empty folders, but there is a info file in it

Presumably the creation of the lock directory and info file is not atomic, though? i.e. something like mkdir lock_dir && echo "123" > lock_dir/pid could potentially be killed between making the directory and filling in the pid. Maybe that is what happened here?

sersorrel on 12 Mar 2018

If it is like that, you should be able to observe that.

BTW, the stale lock killer is only the last resort. Borg avoids leaving behind stale locks (or other unclean process exits) by the use of context managers.

Of course, if one kills -9 it or pulls the plug, the context manager won't help. That's what the stale lock killer is for.

ThomasWaldmann on 12 Mar 2018

Thanks, that fixed the incorrect file length. I dont think mine was caused by running twice, but may have been caused by OOM killer or something along those lines, cause I can see in my logs it running then all of a sudden: Unknown Error: Received SIGHUP

aiso-net on 13 Mar 2018

Isn't the OOM killer sending SIGKILL?

ThomasWaldmann on 13 Mar 2018

I just had the same problem on one of my devices. No idea how to reproduce it tho.

Although it's hopefully unlikely, I'll have a look if my other device is affected as well later today or tomorrow. If it is I'll gladly provide any information I can give you about it before fixing.

Gotos on 17 Mar 2018

I finally got around to checking my second device - it shows the same symptoms. Anything I should check that might help to replicate the bug before I try to use break-lock to fix it?

It's interesting that both my computers are affected, tho. Unfortunately I don't recall when I updated borg on either computer or my server, so no idea if an update caused this...

Gotos on 23 Mar 2018

@Gotos ls -lR on the cache and repo dir (minus everything below repo/data/).

ThomasWaldmann on 23 Mar 2018

Cache: https://pastebin.com/5MqKgvw1
Repo dir: https://pastebin.com/vDGCTmxM

Gotos on 24 Mar 2018

@Gotos there are no locks. not in the repo, not in the caches.

ThomasWaldmann on 24 Mar 2018

Yeah, sorry. That was unrelated. My script calling borg on this device was just broken >.>

Gotos on 24 Mar 2018

I've been affected by this as well (borg version 1.1.4); some process on my machine was running wild and I was getting "Too many open files" messages at the terminal (obviously not Borg's fault, it was some other program). I used CTRL-C to quit my Borg backup process and managed to kill off the unfriendly process, then I restarted borg, which didn't show any progress. Running with --debug --list shows that it hung after printing "TAM-verified manifest" (similar to aiso-net).

I made a copy of my cache directory for testing purposes, then ran break-lock which did fix the problem.

In case it helps, I've posted here the output of ls -lR ~/.cache/borg: https://gist.github.com/ibennetch/fb61138f747a1df5c8bf70b31bdce973. I still have a copy of the cache directory from before break-lock if that could help with further debugging.