My LND node keeps shutting down. Seems like it's triggered by a short network failure preventing the node from reaching bitcoind
bitcoind: Bitcoin Core version v0.20.1lnd and bitocoind are on different serversRestart LND. Repros every 1 to 2 days.
Just keep running.
2020-10-03 02:18:26.258 [INF] NTFN: New block: height=651006, sha=000000000000000000059e4ba6945311b11cf30cea4389ca932ebd0c89f5186c
2020-10-03 02:18:26.258 [INF] UTXN: Attempting to graduate height=651006: num_kids=0, num_babies=0
2020-10-03 02:18:58.221 [INF] CRTR: Processed channels=0 updates=554 nodes=1 in last 59.999815896s
2020-10-03 02:19:24.910 [CRT] SRVR: Health check: chain backend failed after 3 calls <------------
2020-10-03 02:19:24.910 [INF] SRVR: Sending request for shutdown
How to avoid the shutdown?
There seems to be a problem with the health check, some false positives we are investigating.
For now, you can disable the health check with --healthcheck.chainbackend.attempts=0.
Do you have any more logs of the failed attempts? The logic is relatively simple: every N minutes we perform a backend health check, if that fails 3 times, we shutdown with the assumption that lnd will be restarted by some hypervisor-type system. Also the response period before we mark the attempt failed is rather generous at 10 seconds or so (the default).
The current implementation will cal getbestblock for bitcoind backends. I run a few nodes against btcd and haven't yet noticed a spurious shutdown. If you run the HLCK on the debug log-level, then we should be able to see what error lnd is receiving to cause it to think that the heath check failed.
Related: #4671.
another log:
2020-09-29 02:31:28.866 [INF] NTFN: New block: height=650455, sha=000000000000000000057e352676f4e0847106aa5c7a4e35d8e0b319cac185c7
2020-09-29 02:31:28.874 [INF] UTXN: Attempting to graduate height=650455: num_kids=0, num_babies=0
2020-09-29 02:31:29.416 [INF] DISC: GossipSyncer(0334ca3166f220e08f3075b406bd009c4d71575d5131bc688a34049651e94b3d70): querying for 500 new channels
2020-09-29 02:32:34.386 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-09-29 02:32:34.386 [INF] SRVR: Sending request for shutdown
going to try catch the error message by doing repro again with
debuglevel=HLCK=debug
triggered by a short network failure
If you suspect that this is being caused by a temporary networking glitch, there are a few other config settings you could change:
--healthcheck.chainbackend.attempts: increase to allow more tries before lnd shuts down--healthcheck.chainbackend.timeout: increase the amount of time each call is allowed before we fail due to timeout--healthcheck.chainbackend.backoff: increase the amount of time we backoff between attempts, which could allow the temporary issue more time to resolve itselfgoing to try catch the error message by doing repro again with
Thanks! Will be much easier to debug once we know what's going on with the check itself.
10 second timeout seems too small for default.
Average new block arrival time is 600 seconds. Is there any reason not to have the default at 120 seconds?
Here is the debug log:
2020-10-06 04:32:58.985 [INF] NTFN: New block: height=651471, sha=00000000000000000004eff1ca66b45e292a630901cde715222baf4b9aa6f50b
2020-10-06 04:32:58.986 [INF] UTXN: Attempting to graduate height=651471: num_kids=0, num_babies=0
2020-10-06 04:33:00.192 [INF] CRTR: Processed channels=0 updates=360 nodes=1 in last 1m0.000015454s
2020-10-06 04:33:25.091 [INF] DISC: Broadcasting 468 new announcements in 18 sub batches
2020-10-06 04:33:29.381 [INF] DISC: GossipSyncer(022df3659d511f156e3ee91ef608a095ce3399234d374c7105356a36a904469372): applying new update horizon: start=2042-07-14 01:04:00 +0000 UTC, end=2042-07-14 01:04:00 +0000 UTC, backlog_size=0
2020-10-06 04:34:00.192 [INF] CRTR: Processed channels=1 updates=133 nodes=2 in last 59.999820789s
2020-10-06 04:34:55.091 [INF] DISC: Broadcasting 178 new announcements in 18 sub batches
2020-10-06 04:35:00.191 [INF] CRTR: Processed channels=2 updates=131 nodes=0 in last 59.999928167s
2020-10-06 04:36:00.191 [INF] CRTR: Processed channels=0 updates=195 nodes=2 in last 59.999924549s
2020-10-06 04:36:25.091 [INF] DISC: Broadcasting 319 new announcements in 18 sub batches
2020-10-06 04:36:53.710 [INF] DISC: GossipSyncer(02cf93d75855011806318f235a0d7e02735c6d5b961862c8ae3aaf58415d7c24ca): applying new update horizon: start=2020-10-06 04:36:53 +0000 UTC, end=2156-11-12 11:05:08 +0000 UTC, backlog_size=0
2020-10-06 04:36:55.091 [INF] DISC: GossipSyncer(02a2c53bc475cb92e4ab2f38a5bca56df695034ce90ad78c2f47c05911e3f79e41): applying gossipFilter(start=0001-01-01 00:00:00 +0000 UTC, end=0001-01-01 00:00:00 +0000 UTC)
2020-10-06 04:36:55.091 [INF] DISC: GossipSyncer(031015a7839468a3c266d662d5bb21ea4cea24226936e2864a7ca4f2c3939836e0): applying gossipFilter(start=2020-10-06 04:36:55.091711013 +0000 UTC m=+48033.098810450, end=2156-11-12 11:05:10.091711013 +0000 UTC m=+4295015328.098810450)
2020-10-06 04:37:00.191 [INF] CRTR: Processed channels=0 updates=196 nodes=0 in last 59.999873136s
2020-10-06 04:37:55.091 [INF] DISC: Broadcasting 278 new announcements in 18 sub batches
2020-10-06 04:38:00.192 [INF] CRTR: Processed channels=0 updates=203 nodes=3 in last 1m0.000011358s
2020-10-06 04:38:03.950 [ERR] BTCN: Can't accept connection: unable to accept connection from N.N.N.N:64836: EOF
2020-10-06 04:39:00.191 [INF] CRTR: Processed channels=0 updates=177 nodes=7 in last 59.999835524s
2020-10-06 04:39:25.091 [INF] DISC: Broadcasting 314 new announcements in 18 sub batches
2020-10-06 04:40:00.191 [INF] CRTR: Processed channels=0 updates=265 nodes=0 in last 1m0.000007686s
2020-10-06 04:40:04.822 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:44.822 [DBG] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:55.091 [INF] DISC: Broadcasting 459 new announcements in 18 sub batches
2020-10-06 04:41:00.191 [INF] CRTR: Processed channels=0 updates=330 nodes=5 in last 59.999848626s
2020-10-06 04:41:24.823 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-06 04:41:24.823 [INF] SRVR: Sending request for shutdown
I'm getting this issue.
lnd has shutdown twice already because of this. backend is bitcoin core 0.18.1 on the same computer.
2020-10-08 08:09:51.192 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-08 08:09:51.192 [INF] SRVR: Sending request for shutdown
2020-10-08 08:09:51.193 [INF] LTND: Received shutdown request.
2020-10-08 08:09:51.193 [INF] LTND: Shutting down...
2020-10-08 08:09:51.193 [INF] LTND: Gracefully shutting down.
I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again
Is there any reason not to have the default at 120 seconds?
@alevchuk no reason not to bump the default. It does seem bizarre to me that the call is taking 10 seconds to complete, so we're still investigating to make sure nothing is wrong on our side. If we can't figure anything out, will likely bump the default and see how that goes.
I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again
Thanks for the report @sendbitcoin! I'd strongly recommend running lnd with debuglevel=HLCK=debug so that we can see the reason your check is failing. The logging was only bumped to info level in 35a2dbc so you may not see the reason otherwise.
relevant portion of my log
2020-10-06 04:40:04.822 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:44.822 [DBG] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:55.091 [INF] DISC: Broadcasting 459 new announcements in 18 sub batches
2020-10-06 04:41:00.191 [INF] CRTR: Processed channels=0 updates=330 nodes=5 in last 59.999848626s
2020-10-06 04:41:24.823 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-06 04:41:24.823 [INF] SRVR: Sending request for shutdown
my lnd now running with --healthcheck.chainbackend.timeout=120s started doing a lot a continuous disk writes and some occasional [DBG] HLCK timeouts showed up:
2020-10-10 15:56:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
2020-10-11 00:10:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
2020-10-11 01:10:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
kernel consistency maxed out on iowaiting:
$ vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 1 0 758488 84440 647044 0 0 0 405 263 536 2 0 0 98 0
0 1 0 758456 84440 647044 0 0 0 388 230 476 2 0 0 98 0
0 1 0 758172 84444 647044 0 0 0 403 243 508 2 0 0 98 0
0 1 0 758140 84452 647044 0 0 0 401 252 512 2 0 0 98 0
0 1 0 758108 84452 647044 0 0 0 416 241 485 2 0 0 98 0
I checked with atop that the writes are coming from lnd only
The health checks don't touch disk at all, just make a GetBestBlock call to bitcoind (although it's possible that GetBestBlock does). I suspect that there's something else up with your node, are you managing normal operation (send payments, open channels etc)? Could be related to #4689, you could disable the health check entirely and see whether this problem persists, but I imagine that it will.
Are you running lnd and bitcoind on the same hardware? How much memory are you working with?
Looks very related to #4689
I'm running lnd and bitcoind on a separate host. 2 GB RAM. Problem reproduces almost every day, going to get back to you if turning off health check had any effect.
If #4689 is not caused by the health check, then it still makes sense that the helthcheck would report LND unhealthy. It's reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I've seen client timeouts when there is I/O issue on the client host.
going to get back to you if turning off health check had any effect
Thanks that would be great!
If #4689 is not caused by the health check
Most likely not, because we disabled health checks by default in the version they're running (0.11.1-beta.rc5) because we were worried about false-positives
If you're happy to send me a few hours of logs (carla on the lnd slack) I can also have a look and see if there's anything that stands out.
It's reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I've seen client timeouts when there is I/O issue on the client host.
Sure, but being unreachable for 2 minutes? The code here is also pretty simple: send the request, then wait for it to come back. I don't think the healthchecks themselves are causing high I/O as it's just an RPC call, and lnd makes several of these to bitcoind on a normal basis for routing operation.
I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it.
I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it.
Health checks are disable in rc5, so you wouldn't see any shutdowns. Were you also seeing healthcheck timeouts?
Same issue here, after having upgraded from 0.10 to 0.11.99-beta on 13 Oct 2020 (following the security warning of last week), without changing any bitcoin or lnd setting.
the issue: lnd shutdown after few hours of operation while you would expect it to run continuously unattended like in previous version...
Find in attachement two extracts of nohup.out:
Will try meanwhile the work-arounds suggested above.
update 2020.10.17: stable operation of lnd 0.11 since workarounds applied as proposed by @guggero & @carlaKC above
My setup is almost exactly the same as @emplexity , and I was getting the same Health Check shutdowns after upgrading to those versions. I loaded with --healthcheck.chainbackend.attempts=0 and it has been steady since (will report back if it closes out again). I just wanted to also point out that loading lnd with chainbackend.attempts=0 seems to have loaded much faster. I am not sure if this was the result of something else or having something cached.
But from old log with no flag:
2020-10-18 20:09:01.345 [INF] LTND: Database now open (time_to_open=4m4.826767561s)!
Versus with the flag:
2020-10-18 22:21:02.195 [INF] LTND: Database now open (time_to_open=356.557358ms)!
This long load time for lnd started a version or two ago, but I cannot remember if it was the same version where the health checks started shutting down lnd. It also said "this could take a few minutes" so I assumed the long load time was expected.
@floundies That flag shouldn't affect how long it takes to open the database. There're a set of flags that affect that, but all this new flag does it hit an RPC call periodically.
Other questions for those running into this: is lnd the _only_ application that's hitting bitcoind's RPC interface?
lnd is the only application hitting bitcoind
I experience the same issue, and I can reproduce very slow responses to (for example) "getbestblockhash" calls:
$ time bitcoin-cli -conf=/etc/bitcoin/bitcoin.conf getbestblockhash
000000000000000000056cfad3dda5c6ebde81a545f10a2a76c65650779a3dd3
real 0m3.893s
user 0m0.002s
sys 0m0.000s
Sometimes the calls block for minutes (!), and I think this is due to other requests waiting (and blocking). I'll investigate.
I don't think lnd can be blamed for this.
Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).
Related discussion: https://bitcoincore.reviews/16426
I have no idea how to fix this issue, but from a lnd perspective I think increasing the backend check values or disabling the feature would help.
Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).
Thanks for the info @C-Otto! We suspected something like this would be the case, but didn't actually go look in core itself. I'm pretty surprised that it blocks for minutes though! I'll switchover to a call that doesn't need the lock (just need one which we have for neutrino/btcd as well, but not critical that it's the same endpoint imo). Will get this in for 0.12.
The list of relevant endpoints seems to be defined here:
https://github.com/btcsuite/btcwallet/blob/master/chain/interface.go#L31
Based on this I think it's best to add a new endpoint (like "uptime" maybe?). In reality, I think it's best to speed up the bitcoind instance. In my case the (new) storage backend seems to be too slow for meaningful bitcoind operation.
I'm running c7eea13f95cfa1d5362ad65a1502b70b2d5d6aef but still experience issues with the backend health check.
However, I see this when starting lnd: Version: 0.11.99-beta commit=v0.11.0-beta-307-g9e75c3eadede1816d4be19a7923cd5d0b44b78a0, build=development, logging=default, debuglevel=info
2020-11-12 15:56:35.835 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 15:57:15.835 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 16:58:47.772 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 16:59:27.772 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
"getblockchaininfo" is used according to ngrep. @carlaKC could you have a look? Is it a compilation issue? Can I help debug this?
version: 0.11.99-beta commit=v0.11.0-beta-307-g9e75c3eadede1816d4be19a7923cd5d0b44b78a0
This version looks wrong to me, are you sure you're running the latest compiled version at c7eea13f9?
When I checkout that commit I get the following when I check my version:
./lnd-debug --version
version 0.11.99-beta commit=v0.11.0-beta-436-gc7eea13f95cfa1d5362ad65a1502b70b2d5d6aef-dirty
What's tipping me off is that the value beta-? indicates the number of commits since the last tag, and yours looks too low?
Feel free to ping me on lnd slack btw (carla) if it's easier
I forgot make install, and just verified that "uptime" is used with the current version. Thanks for the quick help :)
Most helpful comment
Thanks for the info @C-Otto! We suspected something like this would be the case, but didn't actually go look in core itself. I'm pretty surprised that it blocks for minutes though! I'll switchover to a call that doesn't need the lock (just need one which we have for neutrino/btcd as well, but not critical that it's the same endpoint imo). Will get this in for 0.12.