Lnd: Health check: chain backend failed after 3 calls. How to avoid the shutdown?

Created on 4 Oct 2020 · 30Comments · Source: lightningnetwork/lnd

Background

My LND node keeps shutting down. Seems like it's triggered by a short network failure preventing the node from reaching bitcoind

Your environment

lnd version 0.11.99-beta commit=v0.11.0-beta-255-g10a84f2c75c9de958c9dd07de830f0546a43d05a
which operating system: Debian GNU/Linux 10 4.19.0-8-amd64
version of bitcoind: Bitcoin Core version v0.20.1
lnd and bitocoind are on different servers

Steps to reproduce

Restart LND. Repros every 1 to 2 days.

Expected behaviour

Just keep running.

Actual behaviour

2020-10-03 02:18:26.258 [INF] NTFN: New block: height=651006, sha=000000000000000000059e4ba6945311b11cf30cea4389ca932ebd0c89f5186c
2020-10-03 02:18:26.258 [INF] UTXN: Attempting to graduate height=651006: num_kids=0, num_babies=0
2020-10-03 02:18:58.221 [INF] CRTR: Processed channels=0 updates=554 nodes=1 in last 59.999815896s

2020-10-03 02:19:24.910 [CRT] SRVR: Health check: chain backend failed after 3 calls  <------------

2020-10-03 02:19:24.910 [INF] SRVR: Sending request for shutdown

Question

How to avoid the shutdown?

Source

alevchuk

Most helpful comment

Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).

Thanks for the info @C-Otto! We suspected something like this would be the case, but didn't actually go look in core itself. I'm pretty surprised that it blocks for minutes though! I'll switchover to a call that doesn't need the lock (just need one which we have for neutrino/btcd as well, but not critical that it's the same endpoint imo). Will get this in for 0.12.

carlaKC on 8 Nov 2020

👍2

All 30 comments

There seems to be a problem with the health check, some false positives we are investigating.
For now, you can disable the health check with --healthcheck.chainbackend.attempts=0.

guggero on 4 Oct 2020

👍2

Do you have any more logs of the failed attempts? The logic is relatively simple: every N minutes we perform a backend health check, if that fails 3 times, we shutdown with the assumption that lnd will be restarted by some hypervisor-type system. Also the response period before we mark the attempt failed is rather generous at 10 seconds or so (the default).

Roasbeef on 5 Oct 2020

The current implementation will cal getbestblock for bitcoind backends. I run a few nodes against btcd and haven't yet noticed a spurious shutdown. If you run the HLCK on the debug log-level, then we should be able to see what error lnd is receiving to cause it to think that the heath check failed.

Roasbeef on 5 Oct 2020

Related: #4671.

Roasbeef on 5 Oct 2020

another log:

2020-09-29 02:31:28.866 [INF] NTFN: New block: height=650455, sha=000000000000000000057e352676f4e0847106aa5c7a4e35d8e0b319cac185c7
2020-09-29 02:31:28.874 [INF] UTXN: Attempting to graduate height=650455: num_kids=0, num_babies=0
2020-09-29 02:31:29.416 [INF] DISC: GossipSyncer(0334ca3166f220e08f3075b406bd009c4d71575d5131bc688a34049651e94b3d70): querying for 500 new channels
2020-09-29 02:32:34.386 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-09-29 02:32:34.386 [INF] SRVR: Sending request for shutdown

going to try catch the error message by doing repro again with

debuglevel=HLCK=debug

alevchuk on 5 Oct 2020

triggered by a short network failure

If you suspect that this is being caused by a temporary networking glitch, there are a few other config settings you could change:

--healthcheck.chainbackend.attempts: increase to allow more tries before lnd shuts down
--healthcheck.chainbackend.timeout: increase the amount of time each call is allowed before we fail due to timeout
--healthcheck.chainbackend.backoff: increase the amount of time we backoff between attempts, which could allow the temporary issue more time to resolve itself

going to try catch the error message by doing repro again with

Thanks! Will be much easier to debug once we know what's going on with the check itself.

carlaKC on 6 Oct 2020

10 second timeout seems too small for default.

Average new block arrival time is 600 seconds. Is there any reason not to have the default at 120 seconds?

Here is the debug log:

2020-10-06 04:32:58.985 [INF] NTFN: New block: height=651471, sha=00000000000000000004eff1ca66b45e292a630901cde715222baf4b9aa6f50b
2020-10-06 04:32:58.986 [INF] UTXN: Attempting to graduate height=651471: num_kids=0, num_babies=0
2020-10-06 04:33:00.192 [INF] CRTR: Processed channels=0 updates=360 nodes=1 in last 1m0.000015454s
2020-10-06 04:33:25.091 [INF] DISC: Broadcasting 468 new announcements in 18 sub batches
2020-10-06 04:33:29.381 [INF] DISC: GossipSyncer(022df3659d511f156e3ee91ef608a095ce3399234d374c7105356a36a904469372): applying new update horizon: start=2042-07-14 01:04:00 +0000 UTC, end=2042-07-14 01:04:00 +0000 UTC, backlog_size=0
2020-10-06 04:34:00.192 [INF] CRTR: Processed channels=1 updates=133 nodes=2 in last 59.999820789s
2020-10-06 04:34:55.091 [INF] DISC: Broadcasting 178 new announcements in 18 sub batches
2020-10-06 04:35:00.191 [INF] CRTR: Processed channels=2 updates=131 nodes=0 in last 59.999928167s
2020-10-06 04:36:00.191 [INF] CRTR: Processed channels=0 updates=195 nodes=2 in last 59.999924549s
2020-10-06 04:36:25.091 [INF] DISC: Broadcasting 319 new announcements in 18 sub batches
2020-10-06 04:36:53.710 [INF] DISC: GossipSyncer(02cf93d75855011806318f235a0d7e02735c6d5b961862c8ae3aaf58415d7c24ca): applying new update horizon: start=2020-10-06 04:36:53 +0000 UTC, end=2156-11-12 11:05:08 +0000 UTC, backlog_size=0
2020-10-06 04:36:55.091 [INF] DISC: GossipSyncer(02a2c53bc475cb92e4ab2f38a5bca56df695034ce90ad78c2f47c05911e3f79e41): applying gossipFilter(start=0001-01-01 00:00:00 +0000 UTC, end=0001-01-01 00:00:00 +0000 UTC)
2020-10-06 04:36:55.091 [INF] DISC: GossipSyncer(031015a7839468a3c266d662d5bb21ea4cea24226936e2864a7ca4f2c3939836e0): applying gossipFilter(start=2020-10-06 04:36:55.091711013 +0000 UTC m=+48033.098810450, end=2156-11-12 11:05:10.091711013 +0000 UTC m=+4295015328.098810450)
2020-10-06 04:37:00.191 [INF] CRTR: Processed channels=0 updates=196 nodes=0 in last 59.999873136s
2020-10-06 04:37:55.091 [INF] DISC: Broadcasting 278 new announcements in 18 sub batches
2020-10-06 04:38:00.192 [INF] CRTR: Processed channels=0 updates=203 nodes=3 in last 1m0.000011358s
2020-10-06 04:38:03.950 [ERR] BTCN: Can't accept connection: unable to accept connection from N.N.N.N:64836: EOF
2020-10-06 04:39:00.191 [INF] CRTR: Processed channels=0 updates=177 nodes=7 in last 59.999835524s
2020-10-06 04:39:25.091 [INF] DISC: Broadcasting 314 new announcements in 18 sub batches
2020-10-06 04:40:00.191 [INF] CRTR: Processed channels=0 updates=265 nodes=0 in last 1m0.000007686s
2020-10-06 04:40:04.822 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:44.822 [DBG] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:55.091 [INF] DISC: Broadcasting 459 new announcements in 18 sub batches
2020-10-06 04:41:00.191 [INF] CRTR: Processed channels=0 updates=330 nodes=5 in last 59.999848626s
2020-10-06 04:41:24.823 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-06 04:41:24.823 [INF] SRVR: Sending request for shutdown

alevchuk on 7 Oct 2020

I'm getting this issue.
lnd has shutdown twice already because of this. backend is bitcoin core 0.18.1 on the same computer.

2020-10-08 08:09:51.192 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-08 08:09:51.192 [INF] SRVR: Sending request for shutdown
2020-10-08 08:09:51.193 [INF] LTND: Received shutdown request.
2020-10-08 08:09:51.193 [INF] LTND: Shutting down...
2020-10-08 08:09:51.193 [INF] LTND: Gracefully shutting down.

I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again

sendbitcoin on 8 Oct 2020

👍1

Is there any reason not to have the default at 120 seconds?

@alevchuk no reason not to bump the default. It does seem bizarre to me that the call is taking 10 seconds to complete, so we're still investigating to make sure nothing is wrong on our side. If we can't figure anything out, will likely bump the default and see how that goes.

I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again

Thanks for the report @sendbitcoin! I'd strongly recommend running lnd with debuglevel=HLCK=debug so that we can see the reason your check is failing. The logging was only bumped to info level in 35a2dbc so you may not see the reason otherwise.

carlaKC on 9 Oct 2020

👍1

relevant portion of my log

2020-10-06 04:40:04.822 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:44.822 [DBG] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:55.091 [INF] DISC: Broadcasting 459 new announcements in 18 sub batches
2020-10-06 04:41:00.191 [INF] CRTR: Processed channels=0 updates=330 nodes=5 in last 59.999848626s
2020-10-06 04:41:24.823 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-06 04:41:24.823 [INF] SRVR: Sending request for shutdown

alevchuk on 11 Oct 2020

my lnd now running with --healthcheck.chainbackend.timeout=120s started doing a lot a continuous disk writes and some occasional [DBG] HLCK timeouts showed up:

2020-10-10 15:56:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
2020-10-11 00:10:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
2020-10-11 01:10:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...

kernel consistency maxed out on iowaiting:

$ vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 758488  84440 647044    0    0     0   405  263  536  2  0  0 98  0
 0  1      0 758456  84440 647044    0    0     0   388  230  476  2  0  0 98  0
 0  1      0 758172  84444 647044    0    0     0   403  243  508  2  0  0 98  0
 0  1      0 758140  84452 647044    0    0     0   401  252  512  2  0  0 98  0
 0  1      0 758108  84452 647044    0    0     0   416  241  485  2  0  0 98  0

I checked with atop that the writes are coming from lnd only

alevchuk on 11 Oct 2020

The health checks don't touch disk at all, just make a GetBestBlock call to bitcoind (although it's possible that GetBestBlock does). I suspect that there's something else up with your node, are you managing normal operation (send payments, open channels etc)? Could be related to #4689, you could disable the health check entirely and see whether this problem persists, but I imagine that it will.

Are you running lnd and bitcoind on the same hardware? How much memory are you working with?

carlaKC on 13 Oct 2020

Looks very related to #4689

I'm running lnd and bitcoind on a separate host. 2 GB RAM. Problem reproduces almost every day, going to get back to you if turning off health check had any effect.

If #4689 is not caused by the health check, then it still makes sense that the helthcheck would report LND unhealthy. It's reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I've seen client timeouts when there is I/O issue on the client host.

alevchuk on 13 Oct 2020

going to get back to you if turning off health check had any effect

Thanks that would be great!

If #4689 is not caused by the health check

Most likely not, because we disabled health checks by default in the version they're running (0.11.1-beta.rc5) because we were worried about false-positives

If you're happy to send me a few hours of logs (carla on the lnd slack) I can also have a look and see if there's anything that stands out.

carlaKC on 13 Oct 2020

It's reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I've seen client timeouts when there is I/O issue on the client host.

Sure, but being unreachable for 2 minutes? The code here is also pretty simple: send the request, then wait for it to come back. I don't think the healthchecks themselves are causing high I/O as it's just an RPC call, and lnd makes several of these to bitcoind on a normal basis for routing operation.

Roasbeef on 14 Oct 2020

👍1

I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it.

hpbock on 14 Oct 2020

I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it.

Health checks are disable in rc5, so you wouldn't see any shutdowns. Were you also seeing healthcheck timeouts?

carlaKC on 14 Oct 2020

Same issue here, after having upgraded from 0.10 to 0.11.99-beta on 13 Oct 2020 (following the security warning of last week), without changing any bitcoin or lnd setting.

the issue: lnd shutdown after few hours of operation while you would expect it to run continuously unattended like in previous version...

bitcoind v 0.20.0, bitcoin and lnd on the same machine,
lnd launched with nohup
Running on a Ubuntu 16.04 virtual machine, with 1 TB disk, 14 GB RAM

Find in attachement two extracts of nohup.out:

the occurances of health check warning during the last operation period (approx 4 hrs)
the sequence which triggered the shutdown and few of its ensuing events.

Will try meanwhile the work-arounds suggested above.

shutDown.txt
healthCheck.txt

update 2020.10.17: stable operation of lnd 0.11 since workarounds applied as proposed by @guggero & @carlaKC above

emplexity on 14 Oct 2020

My setup is almost exactly the same as @emplexity , and I was getting the same Health Check shutdowns after upgrading to those versions. I loaded with --healthcheck.chainbackend.attempts=0 and it has been steady since (will report back if it closes out again). I just wanted to also point out that loading lnd with chainbackend.attempts=0 seems to have loaded much faster. I am not sure if this was the result of something else or having something cached.

But from old log with no flag:
2020-10-18 20:09:01.345 [INF] LTND: Database now open (time_to_open=4m4.826767561s)!

Versus with the flag:
2020-10-18 22:21:02.195 [INF] LTND: Database now open (time_to_open=356.557358ms)!

This long load time for lnd started a version or two ago, but I cannot remember if it was the same version where the health checks started shutting down lnd. It also said "this could take a few minutes" so I assumed the long load time was expected.

floundies on 19 Oct 2020

@floundies That flag shouldn't affect how long it takes to open the database. There're a set of flags that affect that, but all this new flag does it hit an RPC call periodically.

Roasbeef on 20 Oct 2020

Other questions for those running into this: is lnd the _only_ application that's hitting bitcoind's RPC interface?

Roasbeef on 20 Oct 2020

lnd is the only application hitting bitcoind

emplexity on 20 Oct 2020

I experience the same issue, and I can reproduce very slow responses to (for example) "getbestblockhash" calls:

$ time bitcoin-cli -conf=/etc/bitcoin/bitcoin.conf getbestblockhash
000000000000000000056cfad3dda5c6ebde81a545f10a2a76c65650779a3dd3

real    0m3.893s
user    0m0.002s
sys 0m0.000s

Sometimes the calls block for minutes (!), and I think this is due to other requests waiting (and blocking). I'll investigate.
I don't think lnd can be blamed for this.

C-Otto on 7 Nov 2020

Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).

Example: https://github.com/bitcoin/bitcoin/blob/867dbeba5f91be15ca0d4a7303a71957ff9a37ad/src/rpc/blockchain.cpp#L225

Related discussion: https://bitcoincore.reviews/16426

I have no idea how to fix this issue, but from a lnd perspective I think increasing the backend check values or disabling the feature would help.

C-Otto on 7 Nov 2020

Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).

carlaKC on 8 Nov 2020

👍2

The list of relevant endpoints seems to be defined here:
https://github.com/btcsuite/btcwallet/blob/master/chain/interface.go#L31

:x: Start/Stop/WaitForShutdown/BackEnd: not relevant.
:x: GetBestBlock: "getblockchaininfo"
:x: GetBlock - "getblock"
:x: GetBlockHash - "getblockhash"
:x: IsCurrent - uses "GetBestBlock"
:x: FilterBlocks - uses "GetBlock"
:x: BlockStamp - is passive, does not attempt to contact bitcoind
:x: SendRawTransaction - I guess sending a transaction every so often isn't a good idea :)
:x: Rescan - doesn't seem to be a good idea
:x: NotifyReceived - uses "NotifyBlocks"
:x: NotifyBlocks - uses "GetBestBlock"
:x: Notifications - is passive, does not attempt to contact bitcoind

Based on this I think it's best to add a new endpoint (like "uptime" maybe?). In reality, I think it's best to speed up the bitcoind instance. In my case the (new) storage backend seems to be too slow for meaningful bitcoind operation.

C-Otto on 8 Nov 2020

I'm running c7eea13f95cfa1d5362ad65a1502b70b2d5d6aef but still experience issues with the backend health check.

However, I see this when starting lnd: Version: 0.11.99-beta commit=v0.11.0-beta-307-g9e75c3eadede1816d4be19a7923cd5d0b44b78a0, build=development, logging=default, debuglevel=info

2020-11-12 15:56:35.835 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 15:57:15.835 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 16:58:47.772 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 16:59:27.772 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s

C-Otto on 12 Nov 2020

"getblockchaininfo" is used according to ngrep. @carlaKC could you have a look? Is it a compilation issue? Can I help debug this?

C-Otto on 12 Nov 2020

version: 0.11.99-beta commit=v0.11.0-beta-307-g9e75c3eadede1816d4be19a7923cd5d0b44b78a0

This version looks wrong to me, are you sure you're running the latest compiled version at c7eea13f9?

When I checkout that commit I get the following when I check my version:

./lnd-debug --version
version 0.11.99-beta commit=v0.11.0-beta-436-gc7eea13f95cfa1d5362ad65a1502b70b2d5d6aef-dirty

What's tipping me off is that the value beta-? indicates the number of commits since the last tag, and yours looks too low?
Feel free to ping me on lnd slack btw (carla) if it's easier

carlaKC on 12 Nov 2020

I forgot make install, and just verified that "uptime" is used with the current version. Thanks for the quick help :)

C-Otto on 12 Nov 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings