A few days ago I noticed two channels had automatically force closed locally after some commitment disagreement (which I believed to be bug #1908), so I decided to stop running my node until a fix is released to prevent further channels from force closing.
Today I saw that a release candidate with the fix was released and decided to give it a try.
After syncing to the current height for a while, lnd crashed. These are the last two lines in the log:
2019-01-27 11:04:05.368 [INF] HSWC: ChannelPoint(<channel point>): updating commit fee to 2426 sat/kw
2019-01-27 11:04:14.370 [INF] HSWC: ChannelLink(<channel point>) has exited
And here's the stderr of the process from my systemd journal:
Jan 27 11:04:14 lnd[11218]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 27 11:04:14 lnd[11218]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x4f88fc]
Jan 27 11:04:14 lnd[11218]: goroutine 8391 [running]:
Jan 27 11:04:14 lnd[11218]: container/list.(*List).Remove(...)
Jan 27 11:04:14 lnd[11218]: /usr/local/go/src/container/list/list.go:123
Jan 27 11:04:14 lnd[11218]: github.com/lightningnetwork/lnd/lnwallet.(*updateLog).removeUpdate(...)
Jan 27 11:04:14 lnd[11218]: /Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:1170
Jan 27 11:04:14 lnd[11218]: github.com/lightningnetwork/lnd/lnwallet.compactLogs.func1(0x262f040, 0x262f180)
Jan 27 11:04:14 lnd[11218]: /Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:1238 +0x1b4
Jan 27 11:04:14 lnd[11218]: github.com/lightningnetwork/lnd/lnwallet.compactLogs(0x262f040, 0x262f180, 0x4e2, 0x0, 0x4e3, 0x0)
Jan 27 11:04:14 lnd[11218]: /Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:1252 +0x60
lnd v0.5.2-beta-rc1 (GitHub armv7 pre-release)Linux 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7l GNU/Linux)bitcoind v0.17.0.1I should also mention that I didn't unlock the wallet again yet after this happened, so I can't say whether this is random or consistent.
It seems that my channel database got corrupted somehow, right?
Is it safe to try unlocking the wallet again or would I risk the channel closing, perhaps even losing the funds?
@a123b What Go version are you running?
@a123b Thanks for reporting this! It indeed looks like you are running into a new bug..
Can you try re-running LND with logs set to debug level, and paste the logs here after the crash? Thanks!
So I just restarted and unlocked lnd with debuglevel=debug. It didn't crash again yet, but it didn't try to update the commit fee for the affected channel yet, either. The channel is however currently active.
I noticed that the affected channel's ChannelArbitrator started with state=StateCommitmentBroadcasted, while it is state=StateDefault for all other channels. Looking at yesterday's log, this was also the case before the crash.
Is this normal or could it have something to do with the bug?
Edit: The above section is invalid, I had the wrong channel point in my clipboard when grepping through the log...
I'll just leave the node running for now and check from time to time if it tried a commit fee update or crashed...
Update:
The channel in question has made a successful commit fee update by now, so I believe a simple restart actually fixed the problem.
Update 2:
_Just_ after restarting lnd with normal logging it happened again, this time with a different channel...
I'll just leave debug logging on for now and restart lnd from time to time until I catch this log.
I noticed that another two or three channels have force closed after I updated to the 0.5.2 release candidate. So it seems like the force closing wasn't bug #1908 and might be related to this bug.
The only unusual thing that happened before the force closings started was that when I stopped my lnd systemd unit, it hadn't finished shutting down after the configured stop timeout of 5 minutes, so systemd killed the process with SIGKILL while it was still in the shutdown phase. Just before/while it got killed, it was still doing commit fee updates:
2019-01-18 00:01:05.636 [INF] LTND: Received SIGINT (Ctrl+C).
2019-01-18 00:01:05.637 [INF] LTND: Shutting down...
2019-01-18 00:01:05.637 [INF] LTND: Gracefully shutting down.
2019-01-18 00:01:14.299 [INF] DISC: gossipSyncer(<<chan_11-remote_node_key>>): applying new update horizon: start=1970-01-01 01:00:00 +0100 CET, end=2106-02-07 07:28:15 +0100 CET, backlog_size=64344
2019-01-18 00:01:14.303 [INF] DISC: gossipSyncer(<<chan_11-remote_node_key>>): buffering chan range reply of size=751
2019-01-18 00:01:14.304 [INF] DISC: gossipSyncer(<<chan_11-remote_node_key>>): filtering through 751 chans
2019-01-18 00:01:24.468 [INF] DISC: gossipSyncer(<<chan_11-remote_node_key>>): starting query for 22 new chans
2019-01-18 00:01:24.469 [INF] DISC: gossipSyncer(<<chan_11-remote_node_key>>): querying for 22 new channels
2019-01-18 00:01:54.233 [INF] HSWC: ChannelPoint(<<chan_4-chan_point>>): updating commit fee to 3753 sat/kw
2019-01-18 00:02:52.037 [INF] HSWC: ChannelPoint(<<chan_0-chan_point>>): updating commit fee to 3753 sat/kw
2019-01-18 00:03:07.386 [INF] HSWC: ChannelPoint(<<chan_6-chan_point>>): updating commit fee to 3753 sat/kw
2019-01-18 00:05:10.015 [INF] HSWC: ChannelPoint(<<chan_1-chan_point>>): updating commit fee to 3753 sat/kw
### systemd sent SIGKILL here ###
I didn't think too much of it then, but considering the first two force closes started soon after this incident, I wonder if this is the root cause of all those problems?
I am attaching a (heavily redacted) log file of the two latest force closings here:
lnd.log
debuglevel=debug where everything seemed normalinvalid_commit_sig, then it crashed again with a SIGSEGV, this time without a stack tracedebuglevel=info again, another channel got force closed, I noticed the force closed channels and stopped lndI hope this can shed some light on what's going on here...
I just ran into the same problem, but I don't have additional information.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x9cdc1b]
goroutine 3859679 [running]:
container/list.(*List).Remove(...)
/usr/local/go/src/container/list/list.go:123
github.com/lightningnetwork/lnd/lnwallet.(*updateLog).removeUpdate(...)
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:1170
github.com/lightningnetwork/lnd/lnwallet.compactLogs.func1(0xc021e05680, 0xc021e05770)
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:1238 +0xfb
github.com/lightningnetwork/lnd/lnwallet.compactLogs(0xc021e05680, 0xc021e05770, 0xdf, 0xe0)
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:1252 +0x71
github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).ReceiveRevocation(0xc0774685b0, 0xc013e979a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/lnwallet/channel.go:4414 +0x12a6
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg(0xc0188cba00, 0x126e840, 0xc013e979a0)
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/htlcswitch/link.go:1559 +0xb7f
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager(0xc0188cba00)
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/htlcswitch/link.go:1036 +0x843
created by github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).Start
/Users/roasbeef/gocode/src/github.com/lightningnetwork/lnd/htlcswitch/link.go:424 +0x387
Thanks again! We _think_ we have identified the problem now, and I have a fix up here: https://github.com/lightningnetwork/lnd/pull/2558
If you could run it and report back that would be godsend 馃檹
I started lnd with your fix (cherry-picked the three commits on top of the v0.5.2-beta-rc3 tag) twice now and (almost) all channels became active without any issues. :+1:
I'll leave it running and restart it every now and then, and report back in case there are any issues.
Conclusion after running the fix for over 24 hours with 7 restarts:
So it seems that this was indeed the problem.
Thanks for fixing it so fast!
Thanks for testing! 馃榾
Most helpful comment
I started
lndwith your fix (cherry-picked the three commits on top of thev0.5.2-beta-rc3tag) twice now and (almost) all channels became active without any issues. :+1:I'll leave it running and restart it every now and then, and report back in case there are any issues.