I run lnd with --watchtower.active, but I can't clean shutdown lnd.
lnd --watchtower.active and then shut down the lnd
Normal exit and the output is LTND: Shutdown complete
It is hanging and can't be shut down, the output is
2020-09-02 11:01:00.714 [INF] LTND: Received interrupt
2020-09-02 11:01:00.715 [INF] LTND: Shutting down...
2020-09-02 11:01:00.715 [INF] LTND: Gracefully shutting down.
Sounds like some watchtower goroutine isn't properly exited.
Could you please forcefully shut down lnd, start it again with the --profile=9736 port, try a clean shutdown and when it hangs again extract a goroutine profile as described here: https://github.com/lightningnetwork/lnd/blob/master/docs/debugging_lnd.md#capturing-pprof-data-with-lnd
That would help us to identify where the problem lies.
Ok, I will send the pprof data to you by slack.
As there is no sensitive/personal information in the profile, I'm posting it here as well so others can take a look too: https://paste.ubuntu.com/p/FcW3QvhJDm/
I looked into some of the goroutines and am not sure if the btcwallet/wallet should still be listening on those channels at this point. But apart from that I didn't see anything suspicious. Any ideas @wpaulino, @Crypt-iQ ?
The goroutine profile shows that lnd is still in the process of starting up, specifically trying to connect to btcd for fee estimation. There's no trace of a shutdown request at all.
Same problem here. Tried with different watchtowers. Unable to stop the LND daemon when connected to a watchtower. When trying "lnd stop", active channels becomes zero but the daemon is not stopping. No error messages.
@peername: I still wasn't able to reproduce this locally. Would you be willing to share your configuration, or at least everything watchtower related?
It usually happens to me when the remote watchtower is offline. There are pending client updates that can't be sent.
Ah, that helps, thanks! Maybe I can reproduce it then.
I looked into this a bit more.
Unfortunately I couldn't reproduce any shutdown hang.
At most it took 10 seconds to shut down if there was a session negotiation still ongoing. But after the 10 seconds the force shutdown kicked in and lnd terminated normally.
But it seems that there are different scenarios where you (@lesterli, @peername and @githorray) experience this.
I think we need more information. Can you please answer the following questions?
--watchtower.active) or client (--wtclient.active)?debuglevel=debug and post more than just the last three lines).Thanks!
I'm seeing this as well and it has been consistent for at least the last two lnd versions. Forced to do a "kill -KILL".
Version: v0.11.0-beta
Full config:
--debuglevel=ARPC=warn,ATPL=warn,BRAR=warn,BTCN=warn,CHBU=warn,CHDB=warn,CHNF=warn,CMGR=warn,CNCT=warn,CRTR=warn,DISC=warn,FNDG=warn,HSWC=warn,INVC=warn,IRPC=warn,LNWL=warn,LTND=warn,NANN=warn,NTFN=info,NTFR=warn,PEER=warn,RPCS=warn,RRPC=warn,SGNR=warn,SPHX=warn,SRVR=warn,SWPR=warn,UTXN=warn,WLKT=warn,WTCL=debug,WTWR=debug
--maxlogfilesize=1000 \
--maxlogfiles=10 \
--alias=XXX \
--color=XXX \
--bitcoin.node=bitcoind \
--bitcoind.rpchost=XXX:XXX \
--bitcoind.zmqpubrawblock=tcp://XXX:XXX \
--bitcoind.zmqpubrawtx=tcp://XXX:XXX \
--externalip=XXX:9735 \
--externalip=[XXX]:9735 \
--externalip=XXX.onion:9735 \
--listen=[::]:XXX \
--restlisten=XXX:XXX \
--rpclisten=XXX:XXX \
--rpclisten=XXX:XXX \
--bitcoin.active \
--bitcoin.mainnet \
--watchtower.active \
--watchtower.listen=[::]:9911 \
--watchtower.externalip=XXX:9911 \
--watchtower.externalip=[XXX]:9911 \
--wtclient.active \
--wtclient.private-tower-uris=XXX@XXX:XXX \
--protocol.wumbo-channels \
--accept-keysend
$ lncli wtclient towers
{
"towers": [
{
"pubkey": "XXX",
"addresses": [
"XXX:XXX",
"YYY:YYY"
],
"active_session_candidate": false,
"num_sessions": 1067,
"sessions": [
]
},
{
"pubkey": "ZZZ",
"addresses": [
"ZZZ:ZZZ",
"ZZZ.onion:ZZZ"
],
"active_session_candidate": false,
"num_sessions": 2,
"sessions": [
]
}
]
}
None of these towers are functional but I'm unable to remove them. I suspect that has something to do with it.
$ lncli wtclient stats
{
"num_backups": 0,
"num_pending_backups": 0,
"num_failed_backups": 0,
"num_sessions_acquired": 0,
"num_sessions_exhausted": 0
}
$ lncli wtclient policy
{
"max_updates": 1024,
"sweep_sat_per_byte": 10
}
I'm not running with tor enabled.
I have trace-level log files for two different occasions that I can send privately, just let me know.
I don't have a any profile data. I have tried to do pkill -5 -x lnd in the past to get a golang call stack dump but systemd never saved any output of that to the journal, lnd was killed though.
@jogc thanks for the additional information. I'm still unable to reproduce this locally.
Can you send me the trace level logs please? I'm guggero in our community Slack or on Keybase.
None of these towers are functional but I'm unable to remove them.
Are you running into https://github.com/lightningnetwork/lnd/issues/4420?
None of these towers are functional but I'm unable to remove them.
Are you running into #4420?
It seems similiar but I'm getting _removing towers is disallowed while a new session negotiation is in progress_ rather than _tower has unacked updates._ Heres what I wrote on slack about it a while back:
If I do a lncli wtclient towers it shows two towers. Both are old stale towers that are not reachable. Trying to remove either of them with lncli wtclient remove gives me [lncli] rpc error: code = Unknown desc = removing towers is disallowed while a new session negotiation is in progress. Neither of them are specified on the lnd command line. How can I get rid of them and start fresh?
Ah I see. That stems from the watchtower client attempting to negotiate a new session with one of the candidates. There's no way for it to know at the moment that these towers no longer exist, so it will wait until it can reach either of them (which will never happen). You'll want to add a new reachable tower, which will be chosen to negotiate a session with, and you should then be able to remove the others.
I was finally able to reproduce it!
The problem occurs if there is no active session and the number of updates is exhausted:
2020-10-27 11:18:49.409 [DBG] WTCL: Processing backup(940a681cf49e2ee12015b0d6b83f810fc5a509d5fe5bf921a74d79c9af3ac7cd, 1907)
2020-10-27 11:18:49.409 [DBG] WTCL: SessionQueue(03cce423735484aedd57882692b132e0d3c84892090936d6b44c41be0771cc1582) deciding to accept backup(940a681cf49e2ee12015b0d6b83f810fc5a509d5fe5bf921a74d79c9af3ac7cd, 1907) seqnum=4 pending=15 max-updates=20
2020-10-27 11:18:49.409 [INF] WTCL: Queued backup(940a681cf49e2ee12015b0d6b83f810fc5a509d5fe5bf921a74d79c9af3ac7cd, 1907) successfully for session 03cce423735484aedd57882692b132e0d3c84892090936d6b44c41be0771cc1582
2020-10-27 11:18:49.409 [DBG] WTCL: Session 03cce423735484aedd57882692b132e0d3c84892090936d6b44c41be0771cc1582 exhausted
2020-10-27 11:18:49.409 [INF] WTCL: Requesting new session.
2020-10-27 11:18:49.409 [DBG] WTCL: Dispatching session negotiation
2020-10-27 11:18:49.410 [DBG] WTCL: Attempting session negotiation with tower=020ecd2e0eb26d73beeec130f2a9d89bdb99a4fb9d2cd1177130b4162b2562f434
2020-10-27 11:18:49.413 [DBG] WTCL: Request for session negotiation with tower=020ecd2e0eb26d73beeec130f2a9d89bdb99a4fb9d2cd1177130b4162b2562f434@127.0.0.1:9911 failed, trying again -- reason: dial tcp 127.0.0.1:9911: connect: connection refused
2020-10-27 11:18:49.413 [DBG] WTCL: Request for session negotiation with tower=020ecd2e0eb26d73beeec130f2a9d89bdb99a4fb9d2cd1177130b4162b2562f434@7ykwxs6ln3o4xjuwcovvpl4mnyg2gbij44i3ve6kofbwbobi2tdx7wyd.onion:9911 failed, trying again -- reason: dial tcp: address 7ykwxs6ln3o4xjuwcovvpl4mnyg2gbij44i3ve6kofbwbobi2tdx7wyd.onion: no suitable address found
2020-10-27 11:18:49.413 [DBG] WTCL: Session negotiation with tower=020ecd2e0eb26d73beeec130f2a9d89bdb99a4fb9d2cd1177130b4162b2562f434 failed, trying again -- reason: session negotiation unsuccessful
2020-10-27 11:18:49.413 [DBG] WTCL: Unable to get new tower candidate, retrying after 10s -- reason: exhausted all tower candidates
If that happens, the sessionQueue is set to nil here: https://github.com/lightningnetwork/lnd/blob/master/watchtower/wtclient/client.go#L848
That causes the main loop to not read from the pipeline anymore, this isn't reached: https://github.com/lightningnetwork/lnd/blob/master/watchtower/wtclient/client.go#L772
Therefore on shutdown, the task pipeline's signalUntilShutdown() has no effect as nobody's listening on https://github.com/lightningnetwork/lnd/blob/master/watchtower/wtclient/task_pipeline.go#L164.
I'm really not sure what to do in this situation, this code base is very unfamiliar to me. @cfromknecht could you take a look please?
@jogc you can solve this problem manually by removing all faulty sessions the way @wpaulino mentioned in the previous comment. The shutdown hang is caused by too many non-submitted backups. Once that queue is cleared, the problem should be gone.
But we obviously still want to fix the issue itself.
@jogc you can solve this problem manually by removing all faulty sessions the way @wpaulino mentioned in the previous comment. The shutdown hang is caused by too many non-submitted backups. Once that queue is cleared, the problem should be gone.
But we obviously still want to fix the issue itself.
Stupid question, but when adding a new tower, is it supposed to keep uploading stuff for 30 minutes straight? I assumed it would just send the latest state if each channel and be done with it (until the state changes for a channel obviously). It just keeps uploading and uploading until I remove the tower again. I mean like 10 uploads per second. Snippet:
2020-11-04 19:34:36.849 [INF] WTCL: SessionQueue(03a7c91ed69b047ff1c055e4a6cf072b7f3ec03b9e95a56b7a579cbe0a356bcf12) uploaded backup(00e996632e9a82d3300c2ad76ebc75a4e25c600f3856975705d44a2c97739a2b, 33992) seqnum=121
2020-11-04 19:34:36.933 [INF] WTCL: SessionQueue(03adeef3a8633ea515602094fcb42ff49833996e2dc75845faab435313f0effdfe) uploaded backup(82b541e318b33eed8024390d83c2b64635f741c2b4192cf3b01e82055bd4ab83, 2831) seqnum=357
2020-11-04 19:34:37.040 [INF] WTCL: SessionQueue(03dcc3aff75781a948f7b05c27d1f184eda776917a7467012005f6945f9c630fb9) uploaded backup(bbc410fd5fb4c4251bc5ecdd2f387c83af31ce75b6f985adc8a90695fac170eb, 40112) seqnum=86
2020-11-04 19:34:37.128 [INF] WTCL: SessionQueue(03650a8bc7527025a7a7bacdb82be53f3f38ea387c04e6af3e4d235d2d9d5fe76e) uploaded backup(6fe98c9059b49bf40c00955891dbad0083883ba043dcc194d06bcc001f3abb4c, 55184) seqnum=66
2020-11-04 19:34:37.257 [INF] WTCL: SessionQueue(02cf46a5446ebe33e3cf540506af3f98b9f56c38d27228f1eed86ea472af6c9788) uploaded backup(6fe98c9059b49bf40c00955891dbad0083883ba043dcc194d06bcc001f3abb4c, 48800) seqnum=238
2020-11-04 19:34:37.355 [INF] WTCL: SessionQueue(0319e84b9a958f96f2533fcc44d8dc63772f1d15b15a332a4759b3e610a2a3e3b4) uploaded backup(6ca4688b1d648c7f79c6f9f53bbba8b93c7ac6cf8a27892703a2881f1367d1c9, 17408) seqnum=332
2020-11-04 19:34:37.496 [INF] WTCL: SessionQueue(035f6044c1b9893275aeecdb34504f0624863e6146de220825ec2e309e7526e18f) uploaded backup(e681892bec75bc3c0525b8116e35c6a7260691059c320cd5e27bca76913ca0b9, 4369) seqnum=78
2020-11-04 19:34:37.683 [INF] WTCL: SessionQueue(022d342d386b03c70b61ac6c857d1b2c883720aafb0c9f0b3432810b9585c45a11) uploaded backup(00e996632e9a82d3300c2ad76ebc75a4e25c600f3856975705d44a2c97739a2b, 33237) seqnum=207
2020-11-04 19:34:37.783 [INF] WTCL: SessionQueue(028254ff16cecee006ea913a6d865056db966af57e57ca9ee4767042a7b3d2d85f) uploaded backup(6fe98c9059b49bf40c00955891dbad0083883ba043dcc194d06bcc001f3abb4c, 53284) seqnum=106
2020-11-04 19:34:37.917 [INF] WTCL: SessionQueue(020e2a1822aa5d5804901f95391a31c7f0575b9cca6a981a14b7c9320b91680f1b) uploaded backup(117052d757e11603674133b54f40b667c1c8e366cc10c14dd63e15d797ecdd0c, 19) seqnum=265
2020-11-04 19:34:38.028 [INF] WTCL: SessionQueue(03b87be3b1d565ba4deea9b2d6222449f7b0294a531d99d935f764187f1917bb81) uploaded backup(6fe98c9059b49bf40c00955891dbad0083883ba043dcc194d06bcc001f3abb4c, 54981) seqnum=75
2020-11-04 19:34:38.245 [INF] WTCL: SessionQueue(024142fb975b5831a9ed5c892341d414a39814ad17849f0ff678901416979986e4) uploaded backup(432aa861bd643492b4e872ae545ac6026c74f9aee3a566b060e9cd8de77c107d, 3582) seqnum=155
2020-11-04 19:34:38.471 [INF] WTCL: SessionQueue(0304129c421ac459b27098c58156dfedae031b60622db2f69b8fdba5da3e2824c7) uploaded backup(2de71ecc6949a9abe3e8556af6a9da78c9330d40ae000dd6b310c887c2e64087, 71296) seqnum=135
2020-11-04 19:34:38.722 [INF] WTCL: SessionQueue(037eeb071e846beb80c4cc66453dc3a20ddfbaf3b34179f1d8c40c292d51eb70d2) uploaded backup(6ca4688b1d648c7f79c6f9f53bbba8b93c7ac6cf8a27892703a2881f1367d1c9, 19498) seqnum=93
2020-11-04 19:34:38.928 [INF] WTCL: SessionQueue(03228756d4382c8ca4b592b3ec74d2a7b9a8a3545d58bff9f1473487f4ee0acdc2) uploaded backup(2de71ecc6949a9abe3e8556af6a9da78c9330d40ae000dd6b310c887c2e64087, 69601) seqnum=198
2020-11-04 19:34:39.159 [INF] WTCL: SessionQueue(0221b355a0d5090bfc557aad7b20d69a0d8907d63eec3da04dc00a6b40abf8d5ec) uploaded backup(4272594c2cbdb52cb75a3077ba50f343896e85c28b7fb425ede5b51fa5278f74, 16194) seqnum=140
2020-11-04 19:34:39.280 [INF] WTCL: SessionQueue(02a6c1a8031f2f970891378a389a5dbace1a192f3885ce2a6ebfa76d5b511c0721) uploaded backup(82b541e318b33eed8024390d83c2b64635f741c2b4192cf3b01e82055bd4ab83, 2883) seqnum=295
2020-11-04 19:34:39.394 [INF] WTCL: SessionQueue(023d70e565c037ab5ca77d37f7064936cb0edde3935f52eb2160e9d18f3dbced5f) uploaded backup(00e996632e9a82d3300c2ad76ebc75a4e25c600f3856975705d44a2c97739a2b, 35217) seqnum=70
2020-11-04 19:34:39.474 [INF] WTCL: SessionQueue(030a9a183a42fa5a035dd6c1ad89db3ddf280cd1019dab3988569fe60dd6e86de7) uploaded backup(930879e167b4ebacc5df6752c78e6a8391d403cdfda97d8cf5946c3e45053337, 4927) seqnum=138
2020-11-04 19:34:39.576 [INF] WTCL: SessionQueue(03973d37ae315447212c1b77f153bc50e12634d262fbe4a4cae1d17e7fb2ca2d1b) uploaded backup(bff175d758172a637858282c67b9d081232bd212941c0e6c06a048581a079796, 39562) seqnum=141
2020-11-04 19:34:39.694 [INF] WTCL: SessionQueue(0395b0e790ceaa0bed947b3492e4538be304d416b4c77597a20889621b8bbeb01a) uploaded backup(80710bf7658d5343d3657faaf12412060c9aed5fcf9bd1e13cd4d0e3f6d51a7d, 3723) seqnum=222
2020-11-04 19:34:39.770 [INF] WTCL: SessionQueue(034c959be792bd4d02f101c60d45ba1b31c6ec3c70df3b0e35d39b7be6affbb2d6) uploaded backup(6fe98c9059b49bf40c00955891dbad0083883ba043dcc194d06bcc001f3abb4c, 53137) seqnum=110
2020-11-04 19:34:39.870 [INF] WTCL: SessionQueue(032680ceaac77fe3792cee5aab3a9633d20e2552b43f7312a0c5e4ee261fd99264) uploaded backup(562a91afb6ba42fd14fc9341329d48f7a95f5d4e18ec02550dff0b95f88d991d, 4091) seqnum=106
2020-11-04 19:34:39.971 [INF] WTCL: SessionQueue(034ec1d023c3e89cc15e8768145b0f02ede278e426781fd4f2657c9c5013308c49) uploaded backup(bb02d622f811e94c00e3d57b9613f9a72b498aa1ae7199bbc55425307334197b, 8685) seqnum=422
$ lncli wtclient stats
{
"num_backups": 62074,
"num_pending_backups": 62074,
"num_failed_backups": 0,
"num_sessions_acquired": 61,
"num_sessions_exhausted": 60
}
md5-62873f975b716c4bd80945e841fbc8fc
$ tail -n5000000 lnd.log|egrep -a 'WTCL.*uploaded backup'|head -n1
2020-11-04 19:06:18.092 [INF] WTCL: SessionQueue(028a93bd3fc97ace386ed25922c2de82cb94d5154dde947d1b7352ba5983c3f1f4) uploaded backup(16c38df135baec90b2bcf417c2c824178d28275e90ec52dc16ef7a79ba97e83a, 1648) seqnum=1
md5-62873f975b716c4bd80945e841fbc8fc
$ tail -n5000000 lnd.log|egrep -a 'WTCL.*uploaded backup'|tail -n1
2020-11-04 19:41:01.822 [INF] WTCL: SessionQueue(03a7c91ed69b047ff1c055e4a6cf072b7f3ec03b9e95a56b7a579cbe0a356bcf12) uploaded backup(930879e167b4ebacc5df6752c78e6a8391d403cdfda97d8cf5946c3e45053337, 5778) seqnum=156
md5-62873f975b716c4bd80945e841fbc8fc
$ tail -n5000000 lnd.log|egrep -a 'WTCL.*uploaded backup'|wc
14320 128880 3093076
Each state update is backed up to ensure the counterparty is caught in any possible state it might try to broadcast to the chain. Since you didn't have any reachable towers, none of these states were backed up and are pending.
Most helpful comment
I was finally able to reproduce it!
The problem occurs if there is no active session and the number of updates is exhausted:
If that happens, the
sessionQueueis set to nil here: https://github.com/lightningnetwork/lnd/blob/master/watchtower/wtclient/client.go#L848That causes the main loop to not read from the pipeline anymore, this isn't reached: https://github.com/lightningnetwork/lnd/blob/master/watchtower/wtclient/client.go#L772
Therefore on shutdown, the task pipeline's
signalUntilShutdown()has no effect as nobody's listening on https://github.com/lightningnetwork/lnd/blob/master/watchtower/wtclient/task_pipeline.go#L164.I'm really not sure what to do in this situation, this code base is very unfamiliar to me. @cfromknecht could you take a look please?
@jogc you can solve this problem manually by removing all faulty sessions the way @wpaulino mentioned in the previous comment. The shutdown hang is caused by too many non-submitted backups. Once that queue is cleared, the problem should be gone.
But we obviously still want to fix the issue itself.