I see quite a lot of htlcs timing out instead of failing on mainnet, which is quite annoying because that leads to channel closing. This could be normal if the counterparty was down, but in that particular case, I was pretty sure the peer was up and operational, so I tried to find out what happened.
lnd: v0.5.0-betaScenario is a payment being relayed through eclair and then through lnd:
... -----> eclair ----- htlcA -----> lnd ----- htlcB -----> ...
htlcA lnd_channel_id=544814:1271:1 expiry=545073 htlc_id=7
htlcB lnd_channel_id=540823:2707:1 expiry=545069 htlc_id=15
From the point of view of eclair, nothing happens until block #545073 gets mined, at which point htlcA has timed out, causing eclair to unilaterally close the channel.
I was able to get lnd's logs (log.zip, thanks @AndrewSamokhvalov), here is what I think are the relevant parts:
"October 9th 2018, 16:39:57.044",DBG,PEER,"Received UpdateAddHTLC(chan_id=b0d69f3c998bfabfc32702e55fbe0ff3e2551d91ca234c692e0fc1b0a1a8745e, id=7, amt=135635112 mSAT, expiry=545073, hash=7202da369b8dae66cb9ac6b0dec53bfe7e254efe53ebfb0d3ef6b6f1ba90d4bc) from 34.239.230.56:9735"
...
"October 9th 2018, 16:39:58.660",DBG,PEER,"Sending UpdateAddHTLC(chan_id=8c8481ae7e2a76af1691599057f2e4bafa7cad382f68eb73a9a1650ee410344d, id=15, amt=135634976 mSAT, expiry=545069, hash=7202da369b8dae66cb9ac6b0dec53bfe7e254efe53ebfb0d3ef6b6f1ba90d4bc) to 47.184.129.94:43178"
...
"October 9th 2018, 16:39:59.664",TRC,HSWC,"ChannelLink(540823:2707:1) revocation window exhausted, unable to send: 1, dangling_opens=([]channeldb.CircuitKey) (len=1 cap=1) {
(channeldb.CircuitKey) (Chan ID=544814:1271:1, HTLC ID=7)
}
, dangling_closes([]channeldb.CircuitKey) {
}
(the above line is repeated ~100k times)
...
"October 9th 2018, 18:25:28.665",DBG,HSWC,"ChannelLink(540823:2707:1) removing Add packet (Chan ID=544814:1271:1, HTLC ID=7) from mailbox"
I think what happened is that the next peer in the route got unresponsive or something, and htlcB never got signed. I'm actually not sure if lnd sent a commit_sig for htlcB or not. It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this). So lnd didn't watch htlcB's expiry and didn't fail htlcA between blocks 545069 and 545073, which caused htlcA to eventually timeout.
If lnd sent a commit_sig containing htlcB, it should have watched the blockchain, closed the downstream channel after block 545069 and failed htlcA.
If lnd didn't send a commit_sig, then it should have fast-failed htlcA
In any case the issue being between lnd and the next node, it shouldn't have caused the closing of the upstream channel.
lnd ignored htlcA, and htlcA eventually timed out.
If htlcB was never locked in, then it could be the case that lnd didn't actually have a htlcB in its version of the commitment transaction. We've tested this pretty extensively in our integration tests, and I've seen us successfully cancel back an HTLC that's timed out off-chain back to the on-chain path. We'll only cancel back the HTLC (incoming) once the outgoing HTLC has been fully resolved on-chain (confirmed and swept).
If lnd didn't send a commit_sig, then it should have fast-failed htlcA
Fast failed after the timeout? Or fast fail since the HTLC was never fully extended? Since it seems as though the HTLC was never fully extended, a fast fail isn't an option as if the HTLC does get cleared on that link, then we lose money if the HTLC is ultimately settled.
Do they have a set of logs under the CNCT log trace?
It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this).
In that message it was unable to send out the commitment for that outgoing link, as the remote party has never sent a prior revocation, or we lost it for some reason.
What do you mean by remove the HTLC?
@pm47 Version of lnd: v0.5.0-beta 48d016bc78e9f338742b10d936e52e7655fd8770
If htlcB was never locked in, then it could be the case that lnd didn't actually have a htlcB in its version of the commitment transaction.
It definitely seems to be the case that htlcB wasn't cross-signed, but even in this case htlcA shouldn't be left to time out. That's my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that's what cltv_expiry_delta are for.
We'll only cancel back the HTLC (incoming) once the outgoing HTLC has been fully resolved on-chain (confirmed and swept).
I think that is the "nominal" case, but there are corner case like the one we have here, which should also ultimately lead to cancelling the incoming htlc.
If lnd didn't send a commit_sig, then it should have fast-failed htlcA
Fast failed after the timeout? Or fast fail since the HTLC was never fully extended?
If lnd hasn't sent a commit_sig to the next node, then htlcA can be failed immediately. I figured that in this particular scenario where lnd seems to be waiting for a revoke_and_ack before sending the new commit_sig that contains htlcB, it could decide at some point (after a few seconds? minutes?) to:
or
revoke_and_ack.Since it seems as though the HTLC was never fully extended, a fast fail isn't an option as if the HTLC does get cleared on that link, then we lose money if the HTLC is ultimately settled.
How could you lose money if you never sent the commit_sig containing htlcB?
Do they have a set of logs under the CNCT log trace?
@AndrewSamokhvalov I'll let you answer
It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this).
In that message it was unable to send out the commitment for that outgoing link, as the remote party has never sent a prior revocation, or we lost it for some reason.
What do you mean by remove the HTLC?
That's just what I read in the logs (removing Add packet), I don't know what that meant exactly for lnd. I explained above what I think it should mean.
That's my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that's what cltv_expiry_delta are for.
If the B-C link is in an indeterminate state (for w/e reason), and this persists for more than the delta, then the channel will indeed be closed. In this case, we can't cancel that HTLC on the outgoing, as ti isn't yet fully locked in and in a partial limbo state (they have a state but we don't or the other way around).
I think that is the "nominal" case, but there are corner case like the one we have here, which should also ultimately lead to cancelling the incoming htlc.
Indeed if we're able to detect it and safely recover.
If lnd hasn't sent a commit_sig to the next node, then htlcA can be failed immediately.
Yeah in the normal case, it depends on exactly what was committed in the prior engagement though. I agree that those are all valid actions we can take at a point like this. The main difficulty is in ensuring that we don't lose funds due to canceling the incoming HTLC before the outgoing is fully resolved.
I've been running into this issue as well sending between two nodes. Maybe. Debugging here with Alex. I have the HTLC here on my send.
{ "active": true, "remote_pubkey": "02529db69fd2ebd3126fb66fafa234fc3544477a23d509fe93ed229bb0e92e4fb8", "channel_point": "b4c57bf511d3d9cd29c046dda4af7230edfabf7df4386cc8fc74c37d25bed73e:0", "chan_id": "594163989040594944", "capacity": "6500000", "local_balance": "123195", "remote_balance": "6374666", "commit_fee": "2139", "commit_weight": "724", "fee_per_kw": "2676", "unsettled_balance": "0", "total_satoshis_sent": "1003", "total_satoshis_received": "124400", "num_updates": "323", "pending_htlcs": [ { "incoming": false, "amount": "200", "hash_lock": "kDv1vJy1S674aWv8t6vSJ8co9k3VffTynyKoOM4M3cQ=", "expiration_height": 547439 }, { "incoming": false, "amount": "1", "hash_lock": "s4bWJfNSlN0H4wcRVg/W3uNYFo/Nq72QPPjZG5v4Hm0=", "expiration_height": 547447 } ], "csv_delay": 781, "private": false },
"active": true,
"remote_pubkey": "02529db69fd2ebd3126fb66fafa234fc3544477a23d509fe93ed229bb0e92e4fb8",
"channel_point": "9a3cd6a722dbc5ee17394ee3e99ce7727d03f69090d3c65d8065ecea0091641e:0",
"chan_id": "599093099740069888",
"capacity": "5210295",
"local_balance": "988894",
"remote_balance": "4219464",
"commit_fee": "1937",
"commit_weight": "724",
"fee_per_kw": "2674",
"unsettled_balance": "0",
"total_satoshis_sent": "108006",
"total_satoshis_received": "1096901",
"num_updates": "157",
"pending_htlcs": [
{
"incoming": true,
"amount": "1",
"hash_lock": "s4bWJfNSlN0H4wcRVg/W3uNYFo/Nq72QPPjZG5v4Hm0=",
"expiration_height": 547303
}
],
"csv_delay": 626,
"private": false
},
HTLC failed after a restart of LND.
Any reason why it didn't accept it?
It failed before expiration.
{
"memo": "Bitrefill Deposit 5bd0ac20d8d5f4000494d388",
"receipt": null,
"r_preimage": "8lvZxfeV3DviixRuUlS/figwClNPNx125xoUv10h1WM=",
"r_hash": "s4bWJfNSlN0H4wcRVg/W3uNYFo/Nq72QPPjZG5v4Hm0=",
"value": "1",
"settled": false,
"creation_date": "1540402208",
"settle_date": "0",
"payment_request": "lnbc10n1pdaptpqpp5kwrdvf0n222d6plrqug4vr7kmm34s950ek4mmypulrv3hxlcreksdzygf5hgun9ve5kcmpqg3jhqmmnd96zqdtzvscxzcejxpjrsep4vc6rqvpsxsungepn8quqcqzysxqr8pqnhnvguuudh80zn7nmfcud0xy23gm0lr5qd2pz3dcfrf93tlqlenn0vvj60t2cufr799n2plcm3gvzs88dj23p8v397rgq4ysx7lwg6cpe2fqq7",
"description_hash": null,
"expiry": "7200",
"fallback_addr": "",
"cltv_expiry": "144",
"route_hints": [
],
"private": false,
"add_index": "2842",
"settle_index": "0",
"amt_paid": "0",
"amt_paid_sat": "0",
"amt_paid_msat": "0"
},
@juscamarena do you have a goroutine dump or debug logs?
No, but if I run into it again I will provide it
I'm not sure what is the status or priority of this, but it is currently the main cause of channel failures on our node, has been for quite some time (e.g. in february it happened 26 times so far).
That's my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that's what cltv_expiry_delta are for.
If the B-C link is in an indeterminate state (for w/e reason), and this persists for more than the delta, then the channel will indeed be closed. In this case, we can't cancel that HTLC on the outgoing, as ti isn't yet fully locked in and in a partial limbo state (they have a state but we don't or the other way around).
If you close B-C, then once your or your peer's commitment tx is confirmed, it won't be in an indeterminate state anymore (HTLC either will or will not be in the commit tx) and you will be able to fail or fulfill it in A-B. All that happening within cltv_expiry_delta.
Hey guys,
I've been poking at _lnd_'s code to try to figure this out and I have been able to find a simple local repro. It looks like this issue happens when _lnd_ sends a commit_sig, doesn't receive a revoke_and_ack but receives a commit_sig (it's possible that other tweaks in lost messages trigger the same bug, but at least I'm able to easily reproduce it with those steps).
Here is the repro (and I attached all the logs, lnd data and bitcoind regtest data captured while running the repro):
htlcswitch/link.go, comment line 1708 (l.cfg.Peer.SendMessage(false, nextRevocation))htlcswitch/link.go, comment line 1744-1748 (if err := l.updateCommitTx())What should have happened:
Once Bob has closed Chan_BC and seen that its CommitTx has been confirmed, Bob should fail the HTLC_AB.
@Roasbeef does it make sense? I tried setting up a unit test showcasing that but I couldn't find an easy way to manipulate Carol's outgoing messages to simulate a message loss while still keeping all the other features (chain_arbitrator in particular).
Yeah that makes sense. We only examine HTLCs to forward once we receive a revocation from the remote party. If we never receive a revocation, then we won't examine which changes are fully locked in to forward onwards (in this case an HTLC settle). If this is only triggered by an incorrect channel state machine (not sending the messages and not retransmitting them on reconnect), then I'm not sure our behavior is incorrect as this would be a violation of the spec.
One thing that would partially resolve this, would be for us to pipeline the pre-image across the relevant sub-systems as soon as we obtain it. I think eclair might already do this, but we don't.
I'm not sure our behavior is incorrect as this would be a violation of the spec
Carol misbehaves so it is ok if channel Alice-Bob gets closed, really?
I didn't make that statement. I'm describing how the state machine is
defined and implementation level details that can result in an event like
this.
On Fri, May 10, 2019, 1:54 PM Pierre-Marie Padiou notifications@github.com
wrote:
I'm not sure our behavior is incorrect as this would be a violation of the
specCarol misbehaves so it is ok if channel Alice-Bob gets closed, really?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/lightningnetwork/lnd/issues/2045#issuecomment-491427246,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHTWLSZIU2HUGFBGPLV77DPUXOJ3ANCNFSM4F3I5PJA
.
One thing that would partially resolve this, would be for us to pipeline the pre-image across the relevant sub-systems as soon as we obtain it. I think eclair might already do this, but we don't.
I'm not sure this is focusing on the right thing. Here Bob never sees the preimage. However, Bob broadcasts his previous commit tx (not containing the faulty HTLC) and never gave Carol the commitment secret (so Carol can't use the revocation outputs). Once that transaction is confirmed (after a few blocks), Bob can completely forget about this HTLC without loss of funds. That's why at that moment Bob should fail the HTLC with Alice (instead of letting Alice time it out).
From looking at your code, it seems like this could be done in the contractcourt (the part that watches the blockchain) right?
Thanks for the detailed report! We have a fix up that ensures we won't let the incoming link time out unnecessarily.
Thanks, I just tested with my repro steps and it works nicely.
Most helpful comment
Thanks for the detailed report! We have a fix up that ensures we won't let the incoming link time out unnecessarily.