Lnd: HTLC not getting failed upstream after they timeout downstream

Created on 12 Oct 2018 · 17Comments · Source: lightningnetwork/lnd

Background

I see quite a lot of htlcs timing out instead of failing on mainnet, which is quite annoying because that leads to channel closing. This could be normal if the counterparty was down, but in that particular case, I was pretty sure the peer was up and operational, so I tried to find out what happened.

Your environment

version of lnd: v0.5.0-beta

Steps to reproduce

Scenario is a payment being relayed through eclair and then through lnd:

... -----> eclair ----- htlcA -----> lnd ----- htlcB -----> ...

htlcA lnd_channel_id=544814:1271:1 expiry=545073 htlc_id=7
htlcB lnd_channel_id=540823:2707:1 expiry=545069 htlc_id=15

From the point of view of eclair, nothing happens until block #545073 gets mined, at which point htlcA has timed out, causing eclair to unilaterally close the channel.

I was able to get lnd's logs (log.zip, thanks @AndrewSamokhvalov), here is what I think are the relevant parts:

"October 9th 2018, 16:39:57.044",DBG,PEER,"Received UpdateAddHTLC(chan_id=b0d69f3c998bfabfc32702e55fbe0ff3e2551d91ca234c692e0fc1b0a1a8745e, id=7, amt=135635112 mSAT, expiry=545073, hash=7202da369b8dae66cb9ac6b0dec53bfe7e254efe53ebfb0d3ef6b6f1ba90d4bc) from 34.239.230.56:9735"
...
"October 9th 2018, 16:39:58.660",DBG,PEER,"Sending UpdateAddHTLC(chan_id=8c8481ae7e2a76af1691599057f2e4bafa7cad382f68eb73a9a1650ee410344d, id=15, amt=135634976 mSAT, expiry=545069, hash=7202da369b8dae66cb9ac6b0dec53bfe7e254efe53ebfb0d3ef6b6f1ba90d4bc) to 47.184.129.94:43178"
...
"October 9th 2018, 16:39:59.664",TRC,HSWC,"ChannelLink(540823:2707:1) revocation window exhausted, unable to send: 1, dangling_opens=([]channeldb.CircuitKey) (len=1 cap=1) {
 (channeldb.CircuitKey) (Chan ID=544814:1271:1, HTLC ID=7)
}
, dangling_closes([]channeldb.CircuitKey) {
}
(the above line is repeated ~100k times)
...
"October 9th 2018, 18:25:28.665",DBG,HSWC,"ChannelLink(540823:2707:1) removing Add packet (Chan ID=544814:1271:1, HTLC ID=7) from mailbox"

I think what happened is that the next peer in the route got unresponsive or something, and htlcB never got signed. I'm actually not sure if lnd sent a commit_sig for htlcB or not. It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this). So lnd didn't watch htlcB's expiry and didn't fail htlcA between blocks 545069 and 545073, which caused htlcA to eventually timeout.

Expected behaviour

If lnd sent a commit_sig containing htlcB, it should have watched the blockchain, closed the downstream channel after block 545069 and failed htlcA.
If lnd didn't send a commit_sig, then it should have fast-failed htlcA

In any case the issue being between lnd and the next node, it shouldn't have caused the closing of the upstream channel.

Actual behaviour

lnd ignored htlcA, and htlcA eventually timed out.

HTLC bug commitments contracts htlcswitch

Source

pm47

Most helpful comment

Thanks for the detailed report! We have a fix up that ensures we won't let the incoming link time out unnecessarily.

Roasbeef on 18 May 2019

👍2

All 17 comments

If htlcB was never locked in, then it could be the case that lnd didn't actually have a htlcB in its version of the commitment transaction. We've tested this pretty extensively in our integration tests, and I've seen us successfully cancel back an HTLC that's timed out off-chain back to the on-chain path. We'll only cancel back the HTLC (incoming) once the outgoing HTLC has been fully resolved on-chain (confirmed and swept).

If lnd didn't send a commit_sig, then it should have fast-failed htlcA

Fast failed after the timeout? Or fast fail since the HTLC was never fully extended? Since it seems as though the HTLC was never fully extended, a fast fail isn't an option as if the HTLC does get cleared on that link, then we lose money if the HTLC is ultimately settled.

Do they have a set of logs under the CNCT log trace?

Roasbeef on 13 Oct 2018

It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this).

In that message it was unable to send out the commitment for that outgoing link, as the remote party has never sent a prior revocation, or we lost it for some reason.

What do you mean by remove the HTLC?

Roasbeef on 13 Oct 2018

@pm47 Version of lnd: v0.5.0-beta 48d016bc78e9f338742b10d936e52e7655fd8770

andrewshvv on 13 Oct 2018

If htlcB was never locked in, then it could be the case that lnd didn't actually have a htlcB in its version of the commitment transaction.

It definitely seems to be the case that htlcB wasn't cross-signed, but even in this case htlcA shouldn't be left to time out. That's my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that's what cltv_expiry_delta are for.

We'll only cancel back the HTLC (incoming) once the outgoing HTLC has been fully resolved on-chain (confirmed and swept).

I think that is the "nominal" case, but there are corner case like the one we have here, which should also ultimately lead to cancelling the incoming htlc.

If lnd didn't send a commit_sig, then it should have fast-failed htlcA

Fast failed after the timeout? Or fast fail since the HTLC was never fully extended?

If lnd hasn't sent a commit_sig to the next node, then htlcA can be failed immediately. I figured that in this particular scenario where lnd seems to be waiting for a revoke_and_ack before sending the new commit_sig that contains htlcB, it could decide at some point (after a few seconds? minutes?) to:

simply close the channel, and fail htlcs that were never signed. It seems harsh, but not doing anything there leads to the previous channel getting closed because of the htlc timeout, so instead of closing a channel with the unresponsive downstream peer we close a channel with the valid upstream peer.

just discard unsigned htlcs (e.g. disconnect/reconnect) and fail them in the upstream channel, and then keep waiting for the revoke_and_ack.

Since it seems as though the HTLC was never fully extended, a fast fail isn't an option as if the HTLC does get cleared on that link, then we lose money if the HTLC is ultimately settled.

How could you lose money if you never sent the commit_sig containing htlcB?

Do they have a set of logs under the CNCT log trace?

@AndrewSamokhvalov I'll let you answer

It seems like lnd was waiting for a revoke_and_ack that never arrived, and after a while lnd just removed htlcB (not sure what exactly triggered this).

In that message it was unable to send out the commitment for that outgoing link, as the remote party has never sent a prior revocation, or we lost it for some reason.
What do you mean by remove the HTLC?

That's just what I read in the logs (removing Add packet), I don't know what that meant exactly for lnd. I explained above what I think it should mean.

pm47 on 13 Oct 2018

That's my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that's what cltv_expiry_delta are for.

If the B-C link is in an indeterminate state (for w/e reason), and this persists for more than the delta, then the channel will indeed be closed. In this case, we can't cancel that HTLC on the outgoing, as ti isn't yet fully locked in and in a partial limbo state (they have a state but we don't or the other way around).

I think that is the "nominal" case, but there are corner case like the one we have here, which should also ultimately lead to cancelling the incoming htlc.

Indeed if we're able to detect it and safely recover.

If lnd hasn't sent a commit_sig to the next node, then htlcA can be failed immediately.

Yeah in the normal case, it depends on exactly what was committed in the prior engagement though. I agree that those are all valid actions we can take at a point like this. The main difficulty is in ensuring that we don't lose funds due to canceling the incoming HTLC before the outgoing is fully resolved.

Roasbeef on 15 Oct 2018

I've been running into this issue as well sending between two nodes. Maybe. Debugging here with Alex. I have the HTLC here on my send.

    {
        "active": true,
        "remote_pubkey": "02529db69fd2ebd3126fb66fafa234fc3544477a23d509fe93ed229bb0e92e4fb8",
        "channel_point": "b4c57bf511d3d9cd29c046dda4af7230edfabf7df4386cc8fc74c37d25bed73e:0",
        "chan_id": "594163989040594944",
        "capacity": "6500000",
        "local_balance": "123195",
        "remote_balance": "6374666",
        "commit_fee": "2139",
        "commit_weight": "724",
        "fee_per_kw": "2676",
        "unsettled_balance": "0",
        "total_satoshis_sent": "1003",
        "total_satoshis_received": "124400",
        "num_updates": "323",
        "pending_htlcs": [
            {
                "incoming": false,
                "amount": "200",
                "hash_lock": "kDv1vJy1S674aWv8t6vSJ8co9k3VffTynyKoOM4M3cQ=",
                "expiration_height": 547439
            },
            {
                "incoming": false,
                "amount": "1",
                "hash_lock": "s4bWJfNSlN0H4wcRVg/W3uNYFo/Nq72QPPjZG5v4Hm0=",
                "expiration_height": 547447
            }
        ],
        "csv_delay": 781,
        "private": false
    },

   "active": true,
        "remote_pubkey": "02529db69fd2ebd3126fb66fafa234fc3544477a23d509fe93ed229bb0e92e4fb8",
        "channel_point": "9a3cd6a722dbc5ee17394ee3e99ce7727d03f69090d3c65d8065ecea0091641e:0",
        "chan_id": "599093099740069888",
        "capacity": "5210295",
        "local_balance": "988894",
        "remote_balance": "4219464",
        "commit_fee": "1937",
        "commit_weight": "724",
        "fee_per_kw": "2674",
        "unsettled_balance": "0",
        "total_satoshis_sent": "108006",
        "total_satoshis_received": "1096901",
        "num_updates": "157",
        "pending_htlcs": [
            {
                "incoming": true,
                "amount": "1",
                "hash_lock": "s4bWJfNSlN0H4wcRVg/W3uNYFo/Nq72QPPjZG5v4Hm0=",
                "expiration_height": 547303
            }
        ],
        "csv_delay": 626,
        "private": false
    },

juscamarena on 24 Oct 2018

HTLC failed after a restart of LND.

Any reason why it didn't accept it?

It failed before expiration.

    {
        "memo": "Bitrefill Deposit 5bd0ac20d8d5f4000494d388",
        "receipt": null,
        "r_preimage": "8lvZxfeV3DviixRuUlS/figwClNPNx125xoUv10h1WM=",
        "r_hash": "s4bWJfNSlN0H4wcRVg/W3uNYFo/Nq72QPPjZG5v4Hm0=",
        "value": "1",
        "settled": false,
        "creation_date": "1540402208",
        "settle_date": "0",
        "payment_request": "lnbc10n1pdaptpqpp5kwrdvf0n222d6plrqug4vr7kmm34s950ek4mmypulrv3hxlcreksdzygf5hgun9ve5kcmpqg3jhqmmnd96zqdtzvscxzcejxpjrsep4vc6rqvpsxsungepn8quqcqzysxqr8pqnhnvguuudh80zn7nmfcud0xy23gm0lr5qd2pz3dcfrf93tlqlenn0vvj60t2cufr799n2plcm3gvzs88dj23p8v397rgq4ysx7lwg6cpe2fqq7",
        "description_hash": null,
        "expiry": "7200",
        "fallback_addr": "",
        "cltv_expiry": "144",
        "route_hints": [
        ],
        "private": false,
        "add_index": "2842",
        "settle_index": "0",
        "amt_paid": "0",
        "amt_paid_sat": "0",
        "amt_paid_msat": "0"
    },

juscamarena on 24 Oct 2018

@juscamarena do you have a goroutine dump or debug logs?

Roasbeef on 24 Oct 2018

No, but if I run into it again I will provide it

juscamarena on 26 Oct 2018

I'm not sure what is the status or priority of this, but it is currently the main cause of channel failures on our node, has been for quite some time (e.g. in february it happened 26 times so far).

That's my point actually: in an A->B->C setting, if there is a problem between B and C (whatever the problem), it should never lead to the A-B channel getting closed, that's what cltv_expiry_delta are for.

If the B-C link is in an indeterminate state (for w/e reason), and this persists for more than the delta, then the channel will indeed be closed. In this case, we can't cancel that HTLC on the outgoing, as ti isn't yet fully locked in and in a partial limbo state (they have a state but we don't or the other way around).

If you close B-C, then once your or your peer's commitment tx is confirmed, it won't be in an indeterminate state anymore (HTLC either will or will not be in the commit tx) and you will be able to fail or fulfill it in A-B. All that happening within cltv_expiry_delta.

pm47 on 15 Feb 2019

👍1

Hey guys,

I've been poking at _lnd_'s code to try to figure this out and I have been able to find a simple local repro. It looks like this issue happens when _lnd_ sends a commit_sig, doesn't receive a revoke_and_ack but receives a commit_sig (it's possible that other tweaks in lost messages trigger the same bug, but at least I'm able to easily reproduce it with those steps).

Here is the repro (and I attached all the logs, lnd data and bitcoind regtest data captured while running the repro):

Create an _lnd_ binary that never sends revoke_and_ack:
-> in htlcswitch/link.go, comment line 1708 (l.cfg.Peer.SendMessage(false, nextRevocation))
-> in htlcswitch/link.go, comment line 1744-1748 (if err := l.updateCommitTx())
Disable autopilot
Create 3 nodes: Alice (master branch), Bob (master branch), Carol (modified to not send revoke_and_ack)
Alice initially has 0.30000000 BTC, Bob has 0.20000000 BTC, Carol has 0.10000000 BTC
Run in regtest with Chan_AB (A (5000000 sat) -> B (0 sat)) and Chan_BC (B (2000000 sat) -> C (0 sat))
Carol creates invoice (1000000 sat) that Alice tries to pay (block 250):
-> Bob doesn't receive Carol's revoke_and_ack but receives its next commit_sig, which it considers invalid (rightfully so because it doesn't contain the HTLC)
-> Bob force closes the channel with Carol using only to_local with all channel funds
-> Carol doesn't redeem the HTLC (whereas she could since she has the preimage and Bob's commit_sig)
-> After 10 blocks: Bob still hasn't failed HTLC_AB: he should fail it because his CommitTx for Chan_BC has been confirmed,
so Carol can't steal Chan_BC's funds (can't use the preimage anymore)
-> After 30 more blocks, we reach the CLTV expiry of HTLC_BC but Bob has already used his CommitTx (the one without the HTLC)
-> After 40 more blocks, we reach the CLTV expiry of HTLC_AB and A is forced to close the channel
-> After CSV timeout, everyone gets their funds back (except fees)

What should have happened:

Once Bob has closed Chan_BC and seen that its CommitTx has been confirmed, Bob should fail the HTLC_AB.

@Roasbeef does it make sense? I tried setting up a unit test showcasing that but I couldn't find an easy way to manipulate Carol's outgoing messages to simulate a message loss while still keeping all the other features (chain_arbitrator in particular).

bitcoind_regtest.tar.gz
lnd_data.tar.gz

t-bast on 10 May 2019

Yeah that makes sense. We only examine HTLCs to forward once we receive a revocation from the remote party. If we never receive a revocation, then we won't examine which changes are fully locked in to forward onwards (in this case an HTLC settle). If this is only triggered by an incorrect channel state machine (not sending the messages and not retransmitting them on reconnect), then I'm not sure our behavior is incorrect as this would be a violation of the spec.

One thing that would partially resolve this, would be for us to pipeline the pre-image across the relevant sub-systems as soon as we obtain it. I think eclair might already do this, but we don't.

Roasbeef on 10 May 2019

I'm not sure our behavior is incorrect as this would be a violation of the spec

Carol misbehaves so it is ok if channel Alice-Bob gets closed, really?

pm47 on 10 May 2019

I didn't make that statement. I'm describing how the state machine is
defined and implementation level details that can result in an event like
this.

On Fri, May 10, 2019, 1:54 PM Pierre-Marie Padiou notifications@github.com
wrote:

I'm not sure our behavior is incorrect as this would be a violation of the
spec

Carol misbehaves so it is ok if channel Alice-Bob gets closed, really?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/lightningnetwork/lnd/issues/2045#issuecomment-491427246,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHTWLSZIU2HUGFBGPLV77DPUXOJ3ANCNFSM4F3I5PJA
.

Roasbeef on 10 May 2019

One thing that would partially resolve this, would be for us to pipeline the pre-image across the relevant sub-systems as soon as we obtain it. I think eclair might already do this, but we don't.

I'm not sure this is focusing on the right thing. Here Bob never sees the preimage. However, Bob broadcasts his previous commit tx (not containing the faulty HTLC) and never gave Carol the commitment secret (so Carol can't use the revocation outputs). Once that transaction is confirmed (after a few blocks), Bob can completely forget about this HTLC without loss of funds. That's why at that moment Bob should fail the HTLC with Alice (instead of letting Alice time it out).

From looking at your code, it seems like this could be done in the contractcourt (the part that watches the blockchain) right?

t-bast on 11 May 2019

Thanks for the detailed report! We have a fix up that ensures we won't let the incoming link time out unnecessarily.

Roasbeef on 18 May 2019

👍2

Thanks, I just tested with my repro steps and it works nicely.

t-bast on 28 May 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings