Lightning: Unilateral close when net connection lags

Created on 11 Mar 2019 · 22Comments · Source: ElementsProject/lightning

Issue and Steps to Reproduce

Yesterday I had a few smaller outage in the internet connection. (About 5-20 minutes for 3-4 times, my ISP did some maintenance work.)

When I've got my connection back, clightning unilaterally closed some channels referring to "sync error".
Here is the listpeers info of one of it:

{removed}

The interesting parts of the lightningd logs:

{removed}

It is not the only case when I saw this behavior. Half an hour earlier my node closed the channel with node {removed}. And a day before closed with {removed}. Same symptoms: lagging internet connection, clightning referring to sync error, unilateral channel closure.

It's unpleasant because I definitely didn't want to close these channels.

`getinfo` output

{removed}

lnd-compat reconnect

Source

gallizoltan

👍1

Most helpful comment

@trueptolemy nice to know and thanks for that but what has that to do with the "sync error" message that makes c-lightning unilaterally close channels? Note: this is the worst bug I've ever encountered.

ghost on 9 May 2019

👍4

All 22 comments

FEATURE REQUEST: please give me a config option to make clightning unable to initiate channel closure (without explicit human request). I've lost a dozen channels in the past few days.

I don't want my node to close any channel under any circumstances. If some extreme situation occurs, then:
1) First and most important: do not close a channel automatically! No hurry, I can do it later manually if it is needed.
2) Inform me about the situation. I can grep error messages from lightningd logs or something.
3) Let me decide if I want to close the channel having regard to the circumstances.

gallizoltan on 11 Mar 2019

based on the logs you've provided, it appears that your channel peer closed the channel on you, i.e. received ERROR from peer. I suspect this is due to an outstanding bug in the data loss protect interop between LND and clightning that I haven't been able to successfully pin down. Can you supply the IO level logs for that peer id: d722e8f0e58966583d0c5e98def56aaf110ac467eb03a75e72ca62e95790c9b3?

given that this closure was initiated by your peer, there's nothing that c-lightning can do to mitigate this particular case.

niftynei on 11 Mar 2019

@niftynei I'm sorry, I don't have any other logs beside I've already provided.

The channel closure surely happened on my side, my clightning node initiated.

At listpeers you can see Tracking our own unilateral close. And if it is not convincing enough, you can take a look at the closing transaction: http://chainquery.com/bitcoin-api/getrawtransaction/c5d543175135c05df3e6c2a52621d6c6056a0dafae6df1ff9e41a0d82d5df1f5/1
All output goes to a witness_v0_keyhash (P2WPKH). Meaning: who initiated the closure, has no money on the channel. It's my node. The listpeers shows "msatoshi_to_us": 0.
(If my counterparty would close the channel on me, all output would go to a witness_v0_scripthash (P2WSH) - a timelocked contract to give time for a possible penalty transaction).

gallizoltan on 12 Mar 2019

Can confirm. While running c-lightning (version: version": "v0.7.0-20-g6f71564) normally without maintenance or whatever, I had the same thing happening yesterday with Yalls.org:
ID: 03e50492eab4107a773141bb419e107bda3de3d55652e6e1a41225f06a0bbf2d56

"CHANNELD_NORMAL:Received error from peer: channel 1601e3910d154aeef2384f3468392155b66aa933ac4563ab9e193992709da833: sync error

ghost on 13 Mar 2019

👍1

got the same error today in one of my channels:

```"status": ["CHANNELD_NORMAL:Received error from peer: channel b14b557979a67c65d2182a81e6027acd32122dbf98778337a794b68d1a9d2ffe: sync error",
"ONCHAIN:Tracking our own unilateral close",
"ONCHAIN:3 outputs unresolved: in 173 blocks will spend DELAYED_OUTPUT_TO_US (bbdede6ce142bcac2d0c21fad0af2143d26198f9c4f607d27e93b4e447107235:1) using OUR_DELAYED_RETURN_TO_WALLET"
],

mloop1 on 19 Mar 2019

Another unilateral closure with sync error today:

2019-03-27T06:20:11.762Z lightningd(24574): 035f5236d7e6c6d16107c1f86e4514e6ccdd6b2c13c2abc1d7a83cd26ecb4c1d0e chan #7388: Peer permanent failure in CHANNELD_NORMAL: lightning_channeld: received ERROR channel be954be818b99c5442f7ebf643fb2f0a9bf170fa9fb50528919739488d3ff71a: sync error
2019-03-27T06:20:11.762Z lightningd(24574):      (tx f0268b09f681c72d7e9f5054fc206f802cf59ecdf61f11a4c9b3cee96e2f1a01)
2019-03-27T06:20:11.766Z lightningd(24574): 035f5236d7e6c6d16107c1f86e4514e6ccdd6b2c13c2abc1d7a83cd26ecb4c1d0e chan #7388: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
2019-03-27T06:21:11.027Z lightningd(24574): 035f5236d7e6c6d16107c1f86e4514e6ccdd6b2c13c2abc1d7a83cd26ecb4c1d0e chan #7388: Peer transient failure in AWAITING_UNILATERAL: Reconnected

https://blockstream.info/tx/f0268b09f681c72d7e9f5054fc206f802cf59ecdf61f11a4c9b3cee96e2f1a01

gallizoltan on 27 Mar 2019

I'm also getting multiple sync error and channels closing (eclips.lnd and rompert.com).

Version: v0.7.0-53-g02ddeed

ghost on 2 Apr 2019

👍1

I lost all LND nodes because of sync errors (and some other less frequent errors) probably due to temporary internet connection problems but which were no problem for the c-lightning nodes. I've decided to only open new channels to c-lightning nodes.

ghost on 17 Apr 2019

I'm exactly in the same situation: opening a channel with a lnd node is much more risky that it will get closed sooner or later

mloop1 on 17 Apr 2019

👍2

I have the same problem. There is any solution? I think it's a big problem for c-lightning, most LN nodes are lnd.

poperbu on 30 Apr 2019

👍3

I find this for Lnd:

We send pings every 1 minute, so the connection should be kept alive on your end. Then there's an idle timer on our end after 5 mins, we'll disconnect if you haven't sent anything.

trueptolemy on 8 May 2019

👍2

ghost on 9 May 2019

👍4

@sumBTC sorry, I'm not very familiar with Lnd and go, but I searched "sync error" and browsed some Lnd code. I find the cases about "sync error" caused our channel closed includes:

Lnd tries to reconnect and sends a channel_reestablish message, and expect the reply in 30s! If timeout, Lnd would fail the channel, but it wouldn't take the initiative to close the channel. It sends us a error message about sync error and ask us to close our channel. In fact, when we receive sync error, our channel has been failed by Lnd.
Lnd receives a channel_reestablish reply in 30s from us, but it finds something wrong, like error commit with data loss, like error that Lnd thinks it's unable to properly synchronize states with us, etc...(the error is here, line 887 - 918). It would also fail our channel and send a error message: sync error, to us, and ask us to close our channel like case1!

I think the most cases about unilateral close above is case 1, because of timeout(30s), I have to admit that only 30s is a too strict limit. As for how long it should wait, rfc only said:

MUST wait to receive the other node's channel_reestablish message before sending any other messages for that channel.

I think clightning doesn't have time limit (I don' find).

If the above what I analysised is correct, these unilateral close about sync error may not be a bug. For those two cases, we can't ignore this error message locally because Lnd has already failed our channel and it only waits for closing.

trueptolemy on 22 May 2019

👍2

@trueptolemy Thanks a lot for researching this issue and for the clear explanation, really appreciated.

we can't ignore this error message locally because Lnd has already failed our channel and it only waits for closing.

Wouldn't these 30 seconds max delay not also be an issue for LND nodes that have a temporary internet connection problem? So if a LND node has channels open with say 30 other LND nodes and it experiences an internet connection problem all the channels would fail and be closed? That doesn't sound like a workable system (for LND).

ghost on 23 May 2019

👍3

Note that channel_reestablish is sent only after both sides have managed to get a connection, not before. A 30s timeout after a connection is established seems generous to me. This is not a timeout on how long to reestablish the connection, but a timeout on how long after reestablishing the connection before we reestablish the channel. lightningd sends channel_reestablish first thing on reconnecting, so it is unlikely to be this issue.

@niftynei seems to have copies of some log, so we should be able to see time in those logs to determine whether it is case1 or case2.

Based on comments https://github.com/ElementsProject/lightning/issues/2462#issuecomment-471742623 it seems issue is with option_data_loss_protect, i.e. case2.

ZmnSCPxj on 23 May 2019

👍3

What's the reason to shift this very important issue from milestone 0.7.1 to 0.7.2? Are we the only ones having this problem? Is it too difficult to solve?

ghost on 3 Jul 2019

@sumBTC When did you see this bug last time? Has it happened recently?

I've checked my logs, and it last occurred on the 6th of April on my machine. I'm not sure it is still present. Maybe an old LND version caused it, and already get fixed on LND side.

gallizoltan on 3 Jul 2019

@gallizoltan It happened about a month ago. I lost my channel with Bitkassa and another LND channel with an Italian peer. Have no logs of those events. It was partly due to attacks on my Electrum Server (electrumx) which made my computer extremely slow but my c-lightning channels had no problems.

ghost on 3 Jul 2019

It was partly due to attacks on my Electrum Server (electrumx) which made my computer extremely slow but my c-lightning channels had no problems.

It could be the lnd side erroring the channel due to the lnd 30second window after connection establishment for channel reestablishment: overloaded computer means C-Lightning is slow, maybe unable to send channel reestablishment window before lnd kills the channel. C-lightning itself does not impose a similar timeout. Maybe @niftynei can chime in here about the data_loss_protect problems with lnd if that is the possible root cause?

ZmnSCPxj on 4 Jul 2019

In the mean time I will reopen the two channels that were closed last time. My Electrum Server seems to be working (reasonably) well at the moment.

ghost on 4 Jul 2019

using electrumx against the same bitcoind as lightningd has caused troubles because of electrum DDoS for me in the past. lightningd may end up in a situation where it is not able to sync correctly, maybe this can also cause channel closures? I solved it by remove electrumx, as my root server harddrive (non-ssd) isn't able to keep up with IO demand on electrum and bitcoin.

m-schmoock on 4 Jul 2019

This is fixed in https://github.com/ElementsProject/lightning/pull/2842.

The core issue is that receiving an error is considered fatal to the channel, by the spec. lnd does not do this to errors, and (not surprisingly) as a result they tend to send errors when things aren't fatal.

IMHO the spec needs more clarity on errors; this is on the roadmap for 1.1, and episodes like this make it higher priority.