Lightning: C-Lightning just continuously restart

Created on 11 Jan 2019  Â·  38Comments  Â·  Source: ElementsProject/lightning

@Kukks updated to 0.6.3. Everything was working fine for a while.

2019-01-11T06:36:45.331Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000022ef384e26ce940adf4dd430702c97a532789de1a7ae49 exited with status 1
2019-01-11T06:36:45.335Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000002a0ef6121b8f3d8e5f7b1cc8b4eb9b1ea353a8de80e48 exited 1 (after 14353 other errors) 'error code: -1?error message:?Block not available (pruned data)?'
bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000002a0ef6121b8f3d8e5f7b1cc8b4eb9b1ea353a8de80e48 exited 1 (after 14353 other errors) 'error code: -1
error message:
Block not available (pruned data)
'
lightningd: Fatal signal 6 (version basedon-v0.6.3-rc1-modded)
2019-01-11T06:36:45.350Z lightningd(1): FATAL SIGNAL 6 (version basedon-v0.6.3-rc1-modded)
2019-01-11T06:36:45.350Z lightningd(1): backtrace: common/daemon.c:45 (crashdump) 0x55d3bb0191d1
2019-01-11T06:36:45.350Z lightningd(1): backtrace: (null):0 ((null)) 0x7f2938fcc0e9
0x55d3bb01917d crashdump
        common/daemon.c:40
0x7f2938fcc0e9 ???
        ???:0
Log dumped in crash.log.20190111063645
lightning_channeld: lightning_onchaind: Writing out status 65520lightning_onchaind: Writing out status 65522: Writing out status 65520: : Broken pipe
Broken pipeBroken pipe

lightning_channeld: Writing out status 65522: Broken pipe
lightning_channeld: Writing out status 65522: Broken pipe

Now it is always restarting after trying to rescan old blocks.

pruning

Most helpful comment

Please close this issue once the PR is merged. I deployed this on machines who got the problem, and it fixed everything.

All 38 comments

The node restart every 5 min after trying to scan all those old blocks. But meanwhile manage to accept payments.

So this user is using FastSync (synching from block 551636), and it has always been the case. Why would clightning try to scan before his wallet's birthday and crash?

clightning has a hard time keeping up on initial or catch-up sync because the getblock rpc is veryyyyy slow. This is the main reason pruned nodes aren't very reliable with clightning at the moment. clightning fetches the blocks one by one, slowly, which tends to fall behind bitcoin sync which is really fast. I don't think there's a way to get around this at the moment...

@jb55 this can't be the case.

BTCPayServer start c-lightning only when the bitcoin node is fully synched.

Things was working fine before, why does it breaks now?

Can we add a way to add our own utxo set to clightning then like in .17
Bitcoin core?

On Fri, Jan 11, 2019, 08:17 Nicolas Dorier <[email protected] wrote:

Things was working fine before, why does it breaks now?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ElementsProject/lightning/issues/2250#issuecomment-453404939,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABu-_l2hlYRn1nCbGNJgbmH1kOxcjOWzks5vCDqngaJpZM4Z64sa
.

What was working fine? If you were starting c-lightning only after bitcoin was fully synced on a pruned node, I don't believe that could have worked.

I think @jonasnick has a script that fetches blocks from a trusted node if bitcoin-cli fails to return a block for a specific height. That's probably the most reliable option at the moment, but may require a trusted third party.

Issue might come from https://github.com/bitcoin/bitcoin/issues/15147 , searching.

What was working fine? If you were starting c-lightning only after bitcoin was fully synced on a pruned node, I don't believe that could have worked.

This has always worked, the vast majority of users of btcpay + clightning (99%?) were running this just fine.

False alert, I was just stupid, I have the blocks up to 551556. So I don't know why suddenly it stops working.

looking at the code again, I think you're right. It looks like it starts from bitcoind's current height?

The scan start at 14353 blocks before 00000000000000000002a0ef6121b8f3d8e5f7b1cc8b4eb9b1ea353a8de80e48 (block 519327)

I guess the problem is from here https://github.com/ElementsProject/lightning/blob/master/lightningd/bitcoind.c#L132

is it just endlessly trying to send the same commands over and over?

Ok so it seems the problem is that bitcoind->first_error_time is never set to 0. So you are bound to get this error at one point.

Not the problem, because it check if error_count is 0.

@NicolasDorier

so I traced this error from process_getblockhash_for_txout to what is most likely the bitcoind_getoutput call in get_txout in gossip_control.

So basically the user was gossip'd a channel that referenced a TXO that clightning didn't have. So clightning called out to getblock, which crashes because it's been pruned. I'm guessing it's channel transactions that contain txos that are behind the FastSync height.

I think fastsync should prune to height 504500 or even 502020 (2018-01-01) to be safe. See also: https://github.com/ElementsProject/lightning/issues/1297#issuecomment-393494943

why was it working before and not now?

still exit with broken pipe?
ok, just some hint might help or not.
but had this issue when try to solve the defunct zombie problem
if i got rid of the zombies after a while my fully synced nodes also
crashed.

Still restarting yes....

i think you already checked at #2145
but Maybe you should also take a lock at waitpid(pid, NULL, 0);
in lightningd

which tends to fall behind bitcoin sync which is really fast. I don't think there's a way to get around this at the moment...

Bitcoin pruning can be told where the calling application has caught up with so far. If the functionality for this isn't sufficient for your application please open an issue.

So on BTCPay side, I solved it by just waiting everything is fully synched before starting c-lightning or LND.

That said, there are several problems with my solution:

  • Old channels can't be properly validated
  • If the LN wallet need a rescan, it will get stuck without any possibility for me to fix the situation just by resynching bitcoin core

I will let other c-lightning or LND people opening an issue about it, as they have probably a better idea about how they would like such bitcoin core feature to be exposed than me.

Ok so the solution I try now is

    t = timemono_between(time_mono(), bitcoind->first_error_time);
-   if (time_greater(t, time_from_sec(60)))
+   if (time_greater(t, time_from_sec(600000)))

Meaning that the scan will fail but not crash the c-lightning node because of gossip.

Btw, I think this is an actual vulnerability of c-lightning ping @cdecker .

I can connect to a peer, send lot's of gossip message with channel which does not exists. Then if my peer for some reason did not used bitcoind for 1 min, then his node would just crash.

More generally, if a network peer can make c-lightning do a call to bitcoind which return an error code, then the network peer has a way to crash the c-lightning node.

So I added did https://github.com/btcpayserver/lightning/commit/4016fac9d40a3cdcb4ce92e3efe782ee94a79113

User says now it is working fine.

Strange fact though: I was expecting he get spammed with logs bitcoind getblock 00000000000000000022ef384e26ce940adf4dd430702c97a532789de1a7ae49 exited with status 1 like before, and that it would stop after a while.

He told me that after update, no more log spam at all. Trying to deploy this on other users who got the problem and will keep up to date.

With that patch you will not notice that you're falling behind on the bitcoind sync, because lightningd will just sit at the last block it was able to process and stay there.

As for backfilling the UTXO, that's probably a good idea, I'll see what I can do :-)

As for why this used to work, and now isn't, it is likely that the node was started before (setting a sync height), then was stopped for a while, and bitcoind pruned the blocks that were directly following that last seen block. When restarting lightningd will attempt to continue from the last seen height, and fail.

Notice that we have a timeout on how long we are willing to wait for the next block, and when the timeout triggers we may actually return an error with some other block hash (despite the original error being the head catchup attempt).

So in this case I'd say this is indeed an issue with pruning removing the needed blocks.

You can check if blockheight in getinfo makes any progress and attempt to do bitcoin-cli blockhash [height] followed by bitcoin-cli getblock [blockhash]. If that fails, then that's the root cause.

As for why this used to work, and now isn't, it is likely that the node was started before (setting a sync height), then was stopped for a while, and bitcoind pruned the blocks that were directly following that last seen block. When restarting lightningd will attempt to continue from the last seen height, and fail.

I am quite sure this can't be the case. In my docker file I only start c-lightning once bitcoind is fully synched. This problem happened to different people just after the 0.6.3 update.

The node in questions are keeping 60GB of block data. (although some of them started synching from 551636)

It is like c-lightning started wanting to fetch block before the "last seen height".

Another person with my update above still have tons of logs:

2019-01-13T11:16:09.620Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 000000000000000000230459f023262a068f85ed215a069d26283596cb54e146 exited with status 1
2019-01-13T11:16:09.637Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000032af4fccbc616c07b2ab985fe77f9b89f3224e0de46bec exited with status 1
2019-01-13T11:16:09.653Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000000039a523bd1e30baebe2aae4f3043453025e49c60bcd8 exited with status 1
2019-01-13T11:16:09.670Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 0000000000000000000ad7075d9a34998dcdfbc28b2861df2ce333e160adda62 exited with status 1
2019-01-13T11:16:09.685Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 0000000000000000001f027a1a520aae182b5bba484edd942960c93158386177 exited with status 1
2019-01-13T11:16:09.701Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 0000000000000000000fe0ae0a853fb33c1ee53052b5a513e054b6c7f168b9d5 exited with status 1
2019-01-13T11:16:09.722Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 0000000000000000000ee357ae68379041f763e1d8d33b5d635fa51992ecb156 exited with status 1
2019-01-13T11:16:09.740Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 0000000000000000001dc5c36260719a12153f8f06860b50b0bd29f267f55164 exited with status 1
2019-01-13T11:16:09.760Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000012877b0843770856093e23fce4ce61ecd113862d2d1f68 exited with status 1
2019-01-13T11:16:09.782Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 0000000000000000002336f41e5be8f824b2a8357a3c1e14108b1c740e490821 exited with status

However, his node seems to be usable and steady because it does not crash anymore.

Is there a way I can get the "last seen height"?

root@vmi228582:~# bitcoin-lightning-cli.sh getinfo
{
  "id": "035736692722a47389d6be965d73a50b1ba63233a1b975aa55db7f90446f5708cb",
  "alias": "",
  "color": "035736",
  "num_peers": 5,
  "num_pending_channels": 0,
  "num_active_channels": 4,
  "num_inactive_channels": 1,
  "address": [
    {
      "type": "ipv4",
      "address": "91.205.175.156",
      "port": 9735
    }
  ],
  "binding": [
    {
      "type": "ipv4",
      "address": "0.0.0.0",
      "port": 9735
    }
  ],
  "version": "basedon-v0.6.3-modded",
  "blockheight": 558345,
  "network": "bitcoin",
  "msatoshi_fees_collected": 6200
}
root@vmi228582:~# bitcoin-cli.sh getblockhash 558345
000000000000000000096df8257eaf022cb13fbc47f16830ca9f0067fe79e1a3
root@vmi228582:~# bitcoin-cli.sh getblock 000000000000000000096df8257eaf022cb13fbc47f16830ca9f0067fe79e1a3
# Found the block

The node on which I am trying, despite the spam about the block not found can receive LN payment just fine. Just logs are getting spammed...

So the status now is:
I can receive payments, logs are spammed with getblock messages BUT this is what I get when trying to send a payment:

2019-01-13T11:36:05.706162590Z 2019-01-13T11:36:05.705Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 000000000000000000187e76c5d16cb4fe7c1bc79873c43fce4fdad58b6e1bb4 exited with status 1
2019-01-13T11:36:05.724464902Z 2019-01-13T11:36:05.724Z lightningd(1): lightning_channeld-0265e5791c617b6e1dfe04137e71c0c959201592bd7fe3f1bb085c35595b72646c chan #21: Adding HTLC too slow: killing channel
2019-01-13T11:36:05.725261236Z 2019-01-13T11:36:05.725Z lightningd(1): 0265e5791c617b6e1dfe04137e71c0c959201592bd7fe3f1bb085c35595b72646c chan #21: Peer transient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (9)
2019-01-13T11:36:05.726209015Z 2019-01-13T11:36:05.726Z lightningd(1): pay(0x56537bde3e68): Try another route
2019-01-13T11:36:05.728342061Z 2019-01-13T11:36:05.728Z lightningd(1): bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 000000000000000000128df3ed8616ab3143f68d4b4a8e97896ed8ffae8f4af7 exited with status 1

bcli_failure is retrying getblock every second while it should not... the case of pruned node is treated correctly by the caller, but somehow this callback is never called if the block is not found. Instead it just stubbornly try to retry the same command.

So the faulty assumption is that bitcoin-cli process should returns error code different from 0 even if the block is not found

bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 00000000000000000002a0ef6121b8f3d8e5f7b1cc8b4eb9b1ea353a8de80e48

Sadly, bitcoin-cli returns an error code if the block is not found which cause an infinite loop, because bcli_failure is retrying it indefinitely.

Please close this issue once the PR is merged. I deployed this on machines who got the problem, and it fixed everything.

I also have the infinite getblock issue but my Bitcoin Core node is not in prune mode. I tried Nicolas' patch and it seems to work. I guess the "pruning" tag may be removed. Or are there two separate issues?

@Kexkey can you include the bitcoin-cli output that you get back?

@Kexkey well I don't know if this condition is only reached by pruned node.
But right now, callbacks of bitcoin-cli expect to process results which may result with error code 1. However this expectation is not fulfilled.

I know it is problematic for pruned node (maybe also problematic if the gossiper is malicious, though I am not sure).
But it is possible the expectation is also broken at other places.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mloop1 picture mloop1  Â·  4Comments

brunoaduarte picture brunoaduarte  Â·  5Comments

jonasnick picture jonasnick  Â·  3Comments

gallizoltan picture gallizoltan  Â·  3Comments

ZmnSCPxj picture ZmnSCPxj  Â·  4Comments