Lightning: Writing to the gossip file descriptor in connectd

Created on 9 Oct 2020  路  11Comments  路  Source: ElementsProject/lightning

Issue and Steps to Reproduce

My node on VPS just stopped working one day. All channels were offline and I found this error in journalctl

Oct 07 07:54:41 satoshis-box lightningd[23254]: lightning_connectd: Failed writing to gossipctl: Broken pipe (version v0.9.1)
Oct 07 07:54:41 satoshis-box lightningd[23254]: keysend: Lost connection to the RPC socket.
Oct 07 07:54:41 satoshis-box lightningd[23254]: autoclean: Lost connection to the RPC socket.
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c4487 send_backtrace
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/daemon.c:38
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c7934 status_failed
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/status.c:206
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bca2b add_gossip_addrs
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1421
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bd41b try_connect_peer
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1470
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bd6aa connect_to_peer
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1534
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28be551 recv_req
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1603
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28c45ce handle_read
Oct 07 07:54:41 satoshis-box lightningd[23254]:         common/daemon_conn.c:31
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e51f7 next_plan
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:59
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e56e6 do_plan
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:407
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e5713 io_ready
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/io.c:417
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28e6fdf io_loop
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ccan/ccan/io/poll.c:445
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28bed4f main
Oct 07 07:54:41 satoshis-box lightningd[23254]:         connectd/connectd.c:1679
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x7fa6c3671b96 ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0x559fd28ba589 ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box lightningd[23254]: 0xffffffffffffffff ???
Oct 07 07:54:41 satoshis-box lightningd[23254]:         ???:0
Oct 07 07:54:41 satoshis-box systemd[1]: lightningd.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 07:54:41 satoshis-box systemd[1]: lightningd.service: Failed with result 'exit-code'.

Then found one channel force-closed but this is for an unknown reason. This channel https://ln.bigsun.xyz/channel/618125x2153x1

fixed

Most helpful comment

I will check and report if it occurs agian. Thank you very much for your support.

All 11 comments

Seems we fail to write to the gossip file descriptor here:

https://github.com/ElementsProject/lightning/blob/ad049c36c119d1e283ac270d73b60d811c509c59/connectd/connectd.c#L1418-L1423

Seems that the socketpair that is the other side of GOSSIPCTL_FD was closed? I don't think it can be gossipd since then lightningd itself would complain about it disappearing.

@cdecker thank you for the response but I'm afraid I can't help here since I don't understand what actually happens. I'm more like a lightningd user here. NIftynei just suggested to post an issue.

Hey, @cdecker, I have the same error again on my node.

  1. Full log
Oct 24 19:47:52 satoshis-box lightningd[19246]: lightning_connectd: Failed writing to gossipctl: Broken pipe (version v0.9.1)
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0a487 send_backtrace
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/daemon.c:38
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0d934 status_failed
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/status.c:206
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a02a2b add_gossip_addrs
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1421
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0341b try_connect_peer
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1470
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a036aa connect_to_peer
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1534
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a04551 recv_req
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1603
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a0a5ce handle_read
Oct 24 19:47:52 satoshis-box lightningd[19246]:         common/daemon_conn.c:31
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b1f7 next_plan
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:59
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b6e6 do_plan
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:407
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2b713 io_ready
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/io.c:417
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a2cfdf io_loop
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ccan/ccan/io/poll.c:445
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a04d4f main
Oct 24 19:47:52 satoshis-box lightningd[19246]:         connectd/connectd.c:1679
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x7fb4a5ffdb96 ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0x561825a00589 ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: 0xffffffffffffffff ???
Oct 24 19:47:52 satoshis-box lightningd[19246]:         ???:0
Oct 24 19:47:52 satoshis-box lightningd[19246]: autoclean: Lost connection to the RPC socket.
Oct 24 19:47:52 satoshis-box lightningd[19246]: keysend: Lost connection to the RPC socket.
Oct 24 19:47:52 satoshis-box systemd[1]: lightningd.service: Main process exited, code=exited, status=1/FAILURE
Oct 24 19:47:52 satoshis-box systemd[1]: lightningd.service: Failed with result 'exit-code'.
Oct 24 19:48:02 satoshis-box systemd[1]: lightningd.service: Service hold-off time over, scheduling restart.
Oct 24 19:48:02 satoshis-box systemd[1]: lightningd.service: Scheduled restart job, restart counter is at 2.
  1. Node restarted but 1 of 3 opened channels couldn't return back online.

@engenegr What version? Did you try to update to some later version and re-run?

I was wondering if this could be related to something on the disk: is the disk that the lightning-dir is on low on capacity? Is it getting remounted in read-only mode? (dmesg can tell you a bit about this kind of error in the kernel). Notice that after the restart the gossip_store file which backs the gossip daemon is being compressed, so restarting may free some disk space.

Are there any other logs, from lightning_gossipd particularly, around the time the node exits that may shed some more light? (lightning_gossipd doesn't log sensitive stuff, except the occasional payment destination, so feel free to upload journalctl -u lightningd -n 1000 |& grep gossipd somewhere and link it here).

As for the channel not being online: is it being closed, or just "connected": false. In the latter case the peer may just be unreachable, or we don't have an address for it and they'll need to reconnect from their end.

@jsarenik v0.9.1

@cdecker

  1. journalctl -u lightningd -n 1000 |& grep gossipd is empty
  2. I see there is enough space left on VPS
  3. The channels aren't closed. The peer is one of reliables. It is LightningTo.Me service.

I have some success with dmesg

9954162.495456] Out of memory: Kill process 19290 (lightningd) score 202 or sacrifice child
[9954162.497159] Killed process 19734 (lightning_gossi) total-vm:84512kB, anon-rss:66888kB, file-rss:0kB, shmem-rss:0kB
[9954162.516137] oom_reaper: reaped process 19734 (lightning_gossi), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Does it mean I need more RAM for the node?

It does indeed, it looks like the 66MB that lightning_gossip makes it the highest scored target to be killed when you are out of memory (OOM). That's a bit surprising, since the score is higher the more memory a process uses (that and a couple of other things like marking some processes as non-OOM-killable). How much memory do you have on the machine and what else is running on the machine?

We are working on further reducing the memory-footprint, only loading the gossip map when it is needed, so it'll have short bursts of memory use, but the base state is far lower. See https://github.com/ElementsProject/lightning/pull/4093 for details.

An alternative to adding more RAM would be to add a swapfile that can be used to externalize some of the memory that is seldomly used.

I have 1GB instance and usually, it has 760 out of 985MB busy.
I maybe move my node to another instance and try to provide more memory for lightningd. Or at least enable swap, indeed.

If it's just occasionally swamped then swap can help, however it's easy to end up in swap hell if the memory pressure persists or if the other processes on the system have memory-access patterns that are not compatible with swap (random access to memory that was just swapped out, large strides across many VM pages, etc).

I will check and report if it occurs agian. Thank you very much for your support.

I will check and report if it occurs again. Thank you very much for your support.

Was this page helpful?
0 / 5 - 0 ratings