lnd outgoing gossip flood can cause message timeouts?

Created on 22 Feb 2019  路  6Comments  路  Source: lightningnetwork/lnd

Background

I run c-lightning under valgrind, which makes it pretty slow. c-lightning also asks for all the gossip, which I suspect is causing part of the problem.

I tried to fund a channel to 038c72e69dda1966fea4b98f3724fc906a097f84bbf6cd4395cd285085db58b8f1 (which runs lnd, unknown version), and it seems to have not been able to send the funding_signed reply in time (through the flood of gossip?), eventually giving up and reconnecting:

2019-02-22T11:03:41.709Z lightningd(1212): lightning_openingd-038c72e69dda1966fea4b98f3724fc906a097f84bbf6cd4395cd285085db58b8f1 chan #7594: peer_out WIRE_OPEN_CHANNEL
2019-02-22T11:03:42.135Z lightningd(1212): lightning_openingd-038c72e69dda1966fea4b98f3724fc906a097f84bbf6cd4395cd285085db58b8f1 chan #7594: peer_in WIRE_ACCEPT_CHANNEL
2019-02-22T11:03:44.092Z lightningd(1212): lightning_openingd-038c72e69dda1966fea4b98f3724fc906a097f84bbf6cd4395cd285085db58b8f1 chan #7594: peer_out WIRE_FUNDING_CREATED
2019-02-22T11:08:45.983Z lightningd(1212): 038c72e69dda1966fea4b98f3724fc906a097f84bbf6cd4395cd285085db58b8f1 chan #7594: Killing openingd: Reconnected

As you can see, we get no response for 5 minutes, and lnd reconnects. It then tries to REESTABLISH but we consider the channel not to have been started.

I fixed this by restarting without valgrind, but this might become an issue for other slow nodes, especially as the network gets bigger...

Most helpful comment

Excellent work! Since the upgrade to 0.6 an 8th of April, the outgoing gossip traffic is reduced dramatically. The flapping peers also disappeared.
image
Thank you very much!

All 6 comments

We have write timesouts set on the socket, if you don't pull the messages in time, then we'll think the socket is borked and hang up.

I think the issue is we don't prioritize channel update or funding related messages over gossip.

@rustyrussell Thanks for the report! Would never have thought to use valgrind to trigger this, super cool :)

I agree with your assessment, we should be deprioritizing gossip traffic. We've received some other reports of strange behavior around this area, and all seemed to point back to the initial gossip burst and not being able to transmit more important messages: chanreestablish, openchannel, probably pings even.

Had actually justtt finished testing a PoC fix for this at the moment this was opened, so was great to see another supporting data point. Proposed fix is here: #2690, interested to see if this allows one to open a channel w/ valgirnd and gossip burst!

I wanted to note it before we fix our gossip "gimme the firehose" problem and it disappears; great work!

I still see a lot of timeouts on my nodes during some hours during the day. They correlate with a raise of memory consumption which @wpaulino reasonably explains with queued up messages in #2689 and a sudden raise of network traffic. The network traffic seems to overload my uplink with 3MBit/s.

The question is why this happens suddenly for three or four hours and then recovers running fine again.

image
image

Excellent work! Since the upgrade to 0.6 an 8th of April, the outgoing gossip traffic is reduced dramatically. The flapping peers also disappeared.
image
Thank you very much!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Roasbeef picture Roasbeef  路  3Comments

Roasbeef picture Roasbeef  路  3Comments

AnthonyRonning picture AnthonyRonning  路  3Comments

ealymbaev picture ealymbaev  路  3Comments

joostjager picture joostjager  路  3Comments