I encountered some db corruption and I believe the root cause is lighthouse crashes regularly due to an issue with the tokio runtime. I'm continuing to debug. One concern is that the VPS I am running on doesn't have stellar stability reviews. I'm wondering if that is causing the runtime to go awry. However, I have managed to maintain 100% uptime on my go-ethereum nodes.
I've run into this issue on both 5828ff1, v1.0, and v1.0.1.
The tokio-runtime segfaults after a period of time:
tokio-runtime-w[3460]: segfault at 470ba01feea ip 00005641f1424f1b sp 00007f59e8ee21f8 error 6 in lighthouse[5641ef75e000+1d9e000]
Debugging this further, it seems that anytime I get close to maxing out the RAM it crashes.
We haven't seen this issue on any other hardware, so it would be super interesting to know what's unique about your setup that's causing this.
It might be a bug in Tokio related to the specific (janky?) hypervisor your VPS provider uses. You could run something like slabbed-or-not to try and work out which hypervisor that might be https://github.com/kaniini/slabbed-or-not
Some of the mysterious database errors we've seen have also been from people running under a hypervisor, which might just be a coincidence, but I'm not sure.
The bug may happen to be fixed by the Tokio 0.3 change (which is almost ready for release in v1.0.2)
I'm running on Contabo VPS.
$ ./slabbed-or-not
Not running under any known container type
Hypervisor: KVM
I'll keep an eye for the Tokio 0.3 release :)
The tokio 0.3 release has been merged. Let us know if the issue persists.
I believe one cause of this issue was running with a high peer count (likely more than the computer can handle).
I'm going to close this issue, assuming it has been resolved. Please re-open if the issue persists.
It does appear a bit more stable, but still seeing the same issue on 2383bfe with 50 peers. I will try to ramp down to 30 to see if it improves.
@AgeManning FYI, it doesn't look like I have the ability to reopen.
hmm are you running any strange hardware? What OS?
I only saw one reported issue of seg faults in tokio recently and looks like it's been fixed in 0.3: https://github.com/tokio-rs/tokio/pull/3019
@AgeManning running Ubuntu 20.04 on a VPS. 8GB ram and 4 xeon vCores. The crash is relatively consistent, I've reprovisioned a few times only to find the same error. Maybe this week I can figure out how to get a proper core dump to share.
@lightclient which lighthouse version are you running?
@pawanjay176 I've run all of the following: 5828ff1, v1.0 (c6baa0e) , v1.0.1 (5a3b94cb), cut-v1.0.2 (2383bfe), and v1.0.2 (f718309).
Are you building the binary locally on the box?
Can you try running the portable version and see if it also happens there
make build-x86_64-portable
and use the binary at target/x86_64-unknown-linug-gnu/release/lighthouse
@AgeManning okay, I'll give that a shot. I've been building locally and have been using the optimized version.
Most helpful comment
I'm running on Contabo VPS.
I'll keep an eye for the Tokio 0.3 release :)