Lighthouse: Remove lock timeouts from BeaconChain

Created on 1 May 2020 · 5Comments · Source: sigp/lighthouse

Description

Remove old ..._LOCK_TIMEOUT variables from BeaconChain.

Detail

Whilst trying to find a deadlock, we added sometimes to our core Mutex/RwLock so that we'd get errors when things were locking. Now we've identified that the cause was running long-running tasks on the tokio executor, we can clear these. I believe they're all listed here, but part of the issue would be to find others:

https://github.com/sigp/lighthouse/blob/1abb54dabd6af4fa9e79cdc914626e2da3e11a46/beacon_node/beacon_chain/src/beacon_chain.rs#L50-L63

Reasons for removing them can be found here: https://github.com/sigp/lighthouse/issues/1078#issuecomment-621024441

A1 code-quality t Consensus & Verification

Source

paulhauner

👍1

All 5 comments

Whilst trying to find a deadlock, we added sometimes to our core Mutex/RwLock so that we'd get errors when things were locking.

Does the consensus code contain any locking logic? (In #1089 you mentioned consensus code is "a series of pure functions completed abstracted from databases, networking etc.". Having zero locking logic would help auditing.)

JustinDrake on 1 May 2020

Does the consensus code contain any locking logic?

No* it doesn't. We've gone to significant lengths to avoid this. For example all the signatures sets use a generic "getter" closure to obtain the pubkey during signature verification which allows us to abstract away global caches and other messy things:

https://github.com/sigp/lighthouse/blob/36f213c092eb4802e603cff5901269168637c3f6/eth2/state_processing/src/per_block_processing/signature_sets.rs#L71

*: The rayon parallelization code is perhaps a violation here, depends if you see it as a black box or for what it really is. We're getting rid of that anyway, so I think "no" is fair.

paulhauner on 3 May 2020

👍1

I'll take this

michaelsproul on 20 Nov 2020

👍1

I've dropped this back to A1 since it seems a little risky to push this in v1.0.0 if we need to cut it this week. I think we'd need to do some testing on high validator counts and low-spec machines before pushing this to master.

Whilst the lock-timeouts seem unpleasant on paper, they have been useful in practice many times by identifying contention.

paulhauner on 22 Nov 2020

👍1

I was thinking the exact same. I started the refactor last week and realised just how many locks need changing. It's not trivial to verify there aren't any lock ordering deadlocks lurking in there, and I think we should take the time to do it properly.

michaelsproul on 22 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings