Remove old ..._LOCK_TIMEOUT variables from BeaconChain.
Whilst trying to find a deadlock, we added sometimes to our core Mutex/RwLock so that we'd get errors when things were locking. Now we've identified that the cause was running long-running tasks on the tokio executor, we can clear these. I believe they're all listed here, but part of the issue would be to find others:
Reasons for removing them can be found here: https://github.com/sigp/lighthouse/issues/1078#issuecomment-621024441
Whilst trying to find a deadlock, we added sometimes to our core
Mutex/RwLockso that we'd get errors when things were locking.
Does the consensus code contain any locking logic? (In #1089 you mentioned consensus code is "a series of pure functions completed abstracted from databases, networking etc.". Having zero locking logic would help auditing.)
Does the consensus code contain any locking logic?
No* it doesn't. We've gone to significant lengths to avoid this. For example all the signatures sets use a generic "getter" closure to obtain the pubkey during signature verification which allows us to abstract away global caches and other messy things:
*: The rayon parallelization code is perhaps a violation here, depends if you see it as a black box or for what it really is. We're getting rid of that anyway, so I think "no" is fair.
I'll take this
I've dropped this back to A1 since it seems a little risky to push this in v1.0.0 if we need to cut it this week. I think we'd need to do some testing on high validator counts and low-spec machines before pushing this to master.
Whilst the lock-timeouts seem unpleasant on paper, they have been useful in practice many times by identifying contention.
I was thinking the exact same. I started the refactor last week and realised just how many locks need changing. It's not trivial to verify there aren't any lock ordering deadlocks lurking in there, and I think we should take the time to do it properly.