Lighthouse: Validator client race condition leading to missed duties

Created on 17 Mar 2020  路  5Comments  路  Source: sigp/lighthouse

Overview

There's a race condition in the validator client that can lead to blocks and attestations being missed.

Setup

The validator client (specifically, the DutiesService) periodically polls for duties for the current and next epoch (according its slot clock). It stores these duties in its DutiesStore so that the BlockService and AttestationService can know when to produce blocks/attestations.

Previously, before an eth2 spec update (I forget which one, perhaps v0.10.0) the only reason the validator duties could change was due to a fork. However, since that aforementioned spec update we only have a 1-epoch look-ahead on proposer duties (since compute_proposer_index relies upon the effective balance for state.validators, which can change on any epoch boundary).

So, prior to this spec change this race condition existed but it was not obvious or part of routine, best-case operation. After this spec change, this race condition is expected to happen commonly, even without a forking chain.

The race condition

The validator client periodically polls for duties for the current and next epoch (according its slot clock). If it detects a change in duties, it simply updates its internal data-store and continues on. It does not trigger the BlockService or AttestationService to try and produce a block/attestation based upon these duties.

So, we have the following race condition:

  1. In epoch E - 1, we check the duties and see that validator V should produce a block at the 2nd slot of epoch E.
  2. Epoch E arrives.
  3. The BlockService::do_update and the DutiesService::do_update function are both triggered simultaneously.
  4. The BlockService completes first, finding that validator V does not need to produce till the 2nd slot of epoch E (it is the 1st slot now).
  5. Then, DutiesService completes, finding that the block proposing duty for V has moved to the 1st slot of epoch E.

Since the BlockService has already completed it's poll for this slot, it ends up producing a block in the 1st slot.

Potential solutions

  1. Have the BlockService and AttestationService multiple times per slot (e.g., 0%, 50%, 80% through the slot).
  2. Have the DutiesService trigger an update on BlockService and AttestationService whenever it finds a duties change.

(1) is simple to implement, however it is inefficient (will check for changes when none have occurred) and it doesn't completely solve the race-condition (it just mitigates).

(2) is more difficult to implement however it's an efficient and complete solution to the problem. The main obstacle I see is that BlockService and AttestationService already have a reference to DutiesService, so adding references back is circular. We could solve this using a futures channel for each of the BlockService and AttestationService, where the DutiesService holds a sender and publishes an event to the relevant service when there's a duties update. That being said, throwing channels around the place can get a bit messy, so perhaps there's a tidier solution (e.g., the DutiesService holds optional references to the other services which are updated after instantiation).

bug

Most helpful comment

I'm going to try tackling this as part of https://github.com/sigp/lighthouse/issues/923

All 5 comments

@adaszko I was thinking you might be interested and available for this one?

@AgeManning I know you've been deep in the validator client with the aggregation strategy changes. Do you think work on this should be stalled or perhaps start on some non-master branch?

Sure, I can take a look. Feel free to assign the issue to me.

I believe @AgeManning wanted this to start from a non-master branch. I think he was still in the process of creating it, perhaps check with him directly before you dig into this :)

Hey, sorry for the late reply. I've been fighting with smaller issues to get my branch to a usable state to work off.

I've made modifications to the validator client and we might get into conflicts if you solve this in master.

I think everything @paulhauner ul has said here still applies. The current plan is to eventually merge v0.2.0 branch to master. This branch has the latest source of truth for the validator client. The branch currently has warnings, which I've left for myself to address in the coming days.

If possible, it would be great to solve this issue on the v0.2.0 branch and submit a PR which targets that.

I'm going to try tackling this as part of https://github.com/sigp/lighthouse/issues/923

Was this page helpful?
0 / 5 - 0 ratings

Related issues

michaelsproul picture michaelsproul  路  4Comments

JustinDrake picture JustinDrake  路  3Comments

michaelsproul picture michaelsproul  路  4Comments

michaelsproul picture michaelsproul  路  5Comments

paulhauner picture paulhauner  路  5Comments