Diem: [Bug] JSON-RPC Assumes Consistency

Created on 22 Jan 2021 · 3Comments · Source: diem/diem

Blockchains in general are at best eventually consistent. That is a client reading from two services could expect that one services is behind the other. This becomes problematic when implementing clients that intentionally or not connect to multiple services. Since a single IP address or URL may refer to a single service or several, clients /must/ be implemented in such a way that read after read consistency is not guaranteed. Specifically, consider a read returns a version:

read(server) => 5
read(server) => 6
read(server) => 5

Is entirely possible in the case where there may be redundancy behind a single domain. And even more likely in our fault tolerance case of n upstreams choose 1. Why?

The primary issue is a race against state synchronization:
Validator_i -> VFN_i -> FN_i
vs
Validator_j -> VFN_j -> FN_j

where the json-rpc service is supported by (FN_i, FN_j). if we assume i != j, then they can take different routes and have different latencies, even temporarily unavailable connections. A client calling into an application load balanced json-rpc service hosted by (FN_i, FN_j) may get different versions in these cases.

Our current json-rpc redundancy approach is to query multiple services, assuming even single homed services, it is possible that our most up to date goes offline causing us to redirect to an older version.

In any case, if we arrive at an older version, we fail. So three ideas:

Require json-rpc clients to specify the minimal version that they want, that way if the upstream is stale, it can either respond back saying try again or hold on to the request while the versions catch up
Make the json-rpc client handle stale responses with retries
Return the stale response but note that it is stale

Does JSON-RPC currently support long poll?

bug

Source

davidiw

👍1

Most helpful comment

For the 3 options you listed.
No.2 is chosen because retry is always needed. If client side implements well, the error raised after retry should also contain the stale response for caller to judge whether it wants to propagate the error, or use the staled response.

No.3 is not good, because in most of cases we just want to retry, so return the stale response will cause caller to write retry.
No.1 maybe better for server resource usage, can be an additional option on top of No.2.