Cockroach: read from a follower with timestamp bound

Created on 19 Jun 2017  Â·  15Comments  Â·  Source: cockroachdb/cockroach

I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.

Goals

  • assuming a read timestamp far enough in the "past", (usually) be able to read
    from any replica. (think: analytics, time travel queries, backups, queries
    that can't or don't need to pay the latency to a far-away lease holder).
  • configurable on the level of zone configs (i.e. table)

Sketch of implementation

Add a field max_write_age to the zone configs (a value of zero behaves like
MaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail (now-max_write_age). Note that this
effectively limits how long transactions can write to approximately
max_write_age. In turn, when running a read-only transaction, once the
current HLC timestamp has passed read_timestamp + max_write_age + max_offset,
any replica can serve reads.

  1. add a field max_write_age to the lease proto.
  2. whenever a lease is proposed, max_write_age is populated with the value
    the proposer believes is current.
  3. lease extensions must not alter max_write_age. If a lease holder realizes
    that the ZoneConfig's max_write_age has changed, it must request a new lease
    (in practice, it only has to do this in case max_write_age increases) and let
    the old one expire (or transfer its lease away). There is room for optimization
    here: the replica could extend the lease with the new max_write_age, but all
    members must enforce the smaller max_write_ages for as long as the "old"
    version is not expired.
  4. Make DistSender aware of max_write_age. When considering a read-only
    BatchRequest with a timestamp eligible for a follower-served read, consider
    followers, prioritizing those in close proximity.
  5. A follower which receives a read-only batch first checks if the current
    lease is active (not whether it holds the lease itself). If not, it behaves as
    it would today (requests the lease). Otherwise, if it is not the lease holder,
    it checks if the batch timestamp is eligible for a follower-served read based
    on the information in the lease and the current timestamp. If so, it serves it
    (it does not need to update the timestamp cache).
  6. on writes that violate now - max_write_age < write_ts, behave as if there
    were a timestamp cache entry at now.

An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of

SELECT (...) AS OF SYSTEM TIME STALE

and DistSender would consult its cache to find the minimal timestamp covered
by all leases (but even that timestamp may not work).

Caveats

  • This relies on clocks and thus on MaxOffset plus not having goroutines
    stalled in inconvenient locations (such a stall would violate MaxOffset too,
    but be very unlikely to be caught): If a write passes the check but then gets
    delayed until it doesn't hold any more, followers may serve reads that are
    then invalidated by the proceeding write. (This does not seem more fragile
    than what we already have with our read leases though).
  • if a Range isn't split along a ZoneConfig, the more restrictive
    max_write_age will be in effect.
A-kv-client C-enhancement

Most helpful comment

Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up.

All 15 comments

cc @arjunravinarayan

For expiration-based leases, knowing the current lease is confirmation that you're reasonably up-to-date. But for epoch-based leases (i.e. for all regular tables), it doesn't tell you much. A replica could be arbitrarily far behind (or even be removed from the range!) and still pass the test in step 5.

Let's ditch the current time from this protocol altogether. Instead, each replica tracks a max_write_timestamp, updated as commands are applied. Leaders also track a max_proposed_write_timestamp, and guarantee not to propose a write with a timestamp less than max_proposed_write_timestamp - max_write_age (TODO: how does this interact with reproposals?). Followers can serve reads with timestamps less than max_write_timestamp - max_write_age. Reads newer than this time would get a NotLeaseHolderError as usual.

When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.

You're right, what I was proposing gives you only consistent, not up-to-date data. I was hoping to avoid having a steady stream of Raft traffic when there isn't any write activity, but that's not possible without something like quorum leases.

Thanks for writing this up, I've been thinking about this lately as well as a side project but hadn't gotten nearly as far into the details.

Do we have any sense of how far in the past our timestamp cache low watermark is in typical deployments? Or how far in the past users that have asked about this are ok with having their reads be? We'd want to have some idea before going too far with this.

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

The informal thinking around the watermark is to trail it at ~10seconds, but that number isn't particularly well rationalized. We do not raise the watermark eagerly, since there is currently no need to do so, and keeping it as far back as possible retains the maximum flexibility, but there are certain future use-cases (i.e. Naiad) that have the opposing incentives - Naiad wants the watermark raised as aggressively as possible as the watermark duration delays materialization of materialized views until the watermark has passed on _all_ ranges. We could add a debug flag to show empirically what our watermarks are at - do you think that information would be useful to know?

The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.

>
Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

Replicas will just refuse what they can't serve, and if we use the eager path as Ben suggested you should "usually" not get refused assuming the timestamp you chose should be safe based on your local HLC and MaxOffset. For example, if a request ends up at the lease holder but is one that should be safe from the follower (and perhaps was tried there first), the lease holder will, for the next X seconds, eagerly bump max_write_timestamp.

We need to somehow limit the amount of proposals we're sending due to this. If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce. Since this information doesn't necessarily have to flow through Raft, we could hoist it up to the node level and, say, send (inverted) bloom filters of RangeIDs which were bumped to a common timestamp. I imagine that'll get us further than reality could go for a while.

Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.

Yes, we definitely want to use the same timestamp (unless we introduce some concept of non-transactional batching). But we also have a fallback path if we choose "incorrectly": go to the remote lease holder. So I'd suggest that when we have flexibility in the timestamp, we choose one based on what we see at the first range we touch.

If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce.

That's true if all ranges are receiving read traffic via multiple replicas. But that's not necessarily true. There's a hierarchy of range activity:

  1. Write traffic: consensus with every operation
  2. Read traffic spread throughout the cluster: requires either A) high-latency remote reads from the lease holder, B) periodic (but potentially frequent) consensus updates to spread timestamps, or C) quorum leases
  3. Read traffic via a single node: rebalance the lease holder there and operate in read-only/quiesced mode
  4. Inactive: fully quiesced

@spencerkimball, curious to hear your proposal on this.

Yes, we definitely want to use the same timestamp (unless we introduce
some concept of non-transactional batching). But we also have a fallback
path if we choose "incorrectly": go to the remote lease holder. So I'd
suggest that when we have flexibility in the timestamp, we choose one based
on what we see at the first range we touch.

We can fall back to the leaseholder, but occasionally needing to fall back
adds a lot of variability in response time, which makes it tough in turn
for apps to deliver a reliable response time. It's great for averages, but
the tail behavior could make the feature unusable for certain customers.

On Jun 23, 2017 3:24 PM, "Tobias Schottdorf" notifications@github.com
wrote:

@spencerkimball https://github.com/spencerkimball, curious to hear your
proposal on this.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/16593#issuecomment-310752446,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGwdH7vI6PI56bBT0I-I5fDFNvR6Hvzlks5sHBDygaJpZM4N9yOM
.

The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.

FYI (in case you missed it), TimestampCache.lowWater is only used for testing. The timestamp cache is now shared by all replicas and we update the low water for a replica by using a span which encompasses the entire range.

When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.

Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.

TODO: how does this interact with reproposals?

This seems to be the biggest hole in this approach.

Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.

We need to guarantee that any future lease holder will know about this timestamp. Writes accomplish this since they are replicated via the raft log. Quiesce messages do not since there is no guarantee that they will reach a quorum. We could poll a quorum without going all the way through the raft log, but that would need to be a new mechanism instead of piggybacking on quiescence.

The Storage-Level Change Feed Primitive has strong connections with follower reads. But in particular, there is one requirement that hasn't been discussed here (quote below by @bdarnell):

For transactional writes, the change feed event is emitted when the intent is resolved, instead of when the original write happens. For follower reads, we've discussed implementing close timestamps via the timestamp cache or something like it, with the semantics that "no more writes with a timestamp less than T will be accepted". However, there could still be unresolved intents with timestamps less than T. For change feeds, we require that "no new events with timestamp less than T will be emitted", so we must resolve all intents before we can advance the close timestamp.

The interesting case here is that in which a timestamp is made available for follower reads, but there is still an intent visible at that timestamp. This is fine for follower reads, though a bit awkward, as intent resolution must be carried out and that takes time. Avoiding this situation would (mostly) create parity with the needs of the change feeds primitive, but can be awkward since we won't be able to raise the safe timestamp until all intents are gone and that is hard to accomplish, seeing that we don't know where the intents are.

Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up.

Thanks @tomholub. Work has continued (https://github.com/cockroachdb/cockroach/pull/21056) and we expect to include a version of this feature in this fall's 2.1 release.

Zendesk ticket #2720 has been linked to this issue.

Was this page helpful?
0 / 5 - 0 ratings