I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.
Add a field max_write_age to the zone configs (a value of zero behaves like
MaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail (now-max_write_age). Note that this
effectively limits how long transactions can write to approximately
max_write_age. In turn, when running a read-only transaction, once the
current HLC timestamp has passed read_timestamp + max_write_age + max_offset,
any replica can serve reads.
max_write_age to the lease proto.max_write_age is populated with the valuemax_write_age. If a lease holder realizesmax_write_age has changed, it must request a new leasemax_write_age increases) and letmax_write_age, but allmax_write_ages for as long as the "old"max_write_age. When considering a read-onlyBatchRequest with a timestamp eligible for a follower-served read, considernow - max_write_age < write_ts, behave as if therenow.An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of
SELECT (...) AS OF SYSTEM TIME STALE
and DistSender would consult its cache to find the minimal timestamp covered
by all leases (but even that timestamp may not work).
max_write_age will be in effect.cc @arjunravinarayan
For expiration-based leases, knowing the current lease is confirmation that you're reasonably up-to-date. But for epoch-based leases (i.e. for all regular tables), it doesn't tell you much. A replica could be arbitrarily far behind (or even be removed from the range!) and still pass the test in step 5.
Let's ditch the current time from this protocol altogether. Instead, each replica tracks a max_write_timestamp, updated as commands are applied. Leaders also track a max_proposed_write_timestamp, and guarantee not to propose a write with a timestamp less than max_proposed_write_timestamp - max_write_age (TODO: how does this interact with reproposals?). Followers can serve reads with timestamps less than max_write_timestamp - max_write_age. Reads newer than this time would get a NotLeaseHolderError as usual.
When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.
You're right, what I was proposing gives you only consistent, not up-to-date data. I was hoping to avoid having a steady stream of Raft traffic when there isn't any write activity, but that's not possible without something like quorum leases.
Thanks for writing this up, I've been thinking about this lately as well as a side project but hadn't gotten nearly as far into the details.
Do we have any sense of how far in the past our timestamp cache low watermark is in typical deployments? Or how far in the past users that have asked about this are ok with having their reads be? We'd want to have some idea before going too far with this.
Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.
The informal thinking around the watermark is to trail it at ~10seconds, but that number isn't particularly well rationalized. We do not raise the watermark eagerly, since there is currently no need to do so, and keeping it as far back as possible retains the maximum flexibility, but there are certain future use-cases (i.e. Naiad) that have the opposing incentives - Naiad wants the watermark raised as aggressively as possible as the watermark duration delays materialization of materialized views until the watermark has passed on _all_ ranges. We could add a debug flag to show empirically what our watermarks are at - do you think that information would be useful to know?
The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.
>
Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.
Replicas will just refuse what they can't serve, and if we use the eager path as Ben suggested you should "usually" not get refused assuming the timestamp you chose should be safe based on your local HLC and MaxOffset. For example, if a request ends up at the lease holder but is one that should be safe from the follower (and perhaps was tried there first), the lease holder will, for the next X seconds, eagerly bump max_write_timestamp.
We need to somehow limit the amount of proposals we're sending due to this. If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce. Since this information doesn't necessarily have to flow through Raft, we could hoist it up to the node level and, say, send (inverted) bloom filters of RangeIDs which were bumped to a common timestamp. I imagine that'll get us further than reality could go for a while.
Also, I assume that we'd want to use the same timestamp on all ranges that a query hits to ensure ACID consistency, and picking a timestamp (other than just always going with the oldest allowed) is going to be tricky with the max_write_timestamp-per-replica approach.
Yes, we definitely want to use the same timestamp (unless we introduce some concept of non-transactional batching). But we also have a fallback path if we choose "incorrectly": go to the remote lease holder. So I'd suggest that when we have flexibility in the timestamp, we choose one based on what we see at the first range we touch.
If you have a large number of ranges and very little write traffic, bumping max_write_timestamp for all of them amounts to approximately the (outgoing) Raft heartbeat traffic we worked so hard to reduce.
That's true if all ranges are receiving read traffic via multiple replicas. But that's not necessarily true. There's a hierarchy of range activity:
@spencerkimball, curious to hear your proposal on this.
Yes, we definitely want to use the same timestamp (unless we introduce
some concept of non-transactional batching). But we also have a fallback
path if we choose "incorrectly": go to the remote lease holder. So I'd
suggest that when we have flexibility in the timestamp, we choose one based
on what we see at the first range we touch.
We can fall back to the leaseholder, but occasionally needing to fall back
adds a lot of variability in response time, which makes it tough in turn
for apps to deliver a reliable response time. It's great for averages, but
the tail behavior could make the feature unusable for certain customers.
On Jun 23, 2017 3:24 PM, "Tobias Schottdorf" notifications@github.com
wrote:
@spencerkimball https://github.com/spencerkimball, curious to hear your
proposal on this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/16593#issuecomment-310752446,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGwdH7vI6PI56bBT0I-I5fDFNvR6Hvzlks5sHBDygaJpZM4N9yOM
.
The low watermark isn't the effective low watermark. For example, the low watermark could be t1 but there could be an all-encompassing span at t2. I don't think it's so easy to measure, and it depends a lot on the workload and the cache size.
FYI (in case you missed it), TimestampCache.lowWater is only used for testing. The timestamp cache is now shared by all replicas and we update the low water for a replica by using a span which encompasses the entire range.
When serving a read older than now - max_write_age but newer than max_write_timestamp - max_write_age, we could trigger an asynchronous dummy write to the range to advance max_write_timestamp and allow future reads at this timestamp to work locally. We could even do this preemptively to ensure steady availability of local reads.
Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.
TODO: how does this interact with reproposals?
This seems to be the biggest hole in this approach.
Do we need to do this with a dummy write? Perhaps we could also advance max_write_timestamp whenever we quiesce and then temporarily unquiesce when serving a read which could have been served from a follower if max_write_timestamp were advanced. Using quiesce messages for this purpose has the advantage of not mucking with the Raft log.
We need to guarantee that any future lease holder will know about this timestamp. Writes accomplish this since they are replicated via the raft log. Quiesce messages do not since there is no guarantee that they will reach a quorum. We could poll a quorum without going all the way through the raft log, but that would need to be a new mechanism instead of piggybacking on quiescence.
The Storage-Level Change Feed Primitive has strong connections with follower reads. But in particular, there is one requirement that hasn't been discussed here (quote below by @bdarnell):
For transactional writes, the change feed event is emitted when the intent is resolved, instead of when the original write happens. For follower reads, we've discussed implementing close timestamps via the timestamp cache or something like it, with the semantics that "no more writes with a timestamp less than T will be accepted". However, there could still be unresolved intents with timestamps less than T. For change feeds, we require that "no new events with timestamp less than T will be emitted", so we must resolve all intents before we can advance the close timestamp.
The interesting case here is that in which a timestamp is made available for follower reads, but there is still an intent visible at that timestamp. This is fine for follower reads, though a bit awkward, as intent resolution must be carried out and that takes time. Avoiding this situation would (mostly) create parity with the needs of the change feeds primitive, but can be awkward since we won't be able to raise the safe timestamp until all intents are gone and that is hard to accomplish, seeing that we don't know where the intents are.
Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up.
Thanks @tomholub. Work has continued (https://github.com/cockroachdb/cockroach/pull/21056) and we expect to include a version of this feature in this fall's 2.1 release.
Zendesk ticket #2720 has been linked to this issue.
Most helpful comment
Just wanted to chip in with a bit of encouragement. Option to read from non-lease holders is super useful, please keep this up.