Describe the problem
Tested using MovR, when applying duplicate covering indexes to the promo code table for fast reads in multiple datacenters, the database seems to only use the primary index and not the index closest to the read.
To Reproduce
Step thru MovR in a multi-region config(easier to do in the 19.2 version), create duplicate indexes and pin to multi regions. p99 on 20.1 is ~20ms, on 19.2 it was <5ms
geo-partitioned-replicas-demo.sh in https://github.com/cockroachlabs/scripted-demos
What is the query which always uses the primary index? I ran the script and looked at the SELECT FROM promo_codes query and it does look like it's using a duplicated index.
Here's an EXPLAIN ANALYZE for the query on n9: linky
It's worth noting that the p90 latency I see is <2ms while the p99 indeed is ~20ms.
The KV execution latencies are very interesting. p90 is <2ms across the board but p99 is ~30ms on a bunch of nodes (and very small on others).


Looping in @nvanbenschoten who is on call for KV.
The cluster I created is at http://radu-multi-region-georeplicas-0001.roachprod.crdb.io:26258/
I'm probably missing something, but how do we know that the impact on p99 latency is due to the SELECT FROM promo_codes query? It looks like we're just watching top-line p99 latency.
Also, we are inserting into this table in an explicit transaction (INSERT INTO promo_codes ...), so this could be responsible. If one of these reads hits an intent, it's going to have to wait for the intent to be resolved. That would be expected.
Yes, I think the database not using duplicate indexes properly (which would only be relevant for that particular query) was just a guess. The problem (as far as I understand) is that the overall p99 latency is 20ms when it was <5ms in 19.2.
Got it. @keith-mcclellan Has anything changed in the workload? Have periodic INSERT INTO promo_codes statements always been part of it? If so then we'd always expect some high latency in the tail.
on 19.2 it was <5ms
Have we validated this recently or is this coming from memory?
I've been looking at traces from this cluster and nothing sticks out. The only time I see latency jump above 3ms for any query is when it hits contention and needs to wait on the completion of another transaction.
Given that there is contention in the workload and a few transactions perform a series of statements and take 20-30ms in aggregate, I'm not surprised that we see p99 latency in the 10s of ms, even with partitioning.
I just ran the demo with v19.2.4 and confirmed that I do see p99 latencies below 5ms. Interestingly, p90 and p99.9 latencies appear to be almost identical between v19.2.4 and v20.1.1, so it's just p99 that differs.
SET CLUSTER SETTING "sql.defaults.implicit_select_for_update.enabled" = false; appears to fix the issue in v20.1.1.
I think what's happening is that the first and third most frequent statements are SELECT ... FROM vehicles ... LIMIT ... and the fourth most frequent statement is UPDATE vehicles ... and implicit SFU changes the way these statements interact. Specifically, it causes the UPDATE statement to acquire a lock on its row earlier, which evens out the odds of it winning in situations where these statements contend. This improved fairness is obviously a trade-off, and in this case, it appears to push enough contention under the p99 threshold that we see this jump in latency. On the other hand, it doesn't appear to affect p90 or p99.9 and we see the improved fairness pull down the rest of the tail latencies significantly - p99.99 drops from 230ms to 66ms with implicit SFU.
So I'm not really sure what the next step here is. I wouldn't really consider this a "bug" in the database, but it's certainly a behavioral change in v20.1 that's having an adverse effect on the metric we're interested in for this demo. We are aware of longer-term improvements in the database that may help retain most of the benefits of SFU without the downsides (https://github.com/cockroachdb/cockroach/issues/49684), but they are not on the roadmap for the next release. We could shift around the workload distribution in movr a little bit to push the contention back above the 99th percentile (e.g. halve the number of UPDATE vehicle statements we run). Or we could disable implicit SFU in movr using the enable_implicit_select_for_update session variable until we do get around to addressing https://github.com/cockroachdb/cockroach/issues/49684.
@nstewart may have opinions here.
@nvanbenschoten Thanks for looking into this... so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases. Correct?
@nstewart what do you suggest? Should we modify the demo to change the query balance or should we consider setting this DCL command ahead of running the demo? Or can we get #49684 prioritized?
I definitely wouldn't change the clusters setting, @keith-mcclellan. To your point, we wouldn't want to do this in a real-world scenario. I will change the query balance as a stop-gap, but obviously we wouldn't have this luxury if this was a customer workload, which MovR is attempting to replicate.
@johnrk @nvanbenschoten do we have any way to see from telemetry if actual customers are running into this? I'm inclined to revisit the priority of #49684 if so, but I don't know anything about the cost or actual customer benefit so I'll defer to the KV team here.
@nvanbenschoten would TPC-E help us catch these types of regressions? If not, would it be helpful to have MovR as a non-micro benchmark as a canary in the coal mine?
In my opinion, this type of change should be called out as "backward-incompatible" in our release notes: https://www.cockroachlabs.com/docs/releases/v20.1.0.html#backward-incompatible-changes. Even though it didn't "break" anything, application changes are required to maintain consistent, expected performance. @johnrk, @nvanbenschoten, if you agree, please work with @rmloveland to update those docs.
I think calling this "backward incompatible" is a bit much. Depending on what you look at (p99 vs p99.99) it looks like either a regression or a significant improvement. Which of those figures matters more to a user - it depends.
Yeah, "backward incompatible" may be too much, but it does seem like we should communicate changes like this, with suggested steps, if we know about them.
Even more than terming this backwards incompatible, I find the idea of even terming this a regression to be controversial. Slight changes to latency distributions (which this is), are generally not documented changes. Furthermore, I suspect this indicates an improvement in average and maximum latency. My guess is nobody would want to actually in practice choose to implement the "workaround" that was proposed.
If anything, I think this highlights that our choice to display arbitrary percentiles of a distribution is a mistake. If we showed the full distribution of latencies then the user would likely be happy with the upgrade rather than alarmed. The reason we choose to display these arbitrary percentiles is a limitation of the way we down-sample histograms for long-term storage. Ideally we'd encode and store the entire histogram (as happens when Prometheus is used to monitor a cockroach cluster) then we could use something like a heatmap or a ridgeplot to visualize the entire latency distribution.


https://github.com/ryantxu/grafana-ridgeline-panel
https://www.circonus.com/2018/03/grafana-heatmaps-with-irondb-have-landed/
cc @piyush-singh
Ok, thanks for this perspective, Andrew. This is bubbling up from the docs
side because we couldn’t get the expected p99 in the 20.1 version of the
multi-region topology tutorial. Perhaps we should be describing latency
improvements differently in that doc.
On Fri, May 29, 2020 at 2:22 PM Nate notifications@github.com wrote:
cc @piyush-singh https://github.com/piyush-singh
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/49658#issuecomment-636117504,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ADZIH4XOMA7FMKYDN2N4BETRT74OFANCNFSM4NNK6CDQ
.>
Jesse Seldess
VP of Education, Cockroach Labs
https://www.cockroachlabs.com/
Maybe we can update the doc to look at p90?
so to summarize I can get the pre 20.1 results by changing that cluster setting, but we really wouldn't want to do that in a real world scenario in most cases.
Yes, that's correct. But as Nate mentioned, that's not a great solution.
do we have any way to see from telemetry if actual customers are running into this
Do we track full-system latency percentiles in telemetry? If not, I don't know how we would be able to figure this out. And even with that, it wouldn't tell us the whole story.
would TPC-E help us catch these types of regressions
Interestingly, TPC-E does have a much more realistic read/write ratio than TPC-C, so it might pick this kind of thing up. If we're running MovR in front of customers, though, I don't think there's a substitute for monitoring its performance directly.
I tend to agree with what @RaduBerinde and @ajwerner said above. "backwards incompatable" is not the right term here. "Regression" is more appropriate, but only if it's properly contextualized as a regression in a portion of the latency distribution for certain workloads. Unfortunately, when it comes to contention handling, you're rarely going to make progress that is an improvement across the board because, at the end of the day, you have multiple transactions competing for the same limited shared resource.
To this point, I was just talking to @sumeerbhola about https://github.com/cockroachdb/cockroach/issues/49684. There might be a way that we can make this change in such a way that it's less invasive by coupling the "upgrade" locking strength with the "unreplicated" lock durability. This should move the needle on p99 here, but it would also make changes elsewhere. For one, it would re-introduce a large class of transaction retries that we considered important to get rid of. It would probably also hurt top-end throughput on a workload like YCSB-A. So it's not the kind of thing that's appropriate for a point release.
I think I found a small patch inside the lockTable implementation that we can make in v20.1.x to recover the p99 latency on this workload (and others like it) without needing to make much of a compromise anywhere else. It's a bit hacky but appears to be working well. The patch drives p99 latencies on the workload back down to ~3.3ms.
By the way, I meant to thank whoever put this script together (@keith-mcclellan?). These kinds of investigations are orders of magnitude easier to work through with an automated reproduction.
Nicely done @nvanbenschoten!
Let's keep this open until the backport lands.
This is now fixed on the release-20.1 branch.
Most helpful comment
I think I found a small patch inside the
lockTableimplementation that we can make in v20.1.x to recover the p99 latency on this workload (and others like it) without needing to make much of a compromise anywhere else. It's a bit hacky but appears to be working well. The patch drives p99 latencies on the workload back down to ~3.3ms.By the way, I meant to thank whoever put this script together (@keith-mcclellan?). These kinds of investigations are orders of magnitude easier to work through with an automated reproduction.