After some initial investigation today, there's one problem so big that it pretty much obscures whatever problems may exist beneath it -- bad replica balance. For the most part, the replicas for any given table mostly fall on the same 3 nodes. The worst offender here is the tpcc.warehouse table, which gets the most load of any table as far as I'm aware (please correct that if I'm wrong). Almost all of the replicas for its 1000 ranges ended up on the same 3 nodes, meaning that the best leaseholder rebalancing in the world couldn't make for good performance. Those 3 nodes are indeed the 3 most loaded along dimensions such as number of BatchRequests processed, CPU usage, and keys written per second, with 1 of them in particular really standing out in goroutines (20k vs 2k for the next highest).
This could be fixed in a number of ways, such as:
ALTER TABLE ... SCATTER that scatters replicas in addition to leasesThe last one should get the quickest results and we've talked about wanting it anyway, so I'll work on that tomorrow and see where it leaves things.
I do have a couple questions about oddities I've noticed, though:
NotLeaseholderErrors. Judging by the fact that its keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?tpcc.warehouse ranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:
I ended up manually constructing most of this data with the raft debug endpoint + a few python scripts as well as SHOW TESTING_RANGES + some bash.
And to be clear, the point of the first paragraph above is that leaseholder rebalancing can't fix the problem by itself if all the warehouse range replicas are on the same 3 nodes.
r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that it's keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?
i'd expect r1 to be mostly idle in the steady state. @nvanbenschoten, @tschottdorf any theories?
The tpcc.warehouse ranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?
This is not expected. Each warehouse should receive approximately the same qps. No idea what is going on, but it is worth pulling on this thread. It might be indicative of a bug in the load generator.
Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:
Heh, I came up with a similar list when debugging tpcc performance issues. I believe there are issues filed for many of these. @piyush-singh please double check.
This is not expected. Each warehouse should receive approximately the same qps. No idea what is going on, but it is worth pulling on this thread. It might be indicative of a bug in the load generator.
I can't confirm it right now since I tore the cluster down when I was done last night, but I think it could conceivably be due to errors from that node. I recall there being a fairly large (albeit spiky) amount of "Replica Errors" in the UI's graph of RPC errors. It's possible that those were mostly coming from n11's replicas, and that retries were then bumping up the count of batches sent to n11's replicas.
Implementing a version of ALTER TABLE ... SCATTER that scatters replicas in addition to ranges
What does "scatters replicas in addition to ranges" mean?
r1 has a lot of qps.
That's surprising. This is probably due to overly-broad cache invalidation when the DistSender gets a NotLeaseHolder error (or a few other things). It sounds like we might be invalidating both meta1 and meta2 when we should only be invalidating meta2.
We will be showing location of replicas in a debug page shortly (#24855), and @vilterp and I have discussed using this layout to similarly show leaseholder counts.
@petermattis mentioned similar requests for information about ranges in #23379 (nodes to which they were replicated, size, load). I'll make a note to think through how best to show ranges in the UI for our last milestone since we'll be making some layout changes anyways.
Thanks for the write up and detailed requests @a-robinson
r1 has a lot of qps. Way more qps than any other range (460 qps to it vs 50 to the next hottest range in my latest full measurement, and it appears to be growing over time; it's now over 1000 qps). That's way higher than the cluster's rate of NotLeaseholderErrors. Judging by the fact that it's keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved in QueryTxn attempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?
The trace is pretty suspicious for this issue. As you mentioned, we see a large number of QueryTxn requests in the trace. A lot of these something like:
event:[n11,s11,r55510/2:/Table/55/1/9{00-10}] no pusher found; backing off
30575.736ms 390.389ms operation:dist sender
30575.767ms 0.030ms event:[s11,r55510/2:/Table/55/1/9{00-10},n11] querying next range at /Min
30575.806ms 0.040ms event:[s11,r55510/2:/Table/55/1/9{00-10},n11] r1: sending batch 1 QueryTxn to (n20,s19):8
30575.827ms 0.021ms event:[s11,r55510/2:/Table/55/1/9{00-10},n11] sending request to alex-tpcc-0002:26257
30575.848ms 0.021ms operation:/cockroach.roachpb.Internal/Batch component:gRPC sb:1 span.kind:client txnid:779a7ccb-5b5d-4a8c-8c02-33a5937f9510
30576.756ms 0.908ms operation:/cockroach.roachpb.Internal/Batch component:gRPC sb:1 span.kind:server txnid:779a7ccb-5b5d-4a8c-8c02-33a5937f9510
30576.776ms 0.019ms event:[n20] 1 QueryTxn
30576.781ms 0.005ms event:[n20,s19] executing 1 requests
30626.903ms 50.122ms event:[n20,s19,r1/8:/{Min-System/}] read-only path
30626.914ms 0.011ms event:[n20,s19,r1/8:/{Min-System/}] command queue
30626.918ms 0.004ms event:[n20,s19,r1/8:/{Min-System/}] waiting for read lock
30626.970ms 0.052ms event:[n20,s19,r1/8:/{Min-System/}] read completed
Note that they're addressed to /Min. This trace is coming from https://github.com/cockroachdb/cockroach/blob/f8a7889c598150b45c933c96ee19dc522eb2f8d3/pkg/storage/txnwait/queue.go#L682-L684 and I suspect that we're querying a transaction that does not have a record yet (i.e. pusher.TxnMeta.Key == nil). For what it's worth, @spencerkimball has been dealing with issues like this lately with some of his contentionQueue changes.
things that would have been nice to have for the sake of debugging the performance issue:
@vilterp's replica matrix (https://github.com/cockroachdb/cockroach/pull/24855) would address at least the first two of these needs. Might be worth trying it out here.
What does "scatters replicas in addition to ranges" mean?
Sorry, I guess I can't type late at night. I meant "scatters replicas in addition to leases".
Yes, we would be seeing spurious QueryTxn requests to r1 in the current code because the txn wait queue isn't checking whether there's a key available for the pusher before querying it for dependencies. The lack of a key addresses the QueryTxn request to r1. This is a one line fix as part of my current PR, but I'll break it out and send it for review.
Before scattering replicas, running tpc-c 5k on 15 nodes:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
300.3s 29544.7 45.9% 26271.4 12348.0 85899.3 103079.2 103079.2 103079.2
After scattering replicas (and leases), running tpc-c 5k on 15 nodes:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
300.4s 61047.7 94.9% 1106.1 117.4 369.1 9663.7 19327.4 28991.0
So if our goal is to have a usable workaround, scattering replicas basically gets us there. And if we want to remove the manual aspect of it, it looks like it'll be more impactful to work on stats-based replica rebalancing than lease rebalancing.
@a-robinson Nice improvement. The p99 and pMax latency looks higher than what I remember for tpcc-5k. Would be interesting to see how this compares to tpcc-5k with partitioning.
Running with --ramp=30s to ramp up the traffic more smoothly seems to help a lot with the tail latencies:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
300.4s 60544.4 94.2% 143.7 134.2 218.1 260.0 453.0 4831.8
Of course, the sample sizes here aren't exactly large, so maybe the difference is just noise.
And to follow up on my questions from the other day:
range-lookup events. It's weird it hasn't dropped closer to 0, but it's not causing any problems anymore.The hottest ranges in the cluster at this point are all from the tpcc.new_order table, followed by the tpcc.item table. Top 50 ranges in case you're curious, with the rangeIDs in the mid-hundreds being from new_order and the 30-thousands from item:
idx rangeID QPS nodeIDs leaseholder node
1: 496 217.292003 [5, 6, 4] 6
2: 793 216.760419 [5, 10, 6] 5
3: 653 216.523500 [5, 10, 6] 10
4: 448 216.106647 [5, 10, 6] 10
5: 557 215.943809 [15, 3, 1] 15
6: 458 215.132572 [15, 12, 1] 15
7: 533 193.794032 [5, 15, 4] 5
8: 730 188.153795 [5, 6, 3] 6
9: 293 159.987734 [10, 6, 9] 9
10: 413 154.168710 [5, 15, 4] 5
11: 613 152.213182 [15, 6, 1] 15
12: 267 151.527005 [5, 6, 4] 6
13: 797 151.262274 [5, 6, 3] 3
14: 611 150.982270 [6, 4, 3] 6
15: 457 150.963159 [5, 6, 3] 5
16: 724 150.941294 [5, 3, 1] 3
17: 615 150.589744 [5, 12, 4] 12
18: 33644 123.715932 [8, 6, 7] 8
19: 33197 112.783797 [8, 6, 7] 6
20: 388 112.059562 [5, 10, 4] 4
21: 540 97.112390 [5, 6, 4] 4
22: 34061 94.627812 [8, 6, 7] 7
23: 32384 92.206044 [8, 6, 7] 8
24: 32110 86.193180 [8, 6, 7] 6
25: 32675 82.701674 [8, 6, 7] 6
26: 34390 73.451197 [8, 6, 7] 6
27: 34878 70.419347 [8, 6, 7] 7
28: 266 69.683754 [5, 15, 4] 15
29: 31457 67.308302 [8, 6, 7] 6
30: 33275 67.023898 [8, 6, 7] 7
31: 976 65.729066 [5, 10, 6] 5
32: 36672 63.460209 [5, 15, 4] 15
33: 34217 60.512212 [8, 6, 7] 6
34: 493 59.469083 [5, 6, 12] 6
35: 34104 54.001622 [8, 6, 7] 7
36: 32033 53.894249 [8, 6, 7] 6
37: 31546 52.105440 [8, 6, 7] 6
38: 32199 52.043978 [8, 6, 7] 7
39: 34701 50.376507 [8, 6, 7] 8
40: 32436 49.500585 [8, 6, 7] 8
41: 32335 47.352396 [8, 6, 7] 6
42: 32210 46.446430 [8, 6, 7] 8
43: 34580 45.592824 [8, 6, 7] 8
44: 32663 44.654265 [8, 6, 7] 8
45: 33395 44.197098 [8, 6, 7] 8
46: 33678 44.139657 [8, 6, 7] 7
47: 31786 43.393267 [8, 6, 7] 8
48: 1 42.440460 [2, 14, 12] 2
49: 34153 41.976303 [8, 6, 7] 7
50: 33196 40.899344 [8, 6, 7] 8
Hey @a-robinson, we can discuss this in a separate issue, but how did you generate the hot ranges table above? Would love to get it into an endpoint so we can show it in the UI.
The new_order table has warehouse ID as the first component of its primary key. I'm surprised to see such a delta in QPS across those new_order ranges. And I'm guessing this is only a fraction of the new_order ranges.
Hey @a-robinson, we can discuss this in a separate issue, but how did you generate the hot ranges table above? Would love to get it into an endpoint so we can show it in the UI.
hottest_ranges.py from https://gist.github.com/a-robinson/54fbaa6628ae9f1ad9c6185ecd28edb9 run on the output of the /_status/raft endpoint.
Disclaimer: the qps numbers are averages over the last 30 minutes, not instantaneous measurements. Also, that raft debug endpoint is very slow (>1 minute) on a large cluster.
And I'm guessing this is only a fraction of the
new_orderranges.
No, they're all represented there except for one outlier that at the very end of the table that's responsible for fewer warehouses. Although that's ignoring the bunch of ranges in the table that are responsible for effectively none of the keyspace (e.g. /1764/3/2642-/1764/3/2642/0), which makes it look like we're choosing split points incorrectly somewhere (reminiscent of https://github.com/cockroachdb/cockroach/pull/24896):
root@localhost:26257/defaultdb> show testing_ranges from table tpcc.new_order;
+-----------------+-----------------+----------+-------------+--------------+
| Start Key | End Key | Range ID | Replicas | Lease Holder |
+-----------------+-----------------+----------+-------------+--------------+
| NULL | /0 | 192 | {3,4,12} | 12 |
| /0 | /0/1/2101/0 | 36721 | {3,4,12} | 12 |
| /0/1/2101/0 | /138/9/2506 | 266 | {4,5,15} | 15 |
| /138/9/2506 | /272/10/2564 | 36672 | {4,5,15} | 15 |
| /272/10/2564 | /272/10/2564/0 | 337 | {4,5,15} | 15 |
| /272/10/2564/0 | /521/6/2127 | 413 | {4,5,15} | 15 |
| /521/6/2127 | /521/6/2127/0 | 460 | {5,10,15} | 15 |
| /521/6/2127/0 | /770/1/2590 | 533 | {4,5,15} | 15 |
| /770/1/2590 | /770/1/2590/0 | 198 | {3,6,15} | 15 |
| /770/1/2590/0 | /1000 | 293 | {6,9,10} | 6 |
| /1000 | /1018/7/2153 | 36711 | {6,9,10} | 6 |
| /1018/7/2153 | /1018/7/2153/0 | 412 | {4,9,14} | 9 |
| /1018/7/2153/0 | /1267/2/2616 | 458 | {1,12,15} | 1 |
| /1267/2/2616 | /1267/2/2616/0 | 552 | {1,4,15} | 1 |
| /1267/2/2616/0 | /1515/8/2179 | 557 | {1,3,15} | 1 |
| /1515/8/2179 | /1515/8/2179/0 | 215 | {4,5,12} | 12 |
| /1515/8/2179/0 | /1764/3/2642 | 267 | {2,4,6} | 6 |
| /1764/3/2642 | /1764/3/2642/0 | 339 | {4,5,11} | 11 |
| /1764/3/2642/0 | /1948/8/2638 | 388 | {4,5,10} | 4 |
| /1948/8/2638 | /1948/8/2638/0 | 492 | {3,6,14} | 14 |
| /1948/8/2638/0 | /2000 | 611 | {3,4,6} | 3 |
| /2000 | /2197/4/2201 | 36697 | {3,4,6} | 3 |
| /2197/4/2201 | /2197/4/2201/0 | 262 | {2,4,6,15} | 2 |
| /2197/4/2201/0 | /2445/9/2664 | 457 | {5,6,13} | 6 |
| /2445/9/2664 | /2445/9/2664/0 | 536 | {3,6,12} | 12 |
| /2445/9/2664/0 | /2605/5/2267 | 540 | {2,6,8,15} | 8 |
| /2605/5/2267 | /2605/5/2267/0 | 721 | {3,4,6} | 3 |
| /2605/5/2267/0 | /2853/10/2730 | 730 | {3,5,6} | 3 |
| /2853/10/2730 | /2853/10/2730/0 | 331 | {4,6,15} | 15 |
| /2853/10/2730/0 | /2921/10/2590 | 493 | {5,6,12} | 12 |
| /2921/10/2590 | /2921/10/2590/0 | 612 | {1,2,3} | 1 |
| /2921/10/2590/0 | /3000 | 613 | {1,6,15} | 1 |
| /3000 | /3170/6/2153 | 36685 | {1,6,15} | 1 |
| /3170/6/2153 | /3170/6/2153/0 | 614 | {1,6,15} | 1 |
| /3170/6/2153/0 | /3419/1/2616 | 615 | {4,5,12} | 12 |
| /3419/1/2616 | /3419/1/2616/0 | 384 | {4,5,12} | 4 |
| /3419/1/2616/0 | /3667/7/2179 | 496 | {4,5,6} | 6 |
| /3667/7/2179 | /3667/7/2179/0 | 551 | {3,6,7} | 3 |
| /3667/7/2179/0 | /3916/2/2642 | 724 | {1,3,5} | 5 |
| /3916/2/2642 | /3916/2/2642/0 | 794 | {3,5,15} | 15 |
| /3916/2/2642/0 | /4000 | 797 | {3,5,6} | 3 |
| /4000 | /4164/8/2205 | 36692 | {3,5,6} | 3 |
| /4164/8/2205 | /4164/8/2205/0 | 417 | {4,6,11,13} | 6 |
| /4164/8/2205/0 | /4413/3/2668 | 448 | {5,6,10} | 10 |
| /4413/3/2668 | /4413/3/2668/0 | 535 | {3,5,6} | 3 |
| /4413/3/2668/0 | /4661/9/2231 | 653 | {5,6,10} | 10 |
| /4661/9/2231 | /4661/9/2231/0 | 726 | {4,5,13} | 13 |
| /4661/9/2231/0 | /4910/4/2694 | 793 | {5,9,10} | 5 |
| /4910/4/2694 | /4910/4/2694/0 | 845 | {5,10,13} | 10 |
| /4910/4/2694/0 | /4999/10/3001 | 976 | {4,5,10} | 10 |
| /4999/10/3001 | /5000 | 1091 | {4,6,11,13} | 6 |
| /5000 | NULL | 36731 | {4,6,13} | 4 |
+-----------------+-----------------+----------+-------------+--------------+
(52 rows)
No, they're all represented there except for one outlier that at the very end of the table that's responsible for fewer warehouses. Although that's ignoring the bunch of ranges in the table that are responsible for effectively none of the keyspace (e.g. /1764/3/2642-/1764/3/2642/0), which makes it look like we're choosing split points incorrectly somewhere (reminiscent of #24896):
We don't manually split the new_order table. I'm not sure why not. Perhaps there are enough ranges already to not require splitting. The strange nearly empty ranges might indicate badness (or a bug) in the automatic splitting heuristics.
Using partitioning gets basically the same results as with replica scattering:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
300.4s 60837.6 94.6% 84.7 71.3 130.0 201.3 352.3 1006.6
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
300.4s 60587.1 94.2% 88.9 79.7 130.0 184.5 369.1 906.0
The strange nearly empty ranges might indicate badness (or a bug) in the automatic splitting heuristics.
It looks to me like the tpc-c 5k fixture I'm RESTOREing has the bad splits already in it. It's as if the cluster that created the backup had partitioned the table and was running a version before https://github.com/cockroachdb/cockroach/pull/24896. Either that, or the RESTORE process itself creates the bad splits somehow. The bad splits are consistently there as soon as the data has been restored.
It looks to me like the tpc-c 5k fixture I'm RESTOREing has the bad splits already in it. It's as if the cluster that created the backup had partitioned the table and was running a version before #24896. Either that, or the RESTORE process itself creates the bad splits somehow. The bad splits are consistently there as soon as the data has been restored.
If this is reproducible, can you file a separate issue about this? Seems like something for the bulkio folks to investigate.
Consider this investigated. Resulted in https://github.com/cockroachdb/cockroach/pull/28340 and https://github.com/cockroachdb/cockroach/pull/28340 after the prototype in https://github.com/cockroachdb/cockroach/pull/26608.
Most helpful comment
After some initial investigation today, there's one problem so big that it pretty much obscures whatever problems may exist beneath it -- bad replica balance. For the most part, the replicas for any given table mostly fall on the same 3 nodes. The worst offender here is the
tpcc.warehousetable, which gets the most load of any table as far as I'm aware (please correct that if I'm wrong). Almost all of the replicas for its 1000 ranges ended up on the same 3 nodes, meaning that the best leaseholder rebalancing in the world couldn't make for good performance. Those 3 nodes are indeed the 3 most loaded along dimensions such as number of BatchRequests processed, CPU usage, and keys written per second, with 1 of them in particular really standing out in goroutines (20k vs 2k for the next highest).This could be fixed in a number of ways, such as:
ALTER TABLE ... SCATTERthat scatters replicas in addition to leasesThe last one should get the quickest results and we've talked about wanting it anyway, so I'll work on that tomorrow and see where it leaves things.
I do have a couple questions about oddities I've noticed, though:
NotLeaseholderErrors. Judging by the fact that its keys written per second is basically, 0, they're all reads. According to the UI, there is no rebalancing activity going on, neither replicas nor leases. I increased the cluster setting for the number of range cache entries from 1M to 10M and it's had no effect. I enabled slow query tracing to try and see which requests were hitting it, and I'm seeing it repeatedly involved inQueryTxnattempts to resolve intents. It looks pretty fishy to me, but I haven't checked against the code yet. Here's an example trace: https://gist.github.com/a-robinson/17ebf561bf450b40e7e89edd967962ca. Is this expected?tpcc.warehouseranges on n11 (the one with the massive number of goroutines and most qps) are getting significantly more qps than those whose leases are on the other two nodes (https://gist.github.com/a-robinson/0aebc9aea3fb042255d3572e52ce63ac). Under normal operation, I'd expect each warehouse to get the same number of qps. Is that a correct understanding of tpc-c? And given that n11 seems to be the worst bottleneck, I'd expect each range on it to have fewer qps, not more, due to the slowdown. Any idea what might be going on here?Finally, for future reference (cc @piyush-singh), things that would have been nice to have for the sake of debugging the performance issue:
I ended up manually constructing most of this data with the raft debug endpoint + a few python scripts as well as
SHOW TESTING_RANGES+ some bash.