Discussion in Gitter:
<@erichocean> if I had to say, the most disappointing thing for me with CRDB is that it refuses to use all of the compute capacity of my cluster
<@erichocean> so at least for me, running the query twice would be faster :)
<@knz> tell me more about using the compute capacity?
<@erichocean> like, I can't get simple counts of a table with 5000+ ranges on a 15 node cluster to user more than 10% of total CPU
<@erichocean> explain(distsql) shows that it's hitting every box
<@erichocean> is there some kind of CPU governor that keeps the CPUs from being overwhelmed or something?
<@knz> no, but with counts we (think we) know from experience that the bottleneck is I/O, not cpu
<@erichocean> maybe caching just isn't as mature
<@knz> well
<@knz> there are also specific count-related optimizations we're working on. Counting is such a common case that we ought to optimize it specifically.
<@erichocean> I guess I'm surprised then that CRDB is so easily IO bound
<@erichocean> I don't have a ton of cores (120 across all nodes)
<@erichocean> I don't know, when working on single node databases, queries are both faster and also seem to be able to hit full CPU utilization much quicker/easier
<@erichocean> it's just an observation, I'm not unhappy or whatever
<@erichocean> I just feel like I'm not getting as much value out of the machines, mostly
<@erichocean> I can get it to peg the CPUs if I do enough inserts at once
<@erichocean> but not really on queries
<@knz> interesting
<@knz> but what kind of queries would you run that touch so many ranges that they would max cpu?
<@knz> (besides count)
<@erichocean> other ones that use agg() functions
<@knz> yeah ok
<@erichocean> I think my mental model is: I asked you to do this query, MELT THE HARDWARE TO GET IT TO ME
<@knz> solid feedback
<@erichocean> but then I sit and wait, and since I've got plenty of time, I can observe the cluster and see that it's, basically, idling
The situation would be better if we could respond to this concern clearly, e.g. "w'ere not using 100% CPU because ..." or "we'll work on improving this by doing ..."
I think there simply needs to be more documentation or blog posts about how it's not that CockroachDB isn't engineered the right way—the fact is much of what CRDB does is IO. As the software matures and people begin to understand how fundamentally different it is from MySQL/PostgreSQL/Cassandra, there will be fewer questions about "why is Cockroach slower than XX?" and people will begin to pay more attention to their infrastructure and networking configs.
Here's an example query that, to me at least, should run faster than it does:
select count(*) from products_m1;
This takes 3m12.276121972s to count 12953329 rows. CPU time across my 15 node cluster during the query never exceeds 25%. The table being queried is 190GB in size and has 5095 ranges. The primary key is a short string (less than 15 characters). Each node has 4/8 3.4GHz cores, 32Gb of RAM, two SSDs (each set up as a store), 10GigE networking, and runs nothing but cockroach.
With just 12.9 million items, even if we double the number of reads to account for consistency, and we assume a unique I/O for every key, that's 860K I/Os per SSD in total, or ~4480 I/Os per second per SSD. I refuse to believe that those are reasonable numbers, and that Cockroach is somehow "I/O bound" on this query—it's off by at least one order of magnitude, and closer to two orders of magnitude.
Is there a way to see how many I/Os a given Cockroach node is doing?
Is there a way to see how many I/Os a given Cockroach node is doing?
You can run sar or iostat on some or all of the nodes while the query is running.
How many total ranges are in your cluster? CockroachDB currently balances ranges across the cluster, but does not ensure the ranges for a particular table are balanced. If your cluster has a lot more than 5k total ranges, the ranges for your products_m1 table may be on a subset of the nodes.
While your primary key is small, the size of each row is relatively large: 190GB / 12.9m == 15.4KB. The select count(*) query has to iterate over all of this data. More math says that your cluster is processing 0.98GB/sec for this query. It would be interesting to use iostat or sar to see what each individual node is doing. With this query, the DistSQL execution engine shouldn't be sending any data over the network, so you should also be able to monitor network traffic and verify that.
I have 28982 ranges in total.
I ran fio on a read-write workload to get a sense of what the hardware can do. Running iostat during the query, reads are less than a tenth of what the hardware can handle, and writes are about 1/3—20723/rps and 6912/wps is what fio says, and Cockroach is doing approx. 1000/rps and 2500/wps during the query (per store).
This is the fio command I used to establish baseline IOPS numbers:
./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.9
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m] [100.0% done] [80912K/27704K /s] [20.3K/6926 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=30496: Tue May 8 23:16:33 2018
read : io=3071.5MB, bw=82895KB/s, iops=20723 , runt= 37942msec
write: io=1024.6MB, bw=27651KB/s, iops=6912 , runt= 37942msec
cpu : usr=3.12%, sys=11.92%, ctx=913850, majf=0, minf=16
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=786297/w=262279/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=3071.5MB, aggrb=82894KB/s, minb=82894KB/s, maxb=82894KB/s, mint=37942msec, maxt=37942msec
WRITE: io=1024.6MB, aggrb=27650KB/s, minb=27650KB/s, maxb=27650KB/s, mint=37942msec, maxt=37942msec
Disk stats (read/write):
sda: ios=783170/262534, merge=151/1922, ticks=1904983/691094, in_queue=2669399, util=99.54%
Here's the iostat 1 numbers for the highest utilization during the query:
avg-cpu: %user %nice %system %iowait %steal %idle
33.46 0.00 4.31 12.93 0.00 49.30
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 770.00 48.00 1494.00 432.00 9720.00 13.17 0.22 0.14 0.44 0.13 0.13 20.80
sda 0.00 874.00 971.00 1483.00 18288.00 10248.00 23.26 1.26 0.52 0.68 0.41 0.31 76.10
If I didn't know I was running the query, I couldn't tell that I was by monitoring iostat.
I have 28982 ranges in total.
I think it is very possible that your products_m1 table is not spread across the cluster. What does the following query show:
SELECT count("Lease Holder"), "Lease Holder"
FROM [SHOW EXPERIMENTAL_RANGES FROM TABLE products_m1]
GROUP BY "Lease Holder"
ORDER BY count("Lease Holder");
The %util on the device is quite high (76%). If there is a significant amount of write activity on the cluster this is to be expected as we have to sync data to disk. Is it possible to run your select count(*) query while the cluster is otherwise idle.
I haven't used fio recently (so I don't have the parameters paged in), but simulating the same behavior there is possible. This requires the correct mix of syncing and concurrency. It is very possible there isn't something tuned well in CockroachDB in this area.
Large analytics-style queries like this have not been a focus for CRDB so far. We're starting to pay more attention to them, but OLTP remains our primary focus.
Here's the result:
+-------+--------------+
| count | Lease Holder |
+-------+--------------+
| 43 | 28 |
| 50 | 22 |
| 54 | 15 |
| 62 | 16 |
| 70 | 3 |
| 71 | 4 |
| 71 | 19 |
| 89 | 1 |
| 89 | 24 |
| 91 | 25 |
| 93 | 8 |
| 97 | 18 |
| 108 | 30 |
| 119 | 7 |
| 121 | 6 |
| 143 | 14 |
| 153 | 27 |
| 161 | 29 |
| 186 | 5 |
| 194 | 21 |
| 205 | 26 |
| 237 | 20 |
| 238 | 13 |
| 248 | 10 |
| 253 | 17 |
| 289 | 23 |
| 318 | 2 |
| 333 | 9 |
| 360 | 12 |
| 556 | 11 |
+-------+--------------+
That looks as I expected: not particularly balanced. There is a 10x difference in leaseholders per store from the least loaded to most loaded store. I think this also implies a 10x difference in how the query is distributed (cc @arjunravinarayan and @RaduBerinde who can confirm that).
@a-robinson This is another indication that we may want to consider balancing the ranges for a particular table across a cluster.
I ran the query on a completely idle cluster, and specifically monitored node 6 which has the most ranges assigned. Here's the iostat -xcd output for the highest %user:
avg-cpu: %user %nice %system %iowait %steal %idle
21.08 0.00 3.89 5.14 0.00 69.89
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 110.00 923.00 356.00 5232.00 80632.00 134.27 7.84 6.13 0.12 21.72 0.26 33.70
sda 0.00 80.00 1344.00 133.00 31192.00 876.00 43.42 0.20 0.14 0.11 0.40 0.11 16.30
Link to entire iostat output: https://gist.github.com/erichocean/4ac96fddd0b4be9bd4545cfa21c3cdd2
Admin UI shows sda as lease holder 11, and sdb as lease holder 12.
So even with the unbalanced lease-holder aspect, Cockroach still isn't pegging either I/O or the CPU on this query on the heaviest node. At worst, the query should complete about 2.5x slower than a perfectly balanced cluster for this table, but I don't think we're seeing that.
We might also want to schedule multiple tablereaders for the same table if there are a lot of ranges on one machine.
I'll close this one due to age. We've done a lot on this front with vectorized execution. We still don't have a way of adding extra parallelism to OLAP queries, but we don't have immediate plans to do that as we're concentrating on OLTP workloads.
Most helpful comment
Here's an example query that, to me at least, should run faster than it does:
This takes 3m12.276121972s to count 12953329 rows. CPU time across my 15 node cluster during the query never exceeds 25%. The table being queried is 190GB in size and has 5095 ranges. The primary key is a short string (less than 15 characters). Each node has 4/8 3.4GHz cores, 32Gb of RAM, two SSDs (each set up as a store), 10GigE networking, and runs nothing but
cockroach.With just 12.9 million items, even if we double the number of reads to account for consistency, and we assume a unique I/O for every key, that's 860K I/Os per SSD in total, or ~4480 I/Os per second per SSD. I refuse to believe that those are reasonable numbers, and that Cockroach is somehow "I/O bound" on this query—it's off by at least one order of magnitude, and closer to two orders of magnitude.
Is there a way to see how many I/Os a given Cockroach node is doing?