Installation details
Scylla version (or git commit hash): 4.0.0-0.20200514.d95aa77b6
Cluster size: 9
OS (RHEL/CentOS/Ubuntu/AWS AMI): CentOS
I have a table with hundreds of millions of rows of data, all in the same partition.
How do I query the number of items? If I use
select count(*) from tablename;
it takes a very, very long time that I can't stand.
Who can help me, please~
please send questions to the mailing list, this tracker is for bugs.
scylla doesn't support fast counting of all rows in a table.
Putting all rows in the same partition will make things a lot worse. You have to use a large number of partitions.
I get it, thanks!
Hi,
You phrased this as a question, not as a specifc bug report or a feature request, so it's not appropriate for the issue tracker. The scylla users mailing list (or even the developers' mailing list) would have been a more suitable location for it.
Nevertheless, since you're already here, I'll try to answer before closing the issue:
In general the answer is unfortunately no - there is no way to get an accurate count of items without actually going through all of them, slowly. There are two reasons why this is the case:
I'm not saying there can't be any opportunities of speeding up "select count(*)". If you have any idea for such opportunities please open a different issue about these specific opportunities. But for the reasons I just explained, you can't expect the counting to be immediate (i.e., just take a pre-calculated counter and return it).
After explaining why you can't efficiently get an accurate count of items, there are efficient approaches for getting approximate counts very quickly. https://github.com/scylladb/scylla/issues/4320 suggests one approach. But we haven't implemented this yet.
Another word of advice: As a general rule, it is not a good idea to have "hundreds of millions of rows of data" in the same partition. You should change your data model to have many partitions (e.g., perhaps divide the items to partitions based on their first characters, or whatever) instead of just one. We're trying our best to improve support for huge partitions, but it still has a bunch of problems. The most obvious problem is that this partition will only be held by RF (e.g., 3) CPUs. Even if you have a big cluster with 100 nodes with 100 CPUs each - 10,000 CPUs in total - only 3 of them will be able to work on this partition.
Most helpful comment
please send questions to the mailing list, this tracker is for bugs.
scylla doesn't support fast counting of all rows in a table.
Putting all rows in the same partition will make things a lot worse. You have to use a large number of partitions.