Scylla: How do I query the number of items in a table that contains many, many, many pieces of data?

Created on 27 May 2020 · 3Comments · Source: scylladb/scylla

Installation details
Scylla version (or git commit hash): 4.0.0-0.20200514.d95aa77b6
Cluster size: 9
OS (RHEL/CentOS/Ubuntu/AWS AMI): CentOS

I have a table with hundreds of millions of rows of data, all in the same partition.
How do I query the number of items? If I use
select count(*) from tablename;
it takes a very, very long time that I can't stand.

Who can help me, please~

Source

Felix-zhoux

Most helpful comment

please send questions to the mailing list, this tracker is for bugs.

scylla doesn't support fast counting of all rows in a table.

Putting all rows in the same partition will make things a lot worse. You have to use a large number of partitions.

avikivity on 27 May 2020

😄1 👍1

All 3 comments

please send questions to the mailing list, this tracker is for bugs.

scylla doesn't support fast counting of all rows in a table.

Putting all rows in the same partition will make things a lot worse. You have to use a large number of partitions.

avikivity on 27 May 2020

😄1 👍1

I get it, thanks!

Felix-zhoux on 27 May 2020

Hi,

You phrased this as a question, not as a specifc bug report or a feature request, so it's not appropriate for the issue tracker. The scylla users mailing list (or even the developers' mailing list) would have been a more suitable location for it.

Nevertheless, since you're already here, I'll try to answer before closing the issue:

In general the answer is unfortunately no - there is no way to get an accurate count of items without actually going through all of them, slowly. There are two reasons why this is the case:

First is the distributed nature of the cluster and how CL is used. Some data might be missing from some of the codes. For example, consider that we have three nodes, row X only exists on node 1 and 2, row Y only exists on nodes 2 and 3, while row Z only exists on nodes 1 and 3. In this example, if you ask each node it will tell you that it has only two rows. Yet if you read with CL=QUORUM, you'll actually read three rows (X, Y and Z).
Another problem is that even inside a single node, Scylla works using the LSM model (i.e., sstables). You may have a million rows of data, where sstable 1 contains some of them, sstable 2 others, sstable 3 may contain rows already in sstable 1 or 2 (so shouldn't be counted again), and sstable 4 may contain tombstones which delete rows in other sstables (so should be reduced from the count), and sstable 5 may contain tombstones which don't match any old data (so shouldn't be removed from the count). We really need to read all these sstables to get an accurate count - and that is what select count(*) does. If you think Scylla can maintain a row count in some side-array or something, consider this: You "INSERT" or "UPDATE" (it's the same...) a row. Should the row counter be increased, or not? This depends on whether the row previously existed or didn't, but Scylla doesn't know this - the whole point of LSM (and what makes Scylla so efficient with writes) is that you don't need to read the previous value of the row before writing a new value.

I'm not saying there can't be any opportunities of speeding up "select count(*)". If you have any idea for such opportunities please open a different issue about these specific opportunities. But for the reasons I just explained, you can't expect the counting to be immediate (i.e., just take a pre-calculated counter and return it).

After explaining why you can't efficiently get an accurate count of items, there are efficient approaches for getting approximate counts very quickly. https://github.com/scylladb/scylla/issues/4320 suggests one approach. But we haven't implemented this yet.

Another word of advice: As a general rule, it is not a good idea to have "hundreds of millions of rows of data" in the same partition. You should change your data model to have many partitions (e.g., perhaps divide the items to partitions based on their first characters, or whatever) instead of just one. We're trying our best to improve support for huge partitions, but it still has a bunch of problems. The most obvious problem is that this partition will only be held by RF (e.g., 3) CPUs. Even if you have a big cluster with 100 nodes with 100 CPUs each - 10,000 CPUs in total - only 3 of them will be able to work on this partition.