Scylla: Invalid aggregation result on table with index

Created on 7 Oct 2020  Â·  9Comments  Â·  Source: scylladb/scylla

When using aggregates (COUNT(*)) on table with index on clustering key and filtering on clustering key, wrong results are returned - returned count does not match number of rows that match WHERE.

cqlsh> CREATE KEYSPACE ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> CREATE TABLE ks.t(a int, b int, c int, PRIMARY KEY ((a, b), c));
cqlsh> INSERT INTO ks.t(a, b, c) VALUES (1, 2, 3);

cqlsh> SELECT * FROM ks.t WHERE c = 3 ALLOW FILTERING;

 a | b | c
---+---+---
 1 | 2 | 3

(1 rows)
cqlsh> SELECT COUNT(*) FROM ks.t WHERE c = 3 ALLOW FILTERING;

 count
-------
     1

(1 rows)

cqlsh> CREATE INDEX kst_index ON ks.t(c);
cqlsh> SELECT * FROM ks.t WHERE c = 3;

 a | b | c
---+---+---
 1 | 2 | 3

(1 rows)
cqlsh> SELECT COUNT(*) FROM ks.t WHERE c = 3;

 count
-------
     0

(↑ wrong! - should be 1)

(1 rows)
cqlsh> INSERT INTO ks.t(a, b, c) VALUES (1, 2, 4);
cqlsh> INSERT INTO ks.t(a, b, c) VALUES (1, 2, 5);
cqlsh> SELECT * FROM ks.t WHERE c = 3;

 a | b | c
---+---+---
 1 | 2 | 3

(1 rows)
cqlsh> SELECT COUNT(*) FROM ks.t WHERE c = 3;

 count
-------
     2

(1 rows)

(↑ wrong! - should be 1)

This bug is the root cause of this issue: https://github.com/scylladb/scylla/issues/7043 (at least one of causes...)

Preliminary analysis on the bug

  • It was introduced in some version of Scylla between 3.1.0 and 3.2.0 (doing bisect now to determine the specific commit).
  • It seems that wrong base table read_command is created for the COUNT(*) query:

read_command for SELECT *:

TRACE 2020-10-06 17:15:01,769 [shard 2] storage_proxy - query ks.t cmd=read_command{cf_id=083b3280-07e3-11eb-bbac-000000000006, version=4f1ed180-28ab-3979-8ad8-d91aba6f0378, slice={regular_cols=[], static_cols=[], rows=[{ckp{000400000003}}], options=103, cql_format=4, partition_row_limit=4294967295}, limit=18446744073709551615, timestamp=1601997301}, partition_limit=4294967295}, ranges={{{4881097376275569167, pk{000400000001000400000002}}}}, id=21

read_command for SELECT COUNT(*): ((ckp{000400000003}, +inf) seems wrong)

TRACE 2020-10-06 17:15:07,683 [shard 2] storage_proxy - query ks.t cmd=read_command{cf_id=083b3280-07e3-11eb-bbac-000000000006, version=4f1ed180-28ab-3979-8ad8-d91aba6f0378, slice={regular_cols=[], static_cols=[], rows=[{ckp{000400000003}}], specific=[{pk{000400000001000400000002} : (ckp{000400000003}, +inf)}], options=103, cql_format=4, partition_row_limit=4294967295}, limit=18446744073709551615, timestamp=1601997307}, partition_limit=4294967295}, ranges={{{4881097376275569167, pk{000400000001000400000002}}}}, id=23

Installation details
Scylla version (or git commit hash): 92e78da2d6bb1e7c61cbc8019e692e0f1fe8268f
Cluster size: 1 node
OS (RHEL/CentOS/Ubuntu/AWS AMI): Fedora 32

bug sec-index

Most helpful comment

Just to sum up the investigation @avelanarius @nyh @penberg and I have done:
It seems that the problem is with partition_slice::set_range called in two places here:
https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L538-L550

set_range overrides the partition_slice::_row_ranges and that means losing the filtering of the query.
The solution probably is to add a new member function to partition_slice called add_range which will take a new range as a parameter and instead of setting it alone as a specific ranges for a partition, it would merge this given range with _row_ranges and set such super set as specific ranges for a partition. This new add_range function should be used instead of set_range in the code above.

All 9 comments

@penberg may be interested to know about this. SI is his baby ;)

Also @dekimir may be familiar with COUNT machinery since he's the king of CQL layer ;)

Bisected the regression to this commit: bb08af7e68197f7cb7b12c38a3aaa708d534857f

cc: @psarna

bb08af7e68197f7cb7b12c38a3aaa708d534857f is the first bad commit
commit bb08af7e68197f7cb7b12c38a3aaa708d534857f
Author: Piotr Sarna <[email protected]>
Date:   Mon Jun 17 14:34:16 2019 +0200

    cql3: add proper aggregation to paged indexing

    Aggregated and paged filtering needs to aggregate the results
    from all pages in order to avoid returning partial per-page
    results. It's a little bit more complicated than regular aggregation,
    because each paging state needs to be translated between the base
    table and the underlying view. The routine keeps fetching pages
    from the underlying view, which are then used to fetch base rows,
    which go straight to the result set builder.

    Fixes #4540

 cql3/statements/select_statement.cc | 54 +++++++++++++++++++++++++++++++++++++

Well done @avelanarius!!! Now you just need to fix the issue :)

I found specifically where the issue is, however I don't fully understand the code.

One thing I don't understand is why paging_state returned from find_index_partition_ranges, which is paging state from index not base table:

https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L1002-L1011

is passed into do_execute_base_query (via options.get_paging_state(), as set in line 1006):

https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L538-L550

(lines 545, 546 cause the invalid result to be returned)

@penberg @haaawk please help

Just to sum up the investigation @avelanarius @nyh @penberg and I have done:
It seems that the problem is with partition_slice::set_range called in two places here:
https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L538-L550

set_range overrides the partition_slice::_row_ranges and that means losing the filtering of the query.
The solution probably is to add a new member function to partition_slice called add_range which will take a new range as a parameter and instead of setting it alone as a specific ranges for a partition, it would merge this given range with _row_ranges and set such super set as specific ranges for a partition. This new add_range function should be used instead of set_range in the code above.

tl;dr Our conclusions from yesterday are very likely wrong!

After the meeting on yesterday, I still didn't fully understand how the paging is done. I had a hunch that just fixing partition_slice::set_range() is not the proper fix - instead that something is fundamentally wrong with the paging.

Example case when paging is wrong and set_range does not "override"

I found the following variant of the bug:

cqlsh> CREATE TABLE ks.t3(pk1 int, pk2 int, ck int, PRIMARY KEY((pk1, pk2), ck));
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 1);
cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1 ALLOW FILTERING;

 count
-------
     1

(1 rows)
cqlsh> CREATE INDEX ks_t3 ON ks.t3(pk2);
cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1;

 count
-------
     0

(1 rows)
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 2);
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 3);
[... and so on ...]
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 9998);
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 9999);
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 10000);
cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1;

 count
-------
     0

(1 rows)
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 10001);
cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1;

 count
-------
     1

(1 rows)
cqlsh> INSERT INTO ks.t3(pk1, pk2, ck) VALUES (1, 1, 10002);
cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1;

 count
-------
     2

(1 rows)
cqlsh> DROP INDEX ks.ks_t3;
cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1 ALLOW FILTERING;

 count
-------
 10002

(1 rows)

As you can see the index is not created on the clustering key.

If we analize what is happening at first SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1; after creating the index, it is not the case that set_range "overrides" some existing filtering on clustering key, because we were not filtering on clustering key in the first place! Making set_range intersect with _row_ranges would not achieve anything, because it is empty.

As you can see in later queries, SELECT COUNT(*) starts to return non-zero results after adding 10000 rows - the default page size. Let's analize what happens in such query:

cqlsh> SELECT COUNT(*) FROM ks.t3 WHERE pk2 = 1;

 count
-------
     2

(1 rows)
Analysis of this query

First, we query the index:
https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L1003

Querying the index returns a 10000-row page and paging_state is set to 10000th row of the index.

https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L1003-L1010

We set the paging_state of internal_options. Now the base query will read rows using this paging_state, which means rows with ck > 10000 (there are two rows).

In next iteration, we query the index again. We get the remaining 2-row page of index. The paging state is now set "past" all rows. When we do the base query, we get nothing (hence the total sum is 2).

I think the solution is to move changing internal_options after base query was done, so that when we do base query we read the rows that correspond to what we queried in index, not one page off.

https://github.com/scylladb/scylla/blob/f30e86395aea5f8cfae311706794eca893f36001/cql3/statements/select_statement.cc#L1006-L1009

(move internal_options.reset before return stop_iteration)

Unfortunately the unit tests are not very thorough and they don't test case when a single partition is large (in context of aggregates). I will continue the work on Monday.

Another bug

I have found another bug, this time with GROUP BY:

cqlsh> CREATE KEYSPACE ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck));
cqlsh> INSERT INTO ks.t(pk, ck, v) VALUES (1, 2, 3);
cqlsh> INSERT INTO ks.t(pk, ck, v) VALUES (1, 4, 3);
cqlsh> SELECT pk FROM ks.t WHERE v=3 GROUP BY pk ALLOW FILTERING;

 pk
----
  1

(1 rows)
cqlsh> CREATE INDEX ks_t on ks.t(v);
cqlsh> SELECT pk FROM ks.t WHERE v=3 GROUP BY pk;

 pk
----
  1
  1

(2 rows)

It is somewhat releated, but different code path is involved - find_index_clustering_rows (not find_index_partition_ranges) and the other overload of do_execute_base_query. Have not investigated it further.

On Fri, Oct 9, 2020 at 12:08 PM Piotr Grabowsk wrote:

tl;dr Our conclusions from yesterday are very likely wrong!

After the meeting on yesterday, I still didn't fully understand how the
paging is done. I had a hunch that just fixing
partition_slice::set_range() is not the proper fix - instead that
something is fundamentally wrong with the paging.
Example case when paging is wrong and set_range does not "override"

Very nice detective work.

Despite the venerable Occam's Razor, my guess is that there is more than
one bug at play here. It is still possible that the problem we discussed
yesterday exists in one specific case, and a different problem exists in
another case. In particular, in cases where we count()ed fewer than the
correct items, like 0 instead of 1 in one of your tests yesterday, or 2
instead of 10002 in your new test, it is possible that we missed some
entire pages. But in other cases where we count()ed more than the correct
times, like 2 instead of 1 in one of your tests yesterday, it is possible
that "overriding" row_ranges led us to count too many results that
shouldn't even match. These would be two seperate bugs.

My favorite approach is to begin by writing a suite of tests which attempt
to cover all the possible cases:

  1. Restriction involving a 1. partition key, 2. non-first multiple
    partition key column, 3. clustering key, 4. non-first clustering key
    column 5. regular column
  2. a full select, a count() and and some other aggregators.
  3. Cases involving different amounts of paging of the index and the base
    table - i.e., many vs few matchting partitions, long vs small partitions.
    Note that for this you don't need any huge data sets - you can also have a
    small data set and very small page sizes.

One of the benefits of programmatic testing - in C++ or Python instead of
text files - is that you may not need to write 500 lines of test code to
cover all these cases, because you can have loops or functions.

It seems like you're off to a good start finding many of these cases (which
is great!), but please try to be thorough in your search of the problem
space so that you can at least convince yourself that you covered all the
code paths (since we don't use any formal code coverage tool,
unfortunately).

In any case, it would be good to have tests that cover the bugs that you
already found (I think we have 4 or 5 known-broken requests already).

Backported to 4.2, 4.3. 4.4 had it already.

Was this page helpful?
0 / 5 - 0 ratings