Scylla: Reverse-ordered queries (Alternator's ScanIndexForward=false) with large partitions

Created on 28 Apr 2020  Â·  28Comments  Â·  Source: scylladb/scylla

In issue #5153 we added support for the ScanIndexForward=false parameter of Query, which requests the query to return items in reversed sort order. This feature already works, and uses the same internal mechanism (query::partition_slice::option::reversed) which we use in CQL's "ORDER BY ... DESC" feature.

However, it has been _claimed_ that our implementation of the query::partition_slice::option::reversed feature is inefficient, and also allocates a lot of memory for long partitions. Ideally, scanning a partition in reverse would be almost as fast as scanning it forward and take O(1) memory (unrelated to the partition length), but probably this is not the case today.

The first thing we should do for this issue is to test the performance and memory use of reverse queries compared to normal-order queries (see also issue #6278 about measuring and optimizing the latter) and check that we actually have a problem. Once we verify a problem, we should fix it.

Unfortunately, the sstable format (neither version 2 nor 3) does not have a length field at the end of a row, which makes it hard to skip back to the previous row. But we can still use the promoted index to at least move back in blocks of 64K which is efficient (it's not O(1), because the promoted index length itself is O(N), but for partitions which aren't huge, it's small).

Beyond a single query page, we will also want to be able to resume readers on the following pages, even when iterating in reverse order. See issue #6278 discussing paging performance in Alternator for forward-ordered queries, and we also need to do this for reverse-ordered queries.

Alternator CQL bug

All 28 comments

We have a more generic issue for the underlying infrastructure to support reading in reverse: https://github.com/scylladb/scylla/issues/1413.

Reverse queries are not just inefficient, they are very dangerous as they can single handedly OOM a node. For this reason we have a hard limit on the memory they are allowed to consume and abort those that want to consume more. So reverse queries are not guaranteed to succeed, see https://github.com/scylladb/scylla/issues/5804.

So according to what @denesb wrote this is not just a performance issue, a reverse query will outright fail if a partition is larger than 1MB... We need a test for this case, and of course to fix it. I'll update the title and tags accordingly.
There is no real reason why a reverse query should need to read into memory the entire partition... There is no reason to read into memory more than a single promoted index chunk (around 64KB) - or perhaps a few of them at once on spinning disks.

Reverse queries can be implemented using the prev_unfiltered_size in m format sstables.

Note it is quite difficult since all of out sstable code is oriented at reading a byte stream.

We might construct a virtual input_stream that takes row data and presents it in reverse order. The parser will have to cooperate by feeding prev_unfiltered_size back to the stream.

I can confirm that reverse queries do not work.

I migrated my application from cassandra and an "ORDER BY ... DESC" query stopped working for all partitions with more than ~4k rows.

Here is the error message:
ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for dash.daily_address_balance_changes - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

And an error displayed by _sudo service scylla-server status_
[shard 1] storage_proxy - Exception when communicating with 192.168.1.104: std::runtime_error (Aborting reverse partition read because partition 2015-01-15 is larger than the maximum safe size of 1048576 for reversible partitions.

This behavior seems to documented here: https://docs.scylladb.com/troubleshooting/reverse-queries/

@Antti-Kaikkonen what version of Scylla are you using? On master, it seems that reverse queries on partition about 1 MB are still allowed but produce a warning like

WARN  2020-08-07 16:42:55,076 [shard 0] flat_mutation_reader - Memory usage of reversed read exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit), while reading partition WHW4ON6L1P

But once you pass 100 MB (max_memory_for_unlimited_query_hard_limit) it becomes an error:

ERROR 2020-08-07 16:42:55,112 [shard 0] storage_proxy - Exception when communicating with 127.1.14.168, to read from alternator_alternator_Test_1596807771082.alternator_Test_1596807771082: std::runtime_error (Memory usage of reversed read exceeds hard limit of 104857600 (configured via max_memory_for_unlimited_query_hard_limit), while reading partition WHW4ON6L1P)

@nyh the soft/hard dual limit is a quite recent change that hadn't made it into any release yet.

@Antti-Kaikkonen what version of Scylla are you using?

From journalctl _COMM=scylla

Scylla version 4.0.1-0.20200524.8d9bc57aca6 with build-id 778b5fddea1d144ace993fe09de0ef6f050bafe6 starting ...

Would you mind to give the nightly build a try (docker)?

On Fri, Aug 7, 2020 at 5:41 PM Antti Kaikkonen notifications@github.com
wrote:

@Antti-Kaikkonen https://github.com/Antti-Kaikkonen what version of
Scylla are you using?

From journalctl _COMM=scylla

Scylla version 4.0.1-0.20200524.8d9bc57aca6 with build-id
778b5fddea1d144ace993fe09de0ef6f050bafe6 starting ...

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/6307#issuecomment-670797465,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AANHURKEJYPXUM5GPBOYE7DR7SNNLANCNFSM4MSVLAWA
.

Would you mind to give the nightly build a try (docker)?

Unfortunately being able to (ineffectively?) query 100x bigger partitions doesn't really solve the issue for me. I would be happy to test when there is a better solution. I wouldn't really mind reverse queries being 10x slower as long as the speed doesn't depend on the partition size. For now I will probably create a duplicate table in reversed clustering order if I absolutely need to traverse the results in both directions.

We have had to switch back to Cassandra due to this :-( , I hope you guys can fix it soon. We needed to get the last / highest value in a clustering column. Using max works but is really inefficient and slow for large partitions, seems like it is scanning. What works well in cassandra is reverse sort with a limit 1, seems to directly take the last item in the partition. However in Scylla we get errors as it dumps everything to memory to do the reverse sort to get the top item. I think if you have a limit you can take those from the bottom and sort reverse on that.

@slivne pls consider Nathan's input and assign it to someone

We can't update to 4.0 because of this, errors due to https://github.com/scylladb/scylla/issues/5804 forced us to revert.

@kharmabum if your reverse queries -- that used to work fine before -- are getting aborted, set max_memory_for_unlimited_query to a high enough number so that it fits your partition sizes to work around this problem.

Introducing this limit out of the blue was too harsh -- we realized that later. So 4.3 will provide two limits to tweak:

  • max_memory_for_unlimited_query_soft_limit: just log a warning if this limit is reached
  • max_memory_for_unlimited_query_hard_limit: abort the query -- the same as max_memory_for_unlimited_query currently.

The soft limit will default to the 1MB limit that max_memory_for_unlimited_query currently defaults to and max_memory_for_unlimited_query_hard_limit will default to a much higher value (100MB) to leave a little bit more headroom. Also these limits will apply to unpaged queries too.

@denesb if the improved limit configuration would help, we could and perhaps backport it to 4.2? But I think @nathanleyton hit the nail right on the head - a reverse query with a small limit should work efficiently, and not have to read 100 MB. It's not good enough that we don't fail these queries - they also need to be reasonably efficient. Even if we can't make them _as efficient_ as forward queries.

@nyh I agree, we should fix this. But in the meanwhile, people who had reverse queries working just fine before like @kharmabum can play with the limits to get Scylla not to reject their previously working queries.

@nyh we don't have real reverse sstable parsing, so reverse queries work by reading everything into memory and reversing it. Until we add reverse sstable readers, the best we can do is fail the query if it consumes too much memory.

We did increase the limit, but we can have a lot of records per partition and it keeps coming up, at the moment there is no efficient query to do what we need, as as we are growing the query performance is degrading rapidly, we can't take it to production like this I am afraid. In the meantime we are moving back to Cassandra for production but will keep an eye on here and if something is done before we launch we will switch back to Scylla. Thanks for the advice and for taking a look at this.

Yes, unfortunately playing with the limit will not help at all with performance.

@Avi Kivity avi@scylladb.com shall we open a new github issue for real
reverse query that doesn't scan from the end?
If it's done in C* and enabled by the sstable format, we should support it
too

On Thu, Oct 1, 2020 at 2:18 AM Botond Dénes notifications@github.com
wrote:

Yes, unfortunately playing with the limit will not help at all with
performance.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/6307#issuecomment-702004961,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AANHURI55KGVT7HWLQTN773SIRCNPANCNFSM4MSVLAWA
.

Such issues don't help. They just sit there and collect dust.

We have such an issue for a long time: #1413

We hit this problem in production too, do you have an ETA to fix this issue ? Creating another MV just for reverse order is wasteful in our case :(.

Closing as duplicate of #1413.

Reopening. This is not really a duplicate, it is about the Alternator feature, while 1413 is about some internal implementation detail. We can close this again if we open a new Alternator issue which perhaps refers to 1413 in the top of the issue. But we need to track this as an alternator issue... I'll do this later today.

Closing again. Opened #7586 for the Alternator-specific issue.

Well, you could have renamed it.

Well, you could have renamed it.

Yes, but I also wanted to get rid of most of the discussion.

Was this page helpful?
0 / 5 - 0 ratings