Scylla: [FEATURE REQUEST] Allow LIKE operations

Created on 5 May 2019 · 34Comments · Source: scylladb/scylla

Hi,

This is as requested a copy of a feature requested on the dev ML

Copy of the original text:

I would like to ask for Scylla to implement LIKE operations on:

single partitions
range scans

The rationale is that since 3.0 we have optimizations on full table scans and server side filtering so it doesn't look too hard to allow this kind of operation before returning data.
That would also save network traffic in most cases.

The use cases are numerous.
At Numberly we already use full table scans on relatively small datasets that we have to filter on the backend code before returning it to the application. Now that our adoption of Scylla has grown, web developers do need this kind of feature to optimize some "search box" queries on acceptable datasets volumes (often pre-aggregated data which "only" need filtering.

This would:

permit to save developers from implementing this filtering on their backend code as well as Ops to avoid having to setup and operate an Elastic Search stack to handle fuzzy search when the use case fits in a simple LIKE operation.
be a great feature addition for developers and does fit very nicely into the "OLAP + OLTP why not both" direction IMHO.
remove a barrier for devs and reduces the "there's so many constraints in Scylla operations" that we can hear.

Can we make it happen? I'll be glad to help test this of course.

Thanks!

CQL User Request enhancement

Source

ultrabug

👍2

Most helpful comment

Hi, as milestones have been added now - could someone please confirm LIKE operations would be available as _plain_ filters (how I understand the original feature request) - or is it going to require some sort of predefined index?

LIKE will be available as a plain filter, no indexing necessary. Enjoy! :wink:

dekimir on 14 Aug 2019

🎉3 ❤1

All 34 comments

We are highly interested in search too and are sure we'll get there.
However, the current priority is LWT, CDC and UDF.
As a near term solution, CDC can help to clone the data stream to Elastic
cluster. Sorry I don't have a better answer but
I rather not over promise.

On Sun, May 5, 2019 at 9:42 AM Ultrabug notifications@github.com wrote:

Hi,

This is as requested a copy of a feature requested on the dev ML
https://groups.google.com/forum/#!topic/scylladb-dev/r9I5yVqjcCQ

Copy of the original text:

I would like to ask for Scylla to implement LIKE operations on:

single partitions

range scans

The rationale is that since 3.0 we have optimizations on full table scans
and server side filtering so it doesn't look too hard to allow this kind of
operation before returning data.
That would also save network traffic in most cases.

The use cases are numerous.
At Numberly we already use full table scans on relatively small datasets
that we have to filter on the backend code before returning it to the
application. Now that our adoption of Scylla has grown, web developers do
need this kind of feature to optimize some "search box" queries on
acceptable datasets volumes (often pre-aggregated data which "only" need
filtering.

This would:

permit to save developers from implementing this filtering on their
backend code as well as Ops to avoid having to setup and operate an Elastic
Search stack to handle fuzzy search when the use case fits in a simple LIKE
operation.

be a great feature addition for developers and does fit very nicely
into the "OLAP + OLTP why not both" direction IMHO.

remove a barrier for devs and reduces the "there's so many
constraints in Scylla operations" that we can hear.

Can we make it happen? I'll be glad to help test this of course.

Thanks!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/4477, or mute the thread
https://github.com/notifications/unsubscribe-auth/AANHURPBEPDP7YHMJIFUZN3PT4E5TANCNFSM4HK3U4DQ
.

dorlaor on 6 May 2019

@dorlaor LIKE isn't search, it's just an additional operator. Can be supported in 1-2 patches.

avikivity on 6 May 2019

Is the idea to implement SASI indexing within Scylla? If not, what are the performance expectations, and how do we handle LIKE limitations derived from SASI options (eg, case sensitivity, CONTAINS vs PREFIX, etc.)?

dekimir on 19 May 2019

I don't think this is the direction avi was looking into was adding SASI.

We are looking to extend ALLOW FILTERING - and allow the user to specify a condition
where column_a like "%BLUE%"

We are looking to add the capability and do filtering on the coordinator side after collecting results (not to use a specialized index) - at least at this point.

slivne on 19 May 2019

No SASI, just regular filtering.

avikivity on 19 May 2019

Got it, thanks.

What do you want to do about case sensitivity?

And about escape character?

dekimir on 19 May 2019

Let's use the standard SQL pratice, as long as it doesn't conflict with an existing choice by Cassandra.

For case insensitivity, AFAICT the standard solution is LOWER(foo) LIKE pattern, so we need to recognize this construct (I did not perform an exhaustive search; there is postgres ILIKE and MS-SQL COLLATE and other extensions, but LOWER() appears the most standard).

avikivity on 19 May 2019

I suggest to start without support for case insensitivity (as it greatly expands the scope, need to add a function and recognize more complex query patterns).

avikivity on 19 May 2019

Here are some interesting differences between CQL and SQL LIKE:

Feature|CQL|SQL
---|---|---
LIKE requires index|yes|no
NOT LIKE|unsupported|supported
_ wildcard|no|yes
% wildcard|only start/end of pattern (configured in SASI)|anywhere in pattern
escaping wildcards|no|yes
custom escape character|n/a|yes
case sensitivity|configured in SASI|dictated by collation
empty pattern (foo LIKE '')|forbidden|allowed

If we follow all the SQL rules, we will clearly return different results than Cassandra for some queries. And if we want to be fully compatible with Cassandra, we'll (eventually) need some way to specify case sensitivity and whether % at the end of pattern is a wildcard or not.

dekimir on 19 May 2019

❤1

I think we can go for a superset:

allow SQL patterns
not require index
allow escapes
SQL style case sensitivity
CQL style case insensitivity if defined by a SASI index

However, if the user requests SASI indexing, we can build a local (but non-SASI) index with keys including both the text and the reversed text. Such indexes can be efficiently queried for prefix and suffix patterns. If the user created such an index and specified case insensitivity, we can switch to the CQL interpretation.

I suggest to avoid indexing for now and start with just filtering. It should be simpler and is what's described by the issue. We can add indexing later, esp. when support for computed columns lands (to reverse the indexed fields, and to lowercase them if a case insensitive index is requested).

avikivity on 19 May 2019

❤1

@ultrabug does this make sense to you from a user's perspective?

avikivity on 19 May 2019

I think we can go for a superset:

Even if that makes our results different from Cassandra's?

* SQL style case sensitivity

This will require tons of work, to introduce collations and propagate them through expressions. Can we pick something simpler?

* CQL style case insensitivity if defined by a SASI index

What about the CONTAINS option? (Ie, whether % at the end of a pattern is a wildcard or literal.) That's also configured in SASI creation. We could simply turn it on always, but again, we won't be fully compatible with Cassandra then.

dekimir on 19 May 2019

I think we can go for a superset:

Even if that makes our results different from Cassandra's?

I don't think it will. With a defined SASI index we'll return Cassandra results, in all other cases we'll return SQL results.

* SQL style case sensitivity
This will require tons of work, to introduce collations and propagate them through expressions. Can we pick something simpler?

Let's just start without it.

* CQL style case insensitivity if defined by a SASI index
What about the CONTAINS option? (Ie, whether % at the end of a pattern is a wildcard or literal.) That's also configured in SASI creation. We could simply turn it on always, but again, we won't be fully compatible with Cassandra then.

If a SASI index is NOT specified, then we can do what we like without breaking Cassandra compatibility, since no application written for Cassandra will issue such a query. So we can do what we like (or LIKE).

avikivity on 19 May 2019

Avi Kivity notifications@github.com [19/05/20 10:50]:
> Let's use the standard SQL pratice, as long as it doesn't conflict with an existing choice by Cassandra.
>
> For case insensitivity, AFAICT the standard solution is LOWER(foo) LIKE pattern, so we need to recognize this construct (I did not perform an exhaustive search; there is postgres ILIKE and MS-SQL COLLATE and other extensions, but LOWER() appears the most standard).

ANSI way is COLLATE clause.

foo COLLATE unicode_case_insensitive LIKE pattern

--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32

kostja on 20 May 2019

Absolutely @avi, that makes sense and sounds perfect for a good startOn May 19, 2019 20:19, Avi Kivity notifications@github.com wrote:@ultrabug does this make sense to you from a user's perspective?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

ultrabug on 22 May 2019

I would like to suggest - if filtering is what we are going for - it might be worthwhile to use UDF filters.
Initially the filters are going to be only defined by us which is like(column_val,expression) that returns boolean. Then, in the future we can extend it to any filter we want.
We can also enjoy it as it will lay the land for UDF in where clause.

eliransin on 2 Jun 2019

I do not think UDF is a good choice unless its an underline hidden implementation detail.
Other databases (MySQL, Apache Cassandra with SASI) have LIKE operation without a need for UDF.
UDF in WHERE is useful, but it's better to keep simple syntax for the common use cases.

tzach on 2 Jun 2019

But keep in mind that the (initial) semantics we discussed above is
different from both SQL and Cassandra. Using UDF might be better than using
the same syntax for different semantics.

dekimir on 2 Jun 2019

Aren't the initial semantics a subset of SQL?

In any case, requiring UDF to implement a standard operation is a terrible choice IMO.

avikivity on 2 Jun 2019

Here is Apache Cassandra example (after a SASI index created)

SELECT first_name, last_name, age, height, created_at FROM sasi
        ... WHERE first_name LIKE 'M%';

https://github.com/apache/cassandra/blob/trunk/doc/SASI.md#equality--prefix-queries

tzach on 2 Jun 2019

@tzach why are you quoting that example?

avikivity on 2 Jun 2019

Aren't the initial semantics a subset of SQL?

Yes, a subset that will never grow to parity, IME.

In any case, requiring UDF to implement a standard operation is a terrible

choice IMO.

Agreed, but in my mind that's not what Eliran proposed. Rather, he proposed
using UDF for a non-standard operation (from the Cassandra user's
standpoint).

>
>

dekimir on 2 Jun 2019

@tzach why are you quoting that example?

Example of how Apache Cassandra added LIKE to CQL without UDF.

tzach on 2 Jun 2019

@tzach ... and?

avikivity on 2 Jun 2019

Aren't the initial semantics a subset of SQL?
Yes, a subset that will never grow to parity, IME.

For sure not at once, but I don't see a real barrier.

In any case, requiring UDF to implement a standard operation is a terrible
choice IMO.
Agreed, but in my mind that's not what Eliran proposed. Rather, he proposed using UDF for a non-standard operation (from the Cassandra user's standpoint).

CQL is following SQL, so it's better to extend CQL in the direction of SQL rather than in non-standard ways.

UDFs are bad for LIKE in many ways:

not standard SQL
not standard CQL
require effort from each user to implement
likely much slower

…

avikivity on 2 Jun 2019

I agree that it can be much slower, on the other hand are we really bound by cpu in filtering operations? (we might need to check that since in the summit we ruled out JIT for queries as an immediate optimization because we don't seem to be bound by cpu for most queries and specifically for scan queries). But except for this the only thing I actually suggested is to filter using a Lua function. I didn't suggest any change to the syntax. For this suggestion I only pointed out that the benefit will be that we will have to write the underlying mechanism for UDF in where clause (not the actual syntax support). I didn't suggest that the user will be aware of the use of UDF but only to use the mechanism.

eliransin on 3 Jun 2019

I don't think that using Lua will be simpler to code than using std::regex.

avikivity on 3 Jun 2019

What about collections? Should we implement queries like

a set element is LIKE "abc"
a map entry is LIKE "abc"
any map value is LIKE "abc"

dekimir on 26 Jun 2019

Lets start with a first version without that

slivne on 26 Jun 2019

I think that for collections we'll need separate syntax:

WHERE ANY VALUE IN my_set LIKE pattern
    AND ANY KEY IN my_map LIKE pattern
    AND ANY VALUE IN my_map LIKE pattern

ANY can be used with other predicates:

WHERE ANY VALUE IN my_map = 17

We need to bikeshed the syntax (this one allows us to make some ANY KEY jokes), but I think it's outside the scope of LIKE and in the scope of equivalents to std::any_of() and std::all_of().

avikivity on 26 Jun 2019

Hi, as milestones have been added now - could someone please confirm LIKE operations would be available as _plain_ filters (how I understand the original feature request) - or is it going to require some sort of predefined index?

I do have very similar requirements as stated above - where I have reasonable data sets, no advanced search requirements ('full text', fuzzy, stemming, stopwords, normalizer, ...) - so filtering server side would be a great fit.

hartmut-co-uk on 14 Aug 2019

Hi, as milestones have been added now - could someone please confirm LIKE operations would be available as _plain_ filters (how I understand the original feature request) - or is it going to require some sort of predefined index?

LIKE will be available as a plain filter, no indexing necessary. Enjoy! :wink:

dekimir on 14 Aug 2019

🎉3 ❤1

Thanks @dekimir !

I'm eager to test this #4855 ;)

ultrabug on 19 Aug 2019

Implementation finished in #4873, closing. I created two more issues for follow-up work: #4910 and #4911.

Please take note of current differences between Cassandra and Scylla LIKE:

In Cassandra, you must create a SASI index to use LIKE. Scylla supports LIKE as a regular filter.
Consequently, Scylla LIKE will often be less performant than Cassandra LIKE.
Scylla treats underscore (_) as a wildcard; Cassandra doesn't.
Scylla treats percent (%) as a wildcard anywhere in the pattern; Cassandra only at the beginning/end.
Scylla interprets backslash (\) as an escape character; Cassandra doesn't.
Cassandra allows case-insensitive LIKE; Scylla doesn't yet (but see #4911).
Scylla allows empty LIKE pattern; Cassandra doesn't.