Hi,
This is as requested a copy of a feature requested on the dev ML
Copy of the original text:
I would like to ask for Scylla to implement LIKE operations on:
The rationale is that since 3.0 we have optimizations on full table scans and server side filtering so it doesn't look too hard to allow this kind of operation before returning data.
That would also save network traffic in most cases.
The use cases are numerous.
At Numberly we already use full table scans on relatively small datasets that we have to filter on the backend code before returning it to the application. Now that our adoption of Scylla has grown, web developers do need this kind of feature to optimize some "search box" queries on acceptable datasets volumes (often pre-aggregated data which "only" need filtering.
This would:
Can we make it happen? I'll be glad to help test this of course.
Thanks!
We are highly interested in search too and are sure we'll get there.
However, the current priority is LWT, CDC and UDF.
As a near term solution, CDC can help to clone the data stream to Elastic
cluster. Sorry I don't have a better answer but
I rather not over promise.
On Sun, May 5, 2019 at 9:42 AM Ultrabug notifications@github.com wrote:
Hi,
This is as requested a copy of a feature requested on the dev ML
https://groups.google.com/forum/#!topic/scylladb-dev/r9I5yVqjcCQCopy of the original text:
I would like to ask for Scylla to implement LIKE operations on:
- single partitions
- range scans
The rationale is that since 3.0 we have optimizations on full table scans
and server side filtering so it doesn't look too hard to allow this kind of
operation before returning data.
That would also save network traffic in most cases.The use cases are numerous.
At Numberly we already use full table scans on relatively small datasets
that we have to filter on the backend code before returning it to the
application. Now that our adoption of Scylla has grown, web developers do
need this kind of feature to optimize some "search box" queries on
acceptable datasets volumes (often pre-aggregated data which "only" need
filtering.This would:
- permit to save developers from implementing this filtering on their
backend code as well as Ops to avoid having to setup and operate an Elastic
Search stack to handle fuzzy search when the use case fits in a simple LIKE
operation.- be a great feature addition for developers and does fit very nicely
into the "OLAP + OLTP why not both" direction IMHO.- remove a barrier for devs and reduces the "there's so many
constraints in Scylla operations" that we can hear.Can we make it happen? I'll be glad to help test this of course.
Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/4477, or mute the thread
https://github.com/notifications/unsubscribe-auth/AANHURPBEPDP7YHMJIFUZN3PT4E5TANCNFSM4HK3U4DQ
.
@dorlaor LIKE isn't search, it's just an additional operator. Can be supported in 1-2 patches.
Is the idea to implement SASI indexing within Scylla? If not, what are the performance expectations, and how do we handle LIKE limitations derived from SASI options (eg, case sensitivity, CONTAINS vs PREFIX, etc.)?
I don't think this is the direction avi was looking into was adding SASI.
We are looking to extend ALLOW FILTERING - and allow the user to specify a condition
where column_a like "%BLUE%"
similar to https://www.w3schools.com/sql/sql_like.asp
We are looking to add the capability and do filtering on the coordinator side after collecting results (not to use a specialized index) - at least at this point.
No SASI, just regular filtering.
Got it, thanks.
What do you want to do about case sensitivity?
And about escape character?
Let's use the standard SQL pratice, as long as it doesn't conflict with an existing choice by Cassandra.
For case insensitivity, AFAICT the standard solution is LOWER(foo) LIKE pattern, so we need to recognize this construct (I did not perform an exhaustive search; there is postgres ILIKE and MS-SQL COLLATE and other extensions, but LOWER() appears the most standard).
I suggest to start without support for case insensitivity (as it greatly expands the scope, need to add a function and recognize more complex query patterns).
Here are some interesting differences between CQL and SQL LIKE:
Feature|CQL|SQL
---|---|---
LIKE requires index|yes|no
NOT LIKE|unsupported|supported
_ wildcard|no|yes
% wildcard|only start/end of pattern (configured in SASI)|anywhere in pattern
escaping wildcards|no|yes
custom escape character|n/a|yes
case sensitivity|configured in SASI|dictated by collation
empty pattern (foo LIKE '')|forbidden|allowed
If we follow all the SQL rules, we will clearly return different results than Cassandra for some queries. And if we want to be fully compatible with Cassandra, we'll (eventually) need some way to specify case sensitivity and whether % at the end of pattern is a wildcard or not.
I think we can go for a superset:
However, if the user requests SASI indexing, we can build a local (but non-SASI) index with keys including both the text and the reversed text. Such indexes can be efficiently queried for prefix and suffix patterns. If the user created such an index and specified case insensitivity, we can switch to the CQL interpretation.
I suggest to avoid indexing for now and start with just filtering. It should be simpler and is what's described by the issue. We can add indexing later, esp. when support for computed columns lands (to reverse the indexed fields, and to lowercase them if a case insensitive index is requested).
@ultrabug does this make sense to you from a user's perspective?
I think we can go for a superset:
Even if that makes our results different from Cassandra's?
* SQL style case sensitivity
This will require tons of work, to introduce collations and propagate them through expressions. Can we pick something simpler?
* CQL style case insensitivity if defined by a SASI index
What about the CONTAINS option? (Ie, whether % at the end of a pattern is a wildcard or literal.) That's also configured in SASI creation. We could simply turn it on always, but again, we won't be fully compatible with Cassandra then.
I think we can go for a superset:
Even if that makes our results different from Cassandra's?
I don't think it will. With a defined SASI index we'll return Cassandra results, in all other cases we'll return SQL results.
* SQL style case sensitivityThis will require tons of work, to introduce collations and propagate them through expressions. Can we pick something simpler?
Let's just start without it.
* CQL style case insensitivity if defined by a SASI indexWhat about the CONTAINS option? (Ie, whether
%at the end of a pattern is a wildcard or literal.) That's also configured in SASI creation. We could simply turn it on always, but again, we won't be fully compatible with Cassandra then.
If a SASI index is NOT specified, then we can do what we like without breaking Cassandra compatibility, since no application written for Cassandra will issue such a query. So we can do what we like (or LIKE).
LOWER(foo) LIKE pattern, so we need to recognize this construct (I did not perform an exhaustive search; there is postgres ILIKE and MS-SQL COLLATE and other extensions, but LOWER() appears the most standard).ANSI way is COLLATE clause.
foo COLLATE unicode_case_insensitive LIKE pattern
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
Absolutely @avi, that makes sense and sounds perfect for a good startOn May 19, 2019 20:19, Avi Kivity notifications@github.com wrote:@ultrabug does this make sense to you from a user's perspective?
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
I would like to suggest - if filtering is what we are going for - it might be worthwhile to use UDF filters.
Initially the filters are going to be only defined by us which is like(column_val,expression) that returns boolean. Then, in the future we can extend it to any filter we want.
We can also enjoy it as it will lay the land for UDF in where clause.
I do not think UDF is a good choice unless its an underline hidden implementation detail.
Other databases (MySQL, Apache Cassandra with SASI) have LIKE operation without a need for UDF.
UDF in WHERE is useful, but it's better to keep simple syntax for the common use cases.
But keep in mind that the (initial) semantics we discussed above is
different from both SQL and Cassandra. Using UDF might be better than using
the same syntax for different semantics.
>
Aren't the initial semantics a subset of SQL?
In any case, requiring UDF to implement a standard operation is a terrible choice IMO.
Here is Apache Cassandra example (after a SASI index created)
SELECT first_name, last_name, age, height, created_at FROM sasi
... WHERE first_name LIKE 'M%';
https://github.com/apache/cassandra/blob/trunk/doc/SASI.md#equality--prefix-queries
@tzach why are you quoting that example?
Aren't the initial semantics a subset of SQL?
Yes, a subset that will never grow to parity, IME.
In any case, requiring UDF to implement a standard operation is a terrible
choice IMO.
Agreed, but in my mind that's not what Eliran proposed. Rather, he proposed
using UDF for a non-standard operation (from the Cassandra user's
standpoint).
>
>
@tzach why are you quoting that example?
Example of how Apache Cassandra added LIKE to CQL without UDF.
@tzach ... and?
Aren't the initial semantics a subset of SQL?
Yes, a subset that will never grow to parity, IME.
For sure not at once, but I don't see a real barrier.
In any case, requiring UDF to implement a standard operation is a terrible
choice IMO.
Agreed, but in my mind that's not what Eliran proposed. Rather, he proposed using UDF for a non-standard operation (from the Cassandra user's standpoint).
CQL is following SQL, so it's better to extend CQL in the direction of SQL rather than in non-standard ways.
UDFs are bad for LIKE in many ways:
I agree that it can be much slower, on the other hand are we really bound by cpu in filtering operations? (we might need to check that since in the summit we ruled out JIT for queries as an immediate optimization because we don't seem to be bound by cpu for most queries and specifically for scan queries). But except for this the only thing I actually suggested is to filter using a Lua function. I didn't suggest any change to the syntax. For this suggestion I only pointed out that the benefit will be that we will have to write the underlying mechanism for UDF in where clause (not the actual syntax support). I didn't suggest that the user will be aware of the use of UDF but only to use the mechanism.
I don't think that using Lua will be simpler to code than using std::regex.
What about collections? Should we implement queries like
Lets start with a first version without that
I think that for collections we'll need separate syntax:
WHERE ANY VALUE IN my_set LIKE pattern
AND ANY KEY IN my_map LIKE pattern
AND ANY VALUE IN my_map LIKE pattern
ANY can be used with other predicates:
WHERE ANY VALUE IN my_map = 17
We need to bikeshed the syntax (this one allows us to make some ANY KEY jokes), but I think it's outside the scope of LIKE and in the scope of equivalents to std::any_of() and std::all_of().
Hi, as milestones have been added now - could someone please confirm LIKE operations would be available as _plain_ filters (how I understand the original feature request) - or is it going to require some sort of predefined index?
I do have very similar requirements as stated above - where I have reasonable data sets, no advanced search requirements ('full text', fuzzy, stemming, stopwords, normalizer, ...) - so filtering server side would be a great fit.
Hi, as milestones have been added now - could someone please confirm
LIKEoperations would be available as _plain_ filters (how I understand the original feature request) - or is it going to require some sort of predefined index?
LIKE will be available as a plain filter, no indexing necessary. Enjoy! :wink:
Thanks @dekimir !
I'm eager to test this #4855 ;)
Implementation finished in #4873, closing. I created two more issues for follow-up work: #4910 and #4911.
Please take note of current differences between Cassandra and Scylla LIKE:
_) as a wildcard; Cassandra doesn't.%) as a wildcard anywhere in the pattern; Cassandra only at the beginning/end.\) as an escape character; Cassandra doesn't.
Most helpful comment
LIKE will be available as a plain filter, no indexing necessary. Enjoy! :wink: