Efcore: Query: Improve translation of String's StartsWith, EndsWith and Contains

Created on 31 Jul 2014  路  34Comments  路  Source: dotnet/efcore

PROVIDERS BEWARE:

Linq translation for methods Contains, EndsWith and StartsWith that we have in the Relational package uses LIKE operator, which may return incorrect results if the value parameter (what we are searching for) contains wildcard characters, e.g. '%' or '_'.

This issue addresses SqlServer and Sqlite providers, but all other providers will still use the old translation. Each provider that can be affected by this should implement their own MethodCallTranslators for Contains, EndsWith and StartsWith.

Currently in EF7 a LINQ query like this:

``` C#
var things = Things.Where(t => t.Name.StartsWith("a"));


Gets translated to SQL like this (note that I am simplifying the query and expanding parameter values for clarity):

``` SQL
SELECT * FROM Things WHERE Name LIKE 'a%' ;

However, in order to return correct results, a LINQ query like this:

``` C#
var underscoreAThings = Things.Where(t => t.Name.StartsWith("_a"));


Should be translated to SQL like this:

``` SQL
SELECT * FROM Things WHERE Name LIKE '~_a%' ESCAPE '~';

The escaping accounts for SQL wildcard characters in the input string which should not be treated as wildcards (we can add a separate Like() method for passing patterns, but that belongs in a separate work item).

When the input string is store correlated (e.g. is another column in the database instead of parameter or a literal in the query) using LIKE in the translation correctly becomes more difficult, e.g. it would be hard to perform the required escaping in SQL.

In general for cases in which LIKE doesn't work well we can fall back to alternative translations that don't rely on LIKE, e.g. for String.StartsWith():

``` C#
var underscoreAThings = Things.Where(t => t.Name.StartsWith(t.Prefix));


``` SQL
SELECT * FROM Things WHERE CHARINDEX(Prefix, Name) = 1 OR Prefix='';

Note that CHARINDEX() won't match an empty string but String.StartsWith("") always return true, that's why we add the Prefix ='' condition.

The main disadvantage of this translation is that it is not sargable. That can be addressed with a hybrid translation, e.g.:

SELECT * FROM Things WHERE Name LIKE Prefix+'%' AND (CHARINDEX(Prefix, Name) = 1 OR Prefix = '');

This should be quick to evaluate using an index because the LIKE condition should be able to take advantage of the index to produce fairly selective results and the second condition will filter out false positives returned by LIKE.

Notice that this alternative removes the need to fiddle with the input value: we no longer need to escape wildcards because in the worse case they will produce false positive matches which the CHARINDEX() based condition will still be able to filter out.

Also notice that based on the current query caching design we wouldn't need to always produce this more complex translation. Instead, we could sniff into the argument of String.StartsWith() and pivot on it to produce different translations, e.g.:

  1. If the value is opaque (i.e. it comes from the store) or if it contains a wildcard character, then produce the condition based on CHARINDEX()
  2. If the value does not contain a wildcard character in the first position then we can emit the condition based on LIKE

Similar approaches can be used for String.EndsWith() and String.Contains(). However for these methods LIKE does not really contribute to the performance since the beginning of the input value cannot be used to perform index lookups, so it should be ok to produce a translation that doesn't use LIKE at all.

closed-fixed providers-beware type-bug

All 34 comments

@anpete here is the bug you asked me to file.

@divega Thanks, keep em coming.

Fixed in 5e1c562ae40c16ef49b90731fd015eccce0c1d44

Is there a way to disable the adding of the call to CHARINDEX()? I'm seeing slow query performance with that against SQL Server and a table containing 7 millions rows with an index. It's taking 11 seconds to run with the call to CHARINDEX() and only one second without it. It seems like maybe the extra function call is confusing the query optimizer.

@jemiller0 can you open a new issue since this issue is closed (and what you're describing looks like something we should dig into).

I know this has been closed for a while, am implementing the same for Npgsql and one point is unclear...

In the original description above, @divega mentions it's possible to "pivot" on the argument of String.StartsWith() and produce different translations based on whether it's opaque (comes from the store) or not. However, looking at the actual code in SqlServerStartsWithOptimizedTranslator, it seems that the complex translation (i.e. the one involving both LIKE and CHARINDEX) is used in all case (except where the pattern is a literal empty string). Did I misunderstand the comments or the code?

But wouldn't it be possible to avoid this entire complexity by translating String.StartsWith()'s pattern as REPLACE(REPLACE(REPLACE(<pattern>, '\', '\\'), '_', '\_'), '%', '\%') + '%'? It simply escapes backslash (the escape character), escapes % and _, and tacks on a % at the end. It definitely ain't very pretty, but it would work regardless of the expression type (constant, parameter, opaque...), and probably adds very little overhead (unless the pattern itself is huge).

Note that in PostgreSQL the three REPLACEs can be replaced by REGEXP_REPLACE(<pattern>, '([\\_%])', '\\\1', 'g'), which does the same with a single regular expression.

What am I missing?

Personally, I think I like the idea of escaping the string like you did above better. However, I did a test on SQL Server using the ESCAPE clause of LIKE and it was really slow. I have no idea why. Maybe it's just a particular query/table that I have that is like that. I'm in favor of only adding the ESCAPE clause or call to CHAR_INDEX() if special characters are present. I would rather keep the SQL as simple as possible because every little bit of complexity always seems to cause problems and slow things down. Also, couldn't you translate the string in .NET rather than in SQL?

@jemiller0 if the pattern happens to be a ConstantExpression, it's definitely possible to do the escaping in C# rather than SQL, but if it's a parameter or some DB column that's not possible. Note that the same is true for checking whether special characters are present - this is only possible with constant patterns. But for that case it definitely makes sense to me to escape things client-side (if needed) and keep things simple and readable.

Note that no ESCAPE clause seems necessary - at least in PostgreSQL the default escape character is backslash, which is as good as any other... Maybe the situation is different with SqlServer.

But to the most important bit - I can't imagine why adding an ESCAPE clause to LIKE (or even adding several REPLACEs around the pattern) would make anything slower... At least for most cases I'd expect the pattern expression to be fully evaluated, and only then the LIKE would get treated. I may try to experiment and measure performance in PostgreSQL, we should probably understand the SqlServer case better...

(The obvious exception to the above would be taking the pattern from a column in the same table, e.g. SELECT a FROM foo WHERE a LIKE b, but I can't imagine indices being usable for these cases anyway).

Note: in PostgreSQL with the default index type (B-tree), it seems that LIKE (and regex matches) can be index-optimized only when the pattern is a literal. So I'm assuming doing REPLACE() (or REGEXP_REPLACE()) will not be good for performance. This is a good reason to do escaping in C# if the pattern is constant.

@roji

In the original description above, @divega mentions it's possible to "pivot" on the argument of String.StartsWith() and produce different translations based on whether it's opaque (comes from the store) or not...

We ended up not implementing that in EF Core. Although it was a possibility and not different from what we do in EF6, there are disadvantages, e.g. we would need to special case it in query caching. Sorry if that caused any confusion. We ended up implementing what I called the "hybrid translation", e.g.:

SELECT * FROM Things WHERE Name LIKE Prefix+'%' AND (CHARINDEX(Prefix, Name) = 1 OR Prefix = '');

I am curious if this translation is sargable in PostgreSQL and if it isn't, why.

FWIW, I don't think there is a default escape character in SQL Server.

Note: in PostgreSQL with the default index type (B-tree), it seems that LIKE (and regex matches) can be index-optimized only when the pattern is a

@roji does that mean that it won't use the index even for the simple LIKE prefix + '%' pattern? :hushed:

I am thinking more and more that we should implement proper LIKE support through a method as you described at #2850 rather than trying to optimize the translation of String.StartsWith() for every possible case.

@divega, could you elaborate a little bit on the sequence wrt query caching? What's especially confusing is that SqlServer's current StartsWith() translator already does pivot on the expression type (i.e. constant expression or not). I don't know anything about query caching (and would appreciate some info), but if the translator can produce different SQL based on the expression type, why not also perform client-side escaping when the expression is constant?

Regarding sargability, I'm pretty sure CHARINDEX (or its PostgreSQL analogue STRPOS) isn't sargable - here are the docs which were the basis of what I posted above, with the sentence:

The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string

I haven't done any tests, but a strict reading would mean that even a concatenation prevents the index from being used. If client-side escaping is used for constant patterns (i.e. if there's no query caching issue), then that's not a problem: we can generate a simple LIKE 'prefix%'. For non-constant patterns I'll have to do some testing... It sure would be nice to benefit from indexing via like you guys are doing (via LIKE + STRPOS).

Regarding general LIKE support (#2850) vs StartsWith(), it seems to me that the two aren't necessarily related (in the sense that both have to be properly implemented, even if some of the problems are going to be similar). Specifically, with general LIKE index use probably isn't going to be a topic of discussion simply because in the general case indices simply aren't used - only in the StartsWith() case...

@roji specifically we would need to "sniff" the value of the argument to StartWith() in SQL generation and caching if we tried to come up with a solution that produces different SQL depending on input data. It would be similar to how today we generate different SQL depending on whether a parameter value is null and then we need to take into account whether each parameter is null to find the right entry on the SQL cache.

In other words, generating different SQL depending on the expression type is business as usual, because we routinely compare the expression tree to match cache entries, but sniffing the values is something that we have currently only do for null vs. not-null and that we are a little averse to extend to other cases :smile:

Thinking about it again, I think this disadvantage is only relevant if we tried to improve the query by avoiding the generation of any escaping syntax if we observe that the argument to match does not contain any wildcard characters. I had in mind when I wrote that sentence in the description of this issue and it was discussed again as a way to generate optimal queries for the scenario in #7429.

@divega, I think I understand, but aren't you already generating different SQL for StartsWith() based on the value of the constant expression? If it's an empty string you seem to be generating the true constant, otherwise you generate a LIKE AND CHARINDEX expression (I hope I'm understanding things correctly).

The proposal here wouldn't be very far from what's already in SqlServer's current translator. First, the translator would check whether the expression is constant or not - as is already done. If it's constant, rather than generating LIKE AND CHARINDEX (or a true constant), the translator would perform escaping in C# and generate a simple LIKE with ESCAPE (the performance implications still need to be studied, but at least in principle). If it's not constant, there doesn't seem to be any choice but to fall back to the current behavior (i.e. generate LIKE AND CHARINDEX).

Would the above be doing something problematic in terms of SQL/query caching that isn't already done now?

@roji I think you are right regarding actual constants. And it should be ok to take their value into account to generate different SQL as long as we also consider their value when doing query cache lookups. What I had in mind is parameters, i.e. captured variables.

@roji and yes, arguably we could more than just skipping the comparison with empty string when we are processing StartsWith() for a constant.

cc @maumar @anpete

We could have sniffing code for constant inputs to StartsWith but I would be very strongly against doing it for parameters also (bloating query cache and the overhead cost we would have to pay when comparing parameter values for EVERY query seems way to much of for the potential benefit). On the other hand I think direct Like function implementation could be a "good enough" solution.

@maumar Well, I wasn't advocating for sniffing parameter values, but since you mentioned it :smile:: I assume you mean that things are not structured in a way that we can easily sniff parameters values for StartsWith() calls alone and record whether they contains wildcard characters, and so we we would end up having to do it for all parameters on every query, correct?

Re sniffing constants, as @roji mentioned, we are already doing it at https://github.com/aspnet/EntityFramework/blob/dev/src/Microsoft.EntityFrameworkCore.SqlServer/Query/ExpressionTranslators/Internal/SqlServerStartsWithOptimizedTranslator.cs#L39. I think we could consider removing the unnecessary CHARINDEX() condition if there are no wildcard in the constant.

I do see that there is a trade off between query compilation and SQL performance that we don't understand very well. And I suspect the real world scenario here is parameters, despite the fact that all the sample queries @jemiller0 provided in #7429 use constants.

@divega yes, just how now we are looking at nullabilities of all parameters and not only the ones involved in null semantics-relevant operations.

I have a newbie question. I'm not even sure how you would use a parameter with EF. I'm writing LINQ queries. As far as I knew, EF always just converted the LINQ queries to SQL queries with constants. However, the queries in my examples were generated using Dynamic LINQ. So, maybe that has something to do with it. I'm using Telerik's RadGrid UI component for ASP.NET Web Forms which generates the WHERE and ORDER BY clauses based on user input. Maybe if I weren't using that, the queries would be using parameters by default. I did notice that calls to .Skip() and .Take() were using parameters.

@jemiller0 in general you can force parameters into your queries by using variables in your query, e.g.

var id = 7;
var query1 = ctx.Customers.Where(c => c.Id == id); // will produce query with parameter
var query2 = ctx.Customers.Where(c => c.Id == 7); // will produce query with constant

@maumar Thanks. I just tested it and see what you mean. I didn't realize that's how it worked. Does EF Core use server-side prepared statements for these? As far as I know EF 6 didn't, at least for INSERT/UPDATE statements and I always wished it did because I think it would have improved performance.

What @maumar said. The general rule is that variables that get captured in lambda expressions are translated as parameters. Skip() and Take() are an exception because their arguments are not lambda expressions but they get parameterized anyway.

Thanks for the data point that Telerik's component generate arguments as constants.

EF6 uses a similar set of rules.

I had to do create a function to clean up the Dynamic LINQ strings that Telerik produces. By default they appeared to be more geared towards LINQ to Objects. I had to clean them up a bunch so that they would get translated into something more reasonable that would work better for use with EF. Since it's using Dynamic LINQ, I'm not sure how I would tell it to use parameters if I wanted to. Note, I'm using the grid's "advanced binding" method. I used to use EntityDataSource, but, found that the advanced method was more flexible and also EntityDataSource isn't supported with EF Core as far as I know. I much prefer using LINQ. The only thing I'm leery of is using Dynamic LINQ which is barely supported by Microsoft. It was in some example code from Microsoft, but, someone made a NuGet package out of it which is what I'm using. As far as I know Entity SQL isn't supported in EF Core. It would be great if Dynamic LINQ was something that was officially supported.

@jemiller0 AFAIK EF Core currently doesn't ever prepare statements (this is tracked by #5459). Note, however, that prepared statements don't really do much in SqlServer which transparently caches query plans server-side. For PostgreSQL, where preparing is vital for performance, Npgsql 3.2.0 (just released) introduces an automatic preparation feature at the driver level, where frequently-used queries will be implicitly prepared without EF Core needing to do anything about it.

Thanks for the valuable discussion. To summarize, at least in Npgsql I'm going to have the StartsWith() translator:

  1. Check whether the pattern is constant or not.
  2. If constant, escape everything client-side in C# and send a simple LIKE (with backslash being the PostgreSQL default escape character)
  3. Otherwise (parameters, store values), send LIKE AND STRPOS (PG's CHARINDEX equivalent) just like you guys are doing today.

As a very minor implementation note, wouldn't it be slightly better to to replace CHARINDEX with LEFT(LEN(<pattern>)), similar to how EndsWith() is currently implemented? This would avoid going through the entire string, searching for the pattern.

@roji I think you found the solution to the problem I've been having with .StartsWith() on SQL Server. I just tried using LEFT(LEN()) like you mentioned and it is FASTER that doing a LIKE alone. 0 seconds versus 2 seconds on the problem query that I had. Awesome!

Also, thanks for he info on prepared statements. I think they would still be good to have for INSERT and UPDATE statements.

BTW, I started playing around with PostgreSQL recently and was impressed by what a small memory footprint it has, at least in the default configuration. Also, of the DBMSs that I've tested, I have found it to have the fastest INSERT speed. It is as if it streams the data directly to disk. The main issue that I have with it is that as far as I know, there's no way to configure it for case insensitivity like you can with SQL Server, at least on Windows. I guess it works on other platforms, but, I haven't had a chance to try it yet.

At any rate, unless I'm missing something, it looks like if the SQL Server provider could use the LEFT(LEN()) method that you mentioned, it looks like it could indeed speed things up. I haven't had to test it much, but, right off the bat, it looks like it cured the problem that I had with my problem query. Maybe it allows the backend to use a hash table lookup or something.

@jemiller0 uh, happy to have helped although I didn't really mean to :) Please note that my proposal was to use LIKE AND LEFT(LEN()) and not LEFT(LEN()) alone, simply because I assumed that LEFT(LEN()) wouldn't be index-optimized. This retains the original logic of using LIKE first for speed, then filtering out false positives with something (CHARINDEX or LEFT(LEN())). So I'm not clear if you're still doing LIKE AND LEFT(LEN()) or have switched to LEFT(LEN()) on its own. I'll let the EF Core team comment further on the usefulness of LEFT(LEN()) for SqlServer.

Regarding prepared statements, I don't think there's any relevance to the statement type (SELECT, INSERT, UPDATE...). All of them greatly benefit from preparation in PostgreSQL, whereas in SqlServer I'm assuming all statement types are implicitly cached without the need for explicit preparation (here's a doc page on this). It may be worth testing to confirm actual SqlServer behavior. Regardless, it seems like a good to implement preparation in EF Core simply to have all providers benefit from it, and EF Core may be in a good position to know what to prepare and what not to prepare. If you want to continue this conversation it may be better to do so in #5459.

Regarding case sensitivity in PostgreSQL, unquoted identifiers are always folded to lowercase, whereas quoted ones maintain case (this is why Npgsql EF Core provider systematically quotes all identifiers). Since this is also off-topic feel free to open an issue in the Npgsql repo to continue discussing.

When I tested it, I used LIKE and LEFT(LEN()) together. The query took 00:00:00.789. I just tried the query without LIKE and it took 00:00:02.021. I ran it several times and it looks like it's consistently faster with LIKE included. So, having the LIKE seems to improve performance. Otherwise, it would be nice to get rid of it altogether and then not have to worry about escaping special characters.

With regard to prepared statements, the main thing I'm thinking about with regard to INSERT/UPDATE statements is that you are only sending the parameters after the first time through. So, it would cut down network traffic.

Regarding case sensitivity, the issue is not case sensitivity of identifiers (I ran into that annoyance as well, but, it's less of an issue), but, of the data in the database itself. I.e. you need to use special PostgreSQL specific column types to get case insensitivity. I.e. you can't configure a case insensitive collation like SQL Server uses by default. Or at least that appears to be the case on Windows. I'm assuming it may work on other platforms.

@jemiller0 re prepared statements, what you say it just as true of SELECT as it is of INSERT/UPDATE: it's very beneficial to have PostgreSQL parse a SELECT statement once and then send only parameters after the first time.

Re case sensitivity you're right, and I'm pretty sure it's no different on non-Windows platforms. You do have the citext extension (which I think you were referring to), but as you say it's not the same as a global case insensitive collation.

@roji You're right about SELECTs. I wasn't thinking correctly. I was thinking about it in the context of some loader applications that I have where I have 1 select and millions of INSERTS/UPDATEs, but, you are right. Especially, when you have things like N+1 queries happening.

The trailing "OR @ParamValue = ''" might break "sargablity". I think, if I'm reading things correctly, this is a bit of left over code from when it was using CHARINDEX (that method needed the OR.... the "LIKE" method does not). See @divega comment on 7 Feb (https://github.com/aspnet/EntityFrameworkCore/issues/474#issuecomment-278119419).

My quick and dirty testing in SQL seems to support this. Here's the setup:

create table LikeTest (
    Value varchar(50) not null primary key
)

declare @i int set @i = 1

while @i < 10000 begin
    insert into LikeTest (Value) values (cast(@i as varchar(8)))

    set @i = @i + 1
end

Here's the test:

declare @p varchar(8) set @p = '23'

select 
    *
from LikeTest
where Value like @p +'%' and left(Value, len(@p)) = @p --or @p = ''

With the "OR" commented out you get:
Table 'LikeTest'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
With the "OR" included you get:
Table 'LikeTest'. Scan count 1, logical reads 41, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

@rmacfadyen thanks for raising this. We have created https://github.com/aspnet/EntityFrameworkCore/issues/11881 to track further optimizations of this translation.

Was this page helpful?
0 / 5 - 0 ratings