Efcore: Incorrect string matching behavior on SQL Server with whitespace patterns

Created on 25 Dec 2019 · 24Comments · Source: dotnet/efcore

I'm getting wrong results from database when trying to filter string (Contains, StartsWith, EndsWith, etc) with spaces.

Steps to reproduce

I'm build query like:
``` C#
var test = " ";
var venues = await context.Venues.Where(x => x.Name.Contains(test)).ToListAsync(cancellationToken);

Which gets translated to:
```SQL
SELECT [v].[Id], [v].[Name]
FROM [Venues] AS [v]
WHERE (@__test_0 = N'') OR CHARINDEX(@__test_0, [v].[Name]) > 0

This query returns all rows from database when @__test_0 contains spaces.

Further technical details

EF Core version: 3.1.0
Database provider: Microsoft.EntityFrameworkCore.SqlServer
Target framework: .NET Core 3.1

closed-fixed customer-reported type-bug

Source

nesterenko-kv

Most helpful comment

@zachrybaker my message above may have been a bit inaccurate - in some cases, the translations of Contains, StartsWith and EndsWith will work correctly with regards to trailing whitespace (i.e. when patterns are constant we will generate LIKE, which doesn't ignore trailing whitespace). Once we implement client-side parameter transformations (which we do intend to do), the same technique can be used for parameter patterns as well.

Re other workaround with len... First, your stack overflow link seems broken. Second, that elaborate workaround still returns a false positive when the value is shorter than the pattern:

SELECT CASE WHEN LEFT('h', LEN(CAST('h ' AS nvarchar(MAX)) + 'x') - 1) = 'h ' THEN 1 ELSE 0 END; -- 1

It's possible to add a condition for that too, but at some point that kind of complexity does not seem appropriate for translating a very common .NET method like StartsWith (and properly has some perf overhead, with all the casting, concatenation, etc. etc). In any case, as you say, it's always possible for people to express this themselves with EF.Functions.Like.

The larger picture here is that it's impossible to handle this well for much more basic cases without a significant perf impact - I'm thinking about simple equality. When users compare two VARCHAR columns, trailing SQL will be truncated. Any attempt to compensate for that with elaborate logic such as the above would prevent index usage - which is the last thing you want to do for the default text equality operator. Since it's not possible to fix this for basic equality, I don't think it makes sense to go so far for StartsWith either.

roji on 30 Apr 2020

👍2

All 24 comments

@nesterenko-kv Please post a small, runnable project or code listing that reproduces the behavior you are seeing so that we can investigate.

ajcvickers on 3 Jan 2020

@ajcvickers repro. Let me know, if you need more details.

nesterenko-kv on 5 Jan 2020

Note for team: I was able to reproduce this with 3.1. More minimal repro and output:

```C#
public class Venue
{
public Guid Id { get; set; }
public string Name { get; set; }
}

public class BloggingContext : DbContext
{
private readonly ILoggerFactory Logger
= LoggerFactory.Create(c => c.AddConsole());//.SetMinimumLevel(LogLevel.Debug));

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    optionsBuilder
        .UseLoggerFactory(Logger)
        .EnableSensitiveDataLogging()
        .UseSqlServer(@"Server=(localdb)\mssqllocaldb;Database=Test;ConnectRetryCount=0");
}

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Venue>();    
}

}

public class Program
{
public static async Task Main()
{
using (var context = new BloggingContext())
{
context.Database.EnsureDeleted();
context.Database.EnsureCreated();

        context.Add(new Venue { Name = "Test" });

        await context.SaveChangesAsync();
    }

    using (var context = new BloggingContext())
    {
        var test = "  ";
        var venues = await context.Set<Venue>().Where(x => x.Name.Contains(test)).ToListAsync();
        Console.WriteLine($"Matched {venues.Count}");
    }
}

}

info: Microsoft.EntityFrameworkCore.Database.Command[20101]
Executed DbCommand (4ms) [Parameters=[@__test_0=' ' (Size = 4000)], CommandType='Text', CommandTimeout='30']
SELECT [v].[Id], [v].[Name]
FROM [Venue] AS [v]
WHERE (@__test_0 = N'') OR (CHARINDEX(@__test_0, [v].[Name]) > 0)
Matched 1
```

ajcvickers on 7 Jan 2020

👍1

Notes from team discussion:

This happens because of the way trailing spaces are handled by SQL Server. This makes ' ' = '' return true!
Ideal fix would be to implement #17598 and sniff for the empty string parameter so that we can generate different SQL. However, #17598 is on the backlog.
Another way to fix would be to check the length of the parameter rather than comparing to the empty string. However, this might cause index misses.

ajcvickers on 11 Jan 2020

Notes from further team discussion:

#17598 is not currently scheduled for EF Core 5. On the other hand, we don't want to wait until after 5.0 to fix this issue. Therefore @roji and @maumar will look into a tactical fix even if we later do #17598.
We should also consider the case where the empty string is coming from a column, although this is likely to be rarer.
Clarification on my last point above: we should check that we don't break hitting indexes with whatever solution we come up with. Probably checking the length will be okay.

ajcvickers on 14 Jan 2020

Summary

On SQL Server, LEN(col) = 0 suffers from the same issue as col = '': it identifies spaces-only strings a having length 0 (incidentally it also does not utilize indexes).

However, col LIKE '' may be a good solution: it only identifies truly empty strings and uses an index. At least on PostgreSQL (where this whole trailing spaces madness doesn't happen), LIKE also uses an index.

We should ideally verify also on Sqlite and MySQL, though I'd have to spend more time to understand how to analyze queries there. We can postpone dealing with this to see whether we end up doing #17598 for 5.0 (which I'd really like to).

SQL Server

Behavior around trailing spaces

CREATE TABLE text_tests (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
INSERT INTO text_tests (name) VALUES ('');
INSERT INTO text_tests (name) VALUES ('  ');
SELECT COUNT(*) FROM text_tests WHERE name=''; -- Returns both
SELECT LEN(name) FROM text_tests; -- Returns 0 for both
SELECT COUNT(*) FROM text_tests WHERE name LIKE '';  -- Returns only the 1st
SELECT '|' + name + '|' FROM text_tests;  -- Returns || for the first, |  | for the second

Index usage

-- Create schema and insert data
CREATE TABLE data (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
CREATE INDEX IX_name ON data(name);

BEGIN TRANSACTION;

DECLARE @i INT = 0;
WHILE @i < 50000
BEGIN
    INSERT INTO data (name) VALUES (CAST(@i AS NVARCHAR(MAX)));
    SET @i = @i + 1;
END;

INSERT INTO data (name) VALUES ('');
INSERT INTO data (name) VALUES ('  ');

DECLARE @j INT = @i;
WHILE @j < 100000
BEGIN
    INSERT INTO data (name) VALUES (CAST(@j AS NVARCHAR(MAX)));
    SET @j = @j + 1;
END;

COMMIT;

-- Queries
SET SHOWPLAN_ALL ON;
SELECT id FROM data WHERE name=''; -- Does an index seek with IX_name, TotalSubtreeCost=0.0032842
SELECT id FROM data WHERE LEN(name) = 0; -- Scans the entire table, TotalSubtreeCost=0.48661754
SELECT id FROM data WHERE name LIKE ''; -- Does an index scan with IX_name, TotalSubtreeCost=0.0032842
SET SHOWPLAN_ALL OFF;

PostgreSQL

Behavior around trailing spaces

DROP TABLE IF EXISTS text_types;
CREATE TABLE text_types (
    id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    text TEXT,
    varchar VARCHAR(10),
    char VARCHAR(10));
INSERT INTO text_types (text, varchar, char) VALUES ('', '', '');
INSERT INTO text_types (text, varchar, char) VALUES ('  ', '  ', '  ');

SELECT * FROM text_types WHERE text = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE varchar = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE char = ''; -- Returns 1st row only

SELECT LENGTH(text), LENGTH(varchar), LENGTH(char) FROM text_types; -- 1st row all zeros, 2nd row all 2s 
SELECT '|' || text || '|' || varchar || '|' || char || '|' FROM text_types; -- 1st row no spaces, 2nd row 2 spaces everywhere

Index usage

-- Create schema and insert data
DROP TABLE IF EXISTS data;
CREATE TABLE data (id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY , name TEXT);
CREATE INDEX IX_name ON data(name);

DO $$BEGIN
  FOR i IN 1..50000 LOOP
    INSERT INTO data (name) VALUES (i::TEXT);     
  END LOOP;
  INSERT INTO data (name) VALUES ('');
  INSERT INTO data (name) VALUES ('  ');
  FOR i IN 50001..100000 LOOP
    INSERT INTO data (name) VALUES (i::TEXT);
  END LOOP;
END$$;

-- Queries - same index usage results as SQL Server
EXPLAIN SELECT * FROM data WHERE name='';
EXPLAIN SELECT * FROM data WHERE LENGTH(name) = 0;
EXPLAIN SELECT * FROM data WHERE name LIKE '';

Sqlite

Behavior around trailing spaces

CREATE TABLE text_tests (id INT, name TEXT);
INSERT INTO text_tests (id, name) VALUES (1, '');
INSERT INTO text_tests (id, name) VALUES (2, '  ');
SELECT id FROM text_tests WHERE name=''; -- Returns only id=1
SELECT LENGTH(name) FROM text_tests;  -- Returns 0 and 2
SELECT id FROM text_tests WHERE name LIKE ''; -- Returns only id=1
SELECT '|' || name || '|' FROM text_tests;  -- Returns || for the first, |  | for the second

Index usage

CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT);
CREATE INDEX IX_name ON data(name);
-- Insert 1..50000
INSERT INTO data
SELECT x, x FROM (
  WITH RECURSIVE
      cnt(x) AS (
          SELECT 1
          UNION ALL
          SELECT x+1 FROM cnt
          LIMIT 50000
      )
  SELECT x FROM cnt
);
INSERT INTO data (id, name) VALUES (200000, '');
INSERT INTO data (id, name) VALUES (200001, '  ');
-- Insert 50001..100000
INSERT INTO data
SELECT x, x FROM (
                     WITH RECURSIVE
                         cnt(x) AS (
                             SELECT 50001
                             UNION ALL
                             SELECT x+1 FROM cnt
                             LIMIT 50000
                         )
                     SELECT x FROM cnt
                 );

SELECT COUNT(*) FROM data;    

EXPLAIN QUERY PLAN SELECT * FROM data WHERE name=''; -- SEARCH TABLE data USING COVERING INDEX IX_name (name=?)
EXPLAIN QUERY PLAN SELECT * FROM data WHERE LENGTH(name) = 0; -- SCAN TABLE data
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name LIKE ''; -- SCAN TABLE data

roji on 15 Jan 2020

👍2

Did the research on Sqlite behavior (added above):

It behaves sanely, unlike SQL Server: trailing whitespace isn't stripped when comparing.
@p LIKE '' works functionally as an empty string check, but it does not use indexes (Sqlite LIKE ignores indexes completely?? /cc @bricelam)
Unfortunately the translation for StartsWith and EndsWith happens in FunctionPreprocessingExpressionVisitor because of a null semantics issue, so it's going to be hard to translate differently per provider, which is what we need here...

@smitpatel @maumar any thoughts here?

roji on 10 Feb 2020

Why cannot we just convert param = N'' in SqlServer provider to @p0 LIKE ''?
It does not have to related with "StartsWith"

smitpatel on 10 Feb 2020

Great workaround, thanks.

If we go down this route, shouldn't we:

Do this conversion for all whitespace-only literals (e.g. param = N' ' => @p LIKE ' ')
Do it not just for parameters (if a user writes column == "" in LINQ today they're probably hitting this bug too)
Where do you want this? SqlServerSqlTranslatingExpressionVisitor or post-processing (which visitor)?

roji on 10 Feb 2020

Since a = b gives incorrect results when either of them have trailing whitespace, we should do it for any case we can improve. That means, if column comparison is also hitting bug then we convert that too.

Equality can be generated outside of SqlTranslator.VisitBinary too. So SqlTranslator wouldn't work. Option would be either SqlExpressionFactory.Equal generates different for string or post-processor would take care of rewriting. PostProcessor would be slightly more preferable as it avoids .Equal not returning SqlBinary.

smitpatel on 10 Feb 2020

👍2

Post-processor indeed sounds better to me (but should be before relational command cache etc.). Will prepare this.

roji on 10 Feb 2020

👍1

I did some research on Contains (what this issue is about originally). It turns out that PostgreSQL, MySQL and Sqlite find empty strings at the beginning of all strings, just like .NET - but not SQL Server.

Empty string searches across databases

SQL Server

SELECT CHARINDEX('f', 'foo'); -- 1
SELECT CHARINDEX('x', 'foo'); -- 0
SELECT CHARINDEX('', 'foo'); -- 0!
SELECT CHARINDEX('', ''); -- 0!
SELECT CHARINDEX('', '   '); -- 0!

Sqlite

SELECT INSTR('foo', 'f'); -- 1
SELECT INSTR('foo', 'x'); -- 0
SELECT INSTR('foo', ''); -- 1
SELECT INSTR('', ''); -- 1
SELECT INSTR('   ', ''); -- 1

MySQL

SELECT INSTR('foo', 'f'); -- 1
SELECT INSTR('foo', 'x'); -- 0
SELECT INSTR('foo', ''); -- 1
SELECT INSTR('', ''); -- 1
SELECT INSTR('   ', ''); -- 1

PostgreSQL

SELECT STRPOS('foo', 'f'); -- 1
SELECT STRPOS('foo', 'x'); -- 0
SELECT STRPOS('foo', ''); -- 1
SELECT STRPOS('', ''); -- 1
SELECT STRPOS('   ', ''); -- 1

Because of the above, SQL Server requires empty-string compensation; but we can remove compensation from Sqlite/Npgsql.
Our current SQL Server compensation (x = '') yields false positives because of the trailing whitespace problem (this is this issue).
If we remove the compensation altogether, we will get false negatives on empty strings.
Above I proposed to change our compensation to x LIKE '', since LIKE does not strip trailing whitespace. I got blocked because some handling of StartsWith/EndsWith happens in preprocessing, and that isn't provider-specific (the code since moved to QueryOptimizingExpressionVisitor).
At this point we started looking into generalizing the whole thing, i.e. changing all string comparisons to use LIKE instead of equality (#19881). We decided to drop this, because we can't do this for parameters/columns (they may contain wildcards), and we don't want a behavior divergence with constants (where this is possible).
For Contains, we can go back to the previous idea of using LIKE with empty string only within null compensation (non-generalized).
For StartsWith/EndsWith I can't think of a solution. The problem isn't just empty strings (where we can compensate with LIKE ''), it's any string with trailing whitespace on either side. For example, our current translation with LEFT/LEN does something like this:

SELECT CASE WHEN LEFT('f', LEN(' ')) = ' ' THEN 1 ELSE 0 END; -- returns true, since whitespace is disregarded

We can't use LIKE to take the whitespace into account because operands may be non-constants. So unless someone has a new idea we can say we won't fix this.

roji on 20 Apr 2020

EF6 translation for Contains on SQL Server:

SELECT 
    [Extent1].[Id] AS [Id], 
    [Extent1].[Name] AS [Name]
    FROM [dbo].[Venues] AS [Extent1]
    WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'

ajcvickers on 21 Apr 2020

Thanks!

Any idea what happens if the pattern parameter has wildcards? This could only work if EF6 escapes them client side before sending the parameter? This could correspond to the "client-side parameter transformations" I mentioned in #17598, we it's definitely another approach to the problem.

roji on 21 Apr 2020

@roji What query (or queries) should I run to get an idea of the wildcards issue?

ajcvickers on 21 Apr 2020

Anything where the pattern being searched contains a LIKE wildcard. For example: ctx.Venues.Where(v => v.Name.Contains("foo%bar")). If I'm understanding this correctly, the actual parameter (@p__linq__0) sent by EF6 should be foo~%bar.

roji on 21 Apr 2020

Set up a quick EF6 test project for Contains (with .NET Core on Linux!), results are below.

tl;dr EF6 seems to be doing the right thing, always using LIKE for Contains, escaping wildcards for both constants and parameters. For columns it falls back to CHARINDEX, but does not compensate for empty string (so it has a false negative bug).

For EF Core, here's what I think we should do:

Switch to LIKE for constant patterns, escaping client-side (like we already do for Sqlite/Npgsql StartsWith/EndsWith)
Keep CHARINDEX for parameter/column patterns, but change empty-string compensation to use LIKE
Consider implementing client-side parameter transformation like EF6, at which point parameter patterns can also switch to LIKE

Constant pattern:

_ = ctx.Blogs.Where(b => b.Name.Contains("foo")).ToList(); //  WHERE [Extent1].[Name] LIKE N'%foo%'
_ = ctx.Blogs.Where(b => b.Name.Contains("fo%o")).ToList(); // WHERE [Extent1].[Name] LIKE N'%fo~%o%' ESCAPE N'~'

Parameter pattern:

var pattern1 = "foo";
_ = ctx.Blogs.Where(b => b.Name.Contains(pattern1)).ToList();

Translates to:

WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
-- p__linq__0: '%foo%' (Type = String, Size = 4000)

Parameterized pattern with wildcards:

var pattern = "fo%o";
_ = ctx.Blogs.Where(b => b.Name.Contains(pattern)).ToList();

Translates to:

WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
-- p__linq__0: '%fo~%o%' (Type = String, Size = 4000)

Column pattern:

_ = ctx.Blogs.Where(b => "foo".Contains(b.Name)).ToList();

Translates to:

WHERE ( CAST(CHARINDEX([Extent1].[Name], N'foo') AS int)) > 0

roji on 22 Apr 2020

👍1

Design decisions:

Switch Contains to translate to LIKE instead of CHARINDEX on constants, like we do with StartsWith/EndsWith (with client-side escaping). This will fix both the empty string behavior and allow it to be used with XML columns.
For parameter patterns with Contains/StartsWith/EndsWith, at some point we'll implement client-side parameter transformations (#17598), which would allow us to switch to LIKE for those as well (with client-side escaping). However, as this is a non-trivial query pipeline improvement, it will not be done immediately.
However, we'll fix Contains empty string compensation checks to use LIKE (relevant for both parameter and column patterns).
We will not attempt to compensate for the general SQL Server whitespace truncation logic. In other words, if the user compares two strings, trailing whitespace differences will be disregarded by SQL Server. We don't think there's anything reasonable we can do about that without compromising performance significantly.

roji on 23 Apr 2020

👍1

It is disappointing to see that the design decision is that this won't be fixed any time soon.
It appears there are a few different cases with different issues and solutions.

I'm guessing there is a good reason that the translation logic can't test for the specific scenario of trailing whitespace in an otherwise not empty pattern and then implement a workaround for the sql len() with trailing whitespace issue, as seen on stack overflow?:

Example row: FirstName column value: 'St-phen'
@__fn_1 nvarchar(4000)='St '

Translates to false positive:
LEFT([u].[FirstName], LEN(@__fn_1)) = @__fn_1

But this would translate correctly when using this StartsWith translation: LEN(CAST(@__fn_1 AS nvarchar(MAX)) + 'x') - 1
LEFT([u].[FirstName], LEN(CAST(@__fn_1 AS nvarchar(MAX)) + 'x') - 1) = @__fn_1

On the plus side, this Linq workaround seems to work for me, for my use case, by causing the translation to not use len() when it ends with a space. Just be aware you need a string.IsNullOrEmpty check beforehand, in that case you need a different query anyway:
if (firstName[firstName.Length-1] == ' ') return query.Where(x => x.FirstName.Substring(0, firstName.Length) == firstName); else return query.Where(x => x.FirstName.StartsWith(firstName));

zachrybaker on 30 Apr 2020

Re other workaround with len... First, your stack overflow link seems broken. Second, that elaborate workaround still returns a false positive when the value is shorter than the pattern:

SELECT CASE WHEN LEFT('h', LEN(CAST('h ' AS nvarchar(MAX)) + 'x') - 1) = 'h ' THEN 1 ELSE 0 END; -- 1

roji on 30 Apr 2020

👍2

The current LEN/LEFT based translation is not SARGable. This means that an index cannot be used to satisfy a StartsWith predicate. I am currently facing this as a performance issue in a LINQ to SQL migration. L2S translated StartsWith to LIKE with escaping for wildcards. This has been proposed above as well.

SQL Server is capable of seeking on an index with a parameterized LIKE pattern. The pattern does not need to be constant.

This is implemented by SQL Server inserting an internal function into the query plan. This function takes the (dynamic) LIKE pattern and computes at runtime upper and lower seek bounds. This turns out to be efficient for commonly used patterns.

I'd be curious if LIKE or CHARINDEX are about equal in performance for Contains searches. If yes, then LIKE could be used as a common workhorse for all string searching operations.

I believe that LIKE has better behavior with respect to whitespace as well.

For a simple equality comparison (entity.Column == parameter) this should be translated to a normal = on the SQL side.

GSPP on 28 Jun 2020

😕1

The current LEN/LEFT based translation is not SARGable

Yes, that is well-known (see discussion above, especially https://github.com/dotnet/efcore/issues/19402#issuecomment-618443410). Note that we only generate LEN/LEFT for non-constant patterns, so if you do Where(c => c.Name.StartsWith("A") you should see LIKE being used instead.

The reason we can't use LIKE for non-constant parameters, is that if the pattern contains contains wildcard characters, incorrect results will be returns. So if you attempt to search for names starting with an underscore, you will get all row since underscore will be interpreted by LIKE as a wilcard. For constants we can escape these character in the pattern, but not for non-constants.

There's a plan going forward (#17598) to allow us to transform parameter values before sending them to SQL Server, which would allow us to escape wildcard characters in parameters, similar to what we do today with constants. However, that will not happen for 5.0; note also that we'll never be able to do that for column patterns (but that's probably a rare case).

Hope that explains the situation.

roji on 28 Jun 2020

Re perf, note that as far as I'm aware, indexes can be used in some cases when the LIKE pattern matches at the beginning of the string, but indexes are never used for other patterns (e.g. generalized Contains); that makes sense when you think about how indexes work. However, I haven't personally done any testing on this. Regardless, the idea in general is to try to use LIKE where possible, but that requires escaping as explained above.

roji on 28 Jun 2020

Thanks for explaining, that makes sense. Right, I guess only parameter expressions can feasibly be escaped.

GSPP on 29 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings