I'm getting wrong results from database when trying to filter string (Contains
, StartsWith
, EndsWith
, etc) with spaces.
I'm build query like:
``` C#
var test = " ";
var venues = await context.Venues.Where(x => x.Name.Contains(test)).ToListAsync(cancellationToken);
Which gets translated to:
```SQL
SELECT [v].[Id], [v].[Name]
FROM [Venues] AS [v]
WHERE (@__test_0 = N'') OR CHARINDEX(@__test_0, [v].[Name]) > 0
This query returns all rows from database when @__test_0
contains spaces.
EF Core version: 3.1.0
Database provider: Microsoft.EntityFrameworkCore.SqlServer
Target framework: .NET Core 3.1
@nesterenko-kv Please post a small, runnable project or code listing that reproduces the behavior you are seeing so that we can investigate.
@ajcvickers repro. Let me know, if you need more details.
Note for team: I was able to reproduce this with 3.1. More minimal repro and output:
```C#
public class Venue
{
public Guid Id { get; set; }
public string Name { get; set; }
}
public class BloggingContext : DbContext
{
private readonly ILoggerFactory Logger
= LoggerFactory.Create(c => c.AddConsole());//.SetMinimumLevel(LogLevel.Debug));
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder
.UseLoggerFactory(Logger)
.EnableSensitiveDataLogging()
.UseSqlServer(@"Server=(localdb)\mssqllocaldb;Database=Test;ConnectRetryCount=0");
}
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.Entity<Venue>();
}
}
public class Program
{
public static async Task Main()
{
using (var context = new BloggingContext())
{
context.Database.EnsureDeleted();
context.Database.EnsureCreated();
context.Add(new Venue { Name = "Test" });
await context.SaveChangesAsync();
}
using (var context = new BloggingContext())
{
var test = " ";
var venues = await context.Set<Venue>().Where(x => x.Name.Contains(test)).ToListAsync();
Console.WriteLine($"Matched {venues.Count}");
}
}
}
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
Executed DbCommand (4ms) [Parameters=[@__test_0=' ' (Size = 4000)], CommandType='Text', CommandTimeout='30']
SELECT [v].[Id], [v].[Name]
FROM [Venue] AS [v]
WHERE (@__test_0 = N'') OR (CHARINDEX(@__test_0, [v].[Name]) > 0)
Matched 1
```
Notes from team discussion:
' ' = ''
return true!Notes from further team discussion:
On SQL Server, LEN(col) = 0
suffers from the same issue as col = ''
: it identifies spaces-only strings a having length 0 (incidentally it also does not utilize indexes).
However, col LIKE ''
may be a good solution: it only identifies truly empty strings and uses an index. At least on PostgreSQL (where this whole trailing spaces madness doesn't happen), LIKE also uses an index.
We should ideally verify also on Sqlite and MySQL, though I'd have to spend more time to understand how to analyze queries there. We can postpone dealing with this to see whether we end up doing #17598 for 5.0 (which I'd really like to).
CREATE TABLE text_tests (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
INSERT INTO text_tests (name) VALUES ('');
INSERT INTO text_tests (name) VALUES (' ');
SELECT COUNT(*) FROM text_tests WHERE name=''; -- Returns both
SELECT LEN(name) FROM text_tests; -- Returns 0 for both
SELECT COUNT(*) FROM text_tests WHERE name LIKE ''; -- Returns only the 1st
SELECT '|' + name + '|' FROM text_tests; -- Returns || for the first, | | for the second
-- Create schema and insert data
CREATE TABLE data (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
CREATE INDEX IX_name ON data(name);
BEGIN TRANSACTION;
DECLARE @i INT = 0;
WHILE @i < 50000
BEGIN
INSERT INTO data (name) VALUES (CAST(@i AS NVARCHAR(MAX)));
SET @i = @i + 1;
END;
INSERT INTO data (name) VALUES ('');
INSERT INTO data (name) VALUES (' ');
DECLARE @j INT = @i;
WHILE @j < 100000
BEGIN
INSERT INTO data (name) VALUES (CAST(@j AS NVARCHAR(MAX)));
SET @j = @j + 1;
END;
COMMIT;
-- Queries
SET SHOWPLAN_ALL ON;
SELECT id FROM data WHERE name=''; -- Does an index seek with IX_name, TotalSubtreeCost=0.0032842
SELECT id FROM data WHERE LEN(name) = 0; -- Scans the entire table, TotalSubtreeCost=0.48661754
SELECT id FROM data WHERE name LIKE ''; -- Does an index scan with IX_name, TotalSubtreeCost=0.0032842
SET SHOWPLAN_ALL OFF;
DROP TABLE IF EXISTS text_types;
CREATE TABLE text_types (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
text TEXT,
varchar VARCHAR(10),
char VARCHAR(10));
INSERT INTO text_types (text, varchar, char) VALUES ('', '', '');
INSERT INTO text_types (text, varchar, char) VALUES (' ', ' ', ' ');
SELECT * FROM text_types WHERE text = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE varchar = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE char = ''; -- Returns 1st row only
SELECT LENGTH(text), LENGTH(varchar), LENGTH(char) FROM text_types; -- 1st row all zeros, 2nd row all 2s
SELECT '|' || text || '|' || varchar || '|' || char || '|' FROM text_types; -- 1st row no spaces, 2nd row 2 spaces everywhere
-- Create schema and insert data
DROP TABLE IF EXISTS data;
CREATE TABLE data (id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY , name TEXT);
CREATE INDEX IX_name ON data(name);
DO $$BEGIN
FOR i IN 1..50000 LOOP
INSERT INTO data (name) VALUES (i::TEXT);
END LOOP;
INSERT INTO data (name) VALUES ('');
INSERT INTO data (name) VALUES (' ');
FOR i IN 50001..100000 LOOP
INSERT INTO data (name) VALUES (i::TEXT);
END LOOP;
END$$;
-- Queries - same index usage results as SQL Server
EXPLAIN SELECT * FROM data WHERE name='';
EXPLAIN SELECT * FROM data WHERE LENGTH(name) = 0;
EXPLAIN SELECT * FROM data WHERE name LIKE '';
CREATE TABLE text_tests (id INT, name TEXT);
INSERT INTO text_tests (id, name) VALUES (1, '');
INSERT INTO text_tests (id, name) VALUES (2, ' ');
SELECT id FROM text_tests WHERE name=''; -- Returns only id=1
SELECT LENGTH(name) FROM text_tests; -- Returns 0 and 2
SELECT id FROM text_tests WHERE name LIKE ''; -- Returns only id=1
SELECT '|' || name || '|' FROM text_tests; -- Returns || for the first, | | for the second
CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT);
CREATE INDEX IX_name ON data(name);
-- Insert 1..50000
INSERT INTO data
SELECT x, x FROM (
WITH RECURSIVE
cnt(x) AS (
SELECT 1
UNION ALL
SELECT x+1 FROM cnt
LIMIT 50000
)
SELECT x FROM cnt
);
INSERT INTO data (id, name) VALUES (200000, '');
INSERT INTO data (id, name) VALUES (200001, ' ');
-- Insert 50001..100000
INSERT INTO data
SELECT x, x FROM (
WITH RECURSIVE
cnt(x) AS (
SELECT 50001
UNION ALL
SELECT x+1 FROM cnt
LIMIT 50000
)
SELECT x FROM cnt
);
SELECT COUNT(*) FROM data;
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name=''; -- SEARCH TABLE data USING COVERING INDEX IX_name (name=?)
EXPLAIN QUERY PLAN SELECT * FROM data WHERE LENGTH(name) = 0; -- SCAN TABLE data
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name LIKE ''; -- SCAN TABLE data
Did the research on Sqlite behavior (added above):
@p LIKE ''
works functionally as an empty string check, but it does not use indexes (Sqlite LIKE ignores indexes completely?? /cc @bricelam)@smitpatel @maumar any thoughts here?
Why cannot we just convert param = N''
in SqlServer provider to @p0 LIKE ''
?
It does not have to related with "StartsWith"
Great workaround, thanks.
If we go down this route, shouldn't we:
param = N' '
=> @p LIKE ' '
)column == ""
in LINQ today they're probably hitting this bug too)Since a = b
gives incorrect results when either of them have trailing whitespace, we should do it for any case we can improve. That means, if column comparison is also hitting bug then we convert that too.
Equality can be generated outside of SqlTranslator.VisitBinary too. So SqlTranslator wouldn't work. Option would be either SqlExpressionFactory.Equal generates different for string or post-processor would take care of rewriting. PostProcessor would be slightly more preferable as it avoids .Equal not returning SqlBinary.
Post-processor indeed sounds better to me (but should be before relational command cache etc.). Will prepare this.
I did some research on Contains (what this issue is about originally). It turns out that PostgreSQL, MySQL and Sqlite find empty strings at the beginning of all strings, just like .NET - but not SQL Server.
Empty string searches across databases
SELECT CHARINDEX('f', 'foo'); -- 1
SELECT CHARINDEX('x', 'foo'); -- 0
SELECT CHARINDEX('', 'foo'); -- 0!
SELECT CHARINDEX('', ''); -- 0!
SELECT CHARINDEX('', ' '); -- 0!
SELECT INSTR('foo', 'f'); -- 1
SELECT INSTR('foo', 'x'); -- 0
SELECT INSTR('foo', ''); -- 1
SELECT INSTR('', ''); -- 1
SELECT INSTR(' ', ''); -- 1
SELECT INSTR('foo', 'f'); -- 1
SELECT INSTR('foo', 'x'); -- 0
SELECT INSTR('foo', ''); -- 1
SELECT INSTR('', ''); -- 1
SELECT INSTR(' ', ''); -- 1
SELECT STRPOS('foo', 'f'); -- 1
SELECT STRPOS('foo', 'x'); -- 0
SELECT STRPOS('foo', ''); -- 1
SELECT STRPOS('', ''); -- 1
SELECT STRPOS(' ', ''); -- 1
x = ''
) yields false positives because of the trailing whitespace problem (this is this issue).x LIKE ''
, since LIKE does not strip trailing whitespace. I got blocked because some handling of StartsWith/EndsWith happens in preprocessing, and that isn't provider-specific (the code since moved to QueryOptimizingExpressionVisitor).LIKE ''
), it's any string with trailing whitespace on either side. For example, our current translation with LEFT/LEN does something like this:SELECT CASE WHEN LEFT('f', LEN(' ')) = ' ' THEN 1 ELSE 0 END; -- returns true, since whitespace is disregarded
We can't use LIKE to take the whitespace into account because operands may be non-constants. So unless someone has a new idea we can say we won't fix this.
EF6 translation for Contains on SQL Server:
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name]
FROM [dbo].[Venues] AS [Extent1]
WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
Thanks!
Any idea what happens if the pattern parameter has wildcards? This could only work if EF6 escapes them client side before sending the parameter? This could correspond to the "client-side parameter transformations" I mentioned in #17598, we it's definitely another approach to the problem.
@roji What query (or queries) should I run to get an idea of the wildcards issue?
Anything where the pattern being searched contains a LIKE wildcard. For example: ctx.Venues.Where(v => v.Name.Contains("foo%bar"))
. If I'm understanding this correctly, the actual parameter (@p__linq__0) sent by EF6 should be foo~%bar
.
Set up a quick EF6 test project for Contains (with .NET Core on Linux!), results are below.
tl;dr EF6 seems to be doing the right thing, always using LIKE for Contains, escaping wildcards for both constants and parameters. For columns it falls back to CHARINDEX, but does not compensate for empty string (so it has a false negative bug).
For EF Core, here's what I think we should do:
Constant pattern:
_ = ctx.Blogs.Where(b => b.Name.Contains("foo")).ToList(); // WHERE [Extent1].[Name] LIKE N'%foo%'
_ = ctx.Blogs.Where(b => b.Name.Contains("fo%o")).ToList(); // WHERE [Extent1].[Name] LIKE N'%fo~%o%' ESCAPE N'~'
Parameter pattern:
var pattern1 = "foo";
_ = ctx.Blogs.Where(b => b.Name.Contains(pattern1)).ToList();
Translates to:
WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
-- p__linq__0: '%foo%' (Type = String, Size = 4000)
Parameterized pattern with wildcards:
var pattern = "fo%o";
_ = ctx.Blogs.Where(b => b.Name.Contains(pattern)).ToList();
Translates to:
WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
-- p__linq__0: '%fo~%o%' (Type = String, Size = 4000)
Column pattern:
_ = ctx.Blogs.Where(b => "foo".Contains(b.Name)).ToList();
Translates to:
WHERE ( CAST(CHARINDEX([Extent1].[Name], N'foo') AS int)) > 0
Design decisions:
It is disappointing to see that the design decision is that this won't be fixed any time soon.
It appears there are a few different cases with different issues and solutions.
I'm guessing there is a good reason that the translation logic can't test for the specific scenario of trailing whitespace in an otherwise not empty pattern and then implement a workaround for the sql len() with trailing whitespace issue, as seen on stack overflow?:
Example row: FirstName column value: 'St-phen'
@__fn_1 nvarchar(4000)='St '
Translates to false positive:
LEFT([u].[FirstName], LEN(@__fn_1)) = @__fn_1
But this would translate correctly when using this StartsWith translation: LEN(CAST(@__fn_1 AS nvarchar(MAX)) + 'x') - 1
LEFT([u].[FirstName], LEN(CAST(@__fn_1 AS nvarchar(MAX)) + 'x') - 1) = @__fn_1
On the plus side, this Linq workaround seems to work for me, for my use case, by causing the translation to not use len() when it ends with a space. Just be aware you need a string.IsNullOrEmpty check beforehand, in that case you need a different query anyway:
if (firstName[firstName.Length-1] == ' ')
return query.Where(x => x.FirstName.Substring(0, firstName.Length) == firstName);
else
return query.Where(x => x.FirstName.StartsWith(firstName));
@zachrybaker my message above may have been a bit inaccurate - in some cases, the translations of Contains, StartsWith and EndsWith will work correctly with regards to trailing whitespace (i.e. when patterns are constant we will generate LIKE, which doesn't ignore trailing whitespace). Once we implement client-side parameter transformations (which we do intend to do), the same technique can be used for parameter patterns as well.
Re other workaround with len... First, your stack overflow link seems broken. Second, that elaborate workaround still returns a false positive when the value is shorter than the pattern:
SELECT CASE WHEN LEFT('h', LEN(CAST('h ' AS nvarchar(MAX)) + 'x') - 1) = 'h ' THEN 1 ELSE 0 END; -- 1
It's possible to add a condition for that too, but at some point that kind of complexity does not seem appropriate for translating a very common .NET method like StartsWith (and properly has some perf overhead, with all the casting, concatenation, etc. etc). In any case, as you say, it's always possible for people to express this themselves with EF.Functions.Like.
The larger picture here is that it's impossible to handle this well for much more basic cases without a significant perf impact - I'm thinking about simple equality. When users compare two VARCHAR columns, trailing SQL will be truncated. Any attempt to compensate for that with elaborate logic such as the above would prevent index usage - which is the last thing you want to do for the default text equality operator. Since it's not possible to fix this for basic equality, I don't think it makes sense to go so far for StartsWith either.
The current LEN/LEFT
based translation is not SARGable. This means that an index cannot be used to satisfy a StartsWith
predicate. I am currently facing this as a performance issue in a LINQ to SQL migration. L2S translated StartsWith
to LIKE
with escaping for wildcards. This has been proposed above as well.
SQL Server is capable of seeking on an index with a parameterized LIKE
pattern. The pattern does not need to be constant.
This is implemented by SQL Server inserting an internal function into the query plan. This function takes the (dynamic) LIKE
pattern and computes at runtime upper and lower seek bounds. This turns out to be efficient for commonly used patterns.
I'd be curious if LIKE
or CHARINDEX
are about equal in performance for Contains
searches. If yes, then LIKE
could be used as a common workhorse for all string searching operations.
I believe that LIKE
has better behavior with respect to whitespace as well.
For a simple equality comparison (entity.Column == parameter)
this should be translated to a normal =
on the SQL side.
The current LEN/LEFT based translation is not SARGable
Yes, that is well-known (see discussion above, especially https://github.com/dotnet/efcore/issues/19402#issuecomment-618443410). Note that we only generate LEN/LEFT
for non-constant patterns, so if you do Where(c => c.Name.StartsWith("A")
you should see LIKE being used instead.
The reason we can't use LIKE for non-constant parameters, is that if the pattern contains contains wildcard characters, incorrect results will be returns. So if you attempt to search for names starting with an underscore, you will get all row since underscore will be interpreted by LIKE as a wilcard. For constants we can escape these character in the pattern, but not for non-constants.
There's a plan going forward (#17598) to allow us to transform parameter values before sending them to SQL Server, which would allow us to escape wildcard characters in parameters, similar to what we do today with constants. However, that will not happen for 5.0; note also that we'll never be able to do that for column patterns (but that's probably a rare case).
Hope that explains the situation.
Re perf, note that as far as I'm aware, indexes can be used in some cases when the LIKE pattern matches at the beginning of the string, but indexes are never used for other patterns (e.g. generalized Contains); that makes sense when you think about how indexes work. However, I haven't personally done any testing on this. Regardless, the idea in general is to try to use LIKE where possible, but that requires escaping as explained above.
Thanks for explaining, that makes sense. Right, I guess only parameter expressions can feasibly be escaped.
Most helpful comment
@zachrybaker my message above may have been a bit inaccurate - in some cases, the translations of Contains, StartsWith and EndsWith will work correctly with regards to trailing whitespace (i.e. when patterns are constant we will generate LIKE, which doesn't ignore trailing whitespace). Once we implement client-side parameter transformations (which we do intend to do), the same technique can be used for parameter patterns as well.
Re other workaround with len... First, your stack overflow link seems broken. Second, that elaborate workaround still returns a false positive when the value is shorter than the pattern:
It's possible to add a condition for that too, but at some point that kind of complexity does not seem appropriate for translating a very common .NET method like StartsWith (and properly has some perf overhead, with all the casting, concatenation, etc. etc). In any case, as you say, it's always possible for people to express this themselves with EF.Functions.Like.
The larger picture here is that it's impossible to handle this well for much more basic cases without a significant perf impact - I'm thinking about simple equality. When users compare two VARCHAR columns, trailing SQL will be truncated. Any attempt to compensate for that with elaborate logic such as the above would prevent index usage - which is the last thing you want to do for the default text equality operator. Since it's not possible to fix this for basic equality, I don't think it makes sense to go so far for StartsWith either.