Currently, EF Core is not parameterizing IN(...)
queries created from .Contains()
(and maybe other cases). This has a very detrimental impact on SQL Server itself because:
Note: when SQL Server has memory pressure, plan cache is the first thing to empty. So this has a profound impact at scale, doubly so when things have gone sideways.
Here's a reduced down version of the problem:
```c#
var ids = Enumerable.Range(1, 1000).ToList();
var throwAwary = context.Users.Where(u => ids.Contains(u.Id)).ToList();
This results in the following:
```sql
SELECT [u].[Id], [u].[AcceptRateAccepted], [u].[AcceptRateAsked], [u].[AccountId], [u].[AnswerCount], [u].[BronzeBadges], [u].[CreationDate], [u].[DaysVisitedConsecutive], [u].[DaysVisitedTotal], [u].[DisplayName], [u].[Email], [u].[Flags], [u].[GoldBadges], [u].[HasAboutMeExcerpt], [u].[HasReplies], [u].[IsVeteran], [u].[JobSearchStatus], [u].[LastAccessDate], [u].[LastDailySiteAccessDate], [u].[LastEmailDate], [u].[LastLoginDate], [u].[LastLoginIP], [u].[LastModifiedDate], [u].[Location], [u].[OptInEmail], [u].[PreferencesRaw], [u].[ProfileImageUrl], [u].[QuestionCount], [u].[RealName], [u].[Reputation], [u].[ReputationMonth], [u].[ReputationQuarter], [u].[ReputationSinceLastCheck], [u].[ReputationToday], [u].[ReputationWeek], [u].[ReputationYear], [u].[SignupStarted], [u].[SilverBadges], [u].[TeamId], [u].[TeamName], [u].[TimedPenaltyDate], [u].[Title], [u].[UserTypeId], [u].[WebsiteUrl]
FROM [Users] AS [u]
WHERE [u].[Id] IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000);
EF Core version: 2.1.4
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 2016/Windows 10
IDE: Visual Studio 2017 15.9
EF Core should parameterize here. Instead of IN (1, 2, 3, ...)
we should see IN (@__ids1, @__ids2, @__ids3, ...)
or similar. This would allow query plan cache to be shared. For example if we ran this 1,000 times to fetch 1,000,000 users in batches, we'd have 1 plan in cache, whereas today we have 1,000 plans. Let's say a user gets removed (or added!) on page 2 of 1,000...today we'd calculate and cache another 999 plans next run.
To further address the cardinality problem, an approach similar to what Dapper does would be at least a starting point. We generate only lists of certain sizes, let's just say 5, 10, 50, 100, 500, 1000. Here's an example with 3 parameters:
SELECT [u].[Id], [u].[AcceptRateAccepted], [u].[AcceptRateAsked], [u].[AccountId], [u].[AnswerCount], [u].[BronzeBadges], [u].[CreationDate], [u].[DaysVisitedConsecutive], [u].[DaysVisitedTotal], [u].[DisplayName], [u].[Email], [u].[Flags], [u].[GoldBadges], [u].[HasAboutMeExcerpt], [u].[HasReplies], [u].[IsVeteran], [u].[JobSearchStatus], [u].[LastAccessDate], [u].[LastDailySiteAccessDate], [u].[LastEmailDate], [u].[LastLoginDate], [u].[LastLoginIP], [u].[LastModifiedDate], [u].[Location], [u].[OptInEmail], [u].[PreferencesRaw], [u].[ProfileImageUrl], [u].[QuestionCount], [u].[RealName], [u].[Reputation], [u].[ReputationMonth], [u].[ReputationQuarter], [u].[ReputationSinceLastCheck], [u].[ReputationToday], [u].[ReputationWeek], [u].[ReputationYear], [u].[SignupStarted], [u].[SilverBadges], [u].[TeamId], [u].[TeamName], [u].[TimedPenaltyDate], [u].[Title], [u].[UserTypeId], [u].[WebsiteUrl]
FROM [Users] AS [u]
WHERE [u].[Id] IN (@ids1, @ids2, @ids3, @ids4, @ids5);
The 5 Ids are so that 1-5 values all use the same plan. Anything < n
in length repeats the last value (don't use null
!). In this case let's say our values are 1
, 2
, and 3
. Our parameterization would be:
@ids1 = 1;
@ids2 = 2;
@ids3 = 3;
@ids4 = 3;
@ids5 = 3;
This fetches the users we want, is friendly to the cache, and lessens the amount of generated permutations for all layers.
To put this in perspective, at Stack Overflow scale we're generating millions of one-time-use plans needlessly in EF Core (the parameterization in Linq2Sql lowered this to only cardinality permutations). We alleviate the cache issue by enabling ad-hoc query mode on our SQL Servers, but that doesn't lessen the CPU use from all the query hashes involved except in the (very rare) reuse case of the exact same list.
This problem is dangerous because it's also hard to see. If you're looking at the plan cache, you're not going to see it by any sort of impact query analysis. Each query, hash, and plan is different. There's no sane way to group them. It's death by a thousand cuts you can't see. I'm currently finding and killing as many of these in our app as I can, but we should fix this at the source for everyone.
Related issue: https://github.com/aspnet/EntityFrameworkCore/issues/12777
I wrote an extension to EF6 to handle this exact problem for our product and apparently it is extremely similar to what Nick is proposing above. By replacing this where clause: .Where(entity => myEnumerable.Contains(entity.prop))
with this extension method: .BatchedWhereKeyInValues(entity => entity.prop, myEnumerables, batchSize: 100)
, my solution will break the list of values into batches, and then produce a properly parameterized query that generates the desired SQL.
For example, this expression:
query.BatchedWhereKeyInValues(q => q.Id, values: {1,2,5,7,8,9}, batchSize: 3)
would effectively become the following two queryables, where the numbers are supplied as variables that allow EF to parameterize the query rather than embed them as constants:
query.Where(q => q.Id == 1 || q.Id == 2 || q.Id == 5)
query.Where(q => q.Id == 7 || q.Id == 8 || q.Id == 9)
EF then converts those LINQ expressions into this SQL expression:
WHERE Id IN (@p__linq__0, @p__linq__1, @p__linq__2)
Check out the source here: https://gist.github.com/kroymann/e57b3b4f30e6056a3465dbf118e5f13d
@kroymann Works like a charm! Big Thanks!
```c#
public static class QueryableExtension
{
internal static IEnumerable
Expression
{
List
int lastBatchSize = distinctValues.Count % batchSize;
if (lastBatchSize != 0)
{
distinctValues.AddRange(Enumerable.Repeat(distinctValues.Last(), batchSize - lastBatchSize));
}
int count = distinctValues.Count;
for (int i = 0; i < count; i += batchSize)
{
var body = distinctValues
.Skip(i)
.Take(batchSize)
.Select(v =>
{
// Create an expression that captures the variable so EF can turn this into a parameterized SQL query
Expression<Func<TKey>> valueAsExpression = () => v;
return Expression.Equal(keySelector.Body, valueAsExpression.Body);
})
.Aggregate((a, b) => Expression.OrElse(a, b));
if (body == null)
{
yield break;
}
var whereClause = Expression.Lambda<Func<TQuery, bool>>(body, keySelector.Parameters);
yield return queryable.Where(whereClause);
}
}
// doesn't use batching
internal static IQueryable
Expression
{
TKey[] distinctValues = values.Distinct().ToArray();
int count = distinctValues.Length;
for (int i = 0; i < count; ++i)
{
var body = distinctValues
.Select(v =>
{
// Create an expression that captures the variable so EF can turn this into a parameterized SQL query
Expression<Func<TKey>> valueAsExpression = () => v;
return Expression.Equal(keySelector.Body, valueAsExpression.Body);
})
.Aggregate((a, b) => Expression.OrElse(a, b));
var whereClause = Expression.Lambda<Func<TQuery, bool>>(body, keySelector.Parameters);
return queryable.Where(whereClause);
}
return Enumerable.Empty<TQuery>().AsQueryable();
}
}
**Usage:**
```c#
int[] a = Enumerable.Range(1, 10).ToArray();
var queries = QueryableExtension<User>.WhereIn(dbContext.Users, t => t.Id, a, 5);
foreach (var queryable in queries)
{
_ = queryable.ToArray();
}
var everything = QueryableExtension<User>.WhereIn(dbContext.Users, t => t.Id, a);
_ = everything.ToArray();
Also such queries don't aggregate well in logging systems like Application Insights, cause every query has unique sql.
I just wanted to share here some lexperiments I did some time ago with different alternative approaches to Enumerable.Contains using existing EF Core APIs:
https://github.com/divega/ContainsOptimization/
The goal was both to find possible workarounds, and to explore how we could deal with Contains in the future.
I timed the initial construction and first execution of the query with different approaches. This isn't a representative benchmark because there is no data and it doesn't measure the impact of caching, but I think it is still interesting to see the perf sensitivity to the number of parameters.
The code tests 4 approaches:
Test("standard Contains", context.People.Where(p => (ids.Contains(p.Id))));
Test("Parameter rewrite", context.People.In(ids, p => p.Id));
Test("Split function",
context.People.FromSql(
$"select * from People p where p.Id in(select value from string_split({string.Join(",", ids)}, ','))"));
Test("table-valued parameter",
context.People.FromSql(
$"select * from People p where p.Id in(select * from {context.CreateTableValuedParameter(ids, "@p0")})"));
This is included as a baseline and is what happens today when you call Contains with a list: the elements of the list get in-lined as constants in the SQL as literals and the SQL cannot be reused.
This approaches tries to rewrite the expression to mimic what the compiler would produce for a query like this:
var p1 = 1;
var p2 = 2;
var p3 = 3;
context.People.Where(p => (new List<int> {p1, p2, p3}.Contains(p.Id))).ToList();
It also "bucketizes" the list, meaning that it only produces lists of specific lengths (powers of 2), repeating the last element as necessary, with the goal of favoring caching.
This approach only works within the limits of 2100 parameters in SQL Server, and sp_executesql takes a couple of parameters, so the size of the last possible bucket is 2098.
Overall this seems to be the most expensive approach using EF Core, at least on initial creation and execution.
This approach was mentioned by @NickCraver and @mgravell as something that Dapper can leverage. I implemented it using FromSql. It is interesting because it leads to just one (potentially very long) string parameter and it seems to perform very well (at least in the first query), but the STRING_SPLIT function
is only available in SQL Server since version 2016.
For this I took advantage of the ability to create table-valued parameters that contain an IEnumerable<DbDataRecord>
(I used a list rather than a streaming IEnumerable, but not sure that matters in this case) and pass them as a DbParameter in FromSql. I also added a hack to define the table type for the parameter if it isn't defined. Overall this seems to be the second most lightweight approach after the split function.
@divega Just stumbled upon this: https://stackoverflow.com/questions/25228362/how-to-avoid-query-plan-re-compilation-when-using-ienumerable-contains-in-entity
Here's an alternative solution I discovered thanks to a semi-related SO post by @davidbaxterbrowne:
c#
var records = await _context.Items
.AsNoTracking()
.FromSql(
"SELECT * FROM dbo.Items WHERE Id IN (SELECT value FROM OPENJSON({0}))",
JsonConvert.SerializeObject(itemIds.ToList()))
.ToListAsync();
It's specific to SQL Server 2017+ (or Azure SQL Server), and in my testing with a list of 50,000 IDs against a remote Azure SQL database it was 5x faster than a standard batching solution (dividing into lists of 1,000 IDs).
I recently put together an implementation for my day job to address this issue (targeting EF Core v3.1.3
)
I would be willing to put together a PR for this issue if the EF Core team would confirm how they want me to address a few of the limitations I had to work around.
IRelationalCommand
is generated:The IN clause parameter expansion happens at:
RelationalCommandCache.GetRelationalCommand(..)
-> ParameterValueBasedSelectExpressionOptimizer.Optimize(..)
-> InExpressionValuesExpandingExpressionVisitor.Visit(..)
One of the limitations I had to work around is the parameters dictionary being IReadOnlyDictionary<string, object>
. In order to parameterize the IN clause I need the ability to add parameters into this dictionary. I could work around this in a solution PR by casting the dictionary back to IDictionary<>
or by changing the call signatures to pass through the non read-only dictionary interface.
How would the EF Core team like this to be addressed?
Controlling batch size, currently I handle this with two solutions:
2a. I use an IDbOptionsExtension
to control enabling/disabling the IN clause parameterization and to supply the default batch size to pad the parameter count to.
What dependency objects/names would the EF Core team would want me to use for these settings?
2b. I provided my team a .WhereIn(predicate, batchSize)
extension method to use. This extension method behaves identically to a normal .Where(predicate)
and the query method translation process extracts the batch size and stores it. This batch size overrides the default value from the options extension when used.
Would the EF Core team want me to include this extension method in my PR?
I have updated one of @divega 's proposals above for 3.1: https://erikej.github.io/efcore/sqlserver/2020/03/30/ef-core-cache-pollution.html
It is good to see that "advanced" solutions have been proposed here. Is it feasible to create a rather simple but quick fix first? Simply making the list a bunch of parameters (e.g. @p0, @p1, ...
) would be much better than the literals. That way there's one plan for each list count in contrast to one plan for each set of constants.
Might be interesting for others, here's how we solved the problem by using LINQKit.Core
:
var predicate = PredicateBuilder.New<User>();
foreach (var id in ids)
{
predicate = predicate.Or(user => user.Id == id);
}
var throwAway = context.Users.Where(predicate).ToList();
This will generate a parameterized query that looks like this:
SELECT ...
FROM [Users] AS [u]
WHERE [u].[Id] = @id1
OR [u].[Id] = @id2
OR [u].[Id] = @id3
...
So do quoted queries and blackslashes lead to SQL injection attacks? E.g. a value of
legit_value\\'); SELECT * FROM some_table limit 10; --
.
I'm surprised that this is not treated as a security issue.
@cjstevenson - The value you provided as example does not cause any SQL injection. EF Core escapes quotes properly in the SQL.
Generated SQL for example
C#
SELECT [b].[Id], [b].[Value]
FROM [Blogs] AS [b]
WHERE [b].[Value] IN (N'legit_value\''); SELECT * FROM some_table limit 10; -- ')
This is _not_ a security issue. This issue has nothing to do with values and SQL injection, please file a new issue with repro steps if you are seeing a bug. Please refer to security policy if you are submitting a report for security issue.
Thanks for pointing me to the security policy.
@ajcvickers as someone interested in this issue and your plans for 6.0, I was wondering what the categories mean in your EF Core Backlog. This has gone from Proposed to Backlog - does that mean it was proposed for the backlog and now it's in the backlog? And then next step would be to consider moving to Committed Next Release?
@snappyfoo we're still working out the mechanics in our planning process and the Github project board. The important thing is that this issue has the consider-for-next-release label, which means it's definitely on the table for 6.0.
Been investigating solutions in the meantime and found that the string_split
function idea from above can be separated out into its own queryable for flexible reuse. For example...
Add keyless entity to your model:
```C#
public class StringSplitResult
{
public string Value { get; set; }
}
````
Some methods like this on your context:
```C#
public IQueryable
SplitString(string.Join(",", values));
public IQueryable
Set
.FromSqlInterpolated($"SELECT Value FROM string_split({values}, {separator})")
.Select(x => x.Value);
You could do something like this:
```C#
var codes = new[] {"CODE1", "CODE2"};
var item = context.Set<Entity>()
.Where(entity => context.AsQuery(codes).Contains(entity.Code))
.FirstOrDefault();
Which produces SQL:
Microsoft.EntityFrameworkCore.Database.Command: Information: Executed DbCommand (21ms) [Parameters=[p0='CODE1,CODE2' (Size = 4000), p1=',' (Size = 4000)], CommandType='Text', CommandTimeout='30']
SELECT TOP(1) [e].[Id], [e].[Code]
FROM [Entities] AS [e]
WHERE EXISTS (
SELECT 1
FROM (
SELECT Value FROM string_split(@p0, @p1)
) AS [s]
WHERE ([s].[Value] = [e].[Code]) OR ([s].[Value] IS NULL AND [e].[Code] IS NULL))
Most helpful comment
I have updated one of @divega 's proposals above for 3.1: https://erikej.github.io/efcore/sqlserver/2020/03/30/ef-core-cache-pollution.html