I've been investigation an issue when using AutoMapper's Queryable projection extensions on top of EF Core 2.1-rc1-final. Full details including troubleshooting logs with the SQL statements are at the issue (https://github.com/AutoMapper/AutoMapper/issues/2639) and EfAutomapperTroubleshooting.zip is a repro.
Basically I've got a Booking
which has multiple BookingItem
s. I have a DTO for each type and have configured them to be used by AutoMapper. When I execute the query on EF6 the JOIN optimization occurs (it's even smart enough that I don't need to specific the .Include(b => b.BookingItems)
. On EF Core however, two DB queries fire (one for Booking and the other for BookingItems). Explicitly specifying the Include
potentially would work, but it's ignored by the "IgnoredIncludes" cleverness, so I can't force that behaviour even if I wanted to.
So there's two issues running on EF Core I'm trying to get past:
Obviously understanding that EF Core is not EF 6, but just wondering if this would ever be supported as it's a common scenario these days. The AutoMapper projection extensions are awesome. I'll try do this without AutoMapper for now and see if that can be my workaround for now.
See linked issue logged at AutoMapper and the downloadable .sln zip in my later comment.
EF Core version: 2.1.0-rc1-final
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 10 Pro, Version: 1803, OS build: 17134.48
IDE: Visual Studio 2017 15.7.1
I also tried it with a straight up LINQ query and hand-written projections and there is the same 2 queries generated. This isn't really an option for me though even if it did work, as I'm building a GraphQL service that passing required field names down from client and then I use the .ProjectTo<T>(null, requiredFieldNames)
overload to basically have AutoMapper generate the perfect SQL query for me. It'll be magical if it can work end-to-end and produce the JOIN I'd like.
bookingDtos = db.Bookings
.AsNoTracking()
.Select(b => new BookingDto
{
Id = b.Id,
Name = b.Name,
BookingItems = b.BookingItems.Select(bi => new BookingItemDto
{
Id = bi.Id,
Name = bi.Name
}).ToList()
})
.ToArray();
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
Executed DbCommand (4ms) [Parameters=[], CommandType='Text', CommandTimeout='30']
SELECT [b].[Id], [b].[Name]
FROM [Booking] AS [b]
ORDER BY [b].[Id]
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
Executed DbCommand (7ms) [Parameters=[], CommandType='Text', CommandTimeout='30']
SELECT [t].[Id], [b.BookingItems].[Id] AS [Id0], [b.BookingItems].[Name], [b.BookingItems].[BookingId]
FROM [BookingItem] AS [b.BookingItems]
INNER JOIN (
SELECT [b0].[Id]
FROM [Booking] AS [b0]
) AS [t] ON [b.BookingItems].[BookingId] = [t].[Id]
ORDER BY [t].[Id]
Its not N+1, its just 2 queries. 2 queries instead of 1 is by design. What is your specific issue with that?
Sorry, you're right. So just to clarify, you are saying it's by design to do 2 queries when it can be achieved with one using a JOIN? If that's the case then no worries, it's just surprising.
@benmccallum - Yes it is by design. While it can be achieved with 1 query, it is not less likely to be performant. Single query for collection (include or join) causes dreaded cartesian product.
To give an example, let's assume you have 50 bookings, and each booking has 5 bookingitems. With 1 query, it will do a join between both and cause 250 rows returns from server. And 50 bookings' data will be repeated for 5 times exactly. If you add another collection include (assume 10 bookingdetails per bookings), now the total size of data is 2500 rows. With a lot of duplicated data present.
With split queries, we send 1 query for each collection include. So you get 3 queries with 50/250/500 results each with no duplication of data. Reference includes are still done using join in single query. Of course there is trade off based on cardinality which would make multiple queries efficient or single query efficient. Due to join explosion in case of multiple collection joins, we decided to adopt 2 query approach in EF Core.
Thanks heaps for taking the time to explain this @smitpatel. That makes total sense. Closing this out.
there are 2 independent queries which means the second query does nothing with the first query, so still I don't understand how running 2 queries prevents data duplication!
@elyasnew - They are not independent queries. 2nd query fetches the related data (for collection include) for the records in the first query. If you want all data in one query then you need to a join which will duplicate data in first query for every related record in 2nd query.
@benmccallum - Yes it is by design. While it can be achieved with 1 query, it is not less likely to be performant. Single query for collection (include or join) causes dreaded cartesian product.
To give an example, let's assume you have 50 bookings, and each booking has 5 bookingitems. With 1 query, it will do a join between both and cause 250 rows returns from server. And 50 bookings' data will be repeated for 5 times exactly. If you add another collection include (assume 10 bookingdetails per bookings), now the total size of data is 2500 rows. With a lot of duplicated data present.
With split queries, we send 1 query for each collection include. So you get 3 queries with 50/250/500 results each with no duplication of data. Reference includes are still done using join in single query. Of course there is trade off based on cardinality which would make multiple queries efficient or single query efficient. Due to join explosion in case of multiple collection joins, we decided to adopt 2 query approach in EF Core.
Thanks for the clarification. I think it is really important to splitting the queries in some cases where you have too many duplicate data but can you make it optional in case some people doesn't want to use it that way ? Something like: _db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).SingleQuery().ToList();
That would be usefull especially if trying to return a single blog. We do not need to split queries for a single blog. I know some scenarios where duplicate data is more than the size of actual data but for the most part people might want to use single query especially when the fact that round trip time and processing the statements makes up the most of the query time taken into account.
@smitpatel I would like to reopen the discussion about this issue.
I just wrote a single linq query and expected the database to spit out a single sql query. See the referenced issue for more information. But I was astonished that entity framework core did something totally different and even sorted out to join tuples in memory. This was unexpected.
This said I do believe that when a developer writes one linq query he expresses the intention to run that linq query as a single sql query and writes it such that it can be run as a single sql query against the database. If the developer did not want to offload that operation to the database, he would write two linq queries and join the results in the application server.
I do believe that the developer should have a choice. It is the developers responsibility to write linq quieries that implement the functional requirements in a sane way. The decision to generate two queries makes it impossible to implement streaming techniques of several hundredthousands of records with their related records while it silently increases the memory requirements of the machine that runs the application server.
Can I somehow persuade the entity framework core team to rethink the original decision? I do not think that entity framework core team has to protect developers from writing bad linq queries. For instance, I would not mind if entity framework core threw an exception if it is unable generate a sensible sql query from a bad linq query.
@smitpatel see related discussion in https://github.com/npgsql/Npgsql.EntityFrameworkCore.PostgreSQL/issues/712
@dpsenner @roji Thanks, this is indeed an interesting discussion. I would like to point out that although EF Core creates a separate query whenever you project or include collections, for each of these cases there is generally an equivalent query (that is, a query that returns the exact same rows) that is translated by EF Core to a single query with JOINs. E.g.:
// this will translate as three SQL queries
db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).ToList();
// this will translate as a single SQL query
db.Comments.Include(b=>b.Post).ThenInclude(b=>b.Blog).Where(b=>b.Post.Blog.BlogId == 1).ToList();
The difference between the two queries is that you are traversing the navigation properties in the opposite direction.
For the first query, we interpret that your main intent is to get a Blog that has a BlogId equal to 1. Your secondary intent is to get related information about the posts and the comments. Based on this, we translate to one query to get the blog, and two additional queries to get the related posts and comments. In this case, your query is traversing the navigation properties from the one end toward the many end. Had we tried to translate to a single query, we would have caused the number of times you get the data in the rows for blog to explode.
For the second query, we interpret that your primary intent is to get the comments for a specific blog, and we your secondary intent is to get the data for the related posts and blogs. Our hypothesis is that you already expect that the cardinality of the results will be the same as all the number of comments for that blog in the database. You are traversing navigations from the many ends towards the one ends, each time narrowing the cardinality of the additional data you are asking for. So when we translate this to a single query, the number of rows in the results is going to be the same as the number of rows if you were querying for the comments alone. However, we are duplicating the data in all the columns of post and blog.
You are welcome to think that the distinction between the two queries (the direction in which you are traversing relationships) is arbitrary, and we could have used the same translation. However I would argue that we are simply using the order in which navigations are being traversed as a hint on how we translate the query. A useful way to think about how EF Core decides whether to translate to one or multiple SQL queries is that it is a heuristic that we apply to avoid the explosion of the cardinality of the root of the query that you are consuming. In a way, it is an attempt to achieve least astonishment, but focused on that specific aspect of the query.
Also regarding the assertion that this doesn't allow streaming: for the first query, we actually still stream the results of Blog on the same query. We buffer the results from Posts and Comments, but if your database (and connection string) supports multiple active result sets (MARS), we only buffer them incrementally. E.g. we will buffer together all the posts for the first blog, then when you skip to the next blog, we will buffer all the posts for the second blog, etc.
In any case, we are reopening this issue to discuss if there are other ways we can do this in the future. Obviously, it becomes much harder or impossible for you to write a query that gets resolved as single query if the reverse navigation properties are absent in the model, or if one of the leaf entities is an owned entity.
@divega that's very useful information I didn't know - thanks! I'd suggest maybe adding this to the docs as I'm sure not people are aware of the distinction and the two SQL generation options. Maybe an "advanced query techniques" page would be appropriate for this kind of thing...
At least IMO that's quite a satisfactory state of affairs... Even if the "default" mode (traversal from one to many) only buffers the one end, it's understandable that in some cases users may wish to avoid it, even at the expense of a cartesian product explosion. You seem to already provide a way to express it, and requiring that reverse navigation be present doesn't seem like a big problem to me. Owned entities may present a more serious problem though.
One last somewhat-related note... Npgsql indeed does not support MARS - you can only have one reader open on a connection at a given point in time. However, PostgreSQL does support SQL server-side cursors, where you start a query with DECLARE
and progressively fetch results with FETCH
, allowing multiple query resultsets to be traversed simultaneously. At some point I considered implementing MARS support in Npgsql over cursors (via lower-level support in the protocol) but ended up deciding against it for several reasons - and in any case the option is there for users to do so themselves with DECLARE
/OPEN
. Treating this as a higher-level SQL-based MARS, in theory EF Core could use that to prevent buffering even when MARS isn't supported - I think it's even part of the SQL standard and implemented across databases. That may be a bit optimistic though, IIRC MySQL only allows cursors in stored procedures, and PostgreSQL itself requires cursors to be in a transaction, which also complicated things.
Just dumping a bit more context/options for us to think about...
@roji agreed. Other two “crazy ideas” that come to mind re streaming and MARS:
I'm glad to see that ef core team is open to discuss this issue.
@divega If a dbcontext begins to spawn connections I raise serious concerns regarding scalability.
As for the two queries mentioned above, I have totally different expectations to how this should behave.
// this will translate as three SQL queries
db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).ToList();
// this will translate as a single SQL query
db.Comments.Include(b=>b.Post).ThenInclude(b=>b.Blog).Where(b=>b.Post.Blog.BlogId == 1).ToList();
I expect both linq queries to map to one sql query each. If I did want to run this as separate queries I could implement it in the application in this way:
// first query
var blogs = await dbContext.Blogs.Include(b => b.Posts).ToListAsync();
foreach (var blogPost in blogs.SelectMany(b => b.Posts)) {
// n queries
var blogPostComments = await dbContext.Comments.Where(c => c.PostId == blogPost.Id).toListAsync();
// do whatever needs to be done
}
or even do fancy stuff with batching:
// first query
var blogs = await .ToListAsync();
foreach (var blogs in dbContext.Blogs.InBatchesOf(100)) {
// fetch all blog posts for the set of blogs in one query and aggregate them into a lookup
var blogPostsLookup = dbContext.Posts.Where(p => blogs.Select(b => b.Id).ToArray().Contains(p.BlogId)).ToLookup(t => t.BlogId, t => t);
// then process all blog posts for each blog
foreach (var blog in blogs) {
foreach (var blogPost in blogPostsLookup[blog.Id]) {
// fancy stuff
}
}
}
This said, linq to sql should map linq to sql. Not do fancy logic.
Thanks for the tips about reversing the query traversal to make a single query.
In my case, I'm translating graphql queries down to db queries via AutoMapper's .ProjectTo<TDto>(fieldsToMap)
, so the query has to be driven top-down meaning I'll always get the query splitting, even if I'm only requesting one top-level record (blog in examples above).
For my situation to be optimized, I'd need to be able to tell EF that I'm only retrieving one blog somehow, or is it smart enough to recognise the query passes a primary key and/or TOP 1
? If not, feature request!
@divega
Someday we could just change how DbContext works to open additional connections when MARS is required but not supported and we are only doing reads.
Definitely an option! Although we'd have to watch out for transactions, which also affect reads...
Assuming you implement connection multiplexing, you could to something homologous thing at the DbConnection vs. physical connection. I.e. they don’t need to be 1:1 anymore.
Yeah - that resembles your 1st suggestion in a way: since every query can be dispatched to an arbitrary physical connection, there's no more constraint on having two queries on the same connection. The same caveat with transactions applies though.
But multiplexing introduces a problem the other way - if you have multiple queries queued up for different producers on the same connection, then if the 2nd producer wants to process results but the 1st one still hasn't, then we have a problem... We can either delay the 2nd one (seems unacceptable) or buffer the 1st resultset in memory - in which we've killed streaming... I still have to think about multiplexing a bit more, but it definitely creates some issues in this area.
@dpsenner, here's my take on your argument:
I expect both linq queries to map to one sql query each. If I did want to run this as separate queries I could implement it in the application in this way:
The problem, again, is that executing this kind of query in a single SQL query can be extremely bad for performance, because of the reasons that have already been explained above: duplication of principal entity fields, leading to tons of unnecessary data being transferred and thrown away. This simply doesn't seem like the right option for the default usecase in an average application.
To summarize, when deciding what EF Core should do by default, the decision seems to be between:
To me, the decision between the two options isn't trivial. However, since the streaming issue isn't universal (MARS is supported in some providers and could be supported in others), and I don't really subscribe to the idea that one LINQ should provide one SQL, the 2nd really seems superior.
One last point is whether the multiple queries produced in the 2nd option are batched. If they're not, we're adding more database network roundtrips, which isn't good - but this could be fixed in EF Core (#10878 tracks this).
Finally, the idea suggested by @divega above could indeed mitigate the streaming concern when provider MARS isn't available, but I agree with your point that it hogs more connections and so could be problematic.
I'm glad that you too share the opinion that entity framework core should not be allowed to spawn connections.
As said, for me entity framework is an object relational mapping framework. As such it is responsible to map linq to a sql command and map the relational results to an object model. If a linq expression is badly formulated, it results in a NotImplementedException or NotSupportedException. This behavior is comparable to a badly formulated sql query. Nobody would expect the database engine to autocomplete missing aggregate functions in combination with group by. When faced with an exception, the developer will figure out a way to formulate a linq expression that gets the data from the database and mapped into the application server. I know very well that this is a very strict viewpoint that not many people share. But in this viewpoint entity framework core does no longer have to implement guesswork on linq query trees like the mentioned ordering of property traversals in the comment of @divega. Further, developers get the benefit of being able to implement sane and stable application servers.
Of course some developers will write bad queries. So they will also be able to write code that crashes. Nobody expects from the compiler to protect them from doing so.
I personally don't really agree with much of what you wrote (my views are my own):
As said, for me entity framework is an object relational mapping framework. As such it is responsible to map linq to a sql command and map the relational results to an object model.
I don't think there's anything in the idea of an ORM that requires a single LINQ query to be translated to a single SQL query - ORMs map objects to relational databases, and can do so in various ways. LINQ specifically is a pretty .NET-specific concept, so it seems odd to say that the concept of ORMs somehow necessarily implies something about how LINQ should be translated.
If a linq expression is badly formulated, it results in a NotImplementedException or NotSupportedException. This behavior is comparable to a badly formulated sql query.
I don't really think the analogy holds well - there's a very big difference between an incorrect expression or query, and the question of how to translate a correct one.
I think it's more appropriate to compare to compilers (or more accurately transpilers), which translate from one language to another. These tools routinely perform optimizations and other manipulations to make sure your code runs in the best way possible - that's one of their main job descriptions. Following your logic we may say that method inlining is a bad idea, because a C# method must always be translated to an IL and assembly method...
Of course some developers will write bad queries.
Agreed, but we're not talking about a bad query, but rather about what happens with the default, standard EF Core join construct. At the end of the day, if the proposal is that EF Core translate the standard join syntax (Include()
) via an SQL construct that results in extremely bad performance for the common scenario, then I think everyone will end up losing.
everyone will end up losing
Of course this should not happen.
SQL construct that results in extremely bad performance
Bad performance can also happen when a queried database column is missing an index or if the index has bad selectivity.
what happens with the default, standard EF Core join construct
Entity framework core should behave consistently and not do unexpected things. To me it was and is still unexpected that one linq query runs as two sql queries because I, being a human, would write it as one sql query. Including hundreds of columns from different tables and joining all together will always end up with huge result sets. I would not expect entity framework core to do any magic on this. But is it really entity framework core's responsibility to deal with these ~broken~ badly designed queries by loading everything into memory and doing the operation in the application server?
SQL construct that results in extremely bad performance
Bad performance can also happen when a queried database column is missing an index or if the index has bad selectivity.
Once again, we're not talking about bad perf because the programmer made a mistake or forgot an index - we're talking about the default API for joining entities in EF Core.
Entity framework core should behave consistently and not do unexpected things. To me it was and is still unexpected that one linq query runs as two sql queries because I, being a human, would write it as one sql query. Including hundreds of columns from different tables and joining all together will always end up with huge result sets. I would not expect entity framework core to do any magic on this. But is it really entity framework core's responsibility to deal with these broken badly designed queries by loading everything into memory and doing the operation in the application server?
I don't think there's anything inconsistent about EF Core's behavior - again, I'm not aware of any promise or guarantee anywhere to translate one LINQ to one SQL. As to you finding this unexpected behavior, well, that could be subjective... Think again about the example with method inlining and the compiler.
To recap, I do think there's an actual argument for providing streaming whenever possible. As for the two queries being unexpected, I just don't agree, and I'd hate to see EF Core penalize performance of its regular APIs just in the name of some principle which doesn't seem to benefit anyone in concrete terms...
I guess I don't have anything more to contribute to this discussion... Let's see what @divega thinks.
I reopened the issue because we definitively want to discuss making changes. For example we have discussed always splitting the queries by default no matter in what direction you traverse the the relationships, and using strategies that can be even more efficient but may require buffering by default. If we do that then it could be useful to provide a way to explicitly opt-in a potentially less efficient stragegy that enables streaming, similar to @dpsenner's SingleQuery
suggestion but possibly named AsStreaming
.
I think I understand where the surprise is coming from when you realize that an O/RM may produce more than one SQL statement for a LINQ query, but I agree with @roji that keeping it strictly 1 : 1 is not a goal (although we do want to avoid 1 : N+1). Both LINQ and SQL are declarative and don't provide any guarantees on what strategies they will use to return the results.
While I don't agree that 1 linq query should map to 1 SQL, but given lack of streaming in certain providers, we could make split query optional based on MARS enabled in order to preserve streaming.
While it could certainly cause the explosion if you have many includes but I believe that expecting streaming & avoiding cartesian product both together for a developer is somewhat impossible feature request.
In case of split scenario, there are ways to split the query yourself. It is easy for tracking queries since Statemanager could populate navigations. It is bit difficult for non-tracking queries (but still not impossible).
Though given both the issues - streaming, cartesian product - if MARS is supported then 2 queries should be optimal default.
Thanks for all the valuable insights. I agree that both (streaming vs query batching) are functional requirements that are orthogonal to each other and entity framework cannot guess the developers intention. Therefore I like the idea of opting in to either one of the linq-to-sql-query strategy with something like RequireSingleQuery()
, AllowManyQueries()
or even UseQueryingStrategy(QueryingStrategy.SingleQuery)
. I have no objection to make the ManyQueries
the default strategy. The great benefit here is that such an option gives the developer a lot of control about how entity framework core treats a linq query. A big plus one from me!
Glad I bumped into this.
Want to switch from normal SQL commands to entity but we're unable to because we have queries that have a lot of one-to-many relationships.
Example (only showing the relevant data here):
(from x in pmCtx.Projects
select new ProjectGridItem()
{
Labels = string.Join(";", x.TblLabelsOfProject.Select(t => t.UidLabelNavigation.Label))
});
This generates a query to fetch the labels per project.
Performance is terrible.
Our SQL command currently uses a STUFF to do this and it takes milliseconds to execute compared to Entity, which takes seconds (10+).
Guessing a single query would fix my problem?
@aldrashan - The behavior you are seeing is called N+1
. EF Core runs N additional queries for each result in the first query because of string.Join
method in your projection which is not translated to server at this point. This issue talks about 2 queries and not N+1
. End result of this issue are not going to help you scenario in any direct way.
@smitpatel I do hope that's not the case. To reduce the complexity of what we discussed I deliberately chose to not mention that we are actually seeing N+1 queries when we reference more than one entity. For instance the following linq:
dbContext
.UserRecords
.Where(...)
.Select(record => new
{
record.Id,
record.DisplayName,
record.FirstName,
record.LastName,
record.Note,
record.IsEnabled,
RoleNames = record.UserRoles.Select(t => t.Role.Name).ToArray(),
TagNames = record.UserTags.Select(t => t.Tag.Name).ToArray(),
})
results in three sql queries whereas a single sql query performs better and in the case of postgresql could look like:
select
t1.id, t1.display_name, t1.first_name, t1.last_name, t1.note, t1.is_enabled,
ARRAY(
select t2.name
from role t2
inner join user_role t3 on t3.role_id = t2.id where t3.user_id = t1.id
) as role_names,
ARRAY(
select t2.name
from tag t2
inner join user_tag t3 on t3.tag_id = t2.id where t3.user_id = t1.id
) as tag_names
from user t1
where ...
@dpsenner - The case you mentioned above is selecting from 2 different collection navigation hence 3 queries which is inline with the issue that separate query issued for each collection nav being projected out. Hence, I do not see how the example you mention now is anyway different from what has been discussed already.
I have overread the string.join()
part, sorry for the noise.
See also #14062, which is about concurrent updates between the two queries, when considering multiple or single query approach.
See also #9014, which is essentially a duplicate of this.
We have new findings to share. Recently we implemented new features and wanted to fetch an entity, from here on referenced to as e1, along with two of its relations, from here on referenced to as e2 and e3, in a single view. In this scenario we observe entity framework core to run one query to fetch e1 (e1.Id, e1.Name, e1.Class.Id, e1.Class.Name, e1.Class.Meaning, IsArchived), one query to fetch e2 (TagNames) and n+1 queries to fetch e3 (component.Id, component.Amount, component.AmountUnit, component.Name, component.Sequence) for every e1 with WHERE @_outer_Id = component.id
.
return await Storage
.Material
.Where(e1 => e1.Id == id)
.Select(e1 => new
{
e1.Id,
e1.Name,
Class= new
{
e1.Class.Id,
e1.Class.Name,
e1.Class.Meaning,
},
IsArchived = e1.Archive.Any(t1 => t1.During.Contains(now)),
TagNames = e1.Tags.Select(t1 => t1.DomainObjectTag.Name).ToArray(),
Components = e1.ItemsWhereComposite.Select(component => new
{
component.Id,
component.Amount,
AmountUnitId = component.AmountUnit.Id,
AmountUnitName = component.AmountUnit.Name,
AmountUnitMeaning = component.AmountUnit.Meaning,
component.Component.Name,
component.Sequence,
}).ToArray(),
})
As a human, I would probably write the query (against postgres) as follows because it yields the best behavior for our usecase:
SELECT e1.id, e1.name, e2.id, e2.name AS class_name, e2.meaning as class_meaning, (
SELECT CASE
WHEN EXISTS (
SELECT 1
FROM material.material_archive AS t1
WHERE ((t1.during @> tooling.now_utc()) = TRUE) AND (e1.id = t1.material_id))
THEN TRUE::bool ELSE FALSE::bool
END
) AS is_archived,
ARRAY(
select
tag.name
from material.material_tag as material_tag
inner join fas.domain_object_tag as tag
on material_tag.domain_object_tag_id = tag.id
where
material_tag.material_id = e1.id
) as tag_names,
(
select array_agg(bill_of_material)
from
(
select
component.id,
component.sequence,
component.amount ,
unit.name,
unit.meaning,
component_material.id,
component_material.name
from
material.bill_of_material_item as component
inner join material.material as component_material
on component_material.id = component.ingredient_material_id
inner join measuring.unit as unit
on unit.id = component.unit_id
where
component.material_id = e1.id
) bill_of_material
) as components
FROM (
select * from material.f_filter_material_unarchived(NULL)
) AS e1
INNER JOIN material.material_class AS e2 ON e1.material_class_id = e2.id
ORDER BY e1.id
With few linked issue where multiple query actually caused issue in results, we decided to move to single SQL for all cases. Yes, it could cause cartesian product explosion which user would be able to work-around with manual loading patterns.
That is sad to hear, thanks for sharing the plan. With this outline we can adapt and fix our implementation accordingly. Unfortunately that means for us that we are going to write a lot of sql queries and manual type conversions.
@dpsenner - Use of ARRAY
in query was always provider specific and cannot be implemented inside base EF Core provider. Apart from that, as I mentioned earlier, doing split query manually is not too much of code. Not ideal for all cases, but given all the pending features, providing multiple strategy to do the same include (like one query vs many query) introduces complexity in the code while providing little value. Of course the value proposition could change based on feedback, in which case, multiple strategies can be revisited in future.
@dpsenner, regarding your query above, it's not out of the realm of possibility for the Npgsql to translate ToArray()
in a projection to array_agg()
on PostgreSQL (and produce ARRAY()
subqueries), but this definitely won't happen anytime soon (and definitely post-EF Core 3.0). As @smitpatel wrote, constructs that are specific to a particular database implementation are always going to be less supported (if at all).
Regardless, while the use of ARRAY()
and array_agg()
in this way may be elegant, it again creates subqueries which can be extremely expensive - I suspect this will perform much slower on the PostgreSQL than a simple, traditional join (even if the latter may end up sending more data over the network). This is very similar to our discussion in https://github.com/npgsql/Npgsql.EntityFrameworkCore.PostgreSQL/issues/712.
With few linked issue where multiple query actually caused issue in results, we decided to move to single SQL for all cases. Yes, it could cause cartesian product explosion which user would be able to work-around with manual loading patterns.
Given this, is there (or will there be) a way to load nested collections without looping? For example, given Trees
with Branches
with Leaves
:
context.Entry(tree).Collection(p => p.Branches).Query().Include(b => b.Leaves).Load();
This would work but could result in a "cartesian explosion" after this change (and even now Includes in some situations aren't great because of #12022, which is why I want to move to manual loads now if I can), and this isn't possible:
context.Entry(tree).Collection(p => p.Branches).Load();
context.Entry(tree).Collection(p => p.Branches).Collection(c => c.Leaves).Load();
This would work but obviously does one query per Branch
:
context.Entry(tree).Collection(p => p.Branches).Load();
foreach (var branch in tree.Branches)
{
_context.Entry(branch).Collection(b => b.Leaves).Load()
}
I would expect to be able to generate something that will select Leaves
where Branch.Id IN (2, 5, 6, 9, ...)
and then have EF associate each Leaf
that comes back with the right Branch
but I'm not aware of any way to do that right now.
I would expect to be able to generate something that will select Leaves where Branch.Id IN (2, 5, 6, 9, ...) and then have EF associate each Leaf that comes back with the right Branch but I'm not aware of any way to do that right now.
If you are doing tracking query then that is very easy to do.
C#
context.Entry(tree).Collection(p => p.Branches).Load();
context.Set<Leaf>().Where(l => tree.Branches.Select(e => e.Id).Contains(l.BranchId)).ToList();
Thanks, that'll work for my purposes, although I have to capture tree.Branches.Select(e => e.Id)
in a local otherwise it evaluates the Contains
client-side. I think some way to do this directly involving .Load()
would be really useful, as I can easily see another developer (or me if I forget) re-inlining the tree.Branches.Select(e => e.Id)
and ruining performance or just not understanding what the point of evaluating an IQueryable
like this is, so I'll have to comment heavily each time I do it.
I have to capture tree.Branches.Select(e => e.Id) in a local otherwise it evaluates the Contains client-side.
Which version of package are you using? We have already improved that in 3.0 and it should not cause client eval anymore. There can be likely bug but plan in 3.0 is make sure that inline-ing also does server eval.
Which version of package are you using? We have already improved that in 3.0 and it should not cause client eval anymore.
I'm using 2.2.3. Glad to hear that's been improved.
I just got a surprise too, seeing my simple query behaves differently than EF6. When I'm writing a query like this, I'm aware of the cartesian product and know it is much smaller deal then running the sub-query separately for each row.
```c#
return (
from r in DC.OrganizationUserRoles
where bytForRoleId == 0 || (bytForRoleId > 0 && r.OrgUserRoleId == bytForRoleId)
orderby r.OrgUserRoleId
select new OrgRolePermissionInfo() {
RoleId = r.OrgUserRoleId,
RoleName = r.OrgUserRoleName,
Permissions = (
from dp in r.RolePermissions
from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == dp.OrgUserRoleId && x.PermissionId == dp.PermissionId).DefaultIfEmpty()
where dp.OrgId == 0
select new OrgRolePermission() {
PermissionId = dp.Permission.PermissionId,
Value = p.PermissionValue,
Types = dp.Permission.PermissionApplicables.ToList()
}
).ToList()
}
).ToList();
````
I would think that retrieving the four fields with every Type from PermissionApplicables would be far less costly then issuing a sub-query and running it to get the Types for each RolePermission separately.
This makes me pretty scared actually, regarding the performance.
As mentioned by some people before, if I thought it would be more efficient to issue separate queries, I'd write it that way. In fact, I would probably write it with Contains to issue only one sub-query and then populated the Types for each row from the one result set I'd receive, which I would suppose (not sure, but I'd think that anyway) would be faster than issuing N separate queries across to the database.
Am I wrong about any of this?
(I'm not sure why my code is not displaying properly.. :( Can anyone help with that? I'm not usually commenting much so I'm not sure if there is a bug on this site or I'm doing something wrong by using the <> tool, which inserts backticks, between which I inserted the code.. )
Did you measure??
No, I did not in this case as I just found out that it issues so many queries, but going across processes to the database more often than necessary, compiling of the query by the database... I'd definitely expect that taking more time then compiling one query and returning let's say 40 bytes of additional data per row.
(Anyone on why my code is not being presented colorized and multiline here?)
OK, so I've spent a couple of hours measuring the difference....
And sure enough, getting one resultset and then - in my case - manually populating the class hierarchy (something EF6 used to do by itself without a problem) is muuuuuuch faster than calling the database 9 times instead. The average time increase on 9 vs 1 call to get the same data is 337%.!! And if I ask my function to populate all my permission roles for me (I have 7 right now in there), the difference goes up to 499%, which is of course no surprise since doing it the same thing with 50!!! queries in EF Core versus how EF6 did it with just one query to get the same result is bound to have an impact. Not a good one either.
And I'm not even mentioning that doing the query backwards to be able to measure this is really not easy as it is a very unnatural way to design a query. Doing this and debugging it regularly for other queries will take lots of additional time and lots more code... Or could it be done easier? Here's my testing code from my real app:
(Sorry if it comes out not formatted again. I'm happy to fix it if anyone can point me to what I'm doing wrong in this comment editor.)
var bytForRoleId = (byte)forRole;
bytForRoleId = 0;
var w = new System.Diagnostics.Stopwatch();
w.Start();
var q1 = (
from r in DC.OrganizationUserRoles
where bytForRoleId == 0 || (bytForRoleId > 0 && r.OrgUserRoleId == bytForRoleId)
orderby r.OrgUserRoleId
select new OrgRolePermissionInfo() {
RoleId = r.OrgUserRoleId,
RoleName = r.OrgUserRoleName,
RoleDescr = r.OrgUserRoleDescr,
Permissions = (
from dp in r.RolePermissions
from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == dp.OrgUserRoleId && x.PermissionId == dp.PermissionId).DefaultIfEmpty()
where dp.OrgId == 0
select new OrgRolePermission() {
Permission = dp.Permission,
DefaultValue = dp.PermissionValue,
Value = p.PermissionId > 0 ? p.PermissionValue : dp.PermissionValue,
Inherited = p.PermissionId > 0 ? false : true,
Types = dp.Permission.PermissionApplicables.ToList()
}
).ToList()
}
).ToList();
w.Stop();
var ms1 = w.ElapsedTicks;
w.Restart();
var q2 = (
from a in DC.PermissionApplicables
from rp in DC.RolePermissions.Where(x => x.PermissionId == a.PermissionId).DefaultIfEmpty()
from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == rp.OrgUserRoleId && x.PermissionId == rp.PermissionId).DefaultIfEmpty()
from pp in DC.Permissions.Where(x => x.PermissionId == rp.PermissionId)
from r in DC.OrganizationUserRoles.Where(x => bytForRoleId == 0 || (bytForRoleId > 0 && x.OrgUserRoleId == bytForRoleId)).DefaultIfEmpty()
where rp.OrgId == 0 && (pp.PermissionId != 0 && pp.PermissionId != 3)
orderby r.OrgUserRoleId, a.PermissionId, a.CanWhat
select new Test {
RoleId = r.OrgUserRoleId,
RoleName = r.OrgUserRoleName,
RoleDescr = r.OrgUserRoleDescr,
Perm = pp,
DefaultValue = rp.PermissionValue,
Value = (p.PermissionId > 0 ? p.PermissionValue : rp.PermissionValue),
Inherited = p.PermissionId > 0 ? false : true,
PermId = a.PermissionId,
CanWhat = a.CanWhat,
UiText = a.UiText,
UiExplanation = a.UiExplanation,
Includes = a.Includes,
ApplicableToAppUser = a.ApplicableToAppUser
}
).ToList();
w.Stop();
var ms2 = w.ElapsedTicks;
w.Restart();
Test row;
byte iLastRoleId = 0;
byte iLastPermId = 0;
short iLastCanWhat = 0;
OrgRolePermissionInfo o = null;
var lst = new List<OrgRolePermissionInfo>();
OrgRolePermission rolePerm = null;
for (var i = 0; i < q2.Count; i++) {
row = q2[i];
if(row.RoleId != iLastRoleId) {
iLastRoleId = row.RoleId;
iLastPermId = 0;
iLastCanWhat = 0;
o = new OrgRolePermissionInfo() {
RoleId = row.RoleId,
RoleName = row.RoleName,
RoleDescr = row.RoleDescr,
Permissions = new List<OrgRolePermission>()
};
lst.Add(o);
}
if(row.PermId != iLastPermId) {
iLastPermId = row.PermId;
iLastCanWhat = 0;
rolePerm = new OrgRolePermission() {
Permission = row.Perm,
DefaultValue = row.DefaultValue,
Value = row.Value,
Inherited = row.Inherited,
Types = new List<PermissionApplicable>()
};
o.Permissions.Add(rolePerm);
}
if(row.CanWhat != iLastCanWhat) {
iLastCanWhat = row.CanWhat;
rolePerm.Types.Add(new PermissionApplicable() {
PermissionId = row.PermId,
ApplicableToAppUser = row.ApplicableToAppUser,
CanWhat = row.CanWhat,
Includes = row.Includes,
UiExplanation = row.UiExplanation,
UiText = row.UiText
});
}
}
w.Stop();
var ms3 = w.ElapsedTicks;
Comments to the above code:
First of all, it is really strange how the code block starts after several lines after its actual beginning. But at least it is readable now. Anyway...
The q1 query is the "natural" one - the way it feels naturally to be written and the way EF6 would issue one query to the database for and fill it correctly.
The q2 is the one I attempted to write backwards. As I mentioned, that was pretty hard to debug. It gives the same results now, which is enough for the measurement, but I'd not be completely confident with it for the production code in this form.
The second query is ordered by the entity keys of the involved entities and then iterated over to populate the list of object hierarchies to return.
I need to write queries like this all the time. Yet, according to @roji this is not a common case? Can I do it better? I'm not happy at all with a 5-fold decrease in performance or more with more return objects expected. With so many queries, the DB server will not be very scaleable. :-(
@marekvse you may have missed it, but the decision has already been made to move to a single SQL query (with classical join) for all cases - see this comment above.
FWIW whether a single query will perform better than one-query-per-joined-table depends on your specific query, various environmental concerns (e.g. network latency), etc. It's very easy to show scenarios in which the single query approach will be very bad for perf. However, it's true that single query has several non-perf advantages (always allows streaming regardless of MARS support, potential subtle transaction isolation issues..), and users should be able to get the previous behavior by coding for it explicitly.
Yes, I indeed did miss that. Thank you @roji . I just updated from EF Core 2.2.0 to 2.2.4, but I get the same result - the same 50 queries instead of one for q1 query from the code above. Is there a setting I have to change or alter somehow the way the query is written?
Overnight I decided I was going to write a native SQL sproc and get data that way, which would at least let me write the query naturally and understandably. I'm glad I don't have to do that probably, but I'm not sure what's wrong with my LINQ in q1 to achieve the desired result.
@marekvse, the change @roji is referring to is coming in EF Core 3.0.
I see. Thanks @roji. Will be looking forward to it then :).
Glad to hear this is being fixed in EF Core 3.0.
I also tried it with a straight up LINQ query and hand-written projections and there is the same 2 queries generated. This isn't really an option for me though even if it did work, as I'm building a GraphQL service that passing required field names down from client and then I use the
.ProjectTo<T>(null, requiredFieldNames)
overload to basically have AutoMapper generate the perfect SQL query for me. It'll be magical if it can work end-to-end and produce the JOIN I'd like.bookingDtos = db.Bookings .AsNoTracking() .Select(b => new BookingDto { Id = b.Id, Name = b.Name, BookingItems = b.BookingItems.Select(bi => new BookingItemDto { Id = bi.Id, Name = bi.Name }).ToList() }) .ToArray();
info: Microsoft.EntityFrameworkCore.Database.Command[20101] Executed DbCommand (4ms) [Parameters=[], CommandType='Text', CommandTimeout='30'] SELECT [b].[Id], [b].[Name] FROM [Booking] AS [b] ORDER BY [b].[Id] info: Microsoft.EntityFrameworkCore.Database.Command[20101] Executed DbCommand (7ms) [Parameters=[], CommandType='Text', CommandTimeout='30'] SELECT [t].[Id], [b.BookingItems].[Id] AS [Id0], [b.BookingItems].[Name], [b.BookingItems].[BookingId] FROM [BookingItem] AS [b.BookingItems] INNER JOIN ( SELECT [b0].[Id] FROM [Booking] AS [b0] ) AS [t] ON [b.BookingItems].[BookingId] = [t].[Id] ORDER BY [t].[Id]
@smitpatel @ajcvickers There is an order by Id exists in the resulted query above although the linq doesn't contains any order by. I'm asking about this as it could cause a performance issue specially in case this "Id" is used only as a primary key, but not clustered index
Most helpful comment
@benmccallum - Yes it is by design. While it can be achieved with 1 query, it is not less likely to be performant. Single query for collection (include or join) causes dreaded cartesian product.
To give an example, let's assume you have 50 bookings, and each booking has 5 bookingitems. With 1 query, it will do a join between both and cause 250 rows returns from server. And 50 bookings' data will be repeated for 5 times exactly. If you add another collection include (assume 10 bookingdetails per bookings), now the total size of data is 2500 rows. With a lot of duplicated data present.
With split queries, we send 1 query for each collection include. So you get 3 queries with 50/250/500 results each with no duplication of data. Reference includes are still done using join in single query. Of course there is trade off based on cardinality which would make multiple queries efficient or single query efficient. Due to join explosion in case of multiple collection joins, we decided to adopt 2 query approach in EF Core.