Efcore: Query: Use WHERE EXISTS for collection navigation Include

Created on 24 Mar 2016  路  15Comments  路  Source: dotnet/efcore

A query of the form context.Posts.Include(p => p.Comments) generates SQL like this:

SELECT [p].[PostId], [p].[Body]
FROM [Posts] AS [p]
ORDER BY [p].[PostId]

SELECT [c].[CommentId], [c].[Body], [c].[PostPostId]
FROM [Comment] AS [c]
INNER JOIN (
    SELECT DISTINCT [p].[PostId]
    FROM [Posts] AS [p]
) AS [p1] ON [c].[PostPostId] = [p1].[PostId]
ORDER BY [p1].[PostId]

This works correctly but there doesn't seem to be any need for the DISTINCT keyword in the subquery in any scenario as the principal key seems to always be included in the projection.

type-bug

Most helpful comment

@anpete the answer on Stack Overflow is misleading and, for SQL Server, wrong. On Stack Overflow I'm one of the users you'd consider "authoritative" for SQL Server, so I'll answer this here (I don't want to link the profiles publicly):

The cases where INNER JOIN is faster than EXISTS are coincidental (cardinality estimation errors and such). Internally, EXISTS _always_ is a join. It is marked specially so that the optimizer knows that only one row is required even if multiple are available. This merges the distinctness into the join.

The SQL Server optimizer has an unfortunate limitation that it cannot _introduce_ an EXISTS-type join. It can only convert it to a normal join. Therefore, EXISTS is more expressive.

I cannot think of any scenario where INNER JOIN is systematically faster. EXISTS is the better code gen strategy.

Also note, that the optimizer generally understands redundant distincts well. They usually are removed if uniqueness can be proven (e.g. through an index). Also, DISTINCT is deleted from EXISTS(DISTINCT ...) because it always does nothing. SQL Server might _add_ a distinct, though.

All 15 comments

This will not cause any SQL Server query plan difference so it's just cosmetics. I think this is needed for correctness in the general case.

But this should be an EXISTS check instead of a INNER JOIN in the first place. We want all comments that are referenced by _at least_ one selected post. This suggests EXISTS.

SQL Server is not capable of introducing EXISTS dues to optimizer limitations. The EXISTS formulation seems cleaner than DISTINCT and possibly faster.

@divega We need the DISTINCT in general because the sub-query may return duplicates (depending on the top-level query. E.g. For this SelectMany query:

``` c#
from c1 in context.Set().OrderBy(c => c.CustomerID).Take(5)
from c2 in context.Set().Include(c => c.Orders)
select c2


We would generate:

``` sql
SELECT [c1].[CustomerID], [c1].[Address], [c1].[City], [c1].[CompanyName], [c1].[ContactName], [c1].[ContactTitle], [c1].[Country], [c1].[Fax], [c1].[Phone], [c1].[PostalCode], [c1].[Region]
FROM (
    SELECT TOP(5) [c0].*
    FROM [Customers] AS [c0]
    ORDER BY [c0].[CustomerID]
) AS [t]
CROSS JOIN [Customers] AS [c1]
ORDER BY [c1].[CustomerID]

SELECT [o].[OrderID], [o].[CustomerID], [o].[EmployeeID], [o].[OrderDate]
FROM [Orders] AS [o]
INNER JOIN (
    SELECT DISTINCT [c1].[CustomerID]
    FROM (
        SELECT TOP(5) [c0].*
        FROM [Customers] AS [c0]
        ORDER BY [c0].[CustomerID]
    ) AS [t]
    CROSS JOIN [Customers] AS [c1]
) AS [c10] ON [o].[CustomerID] = [c10].[CustomerID]
ORDER BY [c10].[CustomerID]

If the DISTINCT is omitted, then the second query will return 4105 Orders cf. just 830.

WRT EXISTS, as @GSPP observes, the second query could be written as:

SELECT [o].[OrderID], [o].[CustomerID], [o].[EmployeeID], [o].[OrderDate]
FROM [Orders] AS [o]
WHERE EXISTS (
    SELECT [c1].[CustomerID]
    FROM (
        SELECT TOP(5) [c0].*
        FROM [Customers] AS [c0]
        ORDER BY [c0].[CustomerID]
    ) AS [t]
    CROSS JOIN [Customers] AS [c1]
    WHERE [c1].[CustomerID] = [t].[CustomerID]
)
ORDER BY [o].[CustomerID]

We went with INNER JOIN because the query generation is slightly easier, but it could be that EXISTS is generally faster (it is hard to find authoritative information about this). This SO points in that direction: http://stackoverflow.com/questions/2177346/can-an-inner-join-offer-better-performance-than-exists

@anpete the answer on Stack Overflow is misleading and, for SQL Server, wrong. On Stack Overflow I'm one of the users you'd consider "authoritative" for SQL Server, so I'll answer this here (I don't want to link the profiles publicly):

The cases where INNER JOIN is faster than EXISTS are coincidental (cardinality estimation errors and such). Internally, EXISTS _always_ is a join. It is marked specially so that the optimizer knows that only one row is required even if multiple are available. This merges the distinctness into the join.

The SQL Server optimizer has an unfortunate limitation that it cannot _introduce_ an EXISTS-type join. It can only convert it to a normal join. Therefore, EXISTS is more expressive.

I cannot think of any scenario where INNER JOIN is systematically faster. EXISTS is the better code gen strategy.

Also note, that the optimizer generally understands redundant distincts well. They usually are removed if uniqueness can be proven (e.g. through an index). Also, DISTINCT is deleted from EXISTS(DISTINCT ...) because it always does nothing. SQL Server might _add_ a distinct, though.

@GSPP Thanks for the info; great to get detail on this from someone who knows the domain.

The main idea of the answer in the SO thread that I was referring to was "If you do an inner join on a recordset with DISTINCT applied (to get rid of the duplicates), EXISTS is usually faster.", which seems to match what you are saying above. Am I missing something?

I'm going to see if we can do anything for RTM here.

@GSPP Thanks for the details! 馃憤

@anpete Makes sense. We should probably have a bug to use EXISTS and discuss it in triage. Sounds like it would be a good improvement.

@divega I was hoping to re-purpose this bug for that :smile:

I am fine with that. I am only concerned that switching to EXISTS could become much more involved than what the original bug proposed (and turned out to be incorrect), in which case we should probably re-triage it.

If you do an inner join on a recordset with DISTINCT applied (to get rid of the duplicates), EXISTS is usually faster.

Yes, it's equal or faster or worse by coincidence.

If you want to play around:

USE tempdb

SELECT *
INTO #o
FROM sys.all_objects

SELECT *
INTO #c
FROM sys.all_columns

SELECT #o.name
FROM #o
JOIN (SELECT DISTINCT object_id FROM #c WHERE name LIKE '%x%') c ON #o.object_id = c.object_id

SELECT #o.name
FROM #o
WHERE EXISTS (SELECT * FROM #c c WHERE c.name LIKE '%x%' AND #o.object_id = c.object_id)

This determines the names of all objects that have a column matching %x%. Semantically, we need distinctness because there might be multiple such columns. The EXISTS does not need the DISTINCT operator because the hash join does that. In my mind this is a missing optimizer feature.

This could potentially solve #5038 which is mainly occurring due to Distinct keyword.
This does not apply when there is order by clause and which is not on principal key properties therefore does not solve #5038

Just to mention this here on the issue.
INNER JOIN cannot be converted to WHERE EXISTS if there is ORDER BY in subquery which has column(s) other than principal key properties since we need to sort by that column in final result.

Shouldn't all columns required for sorting be required in the "driver" query? The EXISTS part never should be required to deliver data.

The INNER JOIN ... DISTINCT also cannot deliver data columns because they would invalidate the distinctness property.

Can you make an example what needs to be in the ORDER BY clause that requires a join?

Update: Why is there a need to sort at all? Does the EF implementation require both queries to be sorted in an aligned way?

Update: Why is there a need to sort at all? Does the EF implementation require both queries to be sorted in an aligned way?

Exactly the case. How EF does collection include is EF triggers 2 queries. 1st query is the main result query. 2nd query is to fetch data from the other table for the include. Both the queries are sorted by principal key properties of relationship so that results from both the queries can be matched while streaming. So when we start enumerating main query, we don't need to scan the all results of 2nd query to find relevant rows for include. When there is some ordering in the result query, we need to include that in first query because that is what final result should be returning. Just to match with that, the 2nd query (where we have inner join.... distinct) also needs to sort by that column and which exist only in the subquery part. Hence it cannot be converted to WHERE EXISTS. The queries in the first post are just example of subset of queries which can arise for collection include. An example for this would be:
Suppose we want posts sorted by time in decreasing order to get most recent post first

SELECT [p].[PostId], [p].[Body]
FROM [Posts] AS [p]
ORDER BY [p].[PostTime] DESC, [p].[PostId]

SELECT [c].[CommentId], [c].[Body], [c].[PostPostId]
FROM [Comment] AS [c]
INNER JOIN (
    SELECT DISTINCT [p].[PostTime], [p].[PostId]
    FROM [Posts] AS [p]
) AS [p1] ON [c].[PostPostId] = [p1].[PostId]
ORDER BY [p1].[PostTime] DESC, [p1].[PostId]

I understand now. Unfortunately, this means that different execution strategies will be optimal for different problems. Either EF needs to guess which strategy is best or it needs to always use a strategy deemed good enough for all cases. This is not a good situation to be in.

Actually, the solution that EF uses right now is very similar to the LEFT JOIN and 1-query solution (I hope you know what I mean). That achieves streaming and 1 query only but it might greatly increase the amount of data transferred.

Maybe EF needs to allow the user to choose which strategy to use?

Until that is implemented, if ever, I actually think the INNER JOIN ... DISTINCT solution is good. The inefficiencies are very minor, usually.

The current solution implemented in #5433 is
if there is no order by which is not on principal key properties, we use Exists else we fall back to Inner join

Was this page helpful?
0 / 5 - 0 ratings