Efcore: Avoid cartesian product without query splitting

Created on 7 Sep 2020  路  13Comments  路  Source: dotnet/efcore

Query splitting was introduced, but I would personally avoid its drawbacks like the plague.

Luckily, we can _usually_ avoid cartesian products through careful aggregate design. Still, sometimes we truly do want to join multiple 1:N relations. Of course, we would rather not risk a huge data set caused by an incidental high number of child entities.

This problem can be solved without resorting to query splitting.

As an example, say we are selecting one Parent and joining both its Sons and its Daughters.

Basically, since the joined siblings are independent, we have no reason to want them multiplied. This can be accomplished by explicitly instructing the database to keep them separate:

  • Join a set of constants. Since we intend to join two 1:N relations, we will join two constants: LEFT JOIN (SELECT 1 AS Id UNION ALL SELECT 2) AS Splitter.
  • To the Son's join condition, add: AND Splitter.Id = 1.
  • To the Daughter's join condition, add: AND Splitter.Id = 2.

This gives us:

SELECT *
FROM Parents p
LEFT JOIN (SELECT 1 AS Id UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
WHERE p.Id = 1

While transferring a _bit_ more data (mostly in the form of duplication of the Parent), the duplication stays linear and well under control.

When we combine careful aggregate design with avoiding the cartesian product, we have all the tools we need to load reasonable object graphs without introducing the significant drawbacks of split queries.

area-perf customer-reported needs-design type-enhancement

Most helpful comment

We must compare the best possible case to the cartesian product case

That is my point and has been said by @roji at multiple point. This works really well for some scenarios but not all and we need to evaluate pros and cons of both sides.

All 13 comments

@Timovzl thanks for this proposal, we'll study this and respond soon.

@Timovzl this is definitely an interesting proposal and has the potential to improve single queries. However, note that this technique works only for multiple collection navigations being loaded at the same level; nested navigations cannot be loaded in this way, and so even if your proposal is implemented, split queries remain relevant. There are also some scenarios where your technique will cause increased data to be transferred (due to duplication of the parent rows), but I agree the benefits likely outweigh that.

In any case, the 5.0 version is now complete barring bug fixes. We'll look into this more in-depth for 6.0.

Note when investigating - do actual cross-database benchmarks to measure and compare perf.

@roji I did not mean to imply that split queries will become irrelevant. Just that is really nice to have an alternative. :)

However [...] nested navigations cannot be loaded in this way

I do believe this approach can work for multiple levels of nesting as well. It can be applied as follows:

  • In a query, define a parent as an instance of addressing a table where one or more others are being joined onto it.
  • Define a child as an instance of addressing a table where it is being joined onto another. (Note that intermediate levels act as both parent and child.)
  • Define a 1:1 child as a child where we are certain that it can produce no more than 1 row per parent row.
  • Define a 1:N child as a child that is not a 1:1 child.
  • When applying this feature, for any parent with _two or more 1:N children_, apply a splitter quasi-table as demonstrated in the original post.

Some points to note: We'll need different names for each splitter quasi-table, e.g. Splitter1, Splitter2. Each one will consist of a number of rows equal to the number of children being joined.

Not only should this approach work for trees of any breadth and depth, but it should be no worse than performing the same query without it! It only applies where a cartesian product is risked, and it avoids it. I'm sure you will do tests to confirm this. I have had good results with it in practice.

As with many of EF's features, I think it would be neat if this one could be applied either per-query or on the registration of a DbContext. I can imagine wanting to use this wherever it may apply (as defined above), or simply having a single query where it comes in handy.

However [...] nested navigations cannot be loaded in this way

I do believe this approach can work for multiple levels of nesting as well

Sure. I meant to say that this technique works only for multiple collection navigations being loaded at the same level, but cannot help cases where a single navigation is loaded at each level (unless I'm missing something). For example, following your example above, if instead of a Parent with two collection navigations Sons and Daughters we have Customers, with navigation Orders, which itself has another navigation LineItems, the cartesian explosion effect still occurs when loading all customers with all their orders and line items. Unless I'm mistaken, that would still be a case where only split queries can help.

As with many of EF's features, I think it would be neat if this one could be applied either per-query or on the registration of a DbContext. I can imagine wanting to use this wherever it may apply (as defined above), or simply having a single query where it comes in handy.

This is really too early to say, but it the choice between split and single query already works this way. If we end up implementing the above mode, I don't see why it would be any different.

For example, following your example above, if instead of a Parent with two collection navigations Sons and Daughters we have Customers, with navigation Orders, which itself has another navigation LineItems, the cartesian explosion effect still occurs when loading all customers with all their orders and line items. Unless I'm mistaken, that would still be a case where only split queries can help.

You're right in that it doesn't help in this situation, whereas split queries help (at least in theory). But this scenario shouldn't warrant either. Let me illustrate.

If I'm not mistaken, the cartesian explosion is another name for the cartesian product. It refers to the multiplication effect that occurs when loading multiple types of 1:N children onto _the same parent_.

For example, if parent A has 100 sons and 100 daughters, there would be no single correct way to combine the sons and daughters, and so, instead, _every_ combination is returned. Instead of 100 + 100 = 200 rows, we get 10.000! That is an explosion indeed. If loading 200 rows is acceptable, then loading 10.000 is very likely unacceptable. Note also how even a single pathological entry may cause command timeouts in the application.

In your example, by contrast, we might have 1 Customer with 5 Orders, each with 10 LineItems. The best number of entities we can hope to transfer is 1 + 1*5 + 1*5*10 = 56. The complexity is O(N), where N is the product of the counts, i.e. 1 * 5 * 10.

When we do this as a single query, we know that we get a row for each of the LineItems. Each row contains 3 entities, with the Orders and Customers potentially repeated. As such, we get 1 * 5 * 10 = 50 rows, each containing 3 entities, for a total of 150 entities transferred. The complexity is still O(N). The constant factory has simply gone up from 1.12 to 3.

I believe the takeaway here is that an increase from linear to quadratic (e.g. 200 to 10.000) is unacceptable and warrants a solution, such as split queries or the solution we are talking about here. After all, the overhead of performing multiple queries is safer and generally better than the overhead of going from 200 to 10.000 rows. However, in the latter example, there is no real explosion going on. The constant increases a bit, and we transfer a bit more data, but the overhead is generally less than that of performing multiple queries.

In conclusion, you're right that the proposed solution cannot help with nesting where no multiplication occurs, whereas split queries _can_, in theory. However, this situation generally needs no help.

Note that the proposed solution _can_ help when multiplication occurs on multiple levels (like this). :)

I agree with your analysis that the problem is usually considerably more severe when the collections are on the same level, then when they are nested. However, it's important to understand that it ultimately depends on your specific scenario andBut data.

In the example with 1 Customer/5 Orders/10 LineItems, the columns for the single customer are transmitted 50 times to the client. Now, if the customer happens to have a very heavy field - such as a photo - that data gets duplicated 50 times (just for one customer). With the right data, this could easily be worse than even the 100*100 explosion of the first scenario. So any scenario where data gets duplicated is potentially very dangerous, and split queries eliminate that altogether.

But I'm not disagreeing with anything you've written above. We'd have to thoroughly benchmark your proposal across databases to make sure that there's no surprising perf issues in some scenarios (though I don't expect that would happen), and then evaluate the complexity of producing the query and materializing the results back. Beyond that I think this could be a great improvement.

If I'm not mistaken, the cartesian explosion is another name for the cartesian product. It refers to the multiplication effect that occurs when loading multiple types of 1:N children onto the same parent.

It is same for same level or nested. 100 collection items on same level is 10,000 rows then 100 collection item for nested is also 10,000 rows. Repetition of parents excluding the very last entity is same.

@smitpatel We must compare the best possible case to the cartesian product case, to see how much overhead we are causing. 100 sons * 100 daughts = 10.000 children returned, with only 200 distinct children in existence. Returned = (Optimum / 2) ^ 2. Quadratic complexity does not scale.

Having 100 sons who _each_ contain 100 grandsons means there are _actually_ 10.000 descendants in existence! Sure, we're repeating the parent 10.000 times, so there is some growth: the repetition of the parent causes us to return an extra 10.000 entities, and the repetition of the sons another extra 10.000. Returned = 3 * Optimum. That is within a small constant factor of the theoretical optimum. Linear complexity scales.

With the right data, this could easily be worse than even the 100*100 explosion of the first scenario. So any scenario where data gets duplicated is potentially very dangerous, and split queries eliminate that altogether.

@roji Fair point!

I appreciate the level of thought that is being given to this proposal.

We must compare the best possible case to the cartesian product case

That is my point and has been said by @roji at multiple point. This works really well for some scenarios but not all and we need to evaluate pros and cons of both sides.

An idea hit me that might let us avoid even the multiplication on single navigations per level. I haven't experimented with it yet, but the theory seems solid.

Basically, we would like to repeat as little as possible of any parent in the hierarchy. We could do that by sticking to its primary key, the minimum we need to identify it. We do, however, want that parent's data. Using the same splitter semi-table technique, we could self-join the parent _once_, separate from its children, in the same way we separate siblings.

Let's use Splitter.Id = 0 for the self-join:

SELECT p.Id, pSelf.*, s.*, d.*
FROM Parents p
LEFT JOIN (SELECT 0 AS Id UNION ALL SELECT 1 UNION ALL SELECT 2) AS Splitter ON TRUE
LEFT JOIN Parents pSelf ON pSelf.Id = p.Id AND Splitter.Id = 0
LEFT JOIN Sons s ON s.ParentId = p.Id AND Splitter.Id = 1
LEFT JOIN Daughters d ON d.ParentId = p.Id AND Splitter.Id = 2
WHERE p.Id = 1

A simple example results set might look like this:

p.Id,   pSelf.Id,   pSelf.Name, s.Id,   s.ParentId, s.Name, d.Id,   d.ParentId, d.Name
1,  1,      Parent,     NULL,   NULL,       NULL,   NULL,   NULL,       NULL
1,  NULL,       NULL,       1,  1,      Son1,   NULL,   NULL,       NULL
1,  NULL,       NULL,       2,  1,      Son2,   NULL,   NULL,       NULL
1,  NULL,       NULL,       NULL,   NULL,       NULL,   1,  1,      Daughter1
1,  NULL,       NULL,       NULL,   NULL,       NULL,   2,  1,      Daughter2

Provided that the database sends NULLs efficiently (such as indicated by a bit field, with no further data for that field), this has the potential to negate the effect of the multiplication almost entirely.

Of course, we'd need to confirm that the self-join is performed efficiently.

I had this in mind too when I read your original proposal :)

Transferring nulls is generally pretty efficient, but of course we'd need to confirm. Definitely an interesting direction to explore!

Was this page helpful?
0 / 5 - 0 ratings