Runtime: Add Enumerable.*By operators (DistinctBy, ExceptBy, IntersectBy, UnionBy, MinBy, MaxBy)

Created on 21 Oct 2018  路  10Comments  路  Source: dotnet/runtime

I propose to add LINQ methods of the pattern *By. For example:

IEnumerable<TSource> DistinctBy<TSource, TKey>(
     IEnumerable<TSource> source,
     Func<TSource, TKey> keySelector,
     IEqualityComparer<TKey> comparer = null)

This method would behave like Distinct except that equality is determined based on the key provided by keySelector. The key could be any value including anonymous types and value tuples.

A motivating case could be this:

IEnumerable<Order> allOrders = GetTransactions();
IEnumerable<int> completedOrderIDs = GetCompletedOrderIDs();
IEnumerable<Order> remainingOrders =
      allOrders.ExceptBy(completedOrderIDs, order => order.ID, orderID => orderID);

Logic like that is reasonably common in business logic code. It is not easy to implement without ExceptBy. In particular, the following is undesirable because it leads to quadratic cost and repeated enumeration:

IEnumerable<Order> remainingOrders = allOrders.Where(o => !completedOrderIDs.Contains(o.ID));

In the past I have had a need for *By methods many times so I have written them myself. A web search reveals great interest. I believe there is a strong case for adding methods like this.

There has been interest on this issue tracker as well:

Here is the API propsal. All of these methods come with an overload with and without comparer. I kept the existing naming conventions and argument ordering.

IEnumerable<TSource> DistinctBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
IEnumerable<TSource> DistinctBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)

IEnumerable<TSource> ExceptBy<TSource, TKey>(this IEnumerable<TSource> first, IEnumerable<TKey> second, Func<TSource, TKey> keySelectorFirst)
IEnumerable<TSource1> ExceptBy<TSource1, TSource2, TKey>(this IEnumerable<TSource1> first, IEnumerable<TSource2> second, Func<TSource1, TKey> keySelectorFirst, Func<TSource2, TKey> keySelectorSecond)
IEnumerable<TSource1> ExceptBy<TSource1, TSource2, TKey>(this IEnumerable<TSource1> first, IEnumerable<TSource2> second, Func<TSource1, TKey> keySelectorFirst, Func<TSource2, TKey> keySelectorSecond, IEqualityComparer<TKey> comparer)

IEnumerable<TSource> IntersectBy<TSource, TKey>(this IEnumerable<TSource> first, IEnumerable<TKey> second, Func<TSource, TKey> keySelectorFirst)
IEnumerable<TSource1> IntersectBy<TSource1, TSource2, TKey>(this IEnumerable<TSource1> first, IEnumerable<TSource2> second, Func<TSource1, TKey> keySelectorFirst, Func<TSource2, TKey> keySelectorSecond)
IEnumerable<TSource1> IntersectBy<TSource1, TSource2, TKey>(this IEnumerable<TSource1> first, IEnumerable<TSource2> second, Func<TSource1, TKey> keySelectorFirst, Func<TSource2, TKey> keySelectorSecond, IEqualityComparer<TKey> comparer)

IEnumerable<TSource> UnionBy<TSource, TKey>(this IEnumerable<TSource> first, IEnumerable<TSource> second, Func<TSource, TKey> keySelector)
IEnumerable<TSource> UnionBy<TSource, TKey>(this IEnumerable<TSource> first, IEnumerable<TSource> second, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)

TResult Max<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer)
TResult Max<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer, TResult defaultValue)

TSource MaxBy<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector)
TSource MaxBy<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer)
TSource MaxBy<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer, TSource defaultValue)

TResult Min<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer)
TResult Min<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer, TResult defaultValue)

TSource MinBy<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector)
TSource MinBy<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer)
TSource MinBy<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector, IComparer<TResult> comparer, TSource defaultValue)

Further notes:

  1. DistinctBy: The order of the output elements should be documented to be the same order as the input. Any reasonable implementation that comes to mind does it this way. For compatibility reasons this could not ever be changed anyway after the first version ships. Distinct should be similarly documented if not already done. Distinct already behaves this way.
  2. For ExceptBy and IntersectBy the same is true for the first input. Only items from the first input are ever returned and their order can be kept. Except does the same thing today.
  3. For UnionBy I'm not sure about the order.
  4. For MinBy/MaxBy it should be documented that the first element in the sequence with the minimum/maximum key is returned.
  5. MinBy/MaxBy can take a defaultValue which is used in case the sequence is empty. Overloads without default value throw in that case.
  6. I added overloads to Min/Max to bring them on par with the functionality added by MinBy/MaxBy (comparer and defaultValue).
  7. Note, that the input element types for ExceptBy and IntersectBy can be different. This is because we only ever return items of the first sequence. We need two key selectors in that case. There's a simpler overload that has only one element type as well.
api-suggestion area-System.Linq

Most helpful comment

Not sure why dotnet/runtime#14753 was closed way back when.

It is crazy that LINQ doesn't have these methods built in. I think it's been rare in the 10 years since LINQ's introduction that I've ever had a project where I don't end up writing one or more of these extension methods again and again myself.

I think the argument (mentioned in the other thread) of, "oh well you can do these things by creating a custom comparer" doesn't hold water. Writing a custom comparer is very heavyweight. And, in practice, I can probably count on one hand the number of times I've seen developers actually go and write their own comparers. Instead they usually a) use home-grown extension methods that do these things or b) fall back to writing a foreach loop.

One of the main points of LINQ is brevity- writing what you want to do, not all the plumbing code required to do it. Forcing users to write a separate class each time they want to slice their enumerable a different way is the absolute antithesis of this.

To implement these methods, I'd imagine it could just use something like Jon Skeet's ProjectionEqualityComparer and call the existing non-*By methods. (Probably deserves a separate issue, but that class should be exposed publicly in the framework proper as well)

All 10 comments

Not sure why dotnet/runtime#14753 was closed way back when.

It is crazy that LINQ doesn't have these methods built in. I think it's been rare in the 10 years since LINQ's introduction that I've ever had a project where I don't end up writing one or more of these extension methods again and again myself.

I think the argument (mentioned in the other thread) of, "oh well you can do these things by creating a custom comparer" doesn't hold water. Writing a custom comparer is very heavyweight. And, in practice, I can probably count on one hand the number of times I've seen developers actually go and write their own comparers. Instead they usually a) use home-grown extension methods that do these things or b) fall back to writing a foreach loop.

One of the main points of LINQ is brevity- writing what you want to do, not all the plumbing code required to do it. Forcing users to write a separate class each time they want to slice their enumerable a different way is the absolute antithesis of this.

To implement these methods, I'd imagine it could just use something like Jon Skeet's ProjectionEqualityComparer and call the existing non-*By methods. (Probably deserves a separate issue, but that class should be exposed publicly in the framework proper as well)

@maryamariyan
The original issue is five years old. Seven versions of .NET Core have been released since then.
These methods are extremely convenient (unlike the overloads with IEqualityComparer) and are easy to implement.

In an effort to unify my own By extension methods (with the hope that those would someday would become part of the runtime) I came looking for a reference implementation that I could grab.
As far as I can tell the extrema-finding functions from both MoreLINQ and Ix.NET cause the evaluation of the selector function for single-element collections.
The decision to skip it (when possible) does come with the side effects of "hiding" a possible exception occurring in it, but at the same time- there is no "valid selector guarantee" with the empty collection either- so to me it would make sense to avoid the possibly lengthy selector evaluation (when possible)- but since the normal Enumerable.Min(source, selector) does it as well- might be the dominating factor in the final decision...
Anyway, can anybody point me to an existing implementation of the proposed API - looking through these 100+ pages of search results is quite painful (in more than one ways)- and I see no way of filtering out the "
ByTest.cs" files...

/cc: @eiriktsarpalis @adamsitnik

@terrajobst What needs to happen to get these API suggestions reviewed and get any official feedback from the team? I saw you mention in the standup yesterday that you guys have a clear plate now... :)

@terrajobst What needs to happen to get these API suggestions reviewed and get any official feedback from the team? I saw you mention in the standup yesterday that you guys have a clear plate now... :)

The API process is documented. Each area has an owner and it's their responsibility to work with the issue opener to get the issue either into the state api-ready-for-review or closed.

In this case, that's @eiriktsarpalis and @adamsitnik.

A few remarks:

  • I would support implementing MinBy and MaxBy. They are useful methods without good alternatives at the moment. I'm not sure if we should add the defaultValue overloads.
  • Likewise, I think DistinctBy is useful. It should also come with an IQueryable implementation.
  • Playing devil's advocate: what problem does ExceptBy solve? You assert that a naive workaround would take quadratic time, however most instances I've seen would simply use completedOrderIDs.ToHashSet(); outside of the main pipeline which is effectively equivalent.
  • Why do we need ExceptBy overloads that select keys for the second enumerable? Using your example it could be easily rewritten in terms of the first oveload: allOrders.ExceptBy(completedOrderIDs.Select(orderID => orderID), order => order.ID);
  • If I understand UnionBy correctly, is it basically using keySelector to define custom a equality semantics for TSource? How is that different from passing a custom IComparer to the existing Union method? Why should it even have a different name?
  • I'm confused by the IntersectBy signatures. Am I right to assume they're just mistakenly copied from the ExceptBy APIs?

@eiriktsarpalis Is there a reason you guys haven't reviewed this (or scheduled it for review) yet? I realize it doesn't follow the template but it seems pretty clear what it's asking and obviously you guys are aware of the proposal.

@eiriktsarpalis

How is that different from passing a custom IComparer to the existing Union method?

It's not different in terms of behavior (I assume you meant IEqualityComparer), and the same is true of all those proposed methods (ExceptBy, UnionBy, IntersectBy, DistinctBy). But writing a custom equality comparer every time you want to perform one of those operations based on a given property is time-consuming, and the logic of the comparison isn't immediately visible without looking at the implementation. The *By overloads make the code terser and more readable.

Is there a reason you guys haven't reviewed this (or scheduled it for review) yet? I realize it doesn't follow the template but it seems pretty clear what it's asking and obviously you guys are aware of the proposal.

Couple of reasons. The size of the backlog is significant so it takes us time before we can get to triage everything. I also need to be confident that any proposal has a strong chance of getting approved before I mark it as ready for review: this includes gathering evidence that the addition provides value to users, poking holes in the design, and finalizing the API shape. Following the template is part of that, but that's merely a convention meant to speed up API review that frankly takes a tiny fraction of the whole effort.

It's not different in terms of behavior (I assume you meant IEqualityComparer), and the same is true of all those proposed methods (ExceptBy, UnionBy, IntersectBy, DistinctBy). But writing a custom equality comparer every time you want to perform one of those operations based on a given property is time-consuming, and the logic of the comparison isn't immediately visible without looking at the implementation. The *By overloads make the code terser and more readable.

I suppose my original question is should they contain the By suffix at all, given that they could just be thought of as overloads that accept a key selector instead of a comparer. While most current LINQ methods accepting a keySelector do have a By suffix, none of them have equivalent methods that accept IEqualityComparer instead.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GitAntoinee picture GitAntoinee  路  3Comments

yahorsi picture yahorsi  路  3Comments

iCodeWebApps picture iCodeWebApps  路  3Comments

bencz picture bencz  路  3Comments

jkotas picture jkotas  路  3Comments