Runtime: Consider adding more LINQ operators from Ix.NET

Created on 30 Nov 2016  路  26Comments  路  Source: dotnet/runtime

https://github.com/dotnet/corefx/issues/13842 introduces TakeLast and SkipLast which existed in Ix.NET (System.Interactive, which is part of the Reactive Extensions project) for a long time. This may be a good time to review other operators we got in Ix.NET and see whether they're worth adding to .NET proper.

I'm pasting a short selection (using the Ix names) below:

Min and Max with IComparer

No overloads of these operators exist with an IComparer<TSource>. The existing overload uses Comparer<TSource>.Default to get the comparer. It seems an omission that we don't provide an option to pass in a custom comparer.

public static TSource Min<TSource>(this IEnumerable<TSource> source, IComparer<TSource> comparer);
public static TSource Max<TSource>(this IEnumerable<TSource> source, IComparer<TSource> comparer);

MinBy and MaxBy to select element(s) with lowest/highest key value

Similar to MinimalBy and MaximalBy in Mathematica. The current overloads with a Func<TSource, TResult> are merely shortcuts for xs.Select(f).Max() rather than providing a means to get the list of values that have the lowest projected value (e.g. the list of people with the lowest value for Age).

public static IList<TSource> MinBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector);
public static IList<TSource> MinBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer);
public static IList<TSource> MaxBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector);
public static IList<TSource> MaxBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IComparer<TKey> comparer);

Concat (or Flatten or Unwrap)

Just like Task<Task<T>> can be unwrapped into Task<T> or AggregateException can be flattened. Obviously, this is just a macro for xss.SelectMany(xs => xs), but possibly more descriptive and having better performance.

public static IEnumerable<TSource> Concat<TSource>(this IEnumerable<IEnumerable<TSource>> sources);
public static IEnumerable<TSource> Concat<TSource>(params IEnumerable<TSource>[] sources);

StartWith (Prepend with multiple values)

I noticed we've added Prepend. Ix provides StartWith which is basically the same but takes a params array with elements to prepend. (Note it comes from the dual of StartWith on IObservable<T> where we can start an event sequence with one or more initial values provided from an array.) It could be done quite easily today using new[] { ... }.Concat(xs) but may be deemed a natural extension to Prepend.

public static IEnumerable<TSource> StartWith<TSource>(this IEnumerable<TSource> source, params TSource[] values);

Scan (aggregate yielding intermediate results)

Totally analogous signatures to Aggregate but yields every intermediate computed result. As usual, it comes from the duality with Rx where we deal with streams of events (that may never terminate). This said, it has value on a lazy enumerable sequence, e.g. when processing large logs and providing all intermediate aggregation values as the analysis is going on. Aggregate is pretty much Scan followed by Last, only optimized.

public static IEnumerable<TAccumulate> Scan<TSource, TAccumulate>(this IEnumerable<TSource> source, TAccumulate seed, Func<TAccumulate, TSource, TAccumulate> accumulator);
public static IEnumerable<TSource> Scan<TSource>(this IEnumerable<TSource> source, Func<TSource, TSource, TSource> accumulator);

There are a bunch more that are likely a bit more esoteric and of lesser value, but nonetheless worth having a look at. One family of them deals with sharing side-effects of enumeration in a less aggressive way than the traditional ToArray to ToList approach; see things like Share, Memoize, and Publish.

It's been a while since I've worked on / looked at this code, so some design decisions may have eroded from my memory, but I'll do my best recollecting these if questions come up.

api-needs-work area-System.Linq

Most helpful comment

Thanks. I read all of those as, paraphrasing: "because I don't want to use another library". To me that's not a good enough reason to add such methods. The core frameworks simply cannot scale to add every possible extension method someone might want. For the .NET ecosystem to thrive, developers need to be comfortable using libraries from NuGet and the broader ecosystem.

All 26 comments

+1

Related: MinBy and MaxBy were part of https://github.com/dotnet/corefx/issues/2146, which was closed.

Also relevant: https://github.com/morelinq/morelinq. DistinctBy sounds like a good idea, for example.

@bartdesmet Nice proposal. I have a few comments, though:

  • I don't think MinBy / MaxBy should return ILists, though, seems very un-Linqish (IEnumerables would be better).

Concat (or Flatten or Unwrap)

Obviously, this is just a macro for xss.SelectMany(xs => xs), but possibly more descriptive and having better performance.

  • I think this is redundant. How will this make things faster? The selector in SelectMany is only run once per subcollection, not per item, so it does not have too much overhead. Also, I have recently made changes that speed it up significantly: https://github.com/dotnet/corefx/pull/13942

StartWith (Prepend with multiple values)

  • I think PrependMany would be a better name, and if we add this we should also add EndWith / AppendMany for symmetry.

  • We may want to hold on this until there is support for params IEnumerable<T> in C#. That way, with the following signature:

IEnumerable<T> PrependMany<T>(this IEnumerable<T> source, params IEnumerable<T> prependees)
{
}

var onetwothree = Enumerable.Range(3, 1).PrependMany(1, 2);

the compiler can cache the [1, 2] collection in a static field. If we had the signature as params T[], a defensive copy would have to be allocated each time.

Scan (aggregate yielding intermediate results)

@jamesqo

I don't think MinBy / MaxBy should return ILists, though, seems very un-Linqish (IEnumerables would be better).

I would expect them to return a single value (for example, MoreLinq's version does that). But if it's going to return a sequence, a list does make sense to me, since it can't really be lazy (though I would prefer IReadOnlyList<T>).

Should it return a sequence containing all the values with the minimum/maximum key? Not sure. I think getting just a single value is the common case, and returning a sequence would make that common case more complex and also less efficient.

On the other hand, knowing if there's a tie (and what values are tied) sounds useful too. Maybe both versions of the method should exist? Though in that case I'm not sure about naming (maybe AllMinBy?).

I'm neither firmly for or against most of these, but a MinBy or MaxBy that returns more than one value is horrible, at least under that name. If you've more than one minima or maxima you either need more information that just the value (which that doesn't seem to offer to give) or you need to rethink your comparison criteria.

@JonHanna When you add a selector it becomes feasible that multiple elements may be mapped to the same maxima. For example, if you wanted to find the oldest people in a List<Person>:

struct Person { string Name; int Age; }

var people = new List<Person>
{
    new Person { Name = "John", Age = 25 },
    new Person { Name = "Tommy", Age = 80 },
    new Person { Name = "Mary", Age = 80 }
};

var oldest = people.MaxBy(p => p.Age);

If we took away the selector then returning an IEnumerable wouldn't make sense. That said, it might still be confusing/inconvenient that MaxBy returns a list when Max returns a scalar. Maybe we should rename it to MaximaBy.

Oh, I get that it's possible to have more than one, but I don't think "MaxBy" is a good match for that, and often nor is just the value; one would often want to know indices too.

I'll refrain from debating names here and provide some clarifications on the semantic decisions we made in Ix. (See Wadler's law.)

MinBy and MaxBy in Ix.NET initially returned one element (first match) but we changed it based on feedback from users, especially folks familiar with Mathematica etc. It's quite hard to get the list of minima and maxima from existing operators (short of using Aggregate with a List<T> seed value and all the min/max logic embedded in its selector functions, or using a much more expensive OrderBy), but the opposite is easier (get one from many, albeit at the cost of having an allocation).

These observations / discussions lead me to conclude both cases (return a list a la Mathematica or a single element a la Scala) have a use. Also, getting the index is easy by composition with the overload of Select that provides access to the element's index.

The choice of IList<T> predates the existence of IReadOnlyList<T>. We explicitly chose a return type that reflects the eager nature of the operator. In fact, LINQ has places where the type promises too much, e.g. GroupBy not returning lazy enumerables (though that's totally possible but not trivial; we have a few such implementations laying around here) and thus not working on infinite streams (even if you just want a few groups and a few elements per group). A similar design decision was made in Rx for the return type of the Buffer operator being IObservable<IList<T>>, so Ix reflected that as part of its symmetry.

With regards to IE<T> Concat(IE<IE<T>>) aka Flatten, there may be very little performance to be gained over SelectMany with the identity function, so it's more a matter of expectation as to whether one finds such an operator when needed or expected. Things like Task<Task<T>> and AggregateException have similar notions of flattening. For example, Scala has a flatten function too.

@bartdesmet Regarding Flatten, I've changed my mind over the past week. Common patterns like

List<List<T>> listOfLists = ...
T[] flat = listOfLists.SelectMany(l => l).ToArray();

are currently very hard to optimize because we do not know l => l will always result in an ICollection, and we may not be able to get the count of all the collections cheaply. Therefore, we have to do dynamic resizing during ToArray.

With Flatten, however, we can check if the source implements IEnumerable<ICollection<TSource>> and if so just sum up all of the counts of the collections and preallocate an array. So I guess it would be a good idea to add such a function.

I'd expect MinBy to return a single thing and MinsBy to return zero or more things.

What's the status of this issue? It's still labeled api-needs-work, but there seems to be an agreement on the shape of MaxBy/MinBy, at least. Maybe the issue should be split into separate issues for each method (or each method family) to ease the process of refining the API.

IMO MaxBy and MinBy are sorely needed. They address a very common need, and the current alternatives are horrible:

  • OrderByDescending(...).First() is O(n log n) when MaxBy could be O(n)
  • Aggregate can be leveraged for this, but it's unwieldy and hard to read.
  • As mentioned, there's currently no overload to specify a comparer, and anyway, writing one is much more verbose than just passing a lambda.

DistinctBy, and possibly IntersectBy, UnionBy, ExceptBy and SequenceEqualBy would also be useful.

@thomaslevesque

SequenceEqualBy

That seems unnecessary since the return value doesn't need the type of the sequence before applying the selector. Just do source.Select(...).SequenceEqual(other.Select(...)) The other set operations you mention seem like they would be good additions, though.

That seems unnecessary since the return value doesn't need the type of the sequence before applying the selector. Just do source.Select(...).SequenceEqual(other.Select(...)) The other set operations you mention seem like they would be good additions, though.

It's true that SequenceEqualBy can be achieved with existing methods, but it's less intuitive. Also, a specific implementation could be slightly faster, since it wouldn't have the overhead of the Select iterator.

@thomaslevesque

OrderByDescending(...).First() is O(n log n) when MaxBy could be O(n)

Not anymore, see https://github.com/dotnet/corefx/pull/2401.

@svick nice!

With regards to IE Concat(IE>) aka Flatten, there may be very little performance to be gained over SelectMany with the identity function, so it's more a matter of expectation as to whether one finds such an operator when needed or expected. Things like Task> and AggregateException have similar notions of flattening. For example, Scala has a flatten function too.

This would be great! Just as long as it is not called Merge....

Regarding MinBy and MaxBy, I definitely think a single value should be returned. While multiple results are of course a possibility, I think that the By suffix indicates a desired behavior in the vein of DistinctBy, ExceptBy, and IntersectBy. In my mode of reasoning, which often involves fuzzy logic, The keySelector strongly suggests that you do not care which one is returned, so long as it is maximal or minimal. These operations tend to be inherently reductive (in a highly useful way).

LINQ is all about productivity. Why not make a productive library even more productive?
My vote is then yeah.

So nice suggestion! I want to use Ix.NET methods without Introducing Library via nuget.

But it is so hard to introduce all Ix.Net methods at the same time to dotnet core.How about introduce and implement them individually?

In this issue, we discuss about Scan methods.
https://github.com/dotnet/corefx/issues/31348

without Introducing Library via nuget.

Why?

without Introducing Library via nuget.

Why?

https://github.com/dotnet/corefx/issues/14119#issuecomment-459973238


Because I want to use Ix.NET methods as language builtin Collection extension methods.
And I want to make many C#er can use Ix.NET methods.


I think, there is so big difference between

  • language builtin Collection extension methods (LINQ to Objects)
  • Libary Collection extension methods (like Ix.NET or MoreLinq via nuget.)

Many C#er study and can use LINQ to Objects (language builtin Collection extension methods).
But Libary Collection extension methods are not same. Some people like and use Ix.NET, MoreLinq and others, but the other people don't use them and don't know them.

About LINQ to Objects, code reading, code review and communication are so easy because many people can use them.
But about Library Collection extension methods, they are not easy. Because not all people know them.

There are so many useful methods in Ix.NET. They make so big productivity.

If Ix.NET methods are added in .NET core in the future, I think that many C#er will be able to use them. And code reading, code review and communication about them will become so easy. I think this introduce more productivity for C#er and C# community.


Some people say that "You can create your own Collection extension methods". Many people create their own Collection extension methods individually. Sometime they are so useful, but sometime they are confused. Because they are not same implementation and specification.

If C#(.NET Core) provides Ix.NET methods as language builtin methods, we can use as common implementation and specification methods. It is so useful and clear.


Because I want to use Ix.NET methods as language builtin Collection extension methods. And I want to make many C#er know and can use Ix.NET methods.

Thanks. I read all of those as, paraphrasing: "because I don't want to use another library". To me that's not a good enough reason to add such methods. The core frameworks simply cannot scale to add every possible extension method someone might want. For the .NET ecosystem to thrive, developers need to be comfortable using libraries from NuGet and the broader ecosystem.

Thank you!

For the .NET ecosystem to thrive, developers need to be comfortable using libraries from NuGet and the broader ecosystem.

Yes! I understand and agree above.

Thanks. I read all of those as, paraphrasinAnd I want to make many C#er can use Ix.NET methods.

I'm so sorry. My main suggestion and opiniton are that "I want to make many C#er can use Ix.NET methods."

But I understand next important point at the same time.

The core frameworks simply cannot scale to add every possible extension method someone might want.

By the way, Ix.NET TakeLast and SkipLast methods were introduced in dotnet/runtime#19431. This means that "If the method is useful and need, it will be added as .NET core builtin method.", isn't it? So I will suggest issues that introduce Ix.NET methods indivisually.

So I submited new issue IsEmpty Ix.NET method in https://github.com/dotnet/corefx/issues/35054
And I take part in the issue about Scan like method : https://github.com/dotnet/corefx/issues/31348

So I will not introduce all Ix.NET methods. But I will introduce methods that they are useful and need for .NET core from Ix.NET.

This may be a good time to review other operators we got in Ix.NET and see whether they're worth adding to .NET proper.

I'd like to question the premise here. In Ix there is an existing popular library, available from NuGet, for free, that already enables any developer using .NET to get this functionality. What's the benefit of copying that functionality into Enumerable? I would much rather spend such efforts removing whatever roadblocks exist for folks to partake in the NuGet ecosystem.

I'm totally for growing the ecosystem and ironing out any wrinkles that may hinder adoption of libraries through NuGet.

One of the motivating issues for opening this was the conflict introduced by TakeLast and SkipLast in Ix where the same extension method ended up being provided in .NET, thus having to add #if in Ix to create different build flavors to resolve the ambiguity for users of those operators (while keeping downlevel support). This is the main drawback of multiple libraries providing the same extension methods and may be a topic to consider in the design of "extension everything".

From the list above, the one inconsistency I see in System.Linq is the lack of a comparer on Min and Max, while other operators do provide a customization for comparers. All other proposals do not intersect with existing operators and can be punted to Ix.

As for Ix, given the recent work on System.Linq.Async over there, I think it's totally fair to continue going down the road of contributing such libraries through NuGet and focus on discoverability.

Related to #27687

As already mentioned adding more methods to System.Linq comes at a cost, so we would need to see strong evidence for the utility of any new method on a case by case basis. As it stands this issue is more of a wish list of assorted LINQ methods, making it more difficult to evaluate during API review. I would recommend closing it and creating separate issues for each proposal, so that we can have a more focused debate.

A few thoughts on the individual proposals:

  • I think MaxBy and MinBy are genuinely useful additions, that cannot be easily expressed with the existing API. Already covered by #27687.
  • Concat seems like a useful addition. Would you be able to create a dedicated issue for it?
  • I would not support adding StartWith since it's really just a slightly different Prepend.
  • Scan seems pretty niche. While I can see why it might make sense in Rx-like flows, I don't think it would see much use in the context of LINQ. IMHO Aggregate is pretty niche as well and should best be avoided.
Was this page helpful?
0 / 5 - 0 ratings

Related issues

omariom picture omariom  路  3Comments

omajid picture omajid  路  3Comments

yahorsi picture yahorsi  路  3Comments

sahithreddyk picture sahithreddyk  路  3Comments

matty-hall picture matty-hall  路  3Comments