Runtime: Faster iterator-unfriendly System.Linq methods

Created on 12 Dec 2016 · 39Comments · Source: dotnet/runtime

Thanks to a recent PR, I knew about a curious bipartition in the performance of the iterator-optimised (i.e., relying on yield or Yield) methods: what performs well in an iterating scenario (e.g., foreach(var item in YieldMethod()){ }) might perform bad in a returning one (e.g., var outputCollection = YieldMethod();) and vice versa.

In that PR, I suggested an improved version of System.Linq.Intersect. The new version delivers notably better (around 40% faster under a wide variety of conditions on a not-too-powerful computer) and worse performances for returning (e.g., var output = collection1.Intersect(collection2);) and iterating (e.g., foreach(var item in collection1.Intersect(collection2)){}) scenarios, respectively. My proposal couldn't go through because of not representing a real improvement (even if all the performance gains would be bigger than the losses, a relevant number of scenarios performing notably worse wouldn't be acceptable).

Although modifying the existing System.Linq.Intersect isn't reasonable, creating an alternative method performing notably faster when getting the collection (rather than right away iterating through it) sounds fine to me. Do you want to rely on an alternative which performs reasonably well in any case? Use System.Linq.Intersect. Are you sure that you want to get the collection (and iterate through it right afterwards or at a later point or even don't iterate through it at all)? Use the notably faster new alternative.

Before going ahead with my proposal, I want to highlight the tremendous importance of the System.Linq methods, which are being systematically used on virtually any situation. Note that I am not the kind of programmer liking very small code and always relying on this kind of approaches; I don't care about writing much bigger codes if they perform better or are clearer or easily modifiable. But I do use System.Linq methods on a regular basis; at least, some of them. Getting notably-better-performing versions of these methods would have a notable impact on my coding approach. Actually, I wasn't using Intersect too much lately precisely because of having confirmed its not-that-good-performance under many different conditions.

What I am proposing is a new class (e.g., System.LinqFast) including returning-scenario-optimised versions for all the System.Linq methods. Although I have only analysed the Intersect and Except (virtually identical to Intersect) codes, can safely assume that equivalent ideas are applicable to the other methods because of the observed noticeable performance differences between the iterating/returning scenarios. That is, either the current methods are adequately optimised for iteration purposes (+ likeliness of a different version performing notably better for the returning-scenario) or they might be further optimised on this front (also a positive output for this proposal).

As an example of the kind of modifications which are likely to define each of the aforementioned scenarios, take a look at the Intersect codes in my aforementioned PR:

Original code (by bearing in mind that comparer is always null and the definition of the Set internal class):

```C#
private static IEnumerable IntersectIterator(IEnumerable first, IEnumerable second, IEqualityComparer comparer)
{
Set set = new Set(comparer);
foreach (TSource element in second)
{
set.Add(element);
}

foreach (TSource element in first)
{
    if (set.Remove(element))
    {
        yield return element;
    }
}

}

Notably-faster-for-returning-scenarios alternative:

```C#
private static IEnumerable<TSource> IntersectIteratorNew<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second)
{
    IEnumerable<TSource> first2 = first.Distinct();
    IEnumerable<TSource> second2 = second.Distinct();

    foreach (TSource element in first2)
    {
        if (second2.Contains(element))
        {
            yield return element;
        }
    }
}

There are also other approaches which are appreciably faster than the current code for returning scenarios. For example:

```C#
private static IEnumerable IntersectIteratorNew2(IEnumerable first, IEnumerable second)
{
List done = new List();

foreach (TSource element in first)
{
    if (done.Contains(element)) continue;

    foreach (TSource element2 in second)
    {
        if (done.Contains(element2)) continue;

        if ((element == null && element2 == null) || element2.Equals(element))
        {
            done.Add(element);
            yield return element;
        }
    }
}

}
``The underlying idea to these codes is that theyieldstatement (when being called inside an iterator) penalises the reliance on external (iterating) methods (e.g.,Contains). This fact makes a-priori-worse-performing alternatives (e.g., adding/removing items to/from a third collection, rather than loops + calling the efficientContains` method) ideal under iterating conditions, but certainly improvable for other scenarios.

As far as this proposal implies quite relevant modifications and I don't know what are the .NET team/community positions on this front, I will better not analyse other methods for the time being. In case of seeing some positive feedback (and certainly before writing the definitive proposal), I wouldn't mind to analyse other System.Linq methods (by bearing in mind that Except is very similar to Intersect and, consequently, all the aforementioned ideas apply to it too). In case of deciding to go ahead, I would like to take care of the implementation myself.

area-System.Linq

Source

varocarbas

All 39 comments

Notably-faster-for-returning-scenarios alternative:

In the PR that mentioned this before, it seemed that this was in fact notably slower if one tries to actually obtain the values produced, even in the case of small collection-type inputs. Have you something to suggest otherwise?

More generally, Enumerable already has many cases where it take a different approach because the type of the source allows a more performant approach.

JonHanna on 12 Dec 2016

The new version delivers notably better (around 40% faster under a wide variety of conditions on a not-too-powerful computer) and worse performances for returning (e.g., var output = collection1.Intersect(collection2);) and iterating (e.g., foreach(var item in collection1.Intersect(collection2)){}) scenarios, respectively

The "returning" case really isn't important to optimize for. The only reason you call these methods is in order to iterate through and get the results: just calling the method to get the enumerable but not using it has no value. Further, the costs involved in getting the enumerable are completely outweighed by the costs of iterating, so even if a change improves the time it takes to get the enumerable significantly, that is generally inconsequential in the overall time it takes to use the result. Further, even if the time to get the enumerator mattered, there's actually little to be done to improve it; in general, all that's done to get the enumerable is doing the argument validation on the inputs and then constructing the enumerable, which generally just involves storing some data into fields... IIRC, in your example the only reason the "returning" case was faster was because your method itself was an iterator, which means you were delaying all of the argument validation until MoveNext was called, which is not a valid change.

stephentoub on 12 Dec 2016

In the PR that mentioned this before, it seemed that this was in fact notably slower if one tries to actually obtain the values produced

@JonHanna I thought that this point was clear. No. Getting the collection was always much quicker (it is the same doing Method(); or var output = Method();). The scenario you proposed of testing inside a foreach() was the much slower one (not always; but in some cases). Take a look at the last version of my testing code there to confirm this point.

varocarbas on 12 Dec 2016

Getting the collection is only worth doing if one is going to use it in some way. The time of doing both must be improved.

Indeed, many optimisations that were made to Enumerable made getting the results slower (probably the majority of the significant changes), because a slight extra cost there was paid back by a significant improvement in iteration time in at least some cases.

JonHanna on 12 Dec 2016

@stephentoub

The "returning" case really isn't important to optimize for.

I fully disagree with you in this point (but well; this is the reason why I didn't spend too much on this proposal, because I don't know your exact position). For example, some times I do foreach(var item in collection.Intersect(collection2)){} but in many other cases I just need var ouput = collection.Intersect(collection2). Same thing for other methods.

The only reason you call these methods is in order to iterate through and get the results:

Certainly not true. You might keep using them as arguments for other Linq methods (e.g., var res = collection.Intersect(collection2); and then if(res.First() == whatever){}) or move to a different collection (e.g., var resList = collection.Intersect(collection2).ToList();). As said, these methods are used in a relevant way in many different scenarios (even by people like me, with an a priori different programming approach).

varocarbas on 12 Dec 2016

and then if(res.Firstt() == whatever))

The moment you call res.First(), you're iterating through the result of the intersection.

or move to a different collection (e.g., var resList = collection.Intersect(collection2).ToList();

The moment you call ToList, you're iterating through the result of the intersection.

stephentoub on 12 Dec 2016

@JonHanna

because a slight extra cost there was paid back my a significant improvement in iteration time in at least some cases.

I wasn't aware about this fact until yesterday and this is the whole point of my proposal: accounting for such an improvable reality. You have the default methods delivering the best performance for both scenarios already, why not having another set performing much better for the returning one?

varocarbas on 12 Dec 2016

Certainly not true. You might keep using them as arguments for other Linq methods.

Which ultimately will result in iteration.

then if(res.Firstt() == whatever).

Iterates through (there are a few exceptions were we manage to skip that entirely, but the proposed change to Intersect isn't one of them.

or move to a different collection

Likewise.

JonHanna on 12 Dec 2016

@stephentoub Logically, it was a random example expected to be properly understood. In any case, with First() you are calling a different method, not iterating through the results of the first one.

varocarbas on 12 Dec 2016

First() does iterate through the results of the enumerable it is called on. (There are a few exceptions where it can do something else, but the change to Intersect here doesn't offer that).

JonHanna on 12 Dec 2016

In any case, with First() you are calling a different method, not iterating through the results of the first one.

You are iterating through it, you're just not doing it directly: First() is doing it. But you're still incurring the costs of iteration, not just the "returning" cost.

it was a random example expected to be properly understood

Please provide a real-world example then where you call Intersect, where the cost of that call is impactful, and where you never directly or indirectly call MoveNext on the resulting enumerable.

stephentoub on 12 Dec 2016

@JonHanna @stephentoub Sorry about the bad example. ToList() is basically the same than the foreach(){} scenario, but anything else isn't. For example, add to my PR testing code .First() to the generated results and confirm that the performance of the new alternative is notably better. In fact, I am observing increasing better results when coupling various Linq calls one after the other.

varocarbas on 12 Dec 2016

@stephentoub

You are iterating through it, you're just not doing it directly: First() is doing it

But in a different way than foreach(){} right away. First() is analysing the collection, so it had been fully created and the foreach{} step by step advantage has already disappeared.

OK. I see that we weren't exactly on the same page. I will do some more tests to account for further scenarios and further-validate my proposal. Will be back in a while (lunch time).

varocarbas on 12 Dec 2016

First()'s main path (once certain possible fast paths have been ruled out) is essentially foreach(var item in source) return item;

JonHanna on 12 Dec 2016

@JonHanna But going through a foreach isn't the problem. The problem is the peculiarities (difficult to be understood/controlled other than via testing) associated with iterating right away through the collection generated via yield, analysing it step by step rather than as a whole. First takes a whole input.

As said, please, let me do some more tests addressing all your concerns (after having my lunch). Additionally, bear in mind that the underlying idea is creating optimised versions of the Linq methods by bearing in mind the differences iterating-not (not necessarily having exactly the same format; they might even return a different type of collection). In any case, please let me do some tests to address your concerns properly.

varocarbas on 12 Dec 2016

Yes, First() has such analysis. It's able to find faster paths for IList<T>, the results of Range(), Repeat(), OrderBy() and some examples of Select(). I don't see how that is applicable here.

JonHanna on 12 Dec 2016

I have used the testing code in the aforementioned PR, removed the iterating alternative (TestIterator) and modified TestReturnedCollection as follows:

C# public static void TestReturnedCollection(int maxCount, target<string> target, IEnumerable<string> first, IEnumerable<string> second) { int count = 0; while (count < maxCount) { count++; var test = target(first, second); var test2 = test.First(); } }

And I have tested all the System.Linq methods by replacing var test2 = test.First(); accordingly. Most of these methods didn't have an appreciable effect on the performance differences Intersect/Intersect2 (i.e., the new method continued being equivalently faster than the old one). The Linq methods analysing all the collection (e.g., ToList(), ToArray(), Count(), Last(), etc.) did provoke the expected inversion of performance (new approach being notably worse, like what happens in the foreach(){} scenario). There were some cases which performed differently than what I was expecting; for example: Take(), Skip(), Select(), ToString(), etc. all of them allowed the new approach to continue being notably better. There were various cases which I found particularly surprising: First(), FirstOrDefault() and ElementAtOrDefault(0) (with ElementAtOrDefault(1), the new approach was already slower) provoked the new alternative to be much quicker than in case of not using them!

By applying the code I wrote for Intersect and the aforementioned conclusions, it would be possible to create a really fast new method called IntersectTakeFirst. For example: collection1.Intersect(collection2).Take(5).First() would deliver (but much slowly) the same than collection1.IntersectTakeFirst(collection2, 5). Another option might be setting up a new type (EnumerableFast) only able to deal with other fast alternatives. There are many possibilities and the potential benefits might be very relevant.

Have I addressed some of your concerns @JonHanna and @stephentoub ?

varocarbas on 12 Dec 2016

Have I addressed some of your concerns @JonHanna and @stephentoub ?

Thank you, but no, this doesn't address my concerns.

First, this issue was opened about creating a specialized LINQ that optimizes for the speed of returning the IEnumerable<T> from calls like Intersect. That only matters if you're not going to use (i.e. call GetEnumerator and MoveNext, explicitly or implicitly) on the result, and your examples do use those, so I've yet to see an example that would justify this. As I previously commented: "Please provide a real-world example then where you call Intersect, where the cost of that call is impactful, and where you never directly or indirectly call MoveNext on the resulting enumerable."

Second, you say that your new iterator is faster for "returning" scenarios, but even if that mattered (and it doesn't), it's not actually faster. Regardless of what the actual private iterator method looks like, the entry point is going to look like:
```C#
public static IEnumerable Intersect(this IEnumerable first, IEnumerable second)
{
if (first == null) throw Error.ArgumentNull(nameof(first));
if (second == null) throw Error.ArgumentNull(nameof(second));
return TheCoreIteratorMethod();
}

Calling "TheCoreIteratorMethod" doesn't actually do any of the work; it doesn't even enter the iterator method you wrote.  All it does is allocate a compiler-generated iterator class and store the parameters you're passing into it as fields on the instance.  It doesn't matter how much work is done inside of TheCoreIteratorMethod, because on the "returning" path as you call it, nothing in there will ever be used... that code is only touched when you call MoveNext on the result of calling GetEnumerator on the result of Intersect.  So, since the argument validation needs to be done in the public entry point like this, and since all of the approaches being discussed involve an iterator, the costs in all of these are going to be the same: validate the input arguments, allocate the compiler-generated type, store the arguments into it, and return.  Any benchmark which suggests one is faster than the other is flawed.

Third, even if there was a substantial difference in the "returning" time between the cases, it's still completely dwarfed by the time it takes to do anything with the resulting iterator.  Consider this microbenchmark:
```C#
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

static class Program
{
    public static void Main()
    {
        var sw = new Stopwatch();
        const int Iters = 10000000;
        IEnumerable<int> input1 = Enumerable.Range(0, 10);
        IEnumerable<int> input2 = Enumerable.Range(5, 10);

        while (true)
        {
            sw.Restart();
            for (int i = 0; i < Iters; i++)
            {
                input1.Intersect(input2);
            }
            sw.Stop();
            double callTime = sw.Elapsed.TotalSeconds;

            sw.Restart();
            for (int i = 0; i < Iters; i++)
            {
                input1.Intersect(input2).First();
            }
            sw.Stop();
            double totalTime = sw.Elapsed.TotalSeconds;

            Console.WriteLine($"{callTime / totalTime * 100:G2}%");
        }
    }
}

All this does is an intersect between two small enumerables that have some overlap, and it times how long it takes to call Intersect vs how long it takes to call Intersect and then call First on the result (which uses the result of the Intersect). In this example, on my machine, the Intersect call time is only 2.5% of the total execution time. Even if you could make it go, say, 25% faster (and as I highlighted in my previous paragraph, any optimizations done would apply equally to both the current and your proposed implementations, so there wouldn't be a difference between them), that would still only be saving 0.6% of the total time required to call Itersect(...).First.

Fourth, benchmarks can only be valid when the things being compared are valid, and the proposed implementation is not valid. First, for source inputs that are just enumerables and don't implement any special interface like ICollection<T>, the proposed implementations are calling Contains on the inputs potentially many, many times; the is both invalid because each call will end up iterating through the source and we can only iterate at most once (otherwise sources with side effects will incur those side effects multiple times) and because each of those calls is O(N) in the number of elements, which means this has now been turned into an O(N^2) algorithm. Second, for source inputs that do implement a special interface like ICollection<T>, you can't assume that Contains provides the right semantics. For example, a HashSet can be constructed with a comparer that tells it how to compare its items internally, including in calls to Contains. If you call Intersect(..., null), that's the same as calling Intersect(..., EqualityComparer.Default), and the items in the hashset will be compared against the other collection via EqualityComparer.Default... but with your implementation, HashSet.Contains is used, and the elements will be compared not with EqualityComparer.Default but with the comparer with which the HashSet was constructed.

If there are ways to improve the existing implementation, making at least some cases go faster while not harming other cases, great, please submit PRs for that; we welcome them. But such changes can't improve one case to the detriment of another, and the benchmarks used in the analysis can't just be for the good cases.

stephentoub on 13 Dec 2016

👍1

@stephentoub

That only matters if you're not going to use (i.e. call GetEnumerator and MoveNext, explicitly or implicitly) on the result, and your examples do use those, so I've yet to see an example that would justify this.

As explained, using First() delivers a (surprisingly) faster performance. By quickly looking at its code, it doesn't need to call GetEnumerator/MoveNext in certain cases, which seems what is happening with the proposed tests (firstly Intersect() with new algorithm and then First()). This isn't precisely a simple matter (mainly by bearing in mind that different Linq methods behave differently) and that's why I think that it is better to focus the preliminary analyses on tests. I did that and the results I got seem quite descriptive. On the other, I did tried the testing conditions you are proposing and the performance isn't better (I am commenting more about it below). I would have to look at this situation closer to understand the exact conditions allowing the First() tremendous-gain of not iterating at all. But just some scenario (the one in my original code) performing so well deserves IMO to be further analysed.

that code is only touched when you call MoveNext

Not necessarily, as explained above for First() or in my comments below. In any case, this eventuality might also be analysed. The reason for starting this issue was being aware about the notable performance drawbacks that the yield/Linq implementations implied; there are tons of possible scenarios and just one of them taking advantage of this issue would be very important. The fact is that you can assign the contents of an IEnumerable variable much faster than usual; under normal conditions, such a gain would be lost when looking at these contents, but what if you don't do it?

Consider this microbenchmark:

Good example highlighting the weaknesses of the one I proposed. As said, I would need to analyse this First case more in detail to understand exactly the reason for its behaviour (certainly worthy for the chance of getting a so good performance improvement as shown by my test). But what about if you replace First with Equals or ToString?

```C#
while (true)
{
sw.Restart();
for (int i = 0; i < Iters; i++)
{
var output = input1.Intersect(input2); //input1.Intersect2(input2);
var output2 = output.ToString();
//if (output2.Equals(output)) { }
}
sw.Stop();

Console.WriteLine(sw.Elapsed.TotalSeconds);

}
```

If you test this modified version of your code, you would continue seeing the new algorithm performing notably better. If I only need to compare a collection with the result of an Intersect operation and I can do it much quickly, why not enabling this option? Or better: doesn't such an eventuality make worth analysing this whole line more carefully? Imagine just one single scenario being much faster? Wouldn't it be excellent news?

the proposed implementations are calling Contains on the inputs potentially many, many times

This was just a preliminary implementation to show the point (although I don't think that it is too improvable either). On one hand, you have adding + removing from a collection and, on the other hand, loops + Contains. As per my experience (and my tests so far), the second option is likely to be always quicker. But I might analyse the situation deeper, do more tests, try different things, etc. if this is the problem.

because each of those calls is O(N) in the number of elements

How can you compare the cost of adding/removing an element to an increasingly-bigger collection against using a highly-optimised method as Contains (or other alternative)? The only way to make sure about the best approach is by testing (+ eventually having some relevant experience on these fronts telling you what is the likely best alternative). I insist in my offer of further tests.

I cannot continue trying to address a huge number of concerns (which seems to keep growing every time I answer something) to convince someone who, at the end, has the last word here. If something of what I have written now (or before) seems promising to you and want me to further analyse/test a specific direction, I wouldn't find any problem. If you prefer me to convert this imprecise issue into one focused on a very specific implementation (e.g., finding out a set of methods which might deliver a notably better performance always, like IntersectFirst), I might also work on that. But you have to understand that I cannot continue defending myself and my proposal from systematic critics based upon an undoubted "it cannot be done". I understand that you are trying to do your level best, so please try also to understand my position: if you want me to say "I understand that it is completely impossible", this wouldn't happen because I don't think so. I think that this is a promising direction which might output beneficial improvements. In any case, I am certainly not willing to fight against everyone to prove my point (mainly when the critics have the final word; not insinuating lack of professionalism, just describing the reality I am facing).

varocarbas on 13 Dec 2016

😕1 👎1

But what about if you replace First with Equals or ToString?

ToString() returns 'System.Linq.Enumerable+d__69`1[System.Int32]', I don't know why you think this is a valid benchmark, in what real-world scenario do you need good performance when calling ToString() on a IEnumerable? Maybe you mean String.Join(", ", output) that would at least iterate through the result and collect the items in it, producing something like "1, 2, 3, 4,...."

In addition, why do you want to increase the performance of an IEnumerable that you never plan in enumerate, again I don't know what real-world scenario this happens in?

If I only need to compare a collection with the result of an Intersect operation and I can do it much quickly, why not enabling this option?

I don't think this code is doing what you think it is doing:

var output = input1.Intersect(input2);
var output2 = output.ToString();
if (output2.Equals(output)) { }

For starters doing Equals with output2 which is a string with output which is IEnumerable<int>? This won't "compare a collection with the result of an Intersect operation"?

Take this code below:

IEnumerable<int> input1 = Enumerable.Range(0, 10);
IEnumerable<int> input2 = Enumerable.Range(5, 10);

var output1 = input1.Intersect(input2);
var output2 = input1.Intersect(input2);

Console.WriteLine("{0}", output1.Equals(output2)); // False

mattwarren on 13 Dec 2016

@mattwarren

ToString() returns 'System.Linq.Enumerable+d__69`1[System.Int32]', I don't know why you think this is a valid benchmark, in what real-world scenario do you need good performance when calling ToString() on a IEnumerable?

You misunderstood the point. The better-performing scenario is var result = Intersect2();, ToString() was just the way to output result without provoking the performance of the proposed approach to dropdown. This is undoubtedly a quite unorthodox situation, but I was plainly trying to make a point (i.e., it is possible to use Linq-generated results in a non-standard way, by avoiding GetEnumerator/MoveNext).

Maybe you mean String.Join(", ", output) that would at least iterate through the result and collect the items in it, producing something like "1, 2, 3, 4,...."

The whole point here was precisely avoiding iterations. In any case, I didn't test that eventuality and might also be fine. As said, not too relevant; just trying to make my point.

I don't think this code is doing what you think it is doing:

I think that you should re-read my comment and the previous ones, because you don't seem to understand what is this about. Here goes a quick summary: I wrote a new version of Intersect() performing notably faster but only when returned (e.g., var result = Intersect2();), not when iterated (e.g., foreach(var item in Intersect2()){}). In my previous comment (and the ones before it), I was trying to prove how relevant this fact is, as opposed to what the other commentators here think (i.e., there is no point in making the "returning version" faster as far as it will have to be iterated anyway).

For starters doing Equals with output2 which is a string with output which is IEnumerable? This won't "compare a collection with the result of an Intersect operation"?

This comment (and the one below it) doesn't make too much sense by properly understanding the intention of my post. As explained above, I was plainly providing an "exiting alternative" allowing the performance of the new method to continue being fine. I proposed to rely on Equals (or in other comparison statement) such that the Enumerable variable could be used as such (e.g., without calling GetEnumerator/MoveNext).

varocarbas on 13 Dec 2016

😕1 👎1

@stephentoub As far as you aren't proposing anything and this thread seems to start derailing a bit (BTW, @jnm2 feel free to ask me for clarifications regarding any issue you cannot understand), here goes my proposal: I will analyse properly the current main implementation (Set class) to create an optimised-for-what-is-intended-here version. That is, I will create a proper first version of the proposed just-returning-values alternative (for the time being, just for Intersect).

I have also found quite curious (perhaps I am the only one), the notable performance differences when testing the code in my original PR under different conditions by using other additional Linq methods. I will try to gain more insights into all this by adequately analysing the behaviour of, at least, First(), ToString() and Equals() as opposed to a much worse performing case like ToList().

I will come back in some days.

varocarbas on 15 Dec 2016

You should take ToString() and Equals() off that list. Neither are overridden for one thing, neither are very expensive for another, and neither are particularly commonly hit in frequent use cases for a third.

Nobody's application is going to become faster, lighter in memory use, or otherwise any better because they upgraded to a version of the assembly where a.Intersect(b).ToString() is faster.

When the performance of these Enumerable methods was something I was focusing on a lot last year, I would start with a very simple benchmark that would take a reasonable size (around 1000 elements) inputs, carry out the operation in question, and then do an empty foreach on the results.

I'd make my first attempt one that "cheated" in the favour of my idea, because if I can't get an improvement when I'm cheating, there's no point going any further (and a lot of my ideas never got past this point).

I'd make my second attempt one that "cheated" against my idea, because if I can get an improvement even when I'm cheating the benchmark against the idea, then I'm definitely on a winner.

These would be quick throwaway code tests to decide whether an idea was worth pursuing at all, I'd still need to benchmark further, generally with a mixture of input sizes (both binary-round and not, as that can have an effect sometimes), primitive, struct and reference types of elements, different patterns to the source (generally I'd just use random elements, but in cases where it could have an effect I'd also try steadily increasing or decreasing sets though weight the results from the random sequences more in considering the outcome).

I might also try more methods that make use of the full results (the ToXXX methods) as well as something that consumed them all in a foreach though I wouldn't consider the relatively quick cases like First() unless they were specifically covered by the change.

That could give me something like https://github.com/dotnet/corefx/pull/2401#issuecomment-142155644 or like comparisons.zip that could both demonstrate an improvement in likely cases and also that either there wasn't deterioration in others or at least argue that such deterioration was minor and more than offset by the improvement elsewhere.

JonHanna on 15 Dec 2016

@JonHanna Thanks. I will certainly consider all your suggestions.

In any case, I want to highlight that my Equals() and ToString() references were only meant to prove a point (i.e., there are scenarios where you don't need to perform a standard iteration). ToString() is certainly a bit extreme, but Equals shows the case (which I have personally had quite a few times) where you generate an IEnumerable only meant to be compared with something else. For example, you just want to know whether collectionA equals collectionB.Intersect(collectionC) or not.

varocarbas on 15 Dec 2016

Then your use of Equals() was a bug:

C# int[] a = {1, 2, 3}; int[] b = {2, 3, 4}; Console.WriteLine(a.Intersect(b).Equals(a.Intersect(b))); // False

You would want to use SequenceEqual() here, possibly with a short-circuit of if (a == b) return true; first if in your given use case it was reasonably likely that you'd end up comparing something with itself.

Equal() isn't even coming from the result of the method anyway, but inherited from object, so the only impact is how quickly the object was created. An improvement there would be welcome if it was still correct in outcome and the total time of T(get result) + T(use result) was reduced, but not otherwise. Otherwise it could be like me deciding to walk to New York to save me the time of getting to the airport.

JonHanna on 15 Dec 2016

@JonHanna

Then your use of Equals() was a bug:

Again, I was trying to plainly make a point! You have IEnumerable result = a.IntersectQuick(b), but you cannot iterate through it. What can you do? For example, you might compare it with another IEnumerable via Equals(). I was plainly trying to prove that iterating (at least, iterating through all the elements by provoking my proposal to perform much worse) isn't strictly required.

Equal() isn't even coming from the result of the method anyway, but inherited from object, so the only impact is how quickly the object was created

This is exactly what I was looking for! Alternatives to what you were claiming as the only way to use an IEnumerable variable (i.e., iterating through it).

I will better focus on creating a code delivering exactly what I meant and stop the general easily-misunderstandable talking. As said, I will come back in some days.

varocarbas on 15 Dec 2016

I'm just confused because I'm racking my brain trying to think of even one case, one, in all the professional and hobbyist code I've ever seen, where the result of a set operation like Intersect is used without somehow calling MoveNext.

I agree with the others here that the MoveNext cases are the only cases that anyone will ever care about, even more so when we're talking about code with demands on performance. I'm eager to be shown what I'm missing but doubtful that a real world example exists.

jnm2 on 15 Dec 2016

@jnm2 Thanks for writing your concerns down (much clearer than an emoji).

As said, I have seen myself quite a few scenarios where the output of a Linq query was plainly compared with another collection. For example: var result = collection1.Intersect(collection2);and just wanting to know whether result equals targetCollection or not.

But the fact of being a common scenario or not wasn't the point anyway. What I was trying to prove was that just using an IEnumerable variable without iterating through it was possible. My intention was to justify why the proposed situations might be promising and should be closer analysed (just one case notably faster would be really worthy).

In any case, all this abstract talking has certainly been a very bad approach. I will better focus on developing a descriptive enough code to prove my point. Feel free to come back in some days and look at it.

varocarbas on 15 Dec 2016

For example: var result = collection1.Intersect(collection2); and just wanting to know whether result equals targetCollection or not.

It doesn't. That will always return false. Either that comparison should be doing something else and that's a bug, or else the real optimisation is to replace that comparison with the constant false and then delete any code this has either made unreachable or which was there just to produce this false, taking care to make sure any side-effects still happen if necessary.

JonHanna on 15 Dec 2016

@varocarbas

As said, I have seen myself quite a few scenarios where the output of a Linq query was plainly compared with another collection. For example: var result = collection1.Intersect(collection2);and just wanting to know whether result equals targetCollection or not.

You're saying this is such an example as I'm looking for, but I don't think it is. The problem is, why would you ever care if the result enumerable instance was the same as the targetCollection enumerable instance? If you're really checking to see if result gives the same sequence as targetCollection, then checking for reference equality is incorrect even as a short-circuit maneuver. It's quite possible that the same _instance_ returns two different sequences in two successive enumerations.

At this point I'm no closer to seeing a single example where this has already been done, let alone been the correct thing to do.

jnm2 on 15 Dec 2016

@JonHanna I don't think that extending a not-properly-understood-prone conversation will have any value to anyone. Since the start, it should be clear that my only intention was finding out scenarios allowing the proposed approach (whose performance becomes horrible under the most common iterating scenarios) to keep being better than the original code. Also from the start, I let clear my test-based proceeding (until taking a closer look at all this, what I am planning to do during the next days).

Different Linq methods have proven to deliver different performances (even the same method under different conditions). I want to know better why this is the case and I want to see if I can modify the current approach (i.e., Set class) to be optimal under the conditions I want. You have kept asking me for immediate examples and I have given you the first answers coming to my mind. When I have said, please let me analyse the whole situation properly such that I can adequately answer all your concerns, you have kept focusing on not adequately understanding the ideas I am trying to make. You keep defending a common underlying reality which clearly doesn't exist (each method behaves differently to what matters here). I don't know what is the benefit (for anyone) of all this.

varocarbas on 15 Dec 2016

I have given you the first answers coming to my mind.

If you aren't happy with your own answers yet, what is this issue for?

JonHanna on 15 Dec 2016

@jnm2
As explained to JonHanna in my previous comment, I don't know why are you continuing in this line. if you want to properly understand what I have tried to say, it is really easy. If you want to continue discussing about abstract out-of-context, not-properly-understood, bringing-nothing-of-value here issues, perhaps you should consider a chat-based alternative.

Anyone keen on properly understanding my intention should have known long time (= lots of posts) ago that I was plainly looking for scenarios allowing the proposed approach to continue performing better (without the drop-down provoked by full-iterating methods, like ToList()). I have (quickly) tested many alternatives and have seen many different behaviours, which I don't fully understand (because I haven't still done the proper analysis which I have said that will like to do during the next days; unless I have to spend all my time here by being here answering your concerns). Don't you like my preliminary ideas? Wait for my definitive ones or plainly ignore this thread. But why are you focusing on extremely irrelevant, plainly-wasting time issues?

varocarbas on 15 Dec 2016

👎1

If you aren't happy with your own answers yet, what is this issue for?

@JonHanna My only answer (since quite a few posts ago) has been: I see something worthy here and have to work on it to properly confirm/dismiss my assumptions. You have kept asking me (irrelevant IMO) questions to prove that my assumptions have to be wrong. I have tried to address all your concerns as I think that they deserve to be addressed ("there you have a quick answer; please, let me work on adequately understanding this whole situation"), until stop seeing the point of continuing in this direction. So, I have said: "I will work on developing a tangible piece of code for you to understand my point. Please, wait for it". And you have kept asking more.

My ideas are now as clear as in the moment when I started this issue. You don't seem to want to understand them. You don't seem to want me to work on having ready a clearer way to transmit them to you. So, what is the point of this issue? Good question. I am starting to not understand it myself either. Although, please, don't be confused regarding whose ideas aren't clear.

As said some posts ago, if you (or a .NET Team member) want to close this issue, do it. Ideally, before I waste more time on it.

varocarbas on 15 Dec 2016

All of us have been trying to tell you, repeatedly, that this is not worth optimizing, that we've never seen real examples where this is valuable. You then say that this is us not understanding you, and that there are cases where it's important to optimize, but you've not yet shared real examples. Until you're able to provide a compelling example, I'm closing the issue, per your request. Feel free to reopen if you can provide a compelling example for why this is worth optimizing.

stephentoub on 15 Dec 2016

As explained to JonHanna in my previous comment, I don't know why are you continuing in this line. if you want to properly understand what I have tried to say, it is really easy. If you want to continue discussing about abstract out-of-context, not-properly-understood, bringing-nothing-of-value here issues, perhaps you should consider a chat-based alternative.

Hey, whoa! Let's keep this professional. :) Nothing I have asked is out of context or abstract; indeed, I'm asking for concrete, real world preexisting scenarios. I'm purposely asking hard questions that will need real answers because that's the whole point of a worthwhile discussion and because I respect your intelligence.

You're proposing an alteration to my ecosystem which struck me as all cost and no benefit, and then you asked me to become part of the conversation and tell you why your comments confuse me. Now that I've obliged, it's not quite fair to tell me that I am forcing you to answer me before you have a real world example ready. Take your time.

There are a few possibilities I'm afraid you might not have given weight:
a) from what you've said, it's possible you had coded incorrectly (e.g. Equals != SequenceEquals) and when you begin coding correctly, this proposal will no longer bring performance benefits.
b) once you do have examples of set operation results being used without MoveNext that are correct, that are needful of high performance, that solve a real world problem, that have no desirable alternative to refactor to, and that would be a benefit to the wider audience of the BCL, then the cost of adding this code is justified. Same as any proposal.

Ed: race condition with @stephentoub.

jnm2 on 15 Dec 2016

@stephentoub

All of us have been trying to tell you, repeatedly, that this is not worth optimizing,

I will go ahead with the intended proper analysis anyway and will come back here with the results (including if I was wrong; I have no problem in recognising my own errors). But I don't want to reopen this issue (can I do it myself? No idea that I could do such a thing). I am not interested in getting involved in this kind of threads (and presumably you neither). I apologise for having started all this (the whole issue, I mean), will not repeat the same mistake in the future.

I'm closing the issue, per your request

I only said that you should feel free to do so if you thought that this was the best thing. In any case, I have no complains because of not being interested in continuing on these lines.

varocarbas on 15 Dec 2016

@jnm2 This is over (although will deliver the promised output anyway. Feel free to come back in some days to get some answers to your questions), but I have some comments for your last post.

You're proposing an alteration to my ecosystem which stuck me as all cost and no benefit

The final benefit was very relevant (e.g., var result = AnyLinqJustReturning(); being much quicker than var result2 = AnyStandardLinq();). This whole thread was in a very preliminary stage just considering minor (or virtually no) modifications of the existing Linq methods, but there was no reason for not coming up with a brand-new approach for these cases. What you (and others; perhaps because of me not transmitting everything properly) didn't see was that you were criticising (with the best intention, not saying the contrary) something which wasn't even born. All this was a set of abstract ideas expecting a proper understanding or time (for me to create the specific implementations). I didn't try to offend you or anyone, just helping you see that there was no point in continuing in that direction (you don't like my preliminary ideas? No problem. Let me work on a clearer version).

from what you've said, it's possible you had coded incorrectly

No. My original approach (take a look at the associated PR) was plainly ignoring the standard Linq implementation (adding from/removing to a Set collection which, at a later point, makes things more efficient) and avoiding the future iterations (what explains why the current approaches are less efficient in a first moment). The whole idea was using the functionalities of the Linq methods under different conditions (just returning a collection) by bearing in mind that properly-optimised algorithms are likely to be notably quicker. The only relevant issue determining whether my proposal will succeed/fail is the fact of being able to reproduce a Linq-like implementation without the Linq drawbacks. What I think that it is possible, but will see (in some days, quicker than one week I guess, I will come back with my impressions).

varocarbas on 15 Dec 2016

I created another more specific issue (i.e., just focused on the first element) to analyse this whole situation better, but finally realised about my mistake of having focused on a (partially) faulty algorithm (i.e., similar the last version of the new Intersect algorithm in my original PR).

Not saying that there is nothing there (actually I found differences in various LINQ methods), but it isn't as relevant/consistent as I was expecting (better: hoping without properly analysing). In any case, I have preferred to delete any reference which might have mislead someone to think that an important performance improvement might have been accomplished by relying on the proposed proceeding.

At least, I have learned to not post too-abstract issues (like this one) and to accept that, despite my quite relevant programming and .NET expertise, there are still parts which I have to analyse properly. Thanks for your contributions and apologises for the (not-too-common-in-me) a bit impatient attitude.

varocarbas on 21 Dec 2016

❤1

Was this page helpful?

0 / 5 - 0 ratings