Runtime: Creating a list from an IEnumerable<T> is inefficient

Created on 7 Mar 2020 · 15Comments · Source: dotnet/runtime

Logging based on https://twitter.com/AntaoAlmada/status/1235979836289675266?s=20.

When creating a List<T> from an IEnumerable<T> (where the enumerable is not an ICollection<T>), the implementation is quite inefficient as it is basically foreach (item in enumeration) { Add(item); }. For large enumerations, this results in the underlying arrays being created, resized, copied numerous times. On the other hand, Enumerable.ToArray() internally uses LargeArrayBuilder which is a bit more efficient for constructing an array of unknown size. This looks to be because it effectively tracks the individual fragmented arrays and combines them together at the end (thus avoiding much of the repeated copying for very large arrays).

This can result in significant perf differences:

It may be worthwhile investigating if List<T>(IEnumerable<T>) can be improved similarly, naively we could possibly just have it be _items = collection.ToArray().

area-System.Collections tenet-performance

Source

tannergooding

👍2

Most helpful comment

Right, the optimizations in the constructor could have stayed. These special casing optimizations for Linq are problematic. They help the one case that they are special casing, and hurt everything else.

LargeArrayBuilder produces a ton of code bloat. It is dubious optimization that is likely hurting on average. It is likely that there is a lot more cycles spent producing this code bloat than what the optimization saves. I would not want List to have dependency on LargeArrayBuilder.

If you want great performance for large datasets, do not use LINQ on objects. LINQ is optimized for small data sets and convenience, not for high performance on large data sets.

jkotas on 7 Mar 2020

👍4

All 15 comments

Here's the link to the original discussion

There are also some other benchmarks in that thread:

Tyrrrz on 7 Mar 2020

FWIW, this was fixed in https://github.com/dotnet/coreclr/pull/8306 and reverted in https://github.com/dotnet/coreclr/pull/20471

jkotas on 7 Mar 2020

Silly observation, but the twitter thread that lead to this mentioned Enumerable.Range(0, Count) being the target of the ToList/Array call, which caused the worst case performance. Well, couldn't Range be changed to return an ICollection<int> (or even IList<int>) so that Count would be available? I'm not suggesting to change the return type, but the runtime object that is returned could trivially implement these interfaces, and thus allow LINQ optimizations against them.

https://twitter.com/tannergooding/status/1236228321093935104

MarkPflug on 7 Mar 2020

@jkotas, I think this one is slightly different.

The previous fix was for List<T>.AddRange which caused a difference in behavior due to how external observers could percieve the changes.

This would be a change to the List<T>(IEnumerable<T>) constructor, which can't be observed since the user doesn't have an initialized object until after the constructor has returned (and we don't pass this out of the constructor).

tannergooding on 7 Mar 2020

(Just noticed that the previous change also covered the constructor, but I think the latter point still stands, I don't think the constructor is observable in the same way).

tannergooding on 7 Mar 2020

If you want great performance for large datasets, do not use LINQ on objects. LINQ is optimized for small data sets and convenience, not for high performance on large data sets.

jkotas on 7 Mar 2020

👍4

LargeArrayBuilder produces a ton of code bloat. It is dubious optimization that is likely hurting on average. It is likely that there is a lot more cycles spent producing this code bloat than what the optimization saves.

See https://github.com/dotnet/runtime/pull/550#issuecomment-593595522 for some relevant data.

stephentoub on 7 Mar 2020

@MarkPflug While that does look like a simple fix, it's possible that someone somewhere somehow has a dependency on Enumerable.Range() returning a lazy enumerable, and changing that under the covers could break them. I don't actually know if that's completely accurate, but the people on the corefx team are much smarter about this kind of stuff and can make a better estimation of whether or not that break is acceptable.

Joe4evr on 8 Mar 2020

the twitter thread that lead to this mentioned Enumerable.Range(0, Count) being the target of the ToList/Array call, which caused the worst case performance.

On .NET Core? That is already special-cased: the list is presized to the exact right amount, so there isn't unnecessary allocation.
https://github.com/dotnet/runtime/blob/e9b1dbb5048bd079c9421a8bab2b439dba4da588/src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs#L31
There's still an Add per item, but it won't result in growth.

stephentoub on 8 Mar 2020

@joe4evr What I'm suggesting would be a "lazy" IList.
Roughly: https://gist.github.com/MarkPflug/14386a91cf5e2491875ea89b02bf5e01
Would take the same amount of memory as the RangeEnumerator currently does (int start, int count), but would provide efficient Count, Contains, index, IndexOf implementations that would allow LINQ to detect that it can use more efficient algorithms.

@stephentoub I'm confused how that ToList implementation comes into play. Enumerable.Range returns an IEnumerable, so if you did a ToList() on that it would resolve to Enumerable.ToList extension method, not the RangeIterator ToList. Or, is that handled via this IPartion interface? I'm not familiar with that interface, is it internal too?

MarkPflug on 8 Mar 2020

RangeIterator implements the internal IListPartition interface, which ToList checks for:
https://github.com/dotnet/runtime/blob/e9b1dbb5048bd079c9421a8bab2b439dba4da588/src/libraries/System.Linq/src/System/Linq/ToCollection.cs#L30

stephentoub on 8 Mar 2020

I find it interesting that IIListProvider interface has two "I"s in it. It suggests that it started out as returning an IList<T>, but then ended up changing to a List<T>, because that is what Enumerable.ToList returns. If Enumerable.ToList returned an IList<T> instead, this particular implementation could be O(1) instead of O(n). I think the problem with IList is the ReadOnly property, which essentially bisects the interface, and most consumers are looking for a mutable List<T>, when otherwise an immutable IList<T> could be provided more efficiently. Admittedly, this case probably isn't common enough to be concerned with; I think I've only ever used Enumerable.Range in test scenarios.

MarkPflug on 8 Mar 2020

I find it interesting that IIListProvider interface has two "I"s in it. It suggests that it started out as returning an IList, but then ended up changing to a List, because that is what Enumerable.ToList returns.

Enumerable.ToList shipped in .NET Framework 3.5 ~13 years ago; IIListProvider was added as an internal optimization ~4 years ago, well after ToList. Its name comes from it providing either a T[] or a List<T>, both of which are IList<T>.

stephentoub on 8 Mar 2020

I have wanted to raise an issue myself but found this one.

Is it an idea to add extra overloads for the constructor instead of stuffing everything in one single method?
That would lead to smaller methods loaded during runtime which process quicker. Atleast is the orginal type is better known then just IEnumerable

Examples:

        public List(T[] collection) { ... }
        public List(List<T> collection) { ... }
        public List(IList<T> collection) { ... }
        public List(ICollection<T> collection) { ... }