Runtime: Creating a list from an IEnumerable<T> is inefficient

Created on 7 Mar 2020  路  15Comments  路  Source: dotnet/runtime

Logging based on https://twitter.com/AntaoAlmada/status/1235979836289675266?s=20.

When creating a List<T> from an IEnumerable<T> (where the enumerable is not an ICollection<T>), the implementation is quite inefficient as it is basically foreach (item in enumeration) { Add(item); }. For large enumerations, this results in the underlying arrays being created, resized, copied numerous times. On the other hand, Enumerable.ToArray() internally uses LargeArrayBuilder which is a bit more efficient for constructing an array of unknown size. This looks to be because it effectively tracks the individual fragmented arrays and combines them together at the end (thus avoiding much of the repeated copying for very large arrays).

This can result in significant perf differences:
image

It may be worthwhile investigating if List<T>(IEnumerable<T>) can be improved similarly, naively we could possibly just have it be _items = collection.ToArray().

area-System.Collections tenet-performance

Most helpful comment

Right, the optimizations in the constructor could have stayed. These special casing optimizations for Linq are problematic. They help the one case that they are special casing, and hurt everything else.

LargeArrayBuilder produces a ton of code bloat. It is dubious optimization that is likely hurting on average. It is likely that there is a lot more cycles spent producing this code bloat than what the optimization saves. I would not want List to have dependency on LargeArrayBuilder.

If you want great performance for large datasets, do not use LINQ on objects. LINQ is optimized for small data sets and convenience, not for high performance on large data sets.

All 15 comments

Here's the link to the original discussion

There are also some other benchmarks in that thread:

image

Silly observation, but the twitter thread that lead to this mentioned Enumerable.Range(0, Count) being the target of the ToList/Array call, which caused the worst case performance. Well, couldn't Range be changed to return an ICollection<int> (or even IList<int>) so that Count would be available? I'm not suggesting to change the return type, but the runtime object that is returned could trivially implement these interfaces, and thus allow LINQ optimizations against them.

https://twitter.com/tannergooding/status/1236228321093935104

@jkotas, I think this one is slightly different.

The previous fix was for List<T>.AddRange which caused a difference in behavior due to how external observers could percieve the changes.

This would be a change to the List<T>(IEnumerable<T>) constructor, which can't be observed since the user doesn't have an initialized object until after the constructor has returned (and we don't pass this out of the constructor).

(Just noticed that the previous change also covered the constructor, but I think the latter point still stands, I don't think the constructor is observable in the same way).

Right, the optimizations in the constructor could have stayed. These special casing optimizations for Linq are problematic. They help the one case that they are special casing, and hurt everything else.

LargeArrayBuilder produces a ton of code bloat. It is dubious optimization that is likely hurting on average. It is likely that there is a lot more cycles spent producing this code bloat than what the optimization saves. I would not want List to have dependency on LargeArrayBuilder.

If you want great performance for large datasets, do not use LINQ on objects. LINQ is optimized for small data sets and convenience, not for high performance on large data sets.

LargeArrayBuilder produces a ton of code bloat. It is dubious optimization that is likely hurting on average. It is likely that there is a lot more cycles spent producing this code bloat than what the optimization saves.

See https://github.com/dotnet/runtime/pull/550#issuecomment-593595522 for some relevant data.

@MarkPflug While that does look like a simple fix, it's possible that someone somewhere somehow has a dependency on Enumerable.Range() returning a lazy enumerable, and changing that under the covers could break them. I don't actually know if that's completely accurate, but the people on the corefx team are much smarter about this kind of stuff and can make a better estimation of whether or not that break is acceptable.

the twitter thread that lead to this mentioned Enumerable.Range(0, Count) being the target of the ToList/Array call, which caused the worst case performance.

On .NET Core? That is already special-cased: the list is presized to the exact right amount, so there isn't unnecessary allocation.
https://github.com/dotnet/runtime/blob/e9b1dbb5048bd079c9421a8bab2b439dba4da588/src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs#L31
There's still an Add per item, but it won't result in growth.

@joe4evr What I'm suggesting would be a "lazy" IList.
Roughly: https://gist.github.com/MarkPflug/14386a91cf5e2491875ea89b02bf5e01
Would take the same amount of memory as the RangeEnumerator currently does (int start, int count), but would provide efficient Count, Contains, index, IndexOf implementations that would allow LINQ to detect that it can use more efficient algorithms.

@stephentoub I'm confused how that ToList implementation comes into play. Enumerable.Range returns an IEnumerable, so if you did a ToList() on that it would resolve to Enumerable.ToList extension method, not the RangeIterator ToList. Or, is that handled via this IPartion interface? I'm not familiar with that interface, is it internal too?

I find it interesting that IIListProvider interface has two "I"s in it. It suggests that it started out as returning an IList<T>, but then ended up changing to a List<T>, because that is what Enumerable.ToList returns. If Enumerable.ToList returned an IList<T> instead, this particular implementation could be O(1) instead of O(n). I think the problem with IList is the ReadOnly property, which essentially bisects the interface, and most consumers are looking for a mutable List<T>, when otherwise an immutable IList<T> could be provided more efficiently. Admittedly, this case probably isn't common enough to be concerned with; I think I've only ever used Enumerable.Range in test scenarios.

I find it interesting that IIListProvider interface has two "I"s in it. It suggests that it started out as returning an IList, but then ended up changing to a List, because that is what Enumerable.ToList returns.

Enumerable.ToList shipped in .NET Framework 3.5 ~13 years ago; IIListProvider was added as an internal optimization ~4 years ago, well after ToList. Its name comes from it providing either a T[] or a List<T>, both of which are IList<T>.

I have wanted to raise an issue myself but found this one.

Is it an idea to add extra overloads for the constructor instead of stuffing everything in one single method?
That would lead to smaller methods loaded during runtime which process quicker. Atleast is the orginal type is better known then just IEnumerable

Examples:

        public List(T[] collection) { ... }
        public List(List<T> collection) { ... }
        public List(IList<T> collection) { ... }
        public List(ICollection<T> collection) { ... }

Given the points already raised here, do you think we should close this issue? cc @jkotas

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chunseoklee picture chunseoklee  路  3Comments

nalywa picture nalywa  路  3Comments

matty-hall picture matty-hall  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments

jamesqo picture jamesqo  路  3Comments