Runtime: [Proposal] Adding a version of List<T> with smaller memory footprint and no random access

Created on 22 Feb 2017 · 27Comments · Source: dotnet/runtime

Background: When Add is called on List and it cannot fit any more items in its internal buffer, it allocates a new array twice the size and copies over each of the existing items.

This is an inefficient use of memory; instead, it could keep the old array around, allocate a new one the same size as the number of existing items, and continue reading items into that array with no copying. This is what LargeArrayBuilder<T> does, which is used by Linq's ToArray method today, and the number of allocations has been shown to decrease by 33% as a result.

Here is a visualization of how 200 items will be stored:

Buffers = new T[][]
{
    [0] = new T[8], // Stores items 0-7
    [1] = new T[8], // Stores items 9-16
    [2] = new T[16], // Stores items 17-32
    [3] = new T[32], // Stores items 33-64
    [4] = new T[64], // Stores items 65-128
    [5] = new T[128] // Stores items 129-199, slots 200-255 are zero-inited
}

When one buffer runs out of space we can simply add it to Buffers and continue reading in items to a new buffer, without copying anything or wasting memory.

Proposal: The public version will be a class and implement ICollection<T>. It will be named ArrayBuilder<T>, since in the related issue we decided we're not going to expose the internal ArrayBuilder<T>. It will expose a minimal API and explicitly implement a few ICollection<T> members.

namespace System.Collections.Extended
{
    public class ArrayBuilder<T> : ICollection<T>, IReadOnlyCollection<T>
    {
        public ArrayBuilder();
        public ArrayBuilder(int capacity);
        public ArrayBuilder(IEnumerable<T> items);
        public ArrayBuilder(int capacity, IEnumerable<T> items);

        // ICollection<T> members
        public int Count { get; }

        public void Add(T item);
        public void Clear();
        public bool Contains(T item);
        public void CopyTo(T[] array, int arrayIndex);
        public Enumerator GetEnumerator();

        public T[] ToArray();

        // Explicitly implemented ICollection<T> members
        bool ICollection<T>.IsReadOnly { get; }

        bool ICollection<T>.Remove(T item);
        IEnumerator<T> IEnumerable<T>.GetEnumerator();
        IEnumerator IEnumerable.GetEnumerator();

        public struct Enumerator : IEnumerator<T>, IEnumerator
        {
            public T Current { get; }

            public void Dispose();
            public bool MoveNext();

            void IEnumerator.Reset();
        }
    }
}

The type will not implement IList<T>. Random access will be O(log n) because we'd have to walk over each of the buffers and subtract its size from the index, and each buffer will be double the size as the previous.

Addtional notes:

Roslyn already has an ArrayBuilder<T> type used in a lot of places, so I'm open to changing the name if the name clashes with that project will be a concern.
The namespace will be System.Collections.Extended so System.Collections.Generic stays lean. See https://github.com/dotnet/corefx/issues/16374#issuecomment-281856288

/cc @benaadams @omariom @jkotas @KrzysztofCwalina

api-needs-work area-System.Collections

Source

jamesqo

👍1

Most helpful comment

We need to lead by example with relatively small repos per nuget package for components that do not have to be part of the platform.

@tmat has been doing that for components that he owns (e.g. https://github.com/dotnet/symreader) and developing best practices on how to setup such small repos. We need more of these to get a vibrant .NET ecosystem.

I have said before that I am not happy about random up stack components like serial port or directory services getting added to corefx. These components should be in its own little repos, shipping on its own cadence, etc.

jkotas on 1 Mar 2017

👍2

All 27 comments

Could implement IList<T> random access should still be O(1) as you know the sequence of sizes so could jump straight to correct array after a calculation.

benaadams on 22 Feb 2017

👍2

@benaadams Technically, it could be random access-- I made a version before (here) that allowed getting a buffer from the index. You just need to solve the equation

blocksize * 2^n <= index < blocksize * 2^(n + 1)

which gets you either n = log2(index/blocksize), or n = log2(index) - log2(blocksize). The problem with this is I am not planning to have a fixed block size, due to the capacity constructors-- if someone creates a new ArrayBuilder<T>(200) the class will straight-up allocate a new T[200] instead of a bunch of different arrays like I showed in the description. Then calculating the buffer index for non-powers of 2 would require either 1 log2 and 1 div, or 2 log2s, both of which are going to be pretty slow.

jamesqo on 22 Feb 2017

The equation that computes the segment index could do the following:

If index < buffer[0].Length return 0;
else return YouAlgorithm(index - buffer[0].Length)) + 1;

KrzysztofCwalina on 22 Feb 2017

BTW, I think we should rename https://github.com/dotnet/corefxlab/tree/master/src/System.Collections.Generic.MultiValueDictionary to System.Collections.Extended (or something like that) and add this type there. It's kind of specialized and probably appealing to limited number of developers.

KrzysztofCwalina on 22 Feb 2017

floor(log2(x)) is potentially very fast, given bsr and similar instructions. There are various lookup table methods you can use that are pretty fast too.

I'm not sure if you can get the jit to emit a bsr though....

AndyAyersMS on 22 Feb 2017

Does it need to implement collection interfaces? LargeArrayBuffer seems only to have minimal API.

omariom on 23 Feb 2017

@KrzysztofCwalina

BTW, I think we should rename https://github.com/dotnet/corefxlab/tree/master/src/System.Collections.Generic.MultiValueDictionary to System.Collections.Extended (or something like that) and add this type there. It's kind of specialized and probably appealing to limited number of developers.

Is it? Seems to me like it could be used as a general purpose replacement for List<T> anywhere the indexer is not used, and there is no trade-off in perf (it's a pure win).

@AndyAyersMS

floor(log2(x)) is potentially very fast, given bsr and similar instructions. There are various lookup table methods you can use that are pretty fast too.

Indeed. I am waiting until https://github.com/dotnet/corefx/issues/12425 lands which implements CountLeadingZeroes using a lookup table and then I will open a proposal for implementing IList<T>.

@omariom

Does it need to implement collection interfaces? LargeArrayBuffer seems only to have minimal API.

If you're asking what's the purpose of implementing ICollection<T> when the existing LargeArrayBuilder<T> doesn't, that's because LargeArrayBuilder is a struct/internal type so we haven't needed to treat it as an interface yet. With a public type people may want to pass it around as an IEnumerable<T> or such so it would make more sense to implement the interfaces.

One other thing I had not mentioned is that Roslyn has their own ArrayBuilder<T> type, which is used in thousands of places in their codebase. If adding this type could cause issues with them due to name clashes, I am fine with changing the name.

jamesqo on 23 Feb 2017

given we already have List

KrzysztofCwalina on 23 Feb 2017

@KrzysztofCwalina OK, I'm fine with putting the type in a new namespace then. I've updated the proposal at the top.

jamesqo on 23 Feb 2017

@jamesqo the point from @KrzysztofCwalina was that we do not believe the type is valuable enough to be in CoreFX repo. If you find it valuable, it might be a good idea for a standalone NuGet package outside of CoreFX repo.

karelz on 23 Feb 2017

I was reminded of https://en.wikipedia.org/wiki/Hashed_array_tree

danmosemsft on 25 Feb 2017

👍1

If random access can be very fast, albeit a bit slower, but allocations greatly reduced, why can we not use this method in the STD. List implementation?

chrisaut on 25 Feb 2017

@ianhays and I agree with @KrzysztofCwalina. What about adding this type in CoreFXLab? @jamesqo

safern on 27 Feb 2017

👍1

CoreFXLab is not the best place either - it is gated by MS employees and we do not monitor it entirely :( (I've heard we had PRs opened for CommandLine for 1 year there :().
I would suggest to use entirely separate repo, purely community driven /owned by community members ... @jamesqo would that work for you?

karelz on 27 Feb 2017

CoreFXLab is not the best place either - it is gated by MS employees and we do not monitor it entirely :( (I've heard we had PRs opened for CommandLine for 1 year there :().

@karelz Because that repo isn't as strictly maintained as corefx, PRs should not be submitted there at all? I think I would like to submit a PR there bc if we find a way to modify its design such that it fits well with the existing API (and does not force people to explicitly choose between it and List<T>), then it will be beneficial to have it in corefxlab.

jamesqo on 1 Mar 2017

@karelz, command line PRs sat for a year because there was nobody on our side who thought the project aligns with our priorities. I think a collection extensions package aligns much nicer.

KrzysztofCwalina on 1 Mar 2017

@KrzysztofCwalina is there any guarantee that @jamesqo's class won't become a non-priority in CoreFXLab?
To be safe, I would still suggest separate repo driven by community members. IMO it gives much more flexibility to everyone.

This might be potentially good match for the PowerCollections and friends repo that we discussed some time back (BTW: was there any progress on that?). However, I am not sure what kind of bar / restrictions we would put there - we probably wouldn't want to add just any "random" collection class and then "maintain" it long-term.

karelz on 1 Mar 2017

There is no traction on PowerCollections. I thought the corfxlab library would be such "power collections" home.

There is no guarantee, but I think it's much more likely that such library would be aligned with corfxlab priorities.

KrzysztofCwalina on 1 Mar 2017

I thought we concluded we need a new repo for PowerCollections and friends. Esp. because CoreFXLab does not have shipping mechanism and is for experiments. We would have to artificially distinguish what's experiment and what is shipping quality :( ... we can take that discussion elsewhere ..

karelz on 1 Mar 2017

I don't remember such conclusion, thought maybe we did reach it.
Assuming we did, the conclusion surely was that we will create a repo and help the community with building, CI, publishing, provide critical mass of datastructures, etc. Is that happening? i.e. is anybody actually working on it? If not, then I think corfxlab is a great way to start (developing the code) and once we have the other repo, we can move the code there. ... but sure, let's discuss in person.

KrzysztofCwalina on 1 Mar 2017

We need to lead by example with relatively small repos per nuget package for components that do not have to be part of the platform.

jkotas on 1 Mar 2017

👍2

@jkotas, the proposal is to add these to corfxlab, not to corefx. I agree with your point about corfx.

KrzysztofCwalina on 1 Mar 2017

We have a similar type in System.Reflection.Metadata: https://github.com/dotnet/corefx/blob/master/src/System.Reflection.Metadata/src/System/Reflection/Metadata/BlobBuilder.cs
It's currently a list of bytes, but it could be easily extended to any item type. I was thinking about implementing it and using it in metadata table builders. Was considering name Stream<T>.

tmat on 1 Mar 2017

Re the repo infrastructure: See RepoToolset and repo already using it: https://github.com/dotnet/symreader-converter

tmat on 1 Mar 2017

Since is something not going into CoreFX we are going to close it. We should move this discussion to wherever we want to add the code.

safern on 6 Mar 2017

I ended up creating my own repo for this: https://github.com/jamesqo/BlockList. @jkotas you were right, it doesn't feel that bad to create my own NuGet package instead of putting this code in corefxlab.

jamesqo on 30 Jul 2017

I would rather see a B+-tree implementation of a list. For reasonably large values of b, these perform similarly to block allocating lists, but offer O(log_b n) insertion and lookup times anywhere in the list.