Runtime: Add 64 bits support to Array underlying storage

Created on 8 Mar 2019 · 38Comments · Source: dotnet/runtime

While System.Array API supports LongLength and operator this[long i], the CLR does not allow arrays to be allocated with more than 2^31-1 elements (int.MaxValue).

This limitation has become a daily annoyance when working with HPC or big data. We frequently hit this limit.

Why this matters

.NET arrays is the de facto storage unit of many APIs & collections.
Currently one has to allocate native memory instead, which does not work with managed code (Span could have helped, but it's been limited to int32 as well).
HPC frameworks expects data to be contiguous in memory (i.e. the data cannot be split into separate arrays). E.g. BLAS libraries.
Requires applications to be coded and designed differently if they handle <2G elements or >2G elements. I.e. It does not scale.
It's an arbitrary limitation with little value to the user and application:
- With a desktop with 64GB, why can one only allocate 3.1% of its memory in one go?
- With a data centre machine with 1,536GB, why can one only allocate 0.1% of its memory in one go?

In C++ this is solved with std::size_t (whose typedef changes depending on the target platform). Ideally, .NET would have taken the same route when designing System.Array. Why they haven't is a mystery, given that AMD64 and .NET Framework appeared around the same time.

Proposal
I suggest that when the CLR/JIT runs the .NET application in x64, it allows the array long constructor to allocate more than int.MaxValue items:

Indexing the array with operator this[long i] should work as expected and give access to the entire array.
Indexing the array with operator this[int i] should work as expected but implicitly limit the access to only the first int.MaxValue elements.
The LongLength property should return the total number of elements in the array.
The Length property should return the total number of elements in the array, or throw OverflowException of there are more than int.MaxValue elements (this matches the current behaviour on multi-dimensional arrays with more than int.MaxValue elements).

I naively believe that the above should not break any existing application.

Bonus points for extending 64-bit support to Span and ReadOnlySpan.

api-needs-work api-suggestion area-System.Runtime

Source

GPSnoopy

👍29 🚀3 ❤3

Most helpful comment

This is an interesting idea, but I wonder if it would be better to have a LargeArray<T> or similar class that has these semantics rather than try to shoehorn it into the existing Array class. The reason I suggest this course of action is that the GC currently has a hard dependency on the element count of any variable-length managed object fitting into a 32-bit signed integer. Changing the normal Array class to have a 64-bit length property would at minimum also affect the String type, and it may have other unintended consequences throughout the GC that would hinder its efficiency, even when collecting normal fixed-size objects. Additionally, accessing the existing Array.Length property would no longer be a simple one-instruction dereference; it'd now be an overflow check with associated branching.

If we had a theoretical LargeArray<T> class, it could be created from the beginning with a "correct" API surface, including even using nuint instead of long for the indexer. If it allowed _T_ to be a reference type, we could also eliminate the weird pseudo-covariance / contravariance behavior that existing Array instances have, which would make writing to a LargeArray<T> potentially cheaper than writing to a normal Array.

GrabYourPitchforks on 21 Mar 2019

👍12

All 38 comments

I naively believe that the above should not break any existing application.

One of the simplest breaks becomes any program that is doing the following:

for (int i = 0; i < array.Length; i++)
{
}

This works fine for any existing programs, since CoreCLR actually defines a limit that is just under int.MaxValue. Allowing values that are greater than int.MaxValue would cause an overflow to -2147483648 and either cause unexpected behavior or cause an IndexOutOfRangeException to be thrown.

tannergooding on 8 Mar 2019

👎2

I frequently run into this array when processing large chunks of data from network streams into byte arrays, and always need to implement chunking logic in order to be able to process the data.
It would be great to be able to use long indexes on arrays

giuliojiang on 8 Mar 2019

👍2

@tannergooding That's why I proposed above to keep the existing behaviour of throwing OverflowException on Length when there are more than int.MaxValue elements. Nothing changes there.

I'm suggesting that changing the CLR implementation as proposed above would allow applications and people who want to to use large arrays without breaking existing applications. You are right that simply passing a large array into a library that does not support it will break, but at least this will give us choice. We need to start somewhere, and .NET cannot keep ignoring this problem.

GPSnoopy on 8 Mar 2019

👍5 👎1

Yesterday I happily created an array in Python which contained more than two billion elements. When can I do the same in .NET?

Currently we get an exception if we try to construct an array with more than 2B elements. What's wrong with deferring that exception until something calls the Length property which can no longer return a valid int? @tannergooding's example wouldn't cause problems. Are there other examples which break?

philjdf on 8 Mar 2019

👍3

GrabYourPitchforks on 21 Mar 2019

👍12

@GrabYourPitchforks interesting facts about the GC. I wasn't aware of such limitations.

A LargeArray<T> class could be a temporary solution. My main concern is that it would stay a very niche class with no interoperability with the rest of the .NET ecosystem, and ultimately would be an evolutionary dead end. I do like the idea of nuint/nint though.

My gut feeling is that Array, String and the GC are ripe for a 64 bits overhaul. We should bite the bullet and do it. So far I've been quite impressed by the .NET Core team willingness to revisit old decisions and areas that Microsoft had kept shut in the past (e.g. SIMD/platform specific instructions, float.ToString() roundtrip fixes, bug fixes that change backward compatibility, etc).

GPSnoopy on 4 Apr 2019

👍1

I guess I could palette a LargeArray but if that's going to be implemented than I hope it doesn't create a new type which is not actually an Array, IMHO it would have been much easier to address this if we would have created another sub type of Array internally instead of inventing Span however Span also solves other problems....

juliusfriedman on 5 Apr 2019

These days 2GB arrays are barely enough for many applications to run reliably. RAM prices have stagnated for a few years now. Surely, the industry will resolve this problem sooner or later. As RAM amounts resume increasing at Moores law rate this 2GB array issue will become very commonplace sooner or later.

A LargeArray<T> type might be a good medium term solution. But will 2GB arrays not be very commonplace 10 years from now? Do we then want to litter application code and API surfaces with LargeArray<T>? It would often be a hard choice whether to go for LargeArray<T> or T[].

Thinking in the very long term it seems far better to find a way to fix T[].

GSPP on 22 Apr 2019

👍7

If 64 bit support is implemented there could be a tool that analyzes your code for legacy patterns (e.g. usage of int Length or the typical for loop for (int i = 0; i < array.Length; i++)). The tool should then be able to mass upgrade the source code. This could be a Roslyn analyzer.

GSPP on 22 Apr 2019

👍4

Since this would be such a massive ecosystem breaking change, one other thing you'd probably have to do is analyze the entire graph of all dependencies your application consumes. If the change were made to T[] directly (rather than LargeArray<T> or similar), assemblies would need to mark themselves with something indicating "I support this concept!", and the loader would probably want to block / warn when such assemblies are loaded. Otherwise you could end up in a scenario where two different assemblies _loaded into the same application_ have different views of what an array is, which would result a never-ending bug farm.

GrabYourPitchforks on 25 Apr 2019

👎2

Not if large array was an array... i.e. derived from it if only internally perhaps; like I suggested back in the span threads.

juliusfriedman on 27 Apr 2019

If large array is a glorified array, then you could pass a large array into an existing API that accepts a normal array, and you'd end up right back with the problems as originally described in this thread.

GrabYourPitchforks on 27 Apr 2019

Furthermore, I'm not sure I buy the argument that adding large array would bifurcate the ecosystem. The scenario for large array is that you're operating with enormous data sets (potentially over 2bn elements). By definition you wouldn't be passing this data to legacy APIs anyway since those APIs wouldn't know what to do with that amount of data. Since this scenario is so specialized it almost assumes that you've already accepted that you're limited to calling APIs which have been enlightened.

GrabYourPitchforks on 27 Apr 2019

👍1

You have LongLength on the Array

The only fundamental diff is that one is on the LOH and one is not.

By virtue of the same fact span wouldn't be able to hold more than such either so large span must be needed also...

juliusfriedman on 27 Apr 2019

From what I can gather and summarise from the above, there are two pieces of work.

Update the CLR so that Array can works with 64-bit length and indices. This includes changes to the Array implementation itself, but as comments have pointed above, also to System.String and the Garbage Collector. It is likely to be relatively easy to come up with a fork of coreclr that can achieve this, as a proof of concept with no regard for backward compatibility.
Find a realistic way to achieve backward compatibility. This is the hard part. I think this is unlikely to succeed without compromising some aspect of the CLR. Whether it is Length throwing on overflow, or awkwardly introducing new specific classes like LargeArray.

But the more I think about it, the more I think this issue is missing the point and ultimately the real problem with .NET as it stands. Even if the initial proposal was to be implemented, it would only fix the immediate 64-bit issue with Array but still leave collections and Span with the same indexing and length limitations.

I've started reading The Rust Programming Language (kind of felt overdue) and it struck me that Rust also mimics C++ size_t and ssize_t with usize and isize. C# on the other hand somehow decided not to expose this CPU architectural detail and forces everyone to the lowest common denominator for most of it's API: a 32-bit CPU with 32-bit addressing.

I'd like to emphasis that the 32-bit limitation is purely arbitrary from a user point of view. There is no such thing as a small array and a big array; an image application should not have to be implemented differently whether it works with 2,147,483,647 pixels or 2,147,483,648 pixels. Especially when it's data driven and the application has little control on what the user is up to. Even more frustrating if the hardware has long been capable of it. If you do not believe me or think I'm talking nonsense, I invite you to learn how to program for MS-DOS 16-bit with NEAR and FAR pointers (hint: there is a reason why Doom required a 386 32-bit CPU).

Instead of tinkering around the edges, what is the general appetite for a more ambitious approach to fix this limitation?

Here is a controversial suggestion, bit of a kick in the nest:

Take on @GrabYourPitchforks idea, introduce nint and nuint (I also like size and ssize, but I can imagine a lot of clashes with existing code).
Allow implicit conversion from int to nint (and uint to nuint) but not the reverse,
Change all Length properties on Array, String, Span and collections to return nint (leaving aside the debate of signed vs unsigned and keeping the .NET convention of signed lengths).
Change all indexing operators on Array, String, Span and collection to only take nint.
Remove LongLength properties and old indexing operators.
Take a page from Nullable Reference book and allow this to be an optional compilation feature (this is where it hurts).
Only allow a nint assembly to depend on another assembly if it's also using the new length types.
But allow a global or per-reference "I know what I'm doing" override, in which case calling the old int32 Length property on an Array is undefined (or just wraps around the 64-bit value).
Spend a decade refactoring for loops and var i = 0.

I understand this is far from ideal and can create uncomfortable ecosystem situations (Python2 vs Python3 anyone?). Open to suggestions on how to introduce size types in .NET in a way that doesn't leave .NET and C# more irrelevant on modern hardware each year.

GPSnoopy on 2 Oct 2019

👍1

If we can solve the issue with the GC not being tolerant of variable-length objects bigger than 2 GB,
a couple things that might make LargeArray<T> with a native-word-sized-Length more palatable:

Array length is already represented as a pointer-sized integer in the memory layout of arrays in .NET. On 64-bit platforms, the extra bytes serve as padding and are always zero.
If we were to introduce a LargeArray<T>, we can give it the same memory layout as existing arrays. The only difference is that the bytes that serve as padding for normal arrays would have a meaning for LargeArray<T>.
If the memory layout is the same, code that operates on LargeArray<T> can also operate on normal arrays (and for a smaller LargeArray<T>, vice-versa)
We can enable casting between LargeArray<T> and normal arrays
Casting from an array to LargeArray<T> is easy - we just follow the normal array casting rules (we could get rid of the covariance though, as @GrabYourPitchforks calls out). If we know the element types, it's basically a no-op.
When casting from LargeArray<T> to normal arrays, we would additionally check the size and throw InvalidCastException if the LargeArray<T> instance is too long.
Similar rules would apply when casting to collection types (e.g. you can cast LargeArray<T> to ICollection<T> only if the element count is less than MaxInt).
Existing code operating on arrays doesn't have to worry about Length being longer than MaxInt. Casting rules guarantee this can't happen.
LargeArray<T> would not derive from System.Array, but one could explicitly cast to it (the length-check throwing InvalidCastException would happen in that case). The same for non-generic collection types (e.g. casting to ICollection that has the Int32 Count property would only be allowed for a small LargeArray<T>).

This scheme would allow LargeArray<T> and normal arrays to co-exist pretty seamlessly.

MichalStrehovsky on 2 Oct 2019

👍5

@MichalStrehovsky

Similar rules would apply when casting to collection types (e.g. you can cast LargeArray to ICollection only if the element count is less than MaxInt).

This would make LargeArray<T> incompatible with a lot of older code in these cases:

Methods that currently accept a more specific type than they actually need(like ICollection<T> instead of IEnumerable<T> while the code just enumerates the values).
Probably, some cases where ICollection<T> isn't used directly but inherited from. For instance, methods that accept ISet/IList/IDictionary<...>(I assume, that those interfaces and their implementations would be eventually updated for 64-bit lengths as well) which are inherited from ICollection<T>.

I would go with overflow checks when .Count is called to keep the compatibility and add .LongCount property with a default implementation to the old interfaces.

kasthack on 2 Oct 2019

@kasthack I was trying to come up with a compatible solution where one never has to worry about getting OverflowException in the middle of a NuGet package that one doesn't have the source for. Allowing cast to ICollection<T> to succeed no matter the size is really no different from just allowing arrays to be longer than 2 GB (no need for a LargeArray<T> type). Some code will work, some won't.

With explicit casting, it's clear that we're crossing a boundary into "legacy land" and we need to make sure "legacy land" can do everything it would reasonably be able to do with a normal ICollection<T> before we do the cast.

methods that accept ISet/IList/IDictionary<...>

Arrays don't implement ISet/IDictionary so a cast to these would never succeed. For IList, the same rules would apply as for ICollection (ICollection was just an example above).

MichalStrehovsky on 2 Oct 2019

@GPSnoopy's post makes me wonder whether the following variation might make sense:

Introduce new nint and nuint types, but don't change the signatures of anything to use them. Nothing breaks.
Introduce new arrays types (with fixed covariance), new span types, etc, which use nint and nuint. Keep the old ones and don't touch them. Make it fast and easy to convert between old and new versions of these types (with an exception if your 64-bit value is too big to fit into the 32-bit counterpart), but conversion should be explicit. Nothing breaks, type safety and all that.
Add a C# compiler switch /HeyEveryoneIts2019 which, when you write double[] you get the new type of array instead of the old one, everything's nint and nuint, and the compiler adds conservative/obvious stuff to convert to/from old-style arrays when you call outside assemblies which want old-style arrays. This way if it gets through the conversion without an exception, you won't break any old referenced code.

philjdf on 3 Oct 2019

👍2

It has been proposed that we could make Array.Length and array indices native-sized (nint or IntPtr).

This would be a portability issue. Code would need to be tested on both bitnesses which currently is rarely required for most codebases. Code on the internet would be subtly broken all the time because developers would only test their own bitness.

Likely, there will be language level awkwardness when nint and int come into contact. This awkwardness is a main reason unsigned types are not generally used.

In C languages the zoo of integer types with their loose size guarantees is a pain point.

I don't think we want to routinely use variable-length types in normal code. If nint is introduced as a type it should be for special situations. Likely, it is most useful as a performance optimization or when interoperating with native code.

All arrays should transparently support large sizes and large indexing. There should be no LargeArray<T> and no LargeSpan<T> so that we don't bifurcate the type system. This would entail an enormous duplication of APIs that operate on arrays and spans.

If the object size increase on 32 bit is considered a problem (it might well be) this could be behind a config switch.

Code, that cares about large arrays needs to switch to long.

Likely, it will be fine even in the very long term to keep most code on int. In my experience, over all the code I ever worked on, it is quite rare to have large collections. Most collections are somehow related to a concept that is inherently fairly limited. For example, a list of customers will not have billions of items in it except if you work for one of 10 companies in the entire world. They can use long. Luckily for us, our reality is structured so that most types of objects do not exist in amounts of billions.

I see no realistic way to upgrade the existing collection system to 64 bit indexes. It would create unacceptable compatibility issues. For example, if ICollection<T>.Count becomes 64 bit, all calling code is broken (all arithmetic but also storing indexes somewhere). This must be opt-in.

It would be nicer to have a more fundamental and more elegant solution. But I think this is the best tradeoff that we can achieve.

GSPP on 3 Oct 2019

👍1

Just note there is already a case today where Length can throw. Multi-dimensional arrays can be large enough that the total number of elements is greater than Int.MaxValue. I think this is why LongCount was originally added.

FraserWaters-GR on 8 Oct 2019

Note: Currently, when you write foreach over an array, the C# (and also VB) compiler actually generates a for loop, and store index in 32 bits. This means that existing code must break with array that > 2G, or at least a recompile is required.

I really hopes all size-like parameters among the ecosystem is using nuint (avoid checking for >= 0).
There can be a [SizeAttribute] on all the parameters, and JIT generates the positive guard and bit expansion to allow existing int-size-compiled assembly to run with native-sized-corelib.

huoyaoyuan on 8 Oct 2019

One option is to create a NuGet package for LargeArray<T>, and polish it in practical use. Once polished, make it part of the standard. This is what C++ did to parts of Boost.

But CLR should be ready by then.

lostmsu on 20 Jun 2020

❤1

But CLR should be ready by then.

@lostmsu Are you aware of ongoing work to fix this in CLR that I'm not aware of?

GPSnoopy on 2 Jul 2020

@GPSnoopy nope, that was an argument to start the CLR work ahead of time before BCL, so that BCL could catch up.

lostmsu on 2 Jul 2020

I want to reiterate: there is little appetite within the .NET engineering team for allowing normal T[] / string / Span to be able to represent more than int.MaxValue items. The problems with shoehorning this in to the existing ecosystem have been described at length earlier in this thread.

Any proposal which modifies the implementation of these existing types is likely to stall. Alternative proposals which do not modify the implementation of these existing types are much more actionable.

GrabYourPitchforks on 3 Jul 2020

👎1 👍1

Thanks @GrabYourPitchforks for the update. Although it's not good news: I am also currently reaching the limit of arrays because I am constructing large graphs. If the team does not want to remove the limit for T[], the only option is probably LargeArray<T>, based on an unsafe implementation. This would mean that we would have to free the memory manually e.g. via LargeArray<T>.FreeMemory(). Are there other proposals imaginable?

My opinion after 15 years as a C# developer: I would appreciate it if at least the limit of arrays T[] would be removed. At the same time it would make sense to use ulong as index instead of long.

SommerEngineering on 22 Jul 2020

It's me again. I've been trying to sketch a solution in a hurry. I put the code online at Github:
~~https://gist.github.com/SommerEngineering/e8f61ada40b8ff3ecfe61f5263a666b9~~ (edit: see https://github.com/dotnet/runtime/issues/12221#issuecomment-665550339 for my ExaArray library)~~

This is a class for a one-dimensional array that pretends to be an array that might become larger than the arrays in C#. This is done using chunks of arrays and ulong as index.

Did I miss a technical detail here, or would that be a possible solution?

A wide range of applications in the field of graphs, big data or HPC should require one, two or three-dimensional arrays. According to this we might extend the chunk approach to these dimensions and would satisfy a wide range of developers without losing compatibility.

Dear .NET team: Couldn't you integrate such an approach into .NET 5? My code is obviously free to use without a license. I might write some unit tests tomorrow.

SommerEngineering on 22 Jul 2020

@SommerEngineering there are few reasons to use signed types to index arrays. For example, unsigned index makes it harder to iterate in reverse due to 0-1 >= 0.

lostmsu on 22 Jul 2020

It's also not strictly required to have an "unsafe" large array. You could, for example, imagine a LargeArray type which Array is implicitly convertible to and which has some specialized runtime/language semantics to make things work smoothly.

tannergooding on 22 Jul 2020

I feel like given the underlying limitation to this is because of the GC there's really only two available options, assuming you want a container that has O(1) index access and is large (more than 2^32 entries):
1) You want a managed container that can hold references to objects, but doesn't need to be contiguous memory. I think this could be done via a container similar to the C++ deque where you have one array that holds references to other arrays, Given the largest array size is 2146435071 if you used the maximum size for parent array and the child arrays you could have up to 4607183514018775041 elements, which is about 2^62.
2) You want contiguous memory, and so because of underlying GC limitations we can't hold references in this but we could have a container that keeps a native allocation of memory. This would just be UnmanagedMemoryAccessor but with an array style interface instead of general read/write.

Frassle on 23 Jul 2020

Note for all who are interested in a managed array with more than 2^32 elements: I published the library ExaArray on NuGet yesterday. In theory, it can hold up to 4.4 quintillion (4,410,000,000,000,000,000) elements. When using byte as T, this would require approx. 3.8 EB of memory. I tested it up to 32 GB of memory, though.

NuGet: https://www.nuget.org/packages/ExaArray/
Github: https://github.com/SommerEngineering/ExaArray
Main repo: https://code.tsommer.org/thorsten/ExaArray
License: BSD-3-Clause

Right now only a one-dimensional array is implemented. Thus, jagged arrays are working fine.

This library might helps someone until .NET supports larger arrays.

SommerEngineering on 29 Jul 2020

@SommerEngineering the limits you've used in that library are a little off, you should be able to allocate up to 2,146,435,071 elements in a (non-byte) array, not just 2,100,000,000. Gives you an extra 46 million to work with.

Frassle on 29 Jul 2020

It would be better to use power of two as max chunk capacity to avoid expensive division. Power of two division is optimized into a shift that is a lot cheaper than arbitrary divide operation.

jkotas on 29 Jul 2020

The largest power of two that's smaller than 2,146,435,071 (a) is 1,073,741,824 (b)
a^2) 4,607,183,514,018,775,041
b^2) 1,152,921,504,606,846,976

I mean probably still more than enough elements but is a lot smaller.

Frassle on 29 Jul 2020

Thanks @Frassle and @jkotas for the feedback.

@jkotas: I was not yet familiar with the automatic optimization of divisions of powers of two. Thanks for that.
@Frassle Thanks for the exact limit that .NET accepts. I determined 2,100,000,000 by trial-and-error.

Alright... now we have a classic trade-off situation: Either slightly more performance but with "only" 1.1 quintillion elements or less performance and 4.6 quintillion elements instead. Yes, 1.1 quintillion is still very much. Especially if we put that in relation to the currently available memory: Azure offers a maximum of 5.7TB, Google 11.7TB and Amazon 3.9TB per machine. None of these offers are sufficient to load 1.1 quintillion elements on one machine.

Thus: I added an option to chose a strategy. Default is maximizing performance. I added a unit test for measure the performance difference: Power of two divisions are approx. 1% faster in average. @jkotas: Is this the expected level of improvement? I mean one percent savings is a lot for a scientific simulation that runs for several months.

I released version 1.1.0 with this changes.

SommerEngineering on 29 Jul 2020

The dividend has to be constant to make the optimization kick in. The configurability that you have introduced defeats the point of the optimization.

jkotas on 29 Jul 2020

Thanks @jkotas ... I tested it once quickly: The performance improves up to about 70% (depending on the number of get/set calls). This is such a significant difference that I will give up the option to choose for this. Especially because I will use the library for scientific simulations, this performance difference is priceless. Thanks again for the advice. This evening I will release a new version 1.1.1 with the appropriate changes.

SommerEngineering on 30 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Getting error while installing the VSTS Agent on Linux RedHat

sahithreddyk · 3Comments

Question: interface call and generated ASM

yahorsi · 3Comments

Add Bson integration

GitAntoinee · 3Comments

JIT: Optimize boxing(x) == boxing(y) to false.

EgorBo · 3Comments

IAsyncDisposable TransactionScope

Timovzl · 3Comments