Runtime: a new GC API for large array allocation

Created on 15 Aug 2018 · 93Comments · Source: dotnet/runtime

To give users with high perf scenarios more flexibility for array allocations I propose to add a new API in the GC class.

Rationale

Below are mechanisms we would like to support for high perf scenarios

coreclr dotnet/runtime#20704;
choose whether you want to allocate the object as a gen0 object or in the old generation;
choose whether you want to pin the object you are requesting to allocate;

I am also thinking of exposing the large object size threshold as a config to users and this API along with that config should help a lot with solving the LOH perf issues folks have been seen.

Proposed APIs

class GC
{
    // generation: -1 means to let GC decide (equivalent to regular new T[])
    // 0 means to allocate in gen0
    // GC.MaxGeneration means to allocate in the oldest generation
    //
    // pinned: true means you want to pin this object that you are allocating
    // otherwise it's not pinned.
    //
    // alignment: only supported if pinned is true.
    // -1 means you don't care which means it'll use the default alignment.
    // otherwise specify a power of 2 value that's >= pointer size
    // the beginning of the payload of the object (&array[0]) will be aligned with this alignment.
    static T[] AllocateArray<T>(int length, int generation=-1, bool pinned=false, int alignment=-1)
    {
        // calls the new AllocateNewArray fcall.
        return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, pinned, clearMemory: true);
    }

    // Skips zero-initialization of the array if possible. If T contains object references, 
    // the array is always zero-initialized.
    static T[] AllocateUninitializedArray<T>(int length, int generation=-1, bool pinned=false, int alignment=-1)
    {
        return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, pinned, clearMemory: false);
     }
}

Restrictions

Only array allocations are supported via this API

Note that I am returing a T[] because this only supports allocating large arrays. it's difficult to support allocating a non array object since you'd need to pass in args for constructors and it's rare for a non array object to be large anyway. I have seen large strings but these are unlikely used in high perf scenarios. and strings also have multiple constructors...we can revisit if string is proven to be necessary.

Minimal size supported

Even though the size is no longer restricted to >= LOH threshold, I might still have some sort of size limit so it doesn't get too small. I will update this when I have that exact size figured out.

Perf consideration

Cost of getting the type

The cost of "typeof(T).TypeHandle.Value" should be dwarfed by the allocation cost of a large object; however in the case of allocating a large object without clearing memory, the cost may show up (we need to do some profiling). If that's proven to be a problem we can implement coreclr dotnet/corefx#5329 to speed it up.

Pinning

We'll provide a pinned heap that are only for objects pinned via this API. So this is for scenarios where you

have control over the allocation of the object you want to pin and
want to pin it as long as it's alive

Since we will not be compacting this heap fragmentation may be a problem so as with normal pinning, it should be use it with caution.

I would like to limit T for the pinning case to contain no references. But I am open to discussion on whether it's warranted to allow types with references.

api-approved area-System.Runtime

Source

Maoni0

👍33 ❤6

Most helpful comment

Video

Looks good as proposed.

C# namespace System { public partial class GC { public static T[] AllocateArray<T>(int length, int generation=-1, bool pinned=false, int alignment=-1); public static T[] AllocateUninitializedArray<T>(int length, int generation=-1, bool pinned=false, int alignment=-1); } }

terrajobst on 1 Oct 2019

🎉4 🚀1 ❤1

All 93 comments

Edited proposal to match naming guidelines

@jkotas supportive of this going to api review?

danmosemsft on 16 Aug 2018

Nit: The method should be static.

@jkotas supportive of this going to api review?

Yes.

jkotas on 16 Aug 2018

return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, clearMemory);

I think this should rather be return AllocateNewArray(typeof(T[]), length, generation, clearMemory) ... but that's an implementation detail we can figure out later.

jkotas on 16 Aug 2018

IMO it would be good place to add alignment control for GC allocations. Additional parameter or additional overload would serve purpose very well.

C# class GC { // generation: -1 means to let GC decide // 0 means to allocate in gen0 // GC.MaxGeneration means to allocate in the oldest generation T[] AllocateLargeArray<T>(int length, int generation=-1, int alignment = -1, bool clearMemory=true) { // calls the new AllocateNewArray fcall. return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, clearMemory); } }

where alignment value -1 means GC decides and any value > 0 asks for allocation alignment as specified by caller.

See https://github.com/dotnet/csharplang/issues/1799 [Performance] Proposal - aligned new and stackalloc with alignas(x) for arrays of primitive types and less primitive as well

4creators on 19 Aug 2018

👍4

alignment control for GC allocations

This problem has been discussed in https://github.com/dotnet/corefx/issues/22790 and related issues.

jkotas on 20 Aug 2018

Video

Should it just be AllocateArray? In the end, the developer controls the size.
What happens if the developer specifies gen-0 but wants to create a 50 MB array? Will the API fail or will it silently promote the object to, say, gen-1?
No clearing the memory is fine, but we want to make sure it shows up visibly on the call side (a plain false isn't good enough). We'd like this to be an overload, such a AllocateLargeArrayUninitialized? The other benefit of having an overload is that this could be constrained to only allow Ts with no references (unmanaged constraint).
Is LOH the same as MaxGeneration? If not, how can a developer explicitly allocate on the LOH?

terrajobst on 28 Aug 2018

We'd like to this be an overload, such a AllocateLargeArrayUninitialize

Agree. Did you mean AllocateUninitializedArray ?

The other benefit of having an overload is that this could be constrained to only allow Ts

I do not think we want the unmanaged constrain it. It would just make this API more pain to use in generic code for no good reason. GC should zero-initialize the array in this case. Note that the array will be zero-initialize in many cases anyway when the GC does not have a uninitialized block of memory around. The flag is just a hint to the GC that you do not care about the content of the array.

jkotas on 28 Aug 2018

👍1

Should it just be AllocateArray? In the end, the developer controls the size.

this is only meant for large array allocation, ie, arrays larger than the LOH threshold.

What happens if the developer specifics gen-0 but wants to create a 50 MB array? Will the API fail or will be silently promote it to, say, gen-1?

that's something we need to decide. but if it fails to allocate anything in gen0 it would revert to the default behavior (ie, on LOH).

No clearing the memory is fine, but we want to make sure it shows up visibly on the call side

I am not sure why this needs to be an overload but not the other aspects. why wouldn't there be a AllocateLargeArrayInYoungGen overload too, then?

Is LOH the same as MaxGeneration? If not, how can a developer explicitly allocated on the LOH?

LOH is logically part of MaxGeneration.

I do not think we want the unmanaged constrain it. It would just make this API more pain to use in generic code for no good reason.

this API is not for generic code though. I would only expect people with very clear intentions for perf to use this. and if you specify to not clear, I think, if I were a user, it would be more desirable to indicate an error if that can't be done (ie, the type has references) instead of silently taking much longer.

after the discussion it seems like this API should perhaps take another parameter that indicates whether the operation succeeded or not, eg, AllocateLargeArrayError.TooBigForGen0, AllocateLargeArrayError.MustClearTypesContainsReferences. however I will leave this decision to API folks.

Maoni0 on 29 Aug 2018

this API is not for generic code though

What makes you think that it is not? It is very natural to use these API to implement generic collections optimized for large number of elements.

jkotas on 29 Aug 2018

I am not sure why this needs to be an overload but not the other aspects. why wouldn't there be a AllocateLargeArrayInYoungGen overload too, then?

The uninitialized memory has security ramifications so you want to have an easy way to search for it. Generation hint has no security ramifications.

jkotas on 29 Aug 2018

👍1

What makes you think that it is not? It is very natural to use these API to implement generic collections optimized for large number of elements.

do you think the default is not good for "implementing generic collections with large number of elements" _in general_? I would think it is - you'd want the objects to be cleared so you don't deal with garbage; and most of the time if you have an object with large # of elements it should be on LOH, not gen0.

The uninitialized memory has security ramifications

ahh, yep, makes sense to single out APIs with security ramifications.

Maoni0 on 29 Aug 2018

The default is fine for most cases. This API is workaround for cases where the default does not work well and turns into bottleneck.

Large arrays are used mostly for buffers and collections. I think it is important that this API works well for specialized generic collections.

For example, the email thread from a few months ago that both of us are on had this real-world code fragment:

class ListEx<T> : IList<T>
{
    private T[][] Memory = null;

    public T this[int index]
    {
        get
        {
            removed checking
            return Memory[index / blockSize][index % blockSize];
        }

This code artificially allocates number of smaller arrays to simulate large array. It does it to avoid landing short-lived large array in Gen2 heap. The double indirection has non-trivial cost (the element access is several times slower). Changing the implementation of this ListEx<T> to use these APIs and avoiding the double indirection should be straightforward. Also, when T does not contain object references, it is fine for the backing array to be uninitialized.

jkotas on 29 Aug 2018

❤1

:laughing: I see what the confusion was...by "generic" I meant "general cases" and you meant "code that implements generics collections". what I meant was this is not an API used in general cases so it's a little harder to use I don't see that as a problem.

Maoni0 on 29 Aug 2018

this is only meant for large array allocation, ie, arrays larger than the LOH threshold

@Maoni0, what would be the proposed behavior if the user attempts to create an array smaller than the LOH threshold?

tannergooding on 29 Aug 2018

what I meant was this is not an API used in general cases so it's a little harder to use I don't see that as a problem.

Little harder to use is fine. Unmanaged constrain would make it very hard to use in my ListEx example (you would have to use reflection to call the API). It is why I think the unmanaged constrain is not good for this API.

jkotas on 29 Aug 2018

what would be the proposed behavior if the user attempts to create an array smaller than the LOH threshold?

e.g.

_longLivedArray = AllocateLargeArray<Vector4>(length: 8000, generation: 2);

Where the goal is more to allocate straight to final generation

benaadams on 29 Aug 2018

👍1

I like the suggestion but I'm curious about two things and some thoughts

I thought the LOH has no generation, objects do not get promoted or compacted?
I was under the impression that LOH objects are only compacted when I set the

GCSettings.GCLargeObjectHeapCompactionMode = LargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();

I must admit that I truly dislike this. In my opinion there should either be a deterministic method, e. g. GC.CompactLOH() either blocking or nonblocking or there should be a setting how to handle the LOH in terms of GCing (so for my applications I would prefer more the approach that if a LOH is necessary because I'm running out of memory that an LOH compaction takes place with manual interaction instead of just getting an OutOfMemoryException). So an LOHCompactingBehavior would be nice.. Most of our objects are larger than 85k, more in the direction of 512x512x4. So if I don't implement my own mechanisms to call GC.Collect() and do a CompactOnce the memory gets more and more fragmented event if I have enough memory, right?

Another point is that currently the 85k is somehow an implementation detail not everyone is aware of.
So I personally prefer the suggestion from @terrajobst with the AllocateArray, but I would rather place it where the developer expects it to be, namely in the Array. There is already a CreateInstance although not generic, but what would prevent you from putting it there?

In the end I might also be interested in not initializing the array even if it's not a LOH array. We are having this a lot when loading data. I need to allocate a byte array first, which is immediately initialized to 0, but in the end I only need a container to override it again.

msedi on 29 Aug 2018

I would rather place it where the developer expects it to be, namely in the Array

Array is main stream type. These are specialized methods for micro-managing the GC that we expect to be used rarely. We avoid placing specialized methods like this on the main stream types. For example, GC.GetGeneration could have been on object, but it is not for the same reason.

jkotas on 29 Aug 2018

👍1

Updated the proposal at the top with feedback incorporated. @Maoni0 Thoughts?

jkotas on 29 Aug 2018

@jkotas: Somehow you are right. But from a certain point of view as a user I don't want to search through the API to find specialized things. I think it is not so seldom that people allocate more than 85k right? 85k is not such a big number so I guess there are many people out there using larger array without even knowing there is a difference as the things from the GC are not so documented in detail than other "classes".

It would be interesting to see how many people know about these internals. Do you have a number on this?

To be honest, I'm fully OK if it's placed in the GC ;-) But I'm a fan of putting the things together where they belong. Something like the GC and the GCSettings seems to me an artifical separation.

msedi on 29 Aug 2018

I think it is not so seldom that people allocate more than 85k right?

Right. We believe that the right default for >85k arrays is to put them into Gen2. We do not expect a lot of .NET developers to worry about these internals. If they need to worry, we have failed.

The path how folks discover these APIs is that they will find they got GC performance issue in their app, they will find the root cause and get to on documentation page that has suggestions for solving different GC performance issues. This API can be one of the suggestions, another suggestion can be array pooling.

jkotas on 29 Aug 2018

👍1

@jkotas I think I misunderstood what you meant by "unmanaged constraints". you meant you don't want the users to have to figure out whether a type contains ref or not (and then call the API only if it doesn't contain refs). I do agree that would be a good thing. a (nit) comment I have on the new AllocateUninitializedArray API is the name sounds like it will for sure be uninitialized but in reality it will be initialized if it contains references and that (important part) isn't reflected in the name. but AllocateUninitializedArrayWhenAppropriate is probably too long.

I'd like to keep this API for only allocating large objects only because I am not implementing a new new. our implementation for allocating a new object with new is heavily optimized and I am not duplicating all that logic. but that's ok for allocating a large object 'cause they are already expensive to allocate. of course there's a balance between the GC cost that this might save and the allocation perf. my worry for opening up this API for smaller objects is people may allocate way more in the old gen and end up reducing the total perf (ie, allocation is more expensive and more cost in GC).

Maoni0 on 29 Aug 2018

unmanaged constraints

The unmanaged constrain is a new C# language feature: https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.3/blittable.md

in reality it will be initialized if it contains references

In reality, it will be also initialized if the GC does not have a suitable block of memory to reuse. Naming is hard - I agree that AllocateUninitializedArrayWhenAppropriate feels too long.

I'd like to keep this API for only allocating large objects

Do you mean to enforce this (e.g. fail with exception when the size is less than X - what should X be?), or just provide guidance and log this in GC trace (I think we should have uses of these APIs in the GC trace anyway)? I think it should be just guidance and logging.

jkotas on 29 Aug 2018

In reality, it will be also initialized if the GC does not have a suitable block of memory to reuse.

whether GC happens to have a suitable block of memory to use is completely unpredictable. the point is if it contains references, GC will make the guarantee that it's initialized; whereas if it doesn't contain references, GC will not make such a guarantee at all if you call this API.

Do you mean to enforce this (e.g. fail with exception when the size is less than X - what should X be?)

X is the LOH threshold which can be exposed as something the user can get. I don't have a very strong opinion whether to enforce this or not. I can see pros and cons for both. I lean towards enforcing but I can understand that users probably want the other way.

Maoni0 on 29 Aug 2018

I have seen cases where folks allocate several arrays (not necessarily above LOH threshold) and pin them for a very long time. The GC has to step around the pinned arrays that causes perf issues if they are stuck in a bad place. This would be another case where this API may help and it is a reason for not enforcing the LOH threshold.

jkotas on 29 Aug 2018

yep, that's certainly a good scenario - obviously it would require you to know the objects that will be pinned before hand; a common situation with pinning is you allocate objects first, then decide to pin them some time later at which point the generation is already decided. but yes, if you do know at alloc time that would make a legit case to use this API.

discussions like this (ie, the kinds of scenarios you'd like use this API for) are definitely welcome!

Maoni0 on 30 Aug 2018

The example @benaadams used is a good one. I make plenty of allocations under the LOH limit that I know in advance are going to be long-lived (and/or pinned at some point).

For that matter, it might be advantageous to have ArrayPool<T>/MemoryPool<T> allocate straight to gen2, even for the smallest arrays, since they're likely to live long enough to be promoted anyway.

saucecontrol on 30 Aug 2018

👍2

Love this proposal! Knowing in advance the lifetime and being able to allocate from the start where it is more efficient. In many occasions when allocating array of structs that I knew should have to stay for the duration of an application , I had to allocate at least 85Ko to make sure that it was going to the LOH... being able to allocate smaller array directly to gen2 would be great.

Extra question: Would we have a way to pin this allocation after, knowing that it is on gen2 and that it would not move anymore for example? (usage: sharing caches between managed array and native code)

xoofx on 13 Sep 2018

@xoofx being in gen2 doesn't mean it would not move anymore. and you can pin the object you get back just like you can pin any other object.

Maoni0 on 13 Sep 2018

@xoofx being in gen2 doesn't mean it would not move anymore. and you can pin the object you get back just like you can pin any other object.

Oh you are right, gen2 is still being compacted (and I mixed it with my usage of LOH back in the days when it was not) and absolutely for the pinning, I was probably not enough awake 😴

xoofx on 14 Sep 2018

I've update this with the pinning option per discussion on 19936

Maoni0 on 14 Sep 2018

👍1

@Maoni0
@jkotas

Is there any chance there could be a forth argument added?

static T[] AllocateArray<T>(int length, int generation, bool pinned=false, _**bool direct=false**_)

If us developers passed in true for the forth argument (i.e. "direct=true"), then the GC would go direct to the OS using VirtualAlloc, similar to the way the OS heaps go direct to the OS using VirtualAlloc when the allocation size is over 512KB(32bit)/1024KB(64bit).

Seudo logic might look something like this:

if "direct=true", the GC doesn't use its "new" allocator, instead it simply calls VirtualAlloc

and sets a bit in the object's header indicating this object's memory is from VirtualAlloc.
(similar to the way the lock statement or gethashcode uses a bit in the header)
then when the GC does its normal garbage collection logic to cleanup objects; if it sees that bit set in the object's header, it would just skip the normal logic and instead simply call VirtualFree on the starting address of the object.

This "direct" argument would basically control whether the T[] got its memory from the new GC heap, or direct from the OS utilizing VirtualAlloc.

This seems like it might be a way to eliminate some of the possible fragmentation issues that might arise in the new heap, by giving us the option to allow the T[] object to not utilize the heap and instead get its memory direct from the OS with VirtualAlloc.

This new API looks great! It seems like it will go a long way toward helping the perf issues, it would just be kind of nice if we could also have this additional argument as an optional way to handle possible fragmentation issues that may arise due the many different varying allocation patterns of the consumers of the API.

Memory for an object coming from VirtualAlloc is not as unorthodox as it first sounds. To the contrary, it's actually very similar to the way the OS heaps go direct to the OS utilizing VirtualAlloc when the allocation size is over 512KB(32bit)/1024KB(64bit).

For _small_ array objects developers would leave this "direct" argument "false", and it would use the new heap.

Several other possible PRO's:

It also might help a little on the issue of the cost of clearing the memory for a very large object, because the memory from VirtualAlloc comes already cleared by the OS (i.e. pages zero'd).
It also might help with some of the problems caused and associated with "pinning", because the actual need for "pinning" kind of goes away since the memory from VirtualAlloc doesn't move.
With this "direct" argument set to "true", the pages of memory that the object consumes are released back to the OS _immediately_ when the object is destroyed. We don't have to wait until the entire segment of memory used for the heap is cleared in the GC before that segment of memory can be released back to the OS. Segments not being released back to the OS immediately is not that big a deal when the segment sizes are small, however when the sizes are gigantic, as is the case with some of the new server garbage collection config options, it can sometimes become a rather serious issue.

For example, if your in a server GC garbage collection configuration with gigantic memory segments and you dispose a huge array of 1GB; with this argument set to "true" the GC would call VirtualFree on the object's starting address, and the physical memory (RAM) would _immediately_ go away from the process's working set, and _immediately_ show up in the OS's available physical memory (RAM) again. It won't have to wait for the huge segment to become clear at some later time, before it can be released back to the OS and show up in the available physical memory (RAM) again.

The "copy semantics" between the managed memory world and unmanaged memory world that plagues most of the high-performance memory related scenarios attempted in C#, might be somewhat eliminated with this argument. Instead of having to resort to C++ for high-perf memory scenarios, which requires us to copy the memory back to C#, with this argument, we'd probably be able to stay entirely in the managed world for everything now, even including these high-perf memory scenarios. This just might allow us to possibly enjoy some of the huge benefits of a "no-copy semantics" paradigm.

For example, in a high-perf scenario with a huge array of 2GB, we'd no longer have to copy from C# and C# datatypes to C++, do a bunch of processing, and then copy back from C++ and C++ datatypes to C#. We could now do all our processing on our large T[], never leaving C#, and not having 2 copies of the data (i.e. only 2GB of memory instead of 4GB).

It also might help with the issue of dealing with the "exchange type" (i.e. Memory) concept with Span. If we had this argument option, it seems like we could just use the T[] returned from your new API for most everything now, kind of in a sense eliminating the problems of the "exchange type" caused from Span being a stack-only struct. We'd be able to use T[] like a regular object for all our regular stuff like async/await, inline lambdas (hoisted variables), etc.
The implications of allowing this one additional argument could possibly be huge, allowing us now for practical purposes to essentially not have to always be battling with the bridge between the managed memory world and unmanaged memory world. In a sense, that battle between the two memory worlds has created many of the long standing issues which has limited the use of C# in many of the extremely high-perf memory related scenarios inherent in many of today's backend server application designs, and has also plagued us developers trying to write really high-perf backend server code in C# for years. Instead, we could now just stay entirely in the managed memory world, never leaving to go out to the unmanaged memory world, and thereby in the process sort of indirectly eliminate those issues due to them no longer being caused in the first place.
Basically, the concept of the "direct" argument is not as radical and out there as it first sounds. It's basically just controlling whether the memory backing T[] is just pages of memory directly from VirtualAlloc, versus memory from the new heap (which basically got its segments of memory from VirtualAlloc also).
When you stop and think about it, we're basically just giving the developer the ability for extremely large array objects to control whether VirtualAlloc is called to create each object (one-to-one), or whether one big VirtualAlloc is called to create the new heap and the new heap in turn creates all the objects (one-to-many). One-to-one when "direct=true", and one-to-many when "direct=false". What you get in return for one-to-one calling VirtualAlloc per object is the memory for the object is _immediately_ released back to the OS when that object is destroyed, which can be a much bigger concern for very large objects. Additionally, when the memory is released back to the OS _immediately_, it can sometimes help reduce or eliminate fragmentation in some cases.

CON's:

Since VirtualAlloc allocates on page boundaries, there can be some waste on the last page of memory when an object isn't large enough to consume all the memory in that last page up to the end of the page's boundary. However, since it was mentioned that the intent of the API is primarily meant for large array allocation; the amount of waste is rather negligible in relation to the overall large size of the object, which may possibly render this CON rather immaterial.

We believe the gain from the memory being released back to the OS _immediately_, far outweighs the small bit of memory waste on the last memory page, when you are dealing exclusively with very large arrays. And, you mention in the comments that the intent of the API is primarily meant for large array allocation.

P.S. String seems like it will also be needed for high-perf scenarios, since many times very large json strings are now being used in many of the large scale application designs of today, like micro-services architectures communicating with rest endpoints talking in large json documents, etc. Since the String is basically just a char array; it seems like this new API would also work for creating large char arrays. It seems like there might be a way to simply add a new method to the String class that would allow us to assign the char array (T[char]) returned from this new API to the internal m_firstChar field inside the String class for everything to still work possibly???

dduerner-ycwang on 24 Sep 2018

👍1

I do not think that the direct argument would help to solve the problem you are describing.

If there is a large amount of unused memory, the GC does release it back to the OS as soon as it notices it today. It can take a while between the time the program stops using the memory and the GC runs and notices that the memory is unused. The direct flag would not help with this problem - the same delay would still be there.

jkotas on 24 Sep 2018

@Maoni0
@jkotas

How is creating a very large object from this new heap with the new API, any different than creating the very large object from the regular Large Object Heap?

It seems like the new heap for the new API will have the same fragmentation issues as the regular Large Object Heap.

The new heap isn't going to be compacted, right?

If the new heap for the new API is experiencing severe fragmentation, this new "direct" argument could be an option to avoid it...

dduerner-ycwang on 25 Sep 2018

@dduerner-ycwang you seem to think somehow by calling VirtualAlloc/VirtualFree for very large objects on the GC heap will somehow give you some advantage for lifetime management - that's not true and plus that's what's already happening as @jkotas explained above. today for a really large object (say 1GB+) we are already calling VirtualAlloc for it as it will be living on its own segment (our largest default LOH seg size is 256mb which means anything larger than that will need a new seg). and when GC discovers the object is dead (which might be a while since gen2 may not happen for a while) we will call VirtualFree on it.

Maoni0 on 1 Oct 2018

@Maoni0
@jkotas

We're not so much worried about the lifetime management as we are the fragmentation issues that potentially still exist. The fragmentation issues that exist with the LOH today seem like they may also exist here in this new API. If an array object allocated from this new API comes from a regular style heap that's not compacted, it seems like the same fragmentation issues that exist in the LOH today will still exist? If that's not true and there will be no fragmentation problems, then please forgive us and we're sorry for bringing up the subject. Us wanting to use VirtualAlloc/VirtualFree is not so much for lifetime management, it's really to combat those fragmentation issues.

In our testing with a server experiencing severe fragmentation, we saw the fragmentation virtually disappear when we called VirtualAlloc to allocate the individual object and called VirtualFree as soon as the object was disposed in the code.

And, fragmentation isn't something we should just sweep under the rug and ignore either, because we saw cases in our testing where the physical memory (RAM) being used up on the server was double to almost triple what it should have been for the objects due almost entirely to that fragmentation. And, when we're talking about large objects, it can add up to be a huge amount of RAM that's no longer available to the server. Calling VirtualFree as soon as the object was disposed in the code basically cut the physical memory (RAM) being used up on the server and not available to do other things virtually in half or more.

We only wanted this "direct" argument so that we might be able to do the same thing to combat fragmentation in this new API...if there will be no fragmentation in this new API, then once again we're sorry for bringing it up.

Kind Regards

dduerner-ycwang on 4 Oct 2018

we called VirtualAlloc to allocate the individual object and called VirtualFree as soon as the object was disposed in the code

I believe that the key here that you called VirtualFree as soon as the object was disposed in the code. The direct argument would not allow you to do that.

jkotas on 4 Oct 2018

@Maoni0
@jkotas

In a sense it would allow us to do that because if we were able to get back the array object with the "direct" argument, we the developer could call the VirtualFree ourselves in our code when we want to dispose the object.

Then this could give us a way to workaround the fragmentation issues when they arise.

P.S.

Or, it would also allow us to do that if we added this to the new API that would call VirtualFree on the T[] object for us:

static void DisposeArray(object o)

Then us developer's would be able to call this from our code to have a workaround to combat the fragmentation issues when they arise. And, if we forgot to call it, we'd still be alright because the GC would still call VirtualFree on the next GC collection so we wouldn't leak.

Or, it would also allow us to do that if the GC team when it implements this new API could add a "Dispose" to the T[] that would check if the bit we mentioned earlier is set in the object header indicating this object's memory is from VirtualAlloc, and if so call VirtualFree for us. That would allow us developers to wrap the object in a "using" block to combat the fragmentation issues when they arise.

dduerner-ycwang on 4 Oct 2018

If you are happy to use the Dispose pattern; then you can use VirtualAlloc and VirtualFree with Memory<T> and Span<T> and either IMemoryOwner<T> or MemoryManager<T> as the source, which are disposable

benaadams on 5 Oct 2018

@benaadams

We're not totally sure, but are pretty sure Memory T won't let you pass it into legacy functions in our existing code base that only accept byte[] for example, without incurring a copy operation which can be a heavy hit for very large arrays...

dduerner-ycwang on 5 Oct 2018

if we were able to get back the array object with the "direct" argument, we the developer could call the VirtualFree ourselves in our code

This would not work. The GC needs to know about all memory it is managing. You cannot free the memory without telling it.

static void DisposeArray(object o)

You are basically asking for a classic C/C++ malloc/free. There is an existing API that does that: ArrayPool<T>.Rent/Return. It is not optimized for large arrays today - it is something that can be done without introducing a new API.

jkotas on 5 Oct 2018

For example, in a high-perf scenario with a huge array of 2GB, we'd no longer have to copy from C# and C# datatypes to C++, do a bunch of processing, and then copy back from C++ and C++ datatypes to C#. We could now do all our processing on our large T[], never leaving C#, and not having 2 copies of the data (i.e. only 2GB of memory instead of 4GB).
We're not totally sure, but are pretty sure Memory T won't let you pass it into legacy functions in our existing code base that only accept byte[] for example, without incurring a copy operation which can be a heavy hit for very large arrays...

@dduerner-ycwang I have already done quite a bit of interop in C# with C++ code...etc. and there is no copy involved when you pass a large C# valuetype array (assuming the valuetype is blittable). The marshalling is pinning the array and passing directly the pointer to the unmanaged code

xoofx on 5 Oct 2018

@jkotas

The GC does know about this memory because we got it from the new API. And, if the GC were to look for the bit in the object's header that we talked about earlier, it could just ignore the object if we've already called VirtualFree.

We were under the impression this new API was being introduced primarily for large arrays...what good would the ArrayPool<T>.Rent/Return help if it's not optimized for large arrays? That would seem kind of silly for us to use for large arrays in high-performance scenarios...

dduerner-ycwang on 5 Oct 2018

if the GC were to look for the bit in the object's header that we talked about earlier, it could just ignore the object if we've already called VirtualFree.

If the GC looked at the object header and you have already called VirtualFree, it would crash with segfault.

..what good would the ArrayPool.Rent/Return help if it's not optimized for large arrays? That would seem kind of silly for us to use for large arrays in high-performance scenarios...

I agree that there would be work required to fix the performance of Rent/Return for large arrays to make it work for your case. It does not require new API though.

If there is an existing API fit for the job and the only problem is that it is not optimized for given case, we preffer to fix the performance of the existing API; not introduce a new API.

jkotas on 5 Oct 2018

@Maoni0

it's difficult to support allocating a non array object since you'd need to pass in args for constructors

I think this problem could be solved by allowing keyword new to have parameters, such as new(int generation=-1, bool pinned=false ) Constructor(...).
For example:
C# class A {decimal n0,n1,n2, ... , n100;} var a = new(2, true) A();

ygc369 on 1 Nov 2018

alignment control for GC allocations

This problem has been discussed in dotnet/runtime#22990 and related issues.

@jkotas hey, just thinking again about this, but what would be the problem for adding support for alignment (of the first &T[0]) for this particular API and use case (array)? That would allow scenarios where we could align data on a cache line and that would be actually quite useful in lock free scenarios/avoid false sharing.

xoofx on 6 Nov 2018

Sounds good to me if it works with pinned=true only. It can be yet another optional argument.

jkotas on 7 Nov 2018

👍2

@Maoni0 what do you think? Adding alignment = 0 on the original API seems a reasonable change but implementation wise, I don't know enough if it is an issue for the existing code or it would just be an easy adjustment:

c# class GC { static T[] AllocateArray<T>(int length, int generation, bool pinned=false, int alignment = 0); static T[] AllocateUninitializedArray<T>(int length, int generation=-1, bool pinned=false, int alignment = 0); }

xoofx on 7 Nov 2018

@xoofx would the first array data byte be aligned or the object header (+method table ptr, length etc); I assume the data being aligned would be preferable?

benaadams on 7 Nov 2018

@xoofx would the first array data byte be aligned or the object header (+method table ptr, length etc); I assume the data being aligned would be preferable?

Yes, the alignment has to be on the first element &T[0]

xoofx on 7 Nov 2018

So would the alignment only make sense for value types? Why align a ref type if its fields are unaligned because of the object metadata

JeffCyr on 7 Nov 2018

So would the alignment only make sense for value types? Why align a ref type if its fields are unaligned because of the object metadata

Yes, alignment is probably more meaning full for value types in the case of this API that is allocating arrays (SIMD, lock free, false sharing...etc.). For reference types, that would be less interesting, but you could want to align a batch of references in a cache line, and update them in a cache line by thread for example, that could be a scenario... (cases: lock free, false sharing)

xoofx on 7 Nov 2018

So would the alignment only make sense for value types?

The alignment is for the array data. Reference types are all single pointer sized data elements as far as the array data is concerned. The pointers are pointer aligned for regular arrays anyway, but yes it still might not give much additional benefit.

However it still would be important for value types that contain reference types e.g.

struct TaggedPointer
{
    object Obj;
    IntPtr Tag;
}

Align to 16 bytes (on x64) or 8 bytes (on x32); and then it can be used with CMPXCHG16B for lock free swaps avoiding the ABA problem which just switching a pointer by itself doesn't.

benaadams on 7 Nov 2018

👍1

@Maoni0 what do you think? Adding alignment = 0 on the original API seems a reasonable change but implementation wise, I don't know enough if it is an issue for the existing code or it would just be an easy adjustment:
class GC
{
    static T[] AllocateArray<T>(int length, int generation, bool pinned=false, int alignment = 0);
    static T[] AllocateUninitializedArray<T>(int length, int generation=-1, bool pinned=false, int alignment = 0);
}

Should the alignmentonly work when pinnedis set to true? If so, what if we set pinnedto false but alignmentis some value such as 16? Will the array be allocated in to the pinned heap automatically? Or the alignment requirement will be ignored? Or just simply throw an exception?

helloguo on 7 Nov 2018

Or just simply throw an exception?

Usually, I prefer explicit, so I would expect an exception. The action of pinning should be something carefully thought by the user, as it can have significant implications on the GC compaction story, so better let the user aware of this

xoofx on 7 Nov 2018

I am fine with adding alignment support for pinned objects. however it would be good to still keep the property where these pinned objects contain no references. while it's nice to be able to do CMPXCHG16B, I would think using this for data only (eg SIMD) is a much more common scenario.

as far as the error goes, it seems fine to just throw an exception but I don't have a strong opinion on that. perhaps someone from the API review team has an opinion.

Maoni0 on 8 Nov 2018

however it would be good to still keep the property where these pinned objects contain no references. while it's nice to be able to do CMPXCHG16B, I would think using this for data only (eg SIMD) is a much more common scenario

It is quite common to use Interlocked.CompareExchange with a reference/object directly and that can be quite useful in lock free/false sharing scenarios, even for reference types. so maybe, having pin for references types would be useful. Is there any problem having pin for array with ref types?

xoofx on 8 Nov 2018

For multi-threading and avoid false sharing perf degradation, it can be for example super relevant to allocate a struct which would have the size of a cache line (and could contain object references), so that when each thread is working on its data (associated with its own struct = cacheline), it can read/write to its cache line without having to force a flush to reload from memory for the other core/threads.

Example of a benchmark: https://nativecoding.wordpress.com/2015/06/19/multithreading-multicore-programming-and-false-sharing-benchmark/ (3x times faster when false sharing is avoided on a multicore arch)

xoofx on 8 Nov 2018

alignment support for pinned object - it seems fine to just throw an exception

Agree.

For multi-threading and avoid false sharing perf degradation, it can be for example super relevant to allocate a struct which would have the size of a cache line

Agree. This API is low-level power tool. I do think it make sense to artifically limit how it can be used.

jkotas on 8 Nov 2018

I am not questioning that allowing refs in pinned objects is useful - it obviously can be useful for certain scenarios and I am sure you can find/write a benchmark that is improved greatly due to a perf feature. it's a cost/benefit decision. allowing them to have refs means GC would need to consider them as roots to non pinned objs. it makes things more complicated and can be more limiting. this can be done incrementally though...we can provide pins without refs first.

Maoni0 on 8 Nov 2018

@Maoni0 @jkotas
Please consider this scenario: I want to allocate a large array of ref type in gen2, and also allocate its elements in gen2. But your new API could only allocate the array itself, how could I allocate its non-array elements in gen2?

ygc369 on 8 Nov 2018

For example:
C# object[] obj_arr=GC.AllocateLargeArray<object>(100000, 2); int i; for(i=0;i<obj_arr.Length;i++) obj_arr[i]=new object(); //How could I allocate obj_arr[i] in gen2 too?

ygc369 on 8 Nov 2018

allowing them to have refs means GC would need to consider them as roots to non pinned objs.

I see - it would not just "just work" well in pay-for-play fashion. Disallowing it make sense then to get started. We can always do the extra work to allow it later.

jkotas on 8 Nov 2018

But your new API could only allocate the array itself, how could I allocate its non-array elements in gen2?

This API is not for this scenario. https://github.com/dotnet/coreclr/issues/4365 and related have discussion about these scenarios.

jkotas on 8 Nov 2018

allowing them to have refs means GC would need to consider them as roots to non pinned objs.

I see - it would not just "just work" well in pay-for-play fashion. Disallowing it make sense then to get started. We can always do the extra work to allow it later.

Agree, that's fine. We can still workaround this by using int indices instead of references and use an indirect managed array for them.

xoofx on 8 Nov 2018

@Maoni0 Currently the API review queue is placing priority on things marked as 3.0. Is this a 3.0 effort, or Future?

bartonjs on 27 Feb 2019

@bartonjs thanks for checking. I'll mark this for Future.

Maoni0 on 1 Mar 2019

Can we also consider non-arrays while we're in here?

static T Allocate<T>(int generation=-1, bool pinned=false, int alignment=-1, int granularity=-1);

Specifically, I'm interested in alignment and granularity. Cache coherency does a real number on performance when two threads happen to be updating the same cache line. Being able to ensure an object occupies its own cache line would be great.

scalablecory on 3 Mar 2019

Video

Looks good as proposed.

terrajobst on 1 Oct 2019

🎉4 🚀1 ❤1

Just to validate pinned=true is both a guarantee and the pinning acts as a weak reference? (as there isn't a handle to release)

Whether this manifests in the implementation as a separate heap is an implantation detail 😉

What I mean by this is this would be a perfectly valid use?

_memory = MemoryMarshal.CreateFromPinnedArray(
                      AllocateUninitializedArray<byte>(length: 4096, pinned: true), 
                      0, 
                      4096);

After which any call on _memory.Pin() noops; as its expected to already be pinned.

benaadams on 1 Oct 2019

👍1

That's my understanding.

@Maoni0, could you confirm?

terrajobst on 2 Oct 2019

the new API just returns an object and if you pass in pinned as true than as long as the object is rooted it will be pinned.

Maoni0 on 2 Oct 2019

👍2

Perfect!

benaadams on 2 Oct 2019

👍1

Can we change current implementation to:

```C#
class GC
{
// generation: -1 means to let GC decide (equivalent to regular new T[])
// 0 means to allocate in gen0
// GC.MaxGeneration means to allocate in the oldest generation
//
// pinned: true means you want to pin this object that you are allocating
// otherwise it's not pinned.
//
// alignment: only supported if pinned is true.
// -1 means you don't care which means it'll use the default alignment.
// otherwise specify a power of 2 value that's >= pointer size
// the beginning of the payload of the object (&array[0]) will be aligned with this alignment.
public static T[] AllocateArray(int length, int generation=-1, bool pinned=false, int alignment=-1)
{
// calls the new AllocateNewArray fcall.
return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, pinned, clearMemory: true);
}

// Skips zero-initialization of the array if possible. If T contains object references, 
// the array is always zero-initialized.
public static T[] AllocateUninitializedArray<T>(int length, int generation=-1, bool pinned=false, int alignment=-1)
{
    return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, pinned, clearMemory: false);
 }

}
```

IMHO out there are some ppl who would like to experiment with API in v5.0 alpha stage 👀

4creators on 24 Oct 2019

@Maoni0

Ping -> When can we expect to have a PR making this API public. We need it even in v5.0 alpha 👀

4creators on 6 Nov 2019

@4creators you mean just the AllocateUninitializedArray one or the whole thing? we are still working in incremental steps for the rest.

Maoni0 on 6 Nov 2019

C# class GC { // Skips zero-initialization of the array if possible. If T contains object references, // the array is always zero-initialized. public static T[] AllocateUninitializedArray<T>(int length, int generation=-1, bool pinned=false, int alignment=-1) { return AllocateNewArray(typeof(T).TypeHandle.Value, length, generation, pinned, clearMemory: false); } }

Useful minimum could be as the above code

4creators on 7 Nov 2019

not all of that is implemented. our internal implementation right now is

T[] AllocateUninitializedArray<T>(int length)

if you can illustrate how you'd want to use this we can see if/what is missing.

Maoni0 on 7 Nov 2019

👍1

The important part for me is to provide besides the allocation of uninitialized data the allocation of the aligned arrays. My application is image/vision/video analysis and transformation.

4creators on 7 Nov 2019

👍4

Is there a PR we can track, or do you need some input or help for specific use cases and testing? Are we still on track for 5.0 here?

Asking, as F# could greatly benefit uninitialized arrays. I've been doing some optimization analysis there for some preformance PRs, and since arrays are immutable (treated as such, but not really) there's a clear (and tested, proven with buffers) benefit to prevent the extra O(n) cycle for zero-initializing an array, which isn't needed, since we know we'll assign values to each member anyway.

Since the next F# is planned to be released together with .NET 5, it'll be great if this got in in time (in a preview) to also make the changes there (I've looked at workarounds, but they're messy at best, and require a dependency on System.Buffers).

PS: I understand it's only for value types and that the api provides a hint, not a guarantee, and it'll be interesting to see how strong this hint will be followed.

PPS: for VS tooling, F# suffers from LOH perf issues related to arrays, leading ultimately to stalls of VS. Forcing an array to stay out of the LOH could help there too,esp since these arrays are short lived.

abelbraaksma on 30 Jul 2020

@abelbraaksma is already in .NET 5 https://github.com/dotnet/runtime/pull/33526?

benaadams on 30 Jul 2020

@benaadams, thanks! I must have missed that,i thought I checked every linked issue. Though, that means this should be closed and tagged as implemented/ready?

abelbraaksma on 30 Jul 2020

I haven't done a good job linking to PRs for the implemented parts, the uninit array API was enabled with this PR.

I left this open just because not all options are implemented.

Maoni0 on 30 Jul 2020

Which are still unimplemented? Alignment and generation?

tannergooding on 30 Jul 2020

Reading through that PR I wonder how strong GC_ALLOC_FLAGS.GC_ALLOC_ZEROING_OPTIONAL is, the added docs don't (yet) say anything about that. .

I assume this isn't closed yet since the API is planned to be extended and not all methods have been added yet.

Edit: you already said so ;)

abelbraaksma on 30 Jul 2020

@tannergooding yes.

Maoni0 on 30 Jul 2020

@Maoni0, thanks for the link! I scanned through the discussion, is the decision "yes or no initialization" purely based on the 256 bytes threshold? Or is the CLR fee to init anyway unless it can determine it's safe? (but since it's only value types, I assume it's always safe?)

abelbraaksma on 30 Jul 2020

"yes or no initialization" purely based on the 256 bytes threshold?

This threshold is a performance optimization. The current implementation of the no-zero init path has non-trivial fixed overhead, and so it is only profitable for larger arrays. This threshold is implementation detail and can change in future.

Or is the CLR fee to init anyway unless it can determine it's safe?

The memory can be zero-inited anyway for number of reasons. Zero-initialization is required for corectness, the the memory was zero-initialized already, OS zero initialized it, ... .

jkotas on 30 Jul 2020

👍2

@jkotas,thanks, I'll have to measure for the given use cases and see what it does.

The current implementation of the no-zero init path has non-trivial fixed overhead

I would've expected that zeroing memory would mean: reserve/commit a chunk of memory, and init. And that without zeroing, the last step would be skipped. I guess it's not so simple as it seemed to me at first.

Zero-initialization is required for corectness

I see that for ref types, but not value types. I mean, they can certainly be invalid values for the type (DateTime comes to mind), but not "incorrect" from a GC point of view. I'd expect the rules here to match the SkipInitLocals flag, generally speaking.

I should experiment to get a better understanding :).

abelbraaksma on 31 Jul 2020

I see that for ref types, but not value types

Value types can contain reference type fields. For example, struct { int id; string name; } needs zero initialization because of the name field. In theory, you can avoid zero-initializating id field in this case, but it is much easier and faster to zero initialize everything in cases like this one. SkipInitLocals works the same way.

jkotas on 31 Jul 2020

Thanks, I should've been more specific and say "unmanaged" types. I meant a value type for which no members are reference types at any depth.

This behavior is how I would expect it, sounds perfect!

abelbraaksma on 31 Jul 2020

Moving to Future as this is not required for 5.0 as far as i can see.

danmosemsft on 13 Aug 2020

@jkotas @Maoni0 considering that .NET 6 will be LTS, any chance the alignment feature can get some priority to make it into that release?

It's quite useful for library developers working with SIMD. I'm commenting wearing ImageSharp 🎩, but could be handy for ML.NET folks, and a wide range of other libs in the ecosystem.