Runtime: Avoid zeroing memory on stackalloc?

Created on 23 Jul 2015  Â·  57Comments  Â·  Source: dotnet/runtime

The current stackalloc implementation uses a tight loop of push instructions to allocate and fill in the memory. For example, stackalloc int[16384] yields:

000007FE8D8A05C4  mov         eax,10000h
000007FE8D8A05C9  push        0
000007FE8D8A05CB  push        0
000007FE8D8A05CD  sub         rax,10h
000007FE8D8A05D1  jne         000007FE8D8A05C9

Given that the use of stackalloc is often performance related, it would be nice if the stack pointer just jumped directly to the end and left the memory as-is. This appears to be permitted by the C# spec, section 18.8: "The content of the newly allocated memory is undefined."

Changing this behavior would nastily break applications that relied on the zeroing (despite spec). I'm not sure how much of a problem this would be. It's a niche feature that tends to be used for very specific reasons, and older JIT versions seemed to leave the memory as-is (according to some postings circa 2003).

area-System.Memory enhancement optimization tenet-performance

Most helpful comment

The current plan is:

  • C# will keep zero initializing by default. Changing the default would be too breaking.
  • We have a set of issues opened to make the zero initialization done by the JIT more efficient or reduce need for it (#13827, dotnet/coreclr#13823, dotnet/coreclr#13825)
  • Folks who really want to get the last bit of performance by avoiding zero initialization can use custom ILLinker step (https://github.com/mono/linker/pull/159) when they know what they are doing. We do this for CoreLib today (via VM hack, but we should switch to the ILLinker), and we plan to experiment with this in CoreFX (https://github.com/dotnet/corefx/pull/25956). Based on the results of these experiments, we may consider introducing a more streamlined way to do this in future. @ahsonkhan You should consider experimenting with it in CoreFXLab as well if you believe that it would help.

I do not think there is anything actionable for 2.1 on this issue currently. Setting the milestone to future.

All 57 comments

In your example, the JIT has been told to force a zero initialization. There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here.

Changing this behavior would nastily break applications that relied on the zeroing (despite spec).

Note that zeroing is not always done even today. The following code prints 42, that value remains on the stack from Foo:

C# static unsafe void Main() { Foo(); int* p = stackalloc int[16384]; Console.WriteLine(p[0]); } static unsafe void Foo() { int* p = stackalloc int[16384]; for (int i = 0; i < 16384; i++) p[i] = 42; }

Interesting! I should have looked deeper.

If my new understanding is right, the JIT just obeys the method IL header CorILMethod_InitLocals flag when choosing how to allocate, and III.3.47 of the CLI spec suggests that the JIT has no choice but to obey. Darn.

I guess any remaining degrees of freedom would be over on the roslyn side of things.

It's even more interesting if you check the C# specification and you find this text in 19.8:

The content of the newly allocated memory is undefined.

Go figure. I think that the IL spec is a bit messed up for requiring localalloc to follow the localsinit flag. That's why in my above example no zeroing is done, Main doesn't contain any locals and there's no localsinit flag. Perhaps Roslyn should just refrain from using localsinit in unsafe methods, it's of no use anyway.

Not setting localsinit for unsafe methods sounds like a good option. I haven't found anything that would restrict that in the specs yet, given that verifiability is already out the window. I think I might go write up an issue on that.

Another option might be to introduce a separate IL instruction that does not have the same baggage as localloc, allowing stackalloc to be implemented according to its looser C# spec. I think this might run afoul of the blanket implications of localsinit, though:

A “localsinit flag” that indicates whether the local variables and memory pool (§I.12.3.2.4) should be initialized by the CLI.

At best, it would be a weird corner case. Also, while I'm not familiar with the process behind doing such a thing, I would guess a change to Roslyn's localsinit behavior would be easier if acceptable.

@mikedn I'm not seeing this (re localsinit); even declaring the stackalloc'd variable results in a localsinit being emitted?

char* output = stackalloc char[length];
.method private hidebysig static string MultiBlockAsciiString(
    class MemoryBlock block,
    int32 offset,
    int32 length) cil managed
{
  // Code size       17 (0x11)
  .maxstack  4
  .locals init ([0] char* output)
  IL_0000:  ldarg.2
  IL_0001:  conv.u
  IL_0002:  ldc.i4.2
  IL_0003:  mul.ovf.un
  IL_0004:  localloc
  IL_0006:  stloc.0
  IL_0007:  ldarg.0
  IL_0008:  ldarg.1
  IL_0009:  ldarg.2
  IL_000a:  ldloc.0
  IL_000b:  call       string ::MultiBlockAsciiIter(
                                                class MemoryBlock,
                                                int32,
                                                int32,
                                                char*)
  IL_0010:  ret
}

Correction: localsinit being dropped does happen but there are some extra caveats; trying to narrow down

Haven't quite narrowed it down; but if you call a method it needs to be first param; and some sorts of work triggers it; possibly for loop even with the index being not local.

Pattern that seems to work:

1..function that declares via statckalloc and does no other work other than

  1. returning result of another function that takes the stackalloc as first param

Can pass though needed vars from first function to second. Bit of a contort...

See as example: https://github.com/aspnet/KestrelHttpServer/pull/312/files#diff-7042cc5345ca45a84e36433636d7f10fR15

There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here.

@BruceForstall Could you share where this might be? I saw that CoreCLR generates 0 initialization of stack locals very late in the prolog in one of the Kestrel hotspots.

@choikwa you can do it with stackalloc, but you have to do a weird pattern of stackallocing in one function and nothing else except passing that variable into another function - then the function doing the stackalloc won't zero.

For stackalloc, the code is in CodeGen::genLclHeap(). Zero init of stack locals in the normal prolog is in CodeGen::genZeroInitFrame()

Is there any particular reason why zero init of stack locals must be done at the prologue *generation? Why not expose this to the optimizer which can dead-store-eliminate or push down init. where it's only necessary?

(I believe) we only zero init local vars with GC types that are untracked (thus don't get individual operation GC state change information). There may be other cases, such as potentially uninitialized locals in verifiable code.

Cursory search for zero init of locals:

  • Above method, in the prolog generation by tracking lvMustInit field of LclVarDsc

    • tracking seems to be done at Compiler::fgInterBlockLocalVarLiveness().

I am confused by this comment; I would be thrilled if someone could explain to me the need for volatility and must-init.
    /* Mark all pointer variables live on exit from a 'finally'
       block as either volatile for non-GC ref types or as
       'explicitly initialized' (volatile and must-init) for GC-ref types */
  • Compiler::fgInlinePrependStatements() for zero init of inlinee's locals by inserting assignment trees

The BOTR doesn't seem to describe why the former is needed or GC's relation to the stack locals. It would be nice if there are no hard restrictions to moving zero-init from prolog to earlier pass.

We don't enregister anything live in/out of handlers. That's because an exception in a "try" can happen at any time, and the handler needs to be able to find the current value of the variable in that case. The OS/VM doesn't preserve registers from the "try" to the handler, so the only place we can find the variable is on the stack. Thus, all writes to the variable must be to the stack. We could do better here in some cases, but that's the current status.

We only need to explicitly zero non-arg, GC-ref types because we're going to report them to the GC, and they must always have a valid value (zero is valid). If we knew the value was always defined before entering every EH region that had a handler using that value, we might be able to avoid force zero initializing it in the prolog. This could be improved with some better analysis.

It looks like Compiler::fgInlinePrependStatements() only zero inits locals for cases where the IL for a function requires it for the inlinee.

@BruceForstall @choikwa This issue is about unnecessary zero-initialization for stackalloc in C# / localloc in IL.

We need to do something about this for Span<T> to work end-to-end.

I have look into this some more:

  • Run CoreCLR tests with a hack that suppressed the init-locals bit for everything. No issues in the JIT found. (~30 tests written in IL that were exercising the init-locals bit failed - by design.)
  • Found one place in corelib that depends on the stackalloc memory to be zero-initialized: https://github.com/dotnet/coreclr/blob/97c58ac4fce27b7796206a59eea0ca27cb49fe1a/src/mscorlib/src/System/RtType.cs#L621
  • Observed that clearing the init-locals bit helps code quality with “out” arguments. The storage for “out” arguments is always zero-initialized today because of the init-locals bits, but the zero-initialization is not necessary (if the out arguments do not contain GC pointers).

The current options that I am thinking about are - from the most preferred to least preferred:

  1. Change C# compiler to not set the init-locals bit for unsafe methods; or even for everything in assemblies compiled with /unsafe or with a new compiler switch. Works for existing runtimes. Potentially breaks code like the corelib example above.
  2. Introduce option for ILLinker to clear the zero init locals bit on all methods in assembly. It would be the start of turning ILLinker into a full IL-to-IL optimizer. There are number of optimization ideas that we do not have a good place to put today – they are hard/expensive/breaking/complex to implement in either Roslyn or JIT. https://github.com/dotnet/roslyn/issues/15929
  3. Introduce a new CompilationRelaxations.NoZeroInitLocals flag and make folks to explicitly add it if they need fast stackalloc or other benefits of no init-locals bit. It is a bit odd design because of one piece data invalidates other piece of data. Does not work for existing runtimes.

Having a new intrinsic method that does not zero init, or having flexibility to control this at method granularity, seems to be unnecessary. Assembly level granularity should be sufficient for this.

Change C# compiler to not set the init-locals bit for unsafe methods

Probably works; and unsafe blocks.

for everything in assemblies compiled with /unsafe or with a new compiler switch

Probably a bit heavy?

MethodImpl flag?

[MethodImpl(MethodImplOptions.NoZeroInitLocals)]

[MethodImpl(MethodImplOptions.NoZeroInitLocals)]

This is variant of 3. I think that folks who care about this would want it for all their code, without need to annotate every method.

On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always?

Right - that's the idea.

@russellhadley The ILLink option above is the easiest one to start with. Could you please look into it for ILLinker?

Stephen and I have tried to switch CoreCLR to use managed version of number parsing and formatting. It is straightforward port of C++ version of code. The C++ version of the code is using stack allocated scratch buffers. We found that it is impossible to get C# version close to C++ version because of the zero initialization imposses 15%-20% penalty. The changes are under https://github.com/jkotas/coreclr/tree/corertnumbercode branch for now. The microbenchmark tested is to run ((Int32)12345).ToString("C"); in a loop.

To unblock progress on the CoreLib work, I planning to add a workaround to suppress unnecessary zero initialization for CoreLib, linked to this issue. This workaround should be removed once we have permanent solution for this problem.

I have hit this perf issue before when using stackalloc in a tight loop for scratchspace for crypto blocks (16bytes). Is there an option for adding a new keyword?

stackalloc nozero byte[16];

or similar? I realise that the spec says that stackalloc won't necessarily zero, but I expect we are too far down the path to change behaviour in code that probably doesn't have a lot of safe guards anyway. As always people probably have come to expect it to be zeroed but an extra keyword or different keyword for it would leave existing code and allow an "opt in"

Also related "JIT: consider optimizing small fixed-sized stackalloc" https://github.com/dotnet/coreclr/issues/8542 to help stackalloc functions inline

Example of a stackalloc that is expecting zero initialization. https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/String.Searching.cs#L177
The InitializeProbabilisticMap can leave some indexes unwritten.

Silently turning it off (option 1) would be a breaking change that's going to subtly corrupt data. It would imply verifying by hand all existing unsafe functions/classes/assemblies. init-locals effect is observable not only by stackalloc but also by using pointers to access ordinary locals: int i; *(byte*)&i = 42; return i; is valid unsafe C# that returns 42 today but would return garbage without init-locals. Even out parameters could cause some confusion when garbage can be seen in the locals window while debugging.

I would also like to turn off init-locals in some of my code, but since it's currently always on I've actually optimized some of my code that accesses locals by pointer to exploit the fact that they are zero initialised.

So I'm voting for some variant of option 3, ideally an attribute that could be applied at the assembly/class/method levels (it could be omitted from the output, only used by Roslyn to set init-locals on the functions in scope).

I agree turning it off isn't a good idea. It could silently break a lot of code out in the wild as well. In very unpleasant ways.

However it might not even need a CoreClr change but a Roslyn change instead?

https://github.com/dotnet/csharplang/issues/868

Won't turning off init locals also potentially cause a bug if you have done this?

public unsafe void MyUnsafeFunc()
{
    byte* otherBuffer = stackalloc byte[100];
    byte* output;
    SomeInteropFunction(out output);
}

Before you were passing a null pointer now you are not, if the behaviour or code path changes if you aren't passing a null pointer could there not be trouble?

This could be a bug farm

out shouldn't be testing its input?

Right but, c++ or c sometimes things like openssl say "if its a null ptr then do x otherwise try to write to said pointer".... its not great but it is...

Then its not an out var but a ref var :P

Before you were passing a null pointer now you are not, if the behaviour or code path changes if you aren't passing a null pointer could there not be trouble?

This would never pass a null pointer, it would pass a pointer to initialized memory. With this change, it would pass a pointer to uninitialized memory. Libraries that test the pointer itself, not its underlying value, do not break from this change.

So would the removal of init only effect stackalloc?

FYI the stackalloc was only in there to try to trigger the init local remove. The key is the pointer.

If the input value of output is important then it should be ref and then you init to null (and the C# compiler I believe would force you to). out means its input value is not important.

byte* output = null;
SomeInteropFunction(ref output);

out and ref are functionally the same; though semantically different (i.e ref needs init first, out needs init before return in callee)

If you interop it doesn't change whether you use ref or out so should use the one that agrees; i.e. if the input is important then it should be ref

out output will pass a pointer to the output variable to the method. In either case this will not be a null pointer. In P/invokes your code is equivalent to
SomeInteropFunction(&output);
The value of the 'output' variable does not matter unless the p/invoked function dereferences the pointer.

Sure so the c might be

void MyMethod ( void** output);

So I love the idea of init being ditched. I worry that it will cause the kind of managed/unmanaged bugs that lead to double freeing, segfaults and general nastiness.

That is why from where I am sitting it needs to be opt in. I hate to think what horrible interop code there is in some point of sale machine or similar somewhere that this could break.

Heck there is a lot of historical finance system interop code I would need to go double check.

A key word or an attribute would work fine. Even though init "isn't in the spec" it's observed behaviour for what 10 years ?

Maybe make the assembly opt into this new faster behavior by setting an attribute? That way the core libraries can make use of this enhancement easily.

It could look like:

[assembly: CodeOptimizationCompatibility(Level = CodeOptimizationLevel.V5_0)]

Or similar. The Level could trigger many optimizations that are currently not being done for compat reasons. This would be similar to the SQL Server compatibility level. It's a per-database setting enabling general improvements with each new version. In my opinion the SQL Server compatibility level has been a very useful concept.

I like it, it's opt in so doesn't break anything but you don't have to tag every method.

@benaadams

On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always?

With @VSadov's work on ref like types it will be soon allowed.
Span<byte> buffer = stackalloc byte[10];

Why do we do a tight loop of pushes, rather than a single adjustment of the stack plus a call to ZeroMemory (or equivalent)?

Also what are the numbers for the overall perf improvement from stripping the flag?

CodeGen::genZeroInitFrame() uses different code sequences for different conditions: https://github.com/dotnet/coreclr/blob/6aa119dcdb88ba1157191600a992aa94eee59b22/src/jit/codegencommon.cpp#L6977

@BruceForstall, I would think that. for anything that is > 32 bytes, ZeroMemory would probably be faster (especially for large buffers) as it may take advantage of the underlying hardware (such as using rep movsb or writing up to 512-bits per iteration on hardware that support those sequences as being fast).

It would also be interesting to determine whether non-temporal zeroing would be faster for most cases (I would guess the common scenario is for a user to initialize the bytes themselves, so zeroing without polluting the cache might be better).

In either case, if we had some numbers for how much stripping the init flag would save, it might be possible to convince the compiler team to add a flag.

The init flag impacts all locals, not just stackalloc, and I imagine (for more complicated methods) the JIT isn't able to always elide the zeroing. So knowing the benefit across the board would be useful.

Note that clearing initlocals flag will still leave lots of zero-initializations of structs with GC fields even though they can be avoided in many cases.

I recently opened several items related to this: dotnet/coreclr#13822 (alreday fixed), dotnet/coreclr#13823, dotnet/coreclr#13825, dotnet/coreclr#13827. Also, ILLink now has the ability to clear initlocals unconditionally. I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies.

it may take advantage of the underlying hardware (such as using rep movsb

Already does this; though that causes issues as isn't necessarily the most efficient way https://github.com/dotnet/coreclr/issues/13827

I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies.

Getting perf numbers for this would be great. If the performance improvement is actually measurable for normal applications (or even things like CoreFX, Roslyn, etc) then it becomes worthwhile to suggest the compiler should have support for stripping the flag integrated into it.

@RussKeldorph, can you please triage/prioritize this issue. Ty

@ahsonkhan What do you mean by triage/prioritize? (It's already triaged) By when do you need it? Can you summarize why it is more important now? I'm just looking for a summary to help us weigh it against other tasks.

@dotnet/jit-contrib

By when do you need it?

2.1, and it is already marked as such. So please ignore my comment. I didn't notice the milestone.

We have some experimental APIs in corefxlab that use stackalloc spans and this could improve their performance.

For example:
https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.Experimental/System/Buffers/Text/BufferReader.cs#L209
https://github.com/dotnet/corefxlab/blob/master/src/System.Azure.Experimental/System/Azure/CosmosDbAuthorizationHeader.cs#L28
https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Primitives/System/Text/Formatters/Formatter_floats.cs#L73
etc.

The current plan is:

  • C# will keep zero initializing by default. Changing the default would be too breaking.
  • We have a set of issues opened to make the zero initialization done by the JIT more efficient or reduce need for it (#13827, dotnet/coreclr#13823, dotnet/coreclr#13825)
  • Folks who really want to get the last bit of performance by avoiding zero initialization can use custom ILLinker step (https://github.com/mono/linker/pull/159) when they know what they are doing. We do this for CoreLib today (via VM hack, but we should switch to the ILLinker), and we plan to experiment with this in CoreFX (https://github.com/dotnet/corefx/pull/25956). Based on the results of these experiments, we may consider introducing a more streamlined way to do this in future. @ahsonkhan You should consider experimenting with it in CoreFXLab as well if you believe that it would help.

I do not think there is anything actionable for 2.1 on this issue currently. Setting the milestone to future.

Do people at Microsoft still write documentation?
https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/stackalloc
Doesn't mention Span.
Doesn't say whether it DOES or DOES NOT, in theory or practice zero memory or not.

This is important.

If you think you would benefit from this feature I have created a package that makes it easy to control this flag for a specific method or for all methods within a type/assembly using a custom attribute InitLocals(false).

https://github.com/josetr/InitLocals

Goes without saying that you shouldn't use it in places where you have no tests since you may rely on this zero-initialization behavior.

Also, don't forget to benchmark your code because it's very likely that you don't need it at all.

Was this page helpful?
0 / 5 - 0 ratings