Buffer::BlockCopy is an FCALL that delegates to the native memmove which may spend a lot of time there depending on how much is being copied.
If GC needs to sync with user threads at inconvenient time, everything will stop until the memmove is done. GC would wait for memmove, everything else will wait for GC.
There was an actual scenario reported when GC pauses could take up to a minute due to this.
(dealing with very large streams, potentially swapped out, ...)
We should either "chunk" large copying into smaller pieces with intermittent GC polling, or just move the whole thing to managed code.
move the whole thing to managed code.
This. We have prior art in CoreRT. I will give it a shot
Would be interesting to compare, in mono we emit @llvm.memmove intrinsic for Buffer.BlockCopy (and we ask llvm to place safepoints for us)
and we ask llvm to place safepoints for us
Do you have safepoints inside the loop that @llvm.memmove expands into when the length is not constant?
@jkotas just checked, unfortunately we don't, llvm is able to unroll it for small constants but otherwise it converts it into a libc call so the problem remains.
UPD: However, there is an ability to expand memmove into loops (expandMemMoveAsLoop) in LLVM IR it seems (and then -place-safepoints will be able to insert sp placeholders for us, will check)
This. We have prior art in CoreRT. I will give it a shot
The memcpy routine that is distributed with MSVC (C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28202\crt\src\x64\memcpy.asm) currently defines a "large block" as 128-bytes (for non overlapping buffers) and uses prefetching and non-temporal stores for this scenario.
Is this something that CoreRT was handling?
Is this something that CoreRT was handling?
That is handled in both CoreCLR and CoreRT. Both delegate to CRT for blocks over certain size (with a proper PInvoke frame that avoids the GC starvation problem).
Keeping this open to fix other places where we do large copies in cooperative mode.
This is fixed for all memory copy variants exposed by the framework now.
We have other similar problems for sure. I have opened dotnet/coreclr#27683 on one found via codereview. Also, @adamsitnik is going to run experiment to see whether Benchmark.NET can be used to find these types of issues.
Also, @adamsitnik is going to run experiment to see whether Benchmark.NET can be used to find these types of issues.
I've performed the experiment and shared my results in https://github.com/dotnet/performance/issues/1049
Most helpful comment
This. We have prior art in CoreRT. I will give it a shot