Runtime: Buffer::BlockCopy may spend too long without GC polling

Created on 9 Oct 2019  路  9Comments  路  Source: dotnet/runtime

Buffer::BlockCopy is an FCALL that delegates to the native memmove which may spend a lot of time there depending on how much is being copied.

If GC needs to sync with user threads at inconvenient time, everything will stop until the memmove is done. GC would wait for memmove, everything else will wait for GC.
There was an actual scenario reported when GC pauses could take up to a minute due to this.
(dealing with very large streams, potentially swapped out, ...)

We should either "chunk" large copying into smaller pieces with intermittent GC polling, or just move the whole thing to managed code.

area-VM-coreclr

Most helpful comment

move the whole thing to managed code.

This. We have prior art in CoreRT. I will give it a shot

All 9 comments

move the whole thing to managed code.

This. We have prior art in CoreRT. I will give it a shot

Would be interesting to compare, in mono we emit @llvm.memmove intrinsic for Buffer.BlockCopy (and we ask llvm to place safepoints for us)

and we ask llvm to place safepoints for us

Do you have safepoints inside the loop that @llvm.memmove expands into when the length is not constant?

@jkotas just checked, unfortunately we don't, llvm is able to unroll it for small constants but otherwise it converts it into a libc call so the problem remains.

UPD: However, there is an ability to expand memmove into loops (expandMemMoveAsLoop) in LLVM IR it seems (and then -place-safepoints will be able to insert sp placeholders for us, will check)

This. We have prior art in CoreRT. I will give it a shot

The memcpy routine that is distributed with MSVC (C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.24.28202\crt\src\x64\memcpy.asm) currently defines a "large block" as 128-bytes (for non overlapping buffers) and uses prefetching and non-temporal stores for this scenario.

Is this something that CoreRT was handling?

Is this something that CoreRT was handling?

That is handled in both CoreCLR and CoreRT. Both delegate to CRT for blocks over certain size (with a proper PInvoke frame that avoids the GC starvation problem).

Keeping this open to fix other places where we do large copies in cooperative mode.

This is fixed for all memory copy variants exposed by the framework now.

We have other similar problems for sure. I have opened dotnet/coreclr#27683 on one found via codereview. Also, @adamsitnik is going to run experiment to see whether Benchmark.NET can be used to find these types of issues.

Also, @adamsitnik is going to run experiment to see whether Benchmark.NET can be used to find these types of issues.

I've performed the experiment and shared my results in https://github.com/dotnet/performance/issues/1049

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Timovzl picture Timovzl  路  3Comments

matty-hall picture matty-hall  路  3Comments

jzabroski picture jzabroski  路  3Comments

GitAntoinee picture GitAntoinee  路  3Comments

omariom picture omariom  路  3Comments