Runtime: System.IO.Pipelines: The default MemoryPool is suboptimal

Created on 27 Oct 2018 · 16Comments · Source: dotnet/runtime

MemoryPool<byte>.Shared is used as the default memory pool for pipes which is backed by ArrayPool<T>.Shared (TlsOverPerCoreLockedStacksArrayPool<T>).

ArrayPool<T>.Shared only keeps 8 arrays per bucket size per core. Typical usage of pipe will contain memory spike, the default pool will quickly echaust its buffers and will start allocating.

ArrayPool<T>.Shared is optimal when the array is returned shortly after being rent by the same thread (example). It uses Thread Local Storage and per-core stacks. The PipeReader and PipeWriter will typically run concurrently on different threads, this is the worst case for ArrayPool<T>.Shared and will likely consume more CPU than having a single global stack.

cc @davidfowl

area-System.IO.Pipelines

Source

JeffCyr

Most helpful comment

Memory allocators benchmarks have to actually read/write the memory like the real workload to get accurate picture. The caches and processor affinity matters more than anything else.

This was a mistake made in the first version of ArrayPool. The microbenchmarks were producing awesome results because they did not actually touched the memory like real workloads. The actual performance in real workloads that actually touched the memory was often worse than just allocating the memory using new.

jkotas on 2 Nov 2018

👍3

All 16 comments

Now that the default ArrayPool supports reclaiming stale buffers, based on experimentation we might want to consider upping the current limits.
cc: @JeremyKuhne

stephentoub on 27 Oct 2018

only keeps 8 arrays per bucket size per core. Typical usage of pipe will contain memory spike, the default pool will quickly echaust its buffers and will start allocating.

Just to be clear, though, all buffers are available to all cores, this just impacts how quickly a buffer can be found.

The PipeReader and PipeWriter will typically run concurrently on different threads, this is the worst case for ArrayPool.Shared and will likely consume more CPU than having a single global stack.

If there's only one pipeline, sure, one thread producing (and renting arrays) and another consuming (and returning arrays) would make it more expensive, as the renter would end up fetching buffers from another queue. But if there are many pipelines, they're likely involving asynchrony and distributed fairly across threads, such that renting and returning would be, too.

Can you share the details/profiles where you've seen this policy to be consistently problematic?

stephentoub on 28 Oct 2018

I just finished going through PR https://github.com/dotnet/coreclr/pull/8209 and https://github.com/dotnet/coreclr/pull/17078, I see that there is a lot of thoughts and testing in the current design.

I have created a benchmark here to reproduce the issue.

The benchmark reproduce a server workload sending medium sized message to a number of client connection. Instead of testing the maximum throughput, the test set up a message rate and inspect GC stats & average CPU using different memory pool.

Here are the results with 64 connections, sending 500 msg/s of size 60KB. The GC stats are for a 3 seconds interval, the CPU % is the addition of all cores so it's normal to have a value > 100%.

Tested on a Intel i7-6820HQ 4 core (hyperthreaded so 8 logical cores)

MemoryPool.Shared

Gen 0: 303
Gen 1: 147
Gen 2: 66
CPU: 291%

The Gen2 GC count shows that the pool is often full/empty, a lot of buffers are created/discarded. CPU is high because the TLS is useless for this usage and MemoryPool.Rent ends up iterating the per core stacks since the pool is often empty.

ArrayMemoryPool (MemoryPool using ArrayPool.Create(1024 * 1024, 10000)

I know 10000 maxArraysPerBucket is too high, I wanted to show the impact on GC

Gen 0: 93
Gen 1: 1
Gen 2: 0
CPU: 292%

The Gen2 GC goes down with the ConfigurableArrayPool, but the CPU stay high probably due to spinning on the contended lock.

ConcurrentStackMemoryPool

Gen 0: 102
Gen 1: 1
Gen 2: 0
CPU: 45%

Like ConfigurableArrayPool, the Gen2 is down, but ConcurrentStack scales better than the locked array based stack so CPU is lower.

JeffCyr on 29 Oct 2018

Thanks. This is worth investigating. The ArrayMemoryPool/ConcurrentStackMemoryPool examples have effectively unlimited capacity, so they'll basically never throw arrays away. When I switch ArrayPool's capacity to match, in your repro the Gen0/1/2 drops down to match these, and the CPU percentage drops down to ~150% instead of ~290%, but that's still 3x the 45% example of ConcurrentStackMemoryPool (but much better than ArrayMemoryPool's 290%). Note that ConcurrentStackMemoryPool has its own issues, at least that would be impactful for ArrayPool, in that every return incurs an allocation, which we definitely wouldn't want to incur for normal ArrayPool usage, and coming up with a wait-free, allocation-free, locality-aware, low-overhead pool is... hard (I've not thought of a good one yet).

stephentoub on 29 Oct 2018

@stephentoub
What about a global stack-ish queue?

Take the ConcurrentQueue data structure with its lock-free ring buffer segment
Keep a fixed segment size instead of exponential growth
Implement the segment linked list as a stack instead of a queue

Then the shrinking logic could simply drop whole segments as needed.

JeffCyr on 31 Oct 2018

I think I'm on to something.

The idea is to create a stack-like data structure by relaxing the strict LIFO orderding. Buffers are allocated by segments (length:256 to get 1MB segments for 4096KB buffers). The segment implementation is based on a stripped down version of the ConcurrentQueue segment.

Push and Pop are decoupled in their own segments. When the write segment is full, it is moved to the full segments stack (ConcurrentStack) and a new segment is fetched from the empty segments queue (ConcurrentQueue). When the read segment is empty, it is moved to the empty segments queue and a full segment is fetched from the full segment stack.

A Thread Local Storage is kept to have a local read segment and a local write segment. They fill and consume the global empty/full segments lists.

With the current impl, all threads have their local segments, but the idea is to use the global read/write segment pair by default and promote "hot" threads to use local segments based on usage.

https://gist.github.com/JeffCyr/5b4d2953cb0467a7a209dc3a1acba3c0

The Pipe benchmark I used earlier was not stable enough, I created a benchmark similar to yours to test when the same thread rent and release the buffers (your gist was no longer available).

The test starts one thread per core, rent 8 buffers and release them in a loop.

ArrayMemoryPool`1 - 00:00:03.2086832
Gen 0: 367
Gen 1: 1
Gen 2: 0
CPU: 694%
Buffer/s:  19 921 318

StackishQueueMemoryPool`1 - 00:00:01.5675948
Gen 0: 4
Gen 1: 1
Gen 2: 0
CPU: 702%
Buffer/s:  40 743 007

StackishQueue achieve twice the throughput

I reimplemented the previous Pipe benchmark by using a bounded channel instead (Pipes were a bottleneck).

The test simulate 2000 connections each trying to send 10000 buffers per seconds. The Channels are bounded with 8 pending buffer each to stay fair with TlsOverPerCoreArrayPool max buffer per segments (a higher bound completely destroys its perf)

ArrayMemoryPool

Gen 0: 135
Gen 1: 19
Gen 2: 1
CPU: 618%
Msg/s:  9 805 115
Thread count: 26

StackishQueueMemoryPool

Gen 0: 7
Gen 1: 2
Gen 2: 1
CPU: 748%
Msg/s:  12 483 334
Thread count: 32

JeffCyr on 2 Nov 2018

Memory allocators benchmarks have to actually read/write the memory like the real workload to get accurate picture. The caches and processor affinity matters more than anything else.

jkotas on 2 Nov 2018

👍3

@jkotas Good point, writing to the buffer does affect negatively the perf of the stackish queue compared to the stacks of TlsOverPerCoreArrayPool. It must be cache locality, but since rented buffer are just overwritten without reading it first, I didn't think cache locality would matter. Do you have an explanation why?

JeffCyr on 4 Nov 2018

I do not know. The implementation you have shared above is pretty complex. It is hard to predict all interactions of different parts by just looking at the code. I think you would need to run this under profiler that can do cache profiling, like Intel VTune.

jkotas on 4 Nov 2018

@jkotas @stephentoub
I have an explanation for the different performance behavior. It’s because of CPU write-back cache.

Cache locality does not matter for a buffer pool since you don't read the buffer after Rent. However, by reusing the same memory, you can rent/return/rent/return faster than the max memory bandwidth, because all the memory transitions doesn't make it to the RAM, some transitions are overwritten in the CPU cache before getting flushed to RAM.

If you don’t reuse the same memory blocks, when you write to a memory location, it fills up the CPU caches and the previous blocks needs to be flushed to RAM (or the next cache level) before writing new data, this adds latency.

By reducing the segment size from 256 to 32 in my previous data structure, the performance is now equivalent to TlsOverPerCoreLockedStacksArrayPool for the 8x Rent/Release tight loop benchmark.

It’s good to keep this in mind, but to put things in perspective, the micro benchmark is writing and reading 40 GB/s on my machine. This does not represent a typical .net application, so trying to optimize for the L1 cache in a micro benchmark might not be pay-for-play in a real-world application where the buffers will be long gone from the L1 cache when they are going to be reused.

For the networking Pipeline scenario, the buffers will be rented from a ThreadPool Worker thread and returned by a ThreadPool IO thread (or vice-versa). This will likely be on another core and the L1/L2 caches won’t matter anyway.

JeffCyr on 12 Nov 2018

For the networking Pipeline scenario, the buffers will be rented from a ThreadPool Worker thread and returned by a ThreadPool IO thread (or vice-versa). This will likely be on another core and the L1/L2 caches won’t matter anyway.

I think this is not the case. As far as I remember completion ports' threads are created per core and the continuation should be run on the same core as the initial request. I'm not sure about Linux though.

Scooletz on 12 Nov 2018

Here is a sample showing it's more often not on the same core (.NET Core 2.1 on Windows)
```c#
public unsafe static void Main()
{
const int Iteration = 10000;
int counter = 0;

        var done = new SemaphoreSlim(0);

        for (int i = 0; i < Iteration; i++)
        {
            Task.Run(() =>
            {
                int workerCoreId = Thread.GetCurrentProcessorId();
                var overlapped = new Overlapped();
                var nativeOverlapped = overlapped.Pack((code, bytes, overlap) =>
                {
                    if (Thread.GetCurrentProcessorId() != workerCoreId)
                        Interlocked.Increment(ref counter);

                    Overlapped.Free(overlap);
                    done.Release();
                }, null);

                ThreadPool.UnsafeQueueNativeOverlapped(nativeOverlapped);
            });
        }

        for (int i = 0; i < Iteration; i++)
            done.Wait();

        Console.WriteLine($"IOCP callback on different core: {counter}/{Iteration}");
    }

```

IOCP callback on different core: 9008/10000

JeffCyr on 12 Nov 2018

👍1

For the Pipe scenario with inbound data, pipe writing (and buffer rent) will occur on the IO thread pool in the Socket.ReceiveAsync callback, pipe reads (and buffer release) will occur on the worker thread pool (with the default ReaderScheduler = PipeScheduler.ThreadPool). Reads will be scheduled in the global worker queue and the core that will execute the workitem will be random.

JeffCyr on 12 Nov 2018

@stephentoub
What about a global stack-ish queue?

Take the ConcurrentQueue data structure with its lock-free ring buffer segment

Keep a fixed segment size instead of exponential growth

Implement the segment linked list as a stack instead of a queue

Then the shrinking logic could simply drop whole segments as needed.

@stephentoub

I got idea of BufferedSegment, which drops segment as needed, I done code here.

Is this what are you mean?

https://gist.github.com/hack2root/20021e2ede4ea29bb8f5c77d906e3463

funcelot on 27 Mar 2019

I think I've finally hit a scenario that shows how badly this behaves. I'm still gathering more data but it's looking like the shared array pool is a poor default for pipelines in some cases