Go: runtime: make the page allocator scale

Created on 23 Oct 2019  路  42Comments  路  Source: golang/go

Over the course of the last year, we've seen several cases where making relatively minor changes to the allocator slow path, which allocates pages from the heap, caused serious performance issues (#32828, #31678). The problem stemmed largely from contention on the heap lock: a central m-based spinlock (so we can't schedule another g while it's waiting!) which guards nearly all operations on the page heap. Since Go 1.11, (and likely earlier) we've observed barging behavior on this lock in applications which allocate larger objects (~1K) frequently, which indicates a collapsing lock. Furthermore, these applications tend to stay the same or worsen in performance as the number of available cores on the machine increases. This represents a fundamental scalability bottleneck in the runtime.

Currently the page allocator is based on a treap (and formerly had a small linked-list based cache for smaller free spans), but I propose we rethink it and rewrite to something that:

  1. Is more cache-friendly, branch-predictor friendly, etc. on the whole.
  2. Has a lockless fast path.

The former just makes the page allocator faster and less likely to fall into bad paths in the microarchitecture, while the latter directly reduces contention on the lock.

While increased performance across the board is what we want, what we're most interested in solving here is the scalability bottleneck: when we increase the number of cores available to a Go application, we want to see a performance improvement.

Edit: Design doc

I've already built both an out-of-tree and in-tree prototype of a solution which is based on a bitmap representing the whole heap and a radix tree over that bitmap. The key idea with bitmaps here being that we may "land grab" several pages at once and cache them in a P. Two RPC benchmarks, a Google internal one, and one based on Tile38, show up to a 30% increase in throughput and up to a 40% drop in tail latencies (90th, 99th percentile) on a 48-core machine, with the benefit increasing with the number of cores available to the benchmark.

Performance

Most helpful comment

Thank you, Michael! ~25% performance is awesome too.

All 42 comments

Change https://golang.org/cl/200439 mentions this issue: runtime: place lower limit on trigger ratio

Change https://golang.org/cl/195698 mentions this issue: runtime: count scavenged bits for new allocation for new page allocator

Change https://golang.org/cl/190621 mentions this issue: runtime: add packed bitmap summaries

Change https://golang.org/cl/196640 mentions this issue: runtime: make more page sweeper operations atomic

Change https://golang.org/cl/195697 mentions this issue: runtime: add scavenging code for new page allocator

Change https://golang.org/cl/196643 mentions this issue: runtime: add page cache and tests

Change https://golang.org/cl/190622 mentions this issue: runtime: add new page allocator core

Change https://golang.org/cl/190619 mentions this issue: runtime: add new page allocator constants and description

Change https://golang.org/cl/196642 mentions this issue: runtime: add per-p mspan cache

Change https://golang.org/cl/201763 mentions this issue: runtime: make the scavenger self-paced

Change https://golang.org/cl/190620 mentions this issue: runtime: add mallocbits and tests

Change https://golang.org/cl/196639 mentions this issue: runtime: remove unnecessary large parameter to mheap_.alloc

Change https://golang.org/cl/195701 mentions this issue: runtime: add per-p page allocation cache

Change https://golang.org/cl/201764 mentions this issue: runtime: integrate new page allocator into runtime

Change https://golang.org/cl/195700 mentions this issue: runtime: remove old page allocator

Change https://golang.org/cl/195699 mentions this issue: runtime: add option to scavenge with lock held throughout

Change https://golang.org/cl/201765 mentions this issue: runtime: switch to new page allocator

Change https://golang.org/cl/196638 mentions this issue: runtime: ensure heap memstats are updated atomically

Change https://golang.org/cl/196641 mentions this issue: runtime: refactor mheap_.alloc* into allocSpan

Gopherbot isn't updating the issue with this, but the design doc is at @ https://go-review.googlesource.com/c/proposal/+/202857. I'll update the first post in the issue once it lands.

CC: @aclements @randall77 @cherrymui

Change https://golang.org/cl/203318 mentions this issue: runtime: fix (*gcSweepBuf).block guarantees

Change https://golang.org/cl/203858 mentions this issue: runtime: compute whether a span needs zeroing in the new page allocator

Change https://golang.org/cl/203859 mentions this issue: runtime: make allocNeedsZero lock-free

https://github.com/golang/go/issues/31222
May be related
Could you test the new allocator with a reproducer from the ticket above?

@un000 It's possible but unlikely this will help.

Let's assume that the problem in that issue is fully that a long allocation blocks STW for 43 ms. Firstly, 100 MB allocations don't really get any faster with this proposal. In that issue, asking for 100 MB is definitely going to go to the OS and map pages in. No matter what we're bound by how quickly we can get 100 MB worth of committed address space from the OS.

@un000 Well, it helps maxWait at least and brings it down to 1.0s vs. 1.3s on my machine, so perhaps I spoke too soon.

But I'm hitting stalls trying to copy the traces to verify whether there's an improvement in the STW delay. I'll try again later and let you know what I found.

@un000 Yeah OK I was able to inspect the trace files and the delay in that large allocation basically went down from 1.3s to 1.0s, consistent with the reported maxWait value. I don't think this proposal helps much.

I also misread the size of allocation you were making. It's not 100 MB, it's 100e6 * 8 * 4 bytes, which is 3.2 GiB. That's a whole order of magnitude more memory. I'm not sure how much we'll be able to help with allocations that large, which basically just talks to the OS (unless you've made such a large allocation already before).

Thank you, Michael! ~25% performance is awesome too.

Change https://golang.org/cl/205277 mentions this issue: runtime: make sysReserve return page-aligned memory on js-wasm

Change https://golang.org/cl/205758 mentions this issue: runtime: define darwin/arm64's address space as 33 bits

Change https://golang.org/cl/205757 mentions this issue: runtime: remove MAP_FIXED in sysReserve for raceenabled on darwin

Change https://golang.org/cl/205759 mentions this issue: runtime: map reserved memory as NORESERVE on solaris

Change https://golang.org/cl/205877 mentions this issue: cmd/go: add math/bits to runtime packages in TestNewReleaseRebuildsStalePackagesInGOPATH

Change https://golang.org/cl/205937 mentions this issue: runtime: define maximum supported physical page and huge page sizes

Change https://golang.org/cl/205938 mentions this issue: runtime: make the test addresses for pageAlloc smaller on 32-bit

Change https://golang.org/cl/206199 mentions this issue: runtime: copy some functions from math/bits to runtime/internal/sys

Change https://golang.org/cl/206277 mentions this issue: runtime: fix min/max logic in findScavengeCandidate

Change https://golang.org/cl/206200 mentions this issue: cmd/compile: intrinsify functions added to runtime/internal/sys

Change https://golang.org/cl/206978 mentions this issue: runtime: check summary before scavenging in fast path

@mknyszek Anything else to do for this issue?

Nope! Closing.

Change https://golang.org/cl/219121 mentions this issue: design/35112-scaling-the-page-allocator.md: update proposal

Was this page helpful?
0 / 5 - 0 ratings

Related issues

enoodle picture enoodle  路  3Comments

rakyll picture rakyll  路  3Comments

OneOfOne picture OneOfOne  路  3Comments

gopherbot picture gopherbot  路  3Comments

jayhuang75 picture jayhuang75  路  3Comments