Consider the following benchmark:
func Benchmark(b *testing.B) {
b.ReportAllocs()
var a, z [1000]*flate.Writer
p := sync.Pool{New: func() interface{} { return &flate.Writer{} }}
for i := 0; i < b.N; i++ {
for j := 0; j < len(a); j++ {
a[j] = p.Get().(*flate.Writer)
}
for j := 0; j < len(a); j++ {
p.Put(a[j])
}
a = z
runtime.GC()
}
}
This currently reports: Benchmark-8 10 263962095 ns/op 663587733 B/op 1017 allocs/op
where I find it surprising that there is around 1000 allocations again every cycle. This seems to indicate that the Pool
is clearing its entire contents upon every GC. A peek at the implementation seems to indicate that this is so.
If the workload is such that it is cycling through a high number of items, then it makes sense to actually not clear everything. A simple heuristic to accomplish this is to keep track of the number of calls to Pool.Get
since the last attempt clearing the pool and ensure that we retain up to that number of items.
\cc @LMMilewski @aclements
When proposing any changes here please consider the discussion at https://groups.google.com/forum/#!msg/golang-dev/kJ_R6vYVYHU/LjoGriFTYxMJ .
I'd like to see that too, a lot of times I use a ghetto chan xxx
as a memory pool just to get around that.
I've also found the sync.Pool clearing behavior to be less than satisfactory. Here's a concrete example taken from a real server.
We have a performance-critical HTTP server. Reducing allocations has been a big part of getting throughput to be consistently high and latency to be consistently low. There are a handful of objects that we cache and reuse between requests as part of the allocation strategy. This server tends to have a ~5 GB heap and runs a GC every ~4 seconds.
One kind of reused object is a certain buffer that we use at the rate of ~10000 / sec. For this use case, a sync.Pool works really well; we get a 99.9% cache hit rate since we reuse buffers tens of thousands of times between GCs.
Another kind of reused object is a gzip.Writer that we use at the rate of ~500 / sec. For this use case, a sync.Pool did not work well for us: the combination of GC frequency and relatively lower usage rate meant that the hit rate when we used a sync.Pool was as low as 50%. Fresh gzip.Writers are extremely expensive to create, so we needed to do better. Now our code uses a buffered channel of *gzip.Writers that it can reuse, but it still uses a sync.Pool as a backup once the channel is drained so that if we suddenly need more writers than the channel size we don't have to keep allocating all of the extras.
Ideally a sync.Pool would be suitable for both of these scenarios.
Finally, the tight coupling between GC frequency and pool efficiency also has weird implications for reasoning about server performance. For instance, if totally unrelated code in our server changes the GC frequency (say, by introducing more allocations or, more perversely, decreasing the overall heap size), the pool hit rates get worse and thus the cost of allocations for these objects increases, even though nothing about the usage or cache-ability of those objects changed.
@ianlancetaylor. Much of the discussion seems to be around the question of "when to drain the pool?" and not much around "how much of the pool to drain?". In the thread, Ingo Oeser asked "Will it be fully or partially drained?", but that question does not seem to have been addressed.
Using the GC seems like a great time to trigger drainages, but the scenario that @cespare describes was the failure mode I imagined could happen. I think there's a worthwhile discussion to be had about "how much to drain".
I'll note my real world experience as well. We use go on embedded systems and we often have soft real-time constraints (aka, low latency processing). In a specific software we were working on, we tried using sync.Pool to reuse allocations for handling packets coming from a high freq UDP server. Unfortunately, whenever the GC was triggered, the full pool was exhausted and this caused a peak of high latency (due to the allocations) that went beyond the soft real-time constraints we had. We had to discard sync.Pool and use a hand-rolled memory pool implementation.
@dsnet It sounds like your proposal is:
count
to sync.Pool
Get
method, atomically increment the count
field.poolCleanup
count
elements distributed across the p'sallPools
count
to 0
Does that sound right? That seems reasonable to me. It basically delays the release of pool elements for one GC cycle.
It won't help @cespare 's case. It may help @rasky 's case. Of course, sync.Pool
can't be a general mechanism that addresses all memory cache needs.
The algorithm to determine how much to drain has some parallels with cache eviction algorithms,
for which most rely on the principal that the past is a good predictor of the future.
In the same way, there are several pieces of information that can be used as a heuristic to develop
a policy for how much of the pool to drain. Ultimately, it comes down to a trade-off between CPU time vs. memory space.
There are two extremes:
My assertion is that neither of these extremes are ideal and well-suited for most use-cases.
The algorithm I proposed (which you correctly summarize) is a simple algorithm that tries to find a better middle-ground between those two extremes. It certainly not a perfect algorithm as it completely ignores Pool.Get statistics from farther than one GC ago. Thus, it suffers when the workload on the server is highly non-uniform. An extension to the algorithm may perform a weighted average over N GC cycles. As long it can be proven that the pool is eventually drained given no Pool.Get traffic, then we know it doesn't leak memory.
More complicated policies can take into account information from the GC like what proportion of the heap the pool is occupying.
However, those algorithms can get very complicated quickly and are much harder to prove that they are correct.
In terms of whether it helps @cespare's or @rasky's use-cases, I believe my simple algorithm helps both.
The degree to how much it helps really depends on the exact work-load.
For @cespare's example with the gzip.Writer Pool, there were ~2000 calls to Pool.Get between each GC (assuming ~4s between GC and ~500/s Pool.Get calls), this means that when the GC occurs, it will preserve at least up to 2000 items in the Pool when drain is called.
Of course, there are probably far less than 2000 items in the pool since some are in active use and the computation of 2000 Pool.Get calls doesn't track how many intervening Pool.Put calls were made. The important property is that it is a useful heuristic of the demand for allocations from that Pool and should (hopefully) increase the hit-rate from 50% to something much higher.
A simpler heuristic would be to only drain some fraction of the pool at each GC.
Doing both would be good, too. It would be less likely to take too much or leave too little.
I believe that @dsnet's suggestion would work for my use case. The gzip.Writer pool that I described never gets very large (I don't have an exact number, but it's always <100 elements) so it seems like @dsnet's proposal would result in the pool never being cleared.
I don't quite see why using the number of Gets as the limit (without regard to Puts) is a good heuristic, though. For instance, in the scenario I described, if the pool grew abnormally large during one GC cycle, it would be trimmed down to 2000 elements and then stay at 2000, even though there are normally only <100 live elements.
ISTM that a better heuristic would be to use the max live set size during the cycle as the limit. This is somewhat more bookkeeping, though. (The straightforward way I can think of doing it would be atomic increment/decrement to track the size plus a CAS loop to track the max size.)
I was hoping using Gets
alone would be a good approximation of "max live set", but your example provides an case where that approximation falls short.
ISTM that a better heuristic would be to use the max live set size during the cycle as the limit. This is somewhat more bookkeeping, though. (The straightforward way I can think of doing it would be atomic increment/decrement to track the size plus a CAS loop to track the max size.)
This could probably be simplified to just a atomic store. It's okay if it is slightly racy.
I think only counting the number of Gets can lead to bad behavior. For example, a sync.Pool could have large number of objects in it, but the only access right now is alternating between Get
and Put
rapidly. In this case, the Pool only needs to retain one object, but the count of Get
calls is a function of the call rate and GC frequency, neither of which seem like they should matter.
Along the lines of what @cespare was suggesting, I think what we want to track is actually a low watermark on the occupancy of the Pool between GCs. Then we would free that many objects from the Pool, since that's how many we didn't need. This is more like the "min dead set" than the "max live set". :) I think tracking how many to free rather than how many to retain is simpler because some of the "live" objects that logically belong to the Pool may not actually be in the Pool when we clean it up.
I'm not sure if we want to actually track this count and free that many objects or create a two-tiered victim cache where:
Get
draws from tier 1 unless it's empty, in which case it draws from tier 2.Put
always puts to tier 1.The victim cache approach keeps cleanup O(1), whereas I think the counting approach would have to be O(n). OTOH, the counting approach is more flexible. For example, we could smooth the count across multiple GC cycles to avoid transient problems. I'm not sure this flexibility is really necessary, though.
Either way, the hot path currently doesn't use any atomic operations or cause any cache contention, and it would be nice to keep it that way. That suggests we should shard whatever the solution is just like the rest of the Pool is sharded.
See #23199 of an example where this optimization can cause indefinite pinning of large buffers. I argue the usage pattern described in #23199 is actually incorrect. In some recent work, I used an array of sync.Pools
to bucketize each item by capacity so that each Pool
contained elements of the approximate same memory cost.
FWIW, I took a stab at implementing the victim cache approach and got completely tangled up in the carefully constructed races surrounding poolCleanup
. I'll give it another go after the holidays, and will probably try to fix the bad (O(n)) STW behavior of poolCleanup
while I'm at it.
@aclements are you planning to tackle this in 1.11? I'm hitting this as well, and it's quite hard to write a correct lockless pool implementation.
I'm hoping to tackle this for 1.11, but we'll see what happens. As you point out, it's not an easy problem. :) I think I have to largely rewrite sync.Pool
.
Should we remove the NeedsDecision label? It seems that we're in agreement that the default behavior is not desired.
As far as I know we don't actually know what we want to do, so I think this does still need a decision.
This seems like a place where some simulations would provide a good idea of what sort of improvements are worth pursuing.
@jimmyfrasche yes, good idea. I did just that: https://github.com/cespare/misc/blob/master/drainalgo/drainalgo.go.
Doing this exercise made me realize that in this discussion we (or at least I) have been discounting an important, observable characteristic of the current implementation: the per-P private item.
By "observable" I mean what a user can see in terms of the pool hit/miss rate.
The sync.Pool implementation is quite sophisticated, but in terms of this user-observable behavior, it can be thought of as consisting of:
private
) per P.All of these are cleared on each GC.
(Note: The per-P item is an important fast-path optimization in the high-contention case; my analysis below is concerned with the cases where the hit rate is low -- generally low-contention situations.)
In my test program, I implemented several different pool strategies:
sync.Pool
: sync.Pool.simple
: a simplified simulation of sync.Pool which has a single shared pool that's cleared on each GC, but no per-P private item.maxlive
: the "max live set" strategy I described in https://github.com/golang/go/issues/22950#issuecomment-348653747.mindead
: the "min dead set" strategy Austin described in https://github.com/golang/go/issues/22950#issuecomment-352935997.We'll consider a really bad scenario for the current implementation:
On my 4-core laptop, this does poorly:
4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype sync.Pool
...
60s: 5706 gets (95.1/sec) / 66.70% hit rate / 3.01 avg pool size / 599 GCs (9.983/sec)
But on a 72-core server, it is much worse, because of the per-P caches:
72-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype sync.Pool
...
60s: 5863 gets (97.7/sec) / 12.81% hit rate / 8.39 avg pool size / 599 GCs (9.983/sec)
The simplified simulation without the per-P items does better in this case, and (as expected) there's no difference between 4 cores and 72 cores:
4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype simple
...
60s: 5797 gets (96.6/sec) / 85.01% hit rate / 1.45 avg pool size / 599 GCs (9.983/sec)
72-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype simple
...
60s: 5862 gets (97.7/sec) / 84.68% hit rate / 1.49 avg pool size / 599 GCs (9.983/sec)
The maxlive and mindead algorithms perform much better yet, yielding 97% hit rates even in this extreme scenario while only keeping a tiny number of items in the pool:
4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype maxlive
...
60s: 5803 gets (96.7/sec) / 97.29% hit rate / 1.76 avg pool size / 599 GCs (9.983/sec)
4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype mindead
...
60s: 5796 gets (96.6/sec) / 97.19% hit rate / 1.72 avg pool size / 599 GCs (9.983/sec)
In all the scenarios I tested, maxlive and mindead gave similar hit rates and kept a similar number of items in the pool. I think this makes sense because, intuitively, they're equivalent strategies.
My conclusion is that the low hit rate in certain scenarios with the current sync.Pool implementation is partially explained by the clear-the-pool-each-GC strategy that it employs and is also partially explained by the existence of the per-P private single-item cache, and that the latter cause has a more pronounced effect with more CPU cores.
Therefore, if we implement a better high-level strategy, such as mindead, but we keep the per-P items, it may only amount to a partial fix for these scenarios.
That could still be an overall improvement so it may be worth doing (and, separately, the O(n) cleanup would be great to fix).
This might be too complex, but perhaps it would be possible to have two modes of operation (a la sync.Mutex). It could start with a simpler implementation suitable for low-contention cases and switch to an optimized implementation with per-P items when it detects high contention.
It could start with a simpler implementation suitable for low-contention cases and switch to an optimized implementation with per-P items when it detects high contention.
If the per-P cache is significant but contention is low, I think it's worth reconsidering whether that use-case is one that sync.Pool
should address at all.
I generally assume that sync.Pool
is only for cases that are known to be high-contention, since the remaining cases are already well handled by other means (e.g. a mutex-guarded stack) for which it is much easier to reason about the interactions.
The other thing you potentially lose if you drop per-P caches is L2 cache locality, if the pool is accessed frequently enough to reside in L2 cache. (Moving an object from one P to another potentially invalidates all of the cache lines containing that object, at a cost of ~40ns per line depending on the CPU and access pattern.)
Change https://golang.org/cl/100036 mentions this issue: sync: Use a single global pool instead of per-P shared ones
Is there any update on this? It looked very promising.
I wonder if sync.Pool can be leaved as is, and separate type be defined that is not cleared on every GC ?
Maybe we should add a public "Size" field in sync.Pool. During the GC we keep "Size" elements in it. I do not think this can break any actual implementation and it provides a way to reduce allocations.
Change https://golang.org/cl/162919 mentions this issue: sync: do not clear sync.Pools completely every GC
I dont like idea that now i have correct go program, and after next release it become incorrect, because i put in pool things with different memory size
I dont like idea that now i have correct go program, and after next release it become incorrect, because i put in pool things with different memory size
That has always been true. It's not well documented today, but having unbounded sized items in a sync.Pool
is problematic. See #23199.
I dont like idea that now i have correct go program, and after next release it become incorrect, because i put in pool things with different memory size
I'm not sure I follow. If you are already using the Pool like that, how should my CL make it worse? Can you provide a benchmark/test case that shows the problem?
I dont have anything against your CL nor any benchmark, just dont like idea that pool is not cleaned completely at GC.
But as dsnet commented and pointed to #23199, it is already bad idea to put things with different memory sizes in pool.
I still think that leaving current sync.Pool as is, and adding new pool type that is not cleaned completely, is good idea.
Change https://golang.org/cl/166957 mentions this issue: sync: internal fixed size lock-free queue for sync.Pool
Change https://golang.org/cl/166958 mentions this issue: sync: internal dynamically sized lock-free queue for sync.Pool
Change https://golang.org/cl/166961 mentions this issue: sync: smooth out Pool behavior over GC with a victim cache
Change https://golang.org/cl/166959 mentions this issue: sync: add Pool benchmarks to stress STW and reuse
Change https://golang.org/cl/166960 mentions this issue: sync: use lock-free structure for Pool stealing
Most helpful comment
I think only counting the number of Gets can lead to bad behavior. For example, a sync.Pool could have large number of objects in it, but the only access right now is alternating between
Get
andPut
rapidly. In this case, the Pool only needs to retain one object, but the count ofGet
calls is a function of the call rate and GC frequency, neither of which seem like they should matter.Along the lines of what @cespare was suggesting, I think what we want to track is actually a low watermark on the occupancy of the Pool between GCs. Then we would free that many objects from the Pool, since that's how many we didn't need. This is more like the "min dead set" than the "max live set". :) I think tracking how many to free rather than how many to retain is simpler because some of the "live" objects that logically belong to the Pool may not actually be in the Pool when we clean it up.
I'm not sure if we want to actually track this count and free that many objects or create a two-tiered victim cache where:
Get
draws from tier 1 unless it's empty, in which case it draws from tier 2.Put
always puts to tier 1.The victim cache approach keeps cleanup O(1), whereas I think the counting approach would have to be O(n). OTOH, the counting approach is more flexible. For example, we could smooth the count across multiple GC cycles to avoid transient problems. I'm not sure this flexibility is really necessary, though.
Either way, the hot path currently doesn't use any atomic operations or cause any cache contention, and it would be nice to keep it that way. That suggests we should shard whatever the solution is just like the rest of the Pool is sharded.