The GC will use as much CPU as is available. If your program is basically
idle, which it appears to be, the GC will use the idle CPU and CPU load
will naturally go up. If your application is active and load is already
high then the GC will limit its load to 25% of GOMAXPROCS. The mark and
scan phase is concurrent, it is unclear how it is adversely affecting your
idle application.

On Mon, Mar 14, 2016 at 1:54 AM, Dieter Plaetinck [email protected]
wrote:

Hello,
while running the program at
https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing excessive time spent in mark-and-scan on go 1.6
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 in the below
description.

./metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
go version go1.6 linux/amd64
$uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've
gotten allocations down in various parts of the program there are currently
probably about a million or more live pointers referencing pieces of data.
I was going to work on optimizing this when Dave Cheney suggested there's a
problem with the runtime and I should file a bug (
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
I'm seeing MAS times of 15s and more cpu time, 2000~2500 ms clock time. (8
core system) for a heap of about 6.5GB
Here's the log with gctrace and schedtrace enabled:
https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find
a zip containing this log, the binary, a cpu profile taken during gc run
1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just
after the spike induced by GC run 1482)

https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that
incoming load (requests) of the app is constant so this conveys to me that
any extra load is caused by GC, not by changing workload

https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC
workload.

let me know if there's anything else I can provide,
thanks,
Dieter.

—
Reply to this email directly or view it on GitHub
https://github.com/golang/go/issues/14812.

RLH on 14 Mar 2016

The mark and scan phase is concurrent, it is unclear how it is adversely affecting your idle application.

just a guess, but perhaps the cause is the extra workload induced by the write barrier? (i watched your gophercon talk again today :) Interestingly, when I use top, i haven't been able to ever catch a core running at 100%.

But you're right that there's essentially two things going on, which may or may not be related:

mark and scan takes too long
app slows down during GC runs, while cpu cores don't saturate.

Let me know how I can help.

Dieterbe on 16 Mar 2016

Hi @Dieterbe, could you clarify what the issue is? 15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

If it's that response time goes up during GC, could you try the CL in #15022? (And, if you're feeling adventurous, there's also https://go-review.googlesource.com/21036 and https://go-review.googlesource.com/21282)

aclements on 3 Apr 2016

Hey @aclements!

15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

ok , fair enough for me. i just reported this here because @davecheney mentioned in
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg
that 1.5s for 5GB was unexpected and that i should open a ticket for it. so hence this ticket.

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

of course, this is by itself not a problem.

If it's that response time goes up during GC, could you try the CL in #15022?

initially the ticket wasn't about this, but it was brought up and is definitely a problem for us. so from now on we may as well consider this the issue at hand.
I recompiled my app with a recompiled go using your patch, and did a test run before and after.
unfortunately i see no change and the latency spikes are still there (details at https://github.com/raintank/raintank-metric/issues/172)
note that i can verify this problem quite early on. e.g. in this case i've seen spikes as early as GC run 270. the issue is there probably much earlier but my app needs to load in a lot of data before i can test. the bug mentioned in #15022 looks like it only activates after a sufficient amount of GC runs.

Dieterbe on 5 Apr 2016

@Dieterbe, would it be possible for you to collect a runtime trace (https://godoc.org/runtime/trace) around one of the periods of increased latency? If you do this with current master, the trace file will be entirely self-contained (otherwise, I'll also need the binary to read the trace file).

I have a hunch about what could be causing this. GC shrinks the stacks, so if many of your goroutines are constantly growing and shrinking the amount of stack they're using by at least a factor of 4, you would see a spike as many goroutines re-grew their stacks after the shrink. This should be more smeared out on master than with Go 1.6 since f11e4eb made shrinking happen concurrently at the beginning of the GC cycle, but if this is the problem I don't think that would have completely mitigated it. (Unfortunately, the trace doesn't say when stack growth happens, so it wouldn't be a smoking gun, but if many distinct goroutines have high latency right after GC that will be some evidence for this theory.)

aclements on 16 May 2016

Hey @aclements
I did curl 'http://localhost:6063/debug/pprof/trace?seconds=20' > trace.bin
about 5~7 seconds in I think (it's a bit hard to tell) is where the GC kicks in and a latency spike was observed
files: http://dieter.plaetinck.be/files/go-gc-team-is-awesome/trace.bin and http://dieter.plaetinck.be/files/go-gc-team-is-awesome/metric_tank for the binary. compiled with official 1.6.2 . hopefully this helps to diagnose. if not let me know, maybe i can get a better trace.

Dieterbe on 16 May 2016

I read through and #9477 and #10345 and wonder if this issue is another similar case? note that this app is centered around a map (https://github.com/raintank/raintank-metric/blob/master/metric_tank/mdata/aggmetrics.go#L13) that has just over 1M values (and each value in turn has a bunch of pointers to things that have more pointers, and lots of strings involved too). optimizing this is on my todo, but in the meantime i wonder if maybe a GC thread blocks the map leaving other application threads (mutators) unable to interact with the map. and since everything in the app needs this map, it could explain the slow downs?

Dieterbe on 30 May 2016

@Dieterbe, it's possible. Could you try the fix I posted for #10345? (https://golang.org/cl/23540)

Note that it's not that the GC thread blocks _the map_. Mutators are free to read and write the map while GC is scanning it; there's no synchronization on the map itself. The issue is that whatever thread gets picked to scan the buckets array of the map is stuck not being able to do anything else until it's scanned the whole bucket array. If there's other mutator work queued up on that thread, it's blocked during this time.

(Sorry I haven't had a chance to dig into the trace you sent.)

aclements on 31 May 2016

👍1

I dug into the trace you posted (thanks!) and took another look at your gctrace. This is pretty clearly an over-assist problem. When the GC is running, the expected 25% of the CPU goes to dedicated GC workers, but nearly all of the rest of the CPU time goes to GC assists triggered by allocation. Hence, when GC is running, even though it's technically concurrent, nearly all of the CPU is being soaked up by GC. You can see this really clearly in the trace:

trace-14812

The GC kicks in in the middle. When the GC isn't running, all of the user goroutine executions are really short and the system keeps up. As soon as GC kicks in, two Ps go to dedicated gcBgMarkWorkers (expected) and everything else turns to a checkerboard of 10–20 ms long user goroutine time slices (not expected). Because of this, the run queue lengths climb rapidly (the second row in the group of three at the top), which is almost certainly what's causing the latency spikes. Right after GC finishes there's a period where the user goroutines are running quickly again but they're packed much more densely as it drains the long run queues and returns to normal.

I don't yet understand why this is happening. The heap is in steady state, so I would expect it to compute an assist ratio that causes it to finish roughly half way between the trigger and the goal, but it seems to be computing a much higher assist ratio and finishing much earlier.

@Dieterbe, could you re-run your application with GODEBUG=gctrace=1,gcpacertrace=1 and grab stderr? It would be awesome if you can run with master, since there's more gcpacertrace output than there was in 1.6, but even the 1.6 output would be helpful.

aclements on 8 Jun 2016

hey @aclements

i recompiled latest master go, then compiled my app with it, and ran with that GODEBUG variable.
this time i just ran it in local dev environment with fake traffic, but the symptoms are visible again (some of the fake requests got slow responses at GC times).
here's the err output: https://gist.github.com/Dieterbe/6a57caf1e6fae129f375146ef95e6276
note at line 459 I pressed return twice.
that's where my fake traffic started and where the symptoms become clear.
the 1st and 2nd GC run after that didn't show it clearly, but then at the 3rd, 4th and 5th and basically every GC run ever that showed response time spikes again.

if it helps, here's a dashboard of the entire run:
https://snapshot.raintank.io/dashboard/snapshot/gNKMeTpufl77H9gK1JCZn5TYvo88uLju
and zoomed in on the region where fake traffic kicked in:
https://snapshot.raintank.io/dashboard/snapshot/QgUlChTZGR4Bg4bw95mHxmN3vozfb334

the "golang GC" graph 2nd row middle and "request handle duration (all steps)" 3rd row on right will probably be most helpful.

Dieterbe on 8 Jun 2016

@Dieterbe, is there a way I can run the load test with fake traffic myself? I built metric_tank (I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting, but otherwise go get just worked), but I don't know how to run it.

aclements on 14 Jun 2016

I had to switch github.com/grafana/grafana/pkg/log to origin/metrics_reporting

i'm not sure what's metrics_reporting, but the log problem is caused due to grafana recently changing up its logging mechanism, you can just checkout the last commit before that (https://github.com/grafana/grafana/commit/22cda198aedfacaa93e2666e54db963ab93825a8) and metric-tank will build fine against it.

as for running it, MT (metric tank) relies on cassandra, elasticsearch and nsqd (from the nsq.io project) at runtime. the easiest (hopefully) way to get it all up and running is by using https://github.com/raintank/raintank-docker, so you can follow that readme, with a few changes:

you need to update screens/metricTank with

-./metric_tank \
+GODEBUG=gctrace=1,gcpacertrace=1 ./metric_tank \

you would also need to create a file screens/benchmark in raintank-docker with contents like the below:

#backfill an hour at 100k, wait for it, and then do realtime at 10k in background, and start http load.
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100 -offset 1h -speedup 10 -stop-at-now
fake_metrics -nsqd-tcp-address nsqd:4150 -statsd-addr statsdaemon:8125 -orgs 100 -keys-per-org 100  &
go get github.com/raintank/raintank-metric/inspect-es
go get github.com/tsenart/vegeta
inspect-es --es-addr elasticsearch:9200 -format vegeta-mt | sed 's#localhost:6063#metricTank:6063#' | vegeta attack -rate 500 -duration=120m > attack.out

these tools set up the "contextual circumstances": fake_metrics generates an input stream, inspect-es piped to vegeta triggers a consistent http request load.

you might hit some snags in the process of getting this all up and running though, and i would hate to have you lose much time with this. but i could hop on a call or whatever to help out.

thanks.

Dieterbe on 14 Jun 2016

I think I got it up and running (with many tweaks), but now I'm a little stuck.

What's the easiest way to get a custom-built Go into the metric_tank container? I'd like to add some debugging to the runtime.
How do I monitor the end-to-end latency so I can test any changes?

aclements on 15 Jun 2016

hey @aclements sorry you had to go through that. what i personally do is compile metric-tank manually on the host, so that the binary is $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/metric_tank, and then make raintank-docker/raintank_code/raintank-metric a symlink to $GOPATH/src/github.com/raintank/raintank-metric/metric_tank/ . you can see in fig.yaml, the metricTank entry, that ./raintank_code/raintank-metric from the host gets mapped to /go/src/github.com/raintank/raintank-metric within the container using a docker directory mapping. in other words, the directory in the container is synchronized with the host, so whatever you compile on the host will be available in the container.

there's 2 methods:

A) using vegeta. in the vegeta attack command you already have, probably change the duration from 120m to about 5m or so, to make the benchmark shorter. then it should be as simple as cat attack.out | vegeta report. vegeta has a few other interesting reporting modes as well. and i think you can even run reports "streaming", e.g. get reports while the file is still being written to.

B) if grafana is running properly in the container (the easiest to verify is to attach to the screen session and go to the grafana tab. in fact you may want to check out all tabs just to verify nothing's obviously broken. usually the last tab i switch to is the benchmark one. there you'll have the console where vegeta is executing and where you can run more vegeta commands). but anyway if grafana's running then you can go to http://localhost:3000, log in as admin:admin and load the "metric tank" dashboard. that's the dashboard i've been using to provide the snapshots above. it should show real-time stats coming in (which it loads from graphite, which also runs in a container)

Dieterbe on 16 Jun 2016

I spent much of this week working on this. I still have ideas on my list, but wanted to post an update.

First, I was able to bring up the whole system with a custom runtime and observe the behavior through Grafana and vegeta report. I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources, so it was swapping and over-saturating the CPU and causing all sorts of horrible latency. My workstation had the resources (though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!), but I'm not sure it exhibited the problem. According to vegeta, max latency was ~350ms, but P99 was just 2ms in several runs. I did see small jumps in max "request handle duration" correlated with GC in Grafana, though they were only around 15ms.

However, I was still able to dig into the execution traces and glean some insight. I fixed some problems with high goroutine churn potentially losing track of assist credit, and with the way we build up spare assist credit causing long assists when the assist ratio is high (which it is in Metric Tank). I also did several more experimental changes, which did get rid of the bad checkerboard pattern in the execution trace and break it up into a much more finely interleaved workload. This is certainly what we want to see in the trace, but it didn't seem to move the overall stats. I have a few more ideas on why that may be, but I'll be out next week, so it may be a little bit before I can try them.

I'm going to mail a bunch of my experimental commits, which should get linked here shortly. @Dieterbe, if you have time, it would be good try running with these, though I'm not sure you'll see much difference.

I did notice one thing you may be able to change on the application side to mitigate some of the problem. I think Metric Tank is allocating very small objects at a very high rate. This is causing the GC scan rate to be much lower than typical, since the memory access pattern is even more random than typical. This is stretching out the GC assists and interacting poorly with some of the tuning that assumes a more typical scan rate. It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

aclements on 1 Jul 2016

I don't think I was able to quite reproduce the problem locally, though. My laptop simply didn't have the resources

yeah it's not trivial to consistently trigger clear symptoms (large spikes). But it seems more load always seems to help.

though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

I run https://github.com/meltwater/docker-cleanup/blob/master/docker-cleanup-volumes.sh periodically in between runs of the stack, this frees up space.

It would be worth doing a heap profile with -alloc_objects to see if there are any sources of large numbers of small allocations that would be easy to address.

Oh boy. I already went through weeks of optimizing memory allocations and what's left now is a bit harder. IIRC the remaining inefficiencies are about allocating a whole bunch of strings (several for each incoming metric). optimizing that (splitting up in the incoming stream in separate data and metadata streams, string interning, using enums, etc) is on the agenda but will require a broader architectural overhaul

I'll be out next week

Enjoy your time off!

Dieterbe on 4 Jul 2016

(though I discovered if I left it running it would chew through my not-very-large root partition; thanks Docker!)

Mounting a directory on another partition into the container might help with this.

uluyol on 4 Jul 2016

Postponing to 1.8.

ianlancetaylor on 7 Jul 2016

CL https://golang.org/cl/25155 mentions this issue.

gopherbot on 23 Jul 2016

Offtopic: @aclements what tool did you use to visualize trace?

wendigo on 22 Sep 2016

@wendigo: go tool trace. Unfortunately, there's not a lot of documentation on it, but if you run go tool trace -h it will give you a high-level overview.

aclements on 22 Sep 2016

👍4

@Dieterbe @aclements Does the fix for #16432 resolve this issue as well?

@Dieterbe in particular, have you been able to resolve this by reducing the allocation load as well?

quentinmit on 29 Sep 2016

@Dieterbe @aclements Does the fix for #16432 resolve this issue as well?

I don't have my metricTank setup any more. I could probably recreate it, but @Dieterbe, if you can test with Go 1.7 (which includes the #16432 fix), that's probably both easier and more reliable.

aclements on 29 Sep 2016

@Dieterbe, would you mind re-testing with current master?

aclements on 21 Nov 2016

Hi @aclements , looks like I have some bad news:

i just did test runs on my laptop:

1.7.3 : https://snapshot.raintank.io/dashboard/snapshot/O5aCmYZfB78Sy4WfovEBYbhzlrMaED99
master : https://snapshot.raintank.io/dashboard/snapshot/ELOxhN8jctpAkSQLxh1O4bL6xOpEC6e5
(to be precise: f88a33aeac427b57a9acb4ecf6e26d7191ab60cc )

check out the 2nd row. specifically the chart middle and right.
you can click on keys in the legend (e.g. 'max') to isolate out those series on the plot.
in both runs, you can see that the spikes in max correspond exactly to all the GC runs.
these spikes (70~100ms) aren't as severe as in prod (where they are many second), i presume because we have much less data in RAM during this test. but they are still way more than the reported GC durations.

FWIW i built go using GOROOT_BOOTSTRAP=/home/dieter/code/go-linux-amd64-bootstrap ./all.bash with a bootstrap go1.7rc4

to doublecheck, i had metrictank report the go version its built with:

./build/metrictank --version
metrictank (built with devel +f88a33a Tue Nov 22 12:26:19 2016 +0000, git hash 0.6.0-12-g9c678bc)

anything I can do to move this forward again? run a trace or whatever? thank you.

i killed the first setup before i looked at vegeta's output, but FWIW the output from the 2nd run (it basically just confirms the stats in the dashboard):

Requests      [total, rate]            72000, 300.00
Duration      [total, attack, wait]    3m59.997277767s, 3m59.996642518s, 635.249µs
Latencies     [mean, 50, 95, 99, max]  775.306µs, 429.842µs, 3.727124ms, 4.887946ms, 127.077661ms
Bytes In      [total, mean]            645690068, 8967.92
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:72000
Error Set:

Dieterbe on 22 Nov 2016

I collected another trace just like last time (curl 'http://localhost:6063/debug/pprof/trace?seconds=20' > trace.bin)
about halfway through (maybe a bit earlier) there was a spike reported as 102ms.

dieter.plaetinck.be/files/go-gc-team-is-awesome/trace-f88a33aeac427b57a9acb4ecf6e26d7191ab60cc
here's a copy of the binary (not sure if you need it, but here it is)
dieter.plaetinck.be/files/go-gc-team-is-awesome/bin-f88a33aeac427b57a9acb4ecf6e26d7191ab60cc

Dieterbe on 22 Nov 2016

Thanks. I was hopeful we would have fixed the problem, but it looks like it's something we still don't understand. If I'm reading the graphs correctly, at least it's not any worse.

these spikes (70~100ms) aren't as severe as in prod (where they are many second), i presume because we have much less data in RAM during this test. but they are still way more than the reported GC durations.

Grafana is reporting just the stop-the-world duration of each GC. It's clearly not reporting the duration of the concurrent part of GC, which is when these latency spikes are happening. The fact that the spikes last longer than the STW duration isn't surprising. The duration of the concurrent part is proportional to heap size, and the spike durations are probably related to the duration of concurrent GC.

I collected another trace just like last time

Thanks. Could you also collect stderr with GODEBUG=gctrace=1,gcpacertrace=1?

aclements on 22 Nov 2016

GODEBUG=gctrace=1,gcpacertrace=1 log:
dieter.plaetinck.be/files/go-gc-team-is-awesome/log-f88a33aeac427b57a9acb4ecf6e26d7191ab60cc

note a lot of the log is filled with http request information.
the stress load starts after gc 118 where you see the flood of logs (there are a handful of http requests before that too, but those are not the stress load)
should also be obvious to tell where the workload stopped.
i also took another snapshot of this run, if it helps: https://snapshot.raintank.io/dashboard/snapshot/kfxi60VDSzLleknm569sU5Pg7DA5E5sk
(only shown here is the time range during the heavy workload)

btw @aclements if you're interested in running it for yourself, let me know, i can hook you up with a way to run the experiment with much less hassle then what you had to go through last time.

otherwise, let me know what else i can do to help diagnose this.

thanks!

Dieterbe on 22 Nov 2016

hi @aclements any ideas? do you think the fix could make it into 1.8? if not, is there anything I can do to make that happen? please let me know! thanks.

Dieterbe on 26 Nov 2016

I think that sync.Pool is involved, but I don't think it explains all of the behavior. This is a bit of a guess based on the end stack trace of a few goroutines. A current CPU profile would help to confirm. @Dieterbe , can you collect and share one?

It looks like the Go runtime clears all pools at the beginning of the GC cycle. This application uses a sync.Pool to cache 32000-byte slices. The end stack traces of G847056, G848339, G848340, G848333, G848329, G848327, G848471, and G848474 (and possibly others) are calls into runtime.gcAssistAlloc via calls to sync.(*Pool).Get.

Furthermore, the application only returns slices to the pool at the end of a request. Many requests start 100 getTargets.func1 goroutines, each of which needs its own slice from the pool. Because everything is slower during the GC cycle, inbound http requests are more likely to overlap which would lead to the need for even more than 100 slices from the sync.Pool.

GC cycles are often around 20 seconds apart, and cover around 500MB of heap growth. That's an allocation rate of 25MB/s. If the allocation rate remained constant during the GC cycle, there would be around 2.5MB of heap growth during the ~100ms long GCs—but the heap often grows by around 6MB during the GC based on the gctrace output. This is consistent with—though does not at all prove—demand for 3.2MB of new sync.Pool contents being allocated during each GC cycle.

I looked through the trace-f88a33aeac427b57a9acb4ecf6e26d7191ab60cc execution trace. There's an inbound http request that starts about 0.2ms before the GC cycle begins. The request handler spawns around 100 goroutines of github.com/raintank/metrictank/api.(*Server).getTargets.func1 .. and a few additional http requests are received which lead to several more of these goroutines being created. The distribution of their runtime durations (in milliseconds) is as follows:

N 141  sum 394.133  mean 2.79527  gmean 1.05768  std dev 3.38325  variance 11.4464

     min 0.016
   1%ile 0.0286933
   5%ile 0.088
  25%ile 0.246
  median 1.436
  75%ile 4.25633
  95%ile 10.7898
  99%ile 15.9215
     max 17.314

⠀⠀⠀⠀⠀⠀⠀⢀⡴⠒⠒⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.299
⠀⠀⠀⠀⠀⠀⢠⠎⠀⠀⠀⠀⠹⡄⠀⠀⠀⠀⠀⣀⠤⠒⠒⠲⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠋⠀⠀⠀⠀⠀⠀⠉⠓⠒⠒⠒⠋⠁⠀⠀⠀⠀⠀⠀⠉⠉⠒⠒⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
        0                               10

For comparison, here's the distribution of runtime duration of the github.com/raintank/metrictank/api.(*Server).getTargets.func1 goroutines that were started in response to a request that came in slightly before the GC cycle:

N 100  sum 8.478  mean 0.08478  gmean 0.0730006  std dev 0.0615151  variance 0.00378411

     min 0.032
   1%ile 0.0333467
   5%ile 0.04135
  25%ile 0.051
  median 0.064
  75%ile 0.0935833
  95%ile 0.17195
  99%ile 0.424033
     max 0.488

⠀⠀⠀⠀⠀⠀⠀⠀⡴⠒⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 21.109
⠀⠀⠀⠀⠀⠀⠀⡸⠀⠀⠈⢦⡀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠜⠁⠀⠀⠀⠀⠉⠁⠉⠑⠒⠲⠖⠦⠤⠴⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠁
  0                                                             0.5

During the GC, many of those goroutines are executed for around 5ms. When there's no GC active, that same kind of goroutine takes less than half a millisecond to do its work.

This application looks like it has a lot of churn in its goroutines. Each of the hundred goroutines running github.com/raintank/metrictank/api.(*Server).getTargets.func1 is new.

It also looks like many of them are created at close to the same time. After a goroutine exists, when/how is its assist credit made available to other goroutines? It looks like the assist credit for all goroutine structs (alive or dead) is reset at the beginning of the GC cycle, and that credit built up by a goroutine that dies before the cycle is over is available to any _new_ goroutines that start (in the case where they happen to pick up a G struct with a positive credit balance), but it is not flushed to the background mark workers and so is not available to be stolen. Is this an accurate description, @aclements ?

The CPU profile taken with go1.6 shows a lot of time spent in runtime.gcAssistAlloc, confirming that assists are involved—however it's still weird that individual goroutines would be required to do much more than 100µs of assist work (with Go 1.7 or newer).

@Dieterbe , can you share an updated CPU profile including at least one GC cycle, from when the app is behaving as it was when you took the recent execution trace (trace-f88a33aeac427b57a9acb4ecf6e26d7191ab60cc)?

From log-f88a33aeac427b57a9acb4ecf6e26d7191ab60cc:

pacer: sweep done at heap size 489MB; allocated 5MB of spans; swept 118926 pages at +2.472380e-004 pages/byte
pacer: assist ratio=+1.246231e+001 (scan 300 MB in 943->967 MB) workers=2+0
pacer: H_m_prev=507438736 h_t=+9.500000e-001 H_T=989505535 h_a=+9.617449e-001 H_a=995465360 h_g=+1.000000e+000 H_g=1014877472 u_a=+2.554183e-001 u_g=+2.500000e-001 W_a=81587272 goalΔ=+5.000000e-002 actualΔ=+1.174491e-002 u_a/u_g=+1.021673e+000
gc 118 @366.846s 0%: 0.062+117+0.17 ms clock, 0.49+5.0/231/415+1.3 ms cpu, 943->949->488 MB, 967 MB goal, 8 P
pacer: sweep done at heap size 496MB; allocated 8MB of spans; swept 121633 pages at +2.503184e-004 pages/byte
pacer: assist ratio=+1.224037e+001 (scan 299 MB in 953->977 MB) workers=2+0
pacer: H_m_prev=512591264 h_t=+9.500000e-001 H_T=999552964 h_a=+9.624541e-001 H_a=1005936848 h_g=+1.000000e+000 H_g=1025182528 u_a=+2.721321e-001 u_g=+2.500000e-001 W_a=81032240 goalΔ=+5.000000e-002 actualΔ=+1.245414e-002 u_a/u_g=+1.088529e+000
gc 119 @387.007s 0%: 0.055+96+0.062 ms clock, 0.44+17/192/438+0.49 ms cpu, 953->959->488 MB, 977 MB goal, 8 P

@aclements why is the GC finishing so quickly?

If the mutator goroutines were allocating as fast as possible I'd expect them to end up doing 3/4ths of the work, since 1/4th of CPU time is earmarked for the dedicated GC worker goroutines. This would result in the heap growing by 3/4ths of the headroom reserved by the pacer: 18 out of 24MB for the log lines above. Instead, the heap grows by only 6MB during the GC.

When the GC isn't running, the number of active cores basically rounds to zero. This implies that the user goroutines are spending CPU cycles because of the GC ... which could be allocation assistance or populating sync.Pool values that were cleared at the start of the cycle. I don't understand why the GC cycle would complete so quickly (measured in bytes of heap growth), other than:

1) Assists are being rounded up, and assist value is not all being spent on offsetting allocations

2) Something about the GC, possibly the clearing of sync.Pools, is causing additional mutator work which is not assistance to the GC.

rhysh on 29 Nov 2016

Hey @rhysh , thank you so much for your input.

http://dieter.plaetinck.be/files/rhys-profile
http://dieter.plaetinck.be/files/rhys-bin (built with go +f88a33a but the metrictank binary/version might be slightly different in an insignificant way - than previous experiments)

corresponding dashboard covering the same 30s timewindow: https://snapshot.raintank.io/dashboard/snapshot/7AzQlwYvPYCtdI32mnbCyUgu2RLUHWBG

as you can see, the spike happened after 18 seconds

This application looks like it has a lot of churn in its goroutines. Each of the hundred goroutines running github.com/raintank/metrictank/api.(*Server).getTargets.func1 is new.

FWIW, we could switch to a pool of long-running goroutine workers, so that for each incoming http request, we would use channels to send and retrieve work jobs, as opposed to the current model of launching goroutines for each job (and using a waitgroup to wait for them to complete) if that's a better mechanism.
i'm willing to rewrite any portion of the code really, to work around this problem. though it sounds like the problem is with the sync.Pool?

If the allocation rate remained constant during the GC cycle, there would be around 2.5MB of heap growth during the ~100ms long GCs—but the heap often grows by around 6MB during the GC based on the gctrace output. This is consistent with—though does not at all prove—demand for 3.2MB of new sync.Pool contents being allocated during each GC cycle.

so this implies that during the GC there would only be 1 of the slices being allocated? interesting, i would expect at least several to be created at once, after a clear.

Dieterbe on 5 Dec 2016

Thanks for sharing the CPU profile. I didn't see the samples I was hoping for in it (details at the end).

The "render request duration (all steps)" chart shows a request taking 8ms near when the GC took place. The execution traces you shared earlier (https://github.com/golang/go/issues/14812#issuecomment-262276449) showed some requests that were delayed by around 80ms, which matches the grafana dashboard from that time. It looks like the requests didn't quite line up with a garbage collection cycle.

Are you able to share another profile, maybe one that covers several minutes, or that covers at least one request that was delayed by more than 50ms?

For what it's worth, I've had some success with uploading execution traces to GitHub after adding a ".zip" extension. GitHub would probably allow uploads of real zip files as well :)

If the allocation rate remained constant during the GC cycle, there would be around 2.5MB of heap growth during the ~100ms long GCs—but the heap often grows by around 6MB during the GC based on the gctrace output. This is consistent with—though does not at all prove—demand for 3.2MB of new sync.Pool contents being allocated during each GC cycle.

so this implies that during the GC there would only be 1 of the slices being allocated? interesting, i would expect at least several to be created at once, after a clear.

This might indicate demand for 100 slices from the sync.Pool. Each slice created by the pool is 32,000 bytes, totaling 3.2MB.

The gctrace lines indicate an allocation rate of 25MB/s and a duration of around 0.1s—but they also indicate an allocation growth of around 6MB during the concurrent mark phase, which is more than the 2.5MB one would expect if the allocation rate remained constant while the collection was active. Allocations totaling 3.2MB, from 100 slices of 32kB each, go a long way to explaining this difference. The number 100 is of interest because many of the inbound http requests seem to start 100 instances of the github.com/raintank/metrictank/api.(*Server).getTargets.func1 goroutine.

Here's what I saw in the profile:

$ go tool pprof ./bin ./profile
Entering interactive mode (type "help" for commands)
(pprof) nodefraction=0
(pprof) edgefraction=0
(pprof) peek mallocgc
2.12s of 2.12s total (  100%)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                             0.05s 71.43% |   runtime.newobject
                                             0.01s 14.29% |   runtime.makechan
                                             0.01s 14.29% |   runtime.rawstring
     0.02s  0.94%  0.94%      0.07s  3.30%                | runtime.mallocgc
                                             0.03s 60.00% |   runtime.(*mcache).nextFree
                                             0.01s 20.00% |   runtime.heapBitsSetType
                                             0.01s 20.00% |   runtime.memclrNoHeapPointers
----------------------------------------------------------+-------------
(pprof) peek sync.*Get
2.12s of 2.12s total (  100%)
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                             0.01s 50.00% |   fmt.newPrinter
                                             0.01s 50.00% |   github.com/raintank/metrictank/vendor/github.com/gocql/gocql.newFramer
         0     0%     0%      0.02s  0.94%                | sync.(*Pool).Get
                                             0.01s 50.00% |   sync.(*Pool).getSlow
                                             0.01s 50.00% |   sync.(*Pool).pin
----------------------------------------------------------+-------------

I expected to see frames where runtime.mallocgc calls runtime.gcAssistAlloc, and where sync.(*Pool).Get is called from github.com/raintank/metrictank/api.(*Server).getSeries

rhysh on 8 Dec 2016

This might indicate demand for 100 slices from the sync.Pool. Each slice created by the pool is 32,000 bytes, totaling 3.2MB.

my bad, i somehow mixed up 32kB with 3.2MB (x100). what you say makes sense now. So much sense that I'm not sure i understand what's noteworthy here. Isn't this increase in allocations by the pool, right after every pool wipe perfectly normal? That's how it's supposed to work, right? Seems to me like the allocation rate is supposed to jump up at every GC run because the pools get filled again.

Many requests start 100 getTargets.func1 goroutines, each of which needs its own slice from the pool.

FWIW, this directly corresponds to how many series are requested from metrictank at once per http request. for each series requested in a request, we launch another goroutine.
in real life this is an arbitrary number. In these artificial tests the requests are generated such that some don't request any series, some request a few, and some request many. (by inserting wildcards in the requests. and we artificially load up data such that there's 100 series per org, so wildcards in the right place will request exactly 100 series). this number may have changed over the course of this thread, as i've made changes to the benchmark tool and scripts and it never seemed like an interesting factor. Note that the previous profile was made in a way where reqs-to-series/goroutines was much lower, which explains why the issue didn't manifest itself so clearly. I used the same parameters now as before and have a much cleaner result (in terms of exposing the latency spikes) again for you (with some reqs requesting 100 series), and i will continue to use the same parameters from here on.

Note that the "series request rate" chart in the grafana snapshot directly corresponds to github.com/raintank/metrictank/api.(*Server).getTargets.func1 goroutines being created across all http requests (per second), which tends to be between 1.5 - 2.8 kHz, whereas the "render request duration (all steps)" shows the reqs/s on the right y-axis (which tends to be reported as 250 - 270 Hz, not sure why this is less than what vegeta should be generating which is 300Hz, but I don't think we should worry about this discrepancy, at least not for the purpose of solving the problem at hand here)

without further ado:
http://dieter.plaetinck.be/files/rhys/metrictank-binary
http://dieter.plaetinck.be/files/rhys/cpu.profile

inspect-idx -old -addr http://metrictank:6063 -from 1min cass cassandra:9042 raintank vegeta-render-patterns | vegeta attack -rate 300 -duration 600s | vegeta report                                
Requests      [total, rate]            180000, 300.00
Duration      [total, attack, wait]    9m59.997408023s, 9m59.996606552s, 801.471µs
Latencies     [mean, 50, 95, 99, max]  801.283µs, 425.842µs, 3.880967ms, 5.051549ms, 124.368376ms
Bytes In      [total, mean]            1704992412, 9472.18
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:180000  
Error Set:

https://snapshot.raintank.io/dashboard/snapshot/YJoc5Ty4y205Cy6dZHBzxMuyO0jYHgNE corresponds to the 30s timeframe of the profile.
note:

note the 81ms spike 13 seconds in
the "render request duration per step" displays means of the data on a per-second granularity. here's a variant of that chart that uses max values : https://snapshot.raintank.io/dashboard/snapshot/Tn0m29Xk1KEqj41cIsC2sdKD8ws5zhzz for you this is probably not very useful, it just clarifies a bit more where in the code time is being spent (still on a pretty coarse level though)

Dieterbe on 9 Dec 2016

also not sure if you'ld like a shorter profile that could focus more on just the problematic period.. I think i was able to get a 5s profile that captured a spike right at the beginning
this timerange: https://snapshot.raintank.io/dashboard/snapshot/CwsaHypIDlAR2UcYwZtC3Qd6GLc0iOX7 :
prof5s.zip

Dieterbe on 9 Dec 2016

@aclements is there anything I can do that will increase the odds of this being adressed for Go 1.8 ? thanks? Send me your address and I'll mail you a bribe? :)

Dieterbe on 13 Dec 2016

It looks like the assist credit for all goroutine structs (alive or dead) is reset at the beginning of the GC cycle, and that credit built up by a goroutine that dies before the cycle is over is available to any new goroutines that start (in the case where they happen to pick up a G struct with a positive credit balance), but it is not flushed to the background mark workers and so is not available to be stolen. Is this an accurate description, @aclements ?

That is accurate. In fact, https://golang.org/cl/24703/ was an experiment to change this in the service of this issue, but didn't seem to help when I wrote it.

gc 118 @366.846s 0%: 0.062+117+0.17 ms clock, 0.49+5.0/231/415+1.3 ms cpu, 943->949->488 MB, 967 MB goal, 8 P

@aclements why is the GC finishing so quickly?

My guess would be that idle GC (which is 2/3rds of the GC CPU time in the above line) is driving it to completion quickly. It's not finishing that quickly anyway. In steady state without idle GC I would have expected GC 118 to finish at 955 MB (half way between 943 MB and 967 MB). Given the idle GC time, I'm not at all surprised that it finished in half of that heap growth. The same goes for GC 119.

@aclements is there anything I can do that will increase the odds of this being adressed for Go 1.8 ?

Unfortunately I'm still not sure what to fix. :( If it's not too terrible, it would be worth trying a goroutine pool and seeing if it helps. That's not the right long-term solution, but if it helps, that tells us something.

The other experiment that's worth trying is to simply comment out the poolcleanup() call in runtime/mgc.go. That will disable wiping the sync.Pool. If that helps, we can look into perhaps adding some hysteresis to pool cleanup to avoid these sudden cliffs.

aclements on 13 Dec 2016

it would be worth trying a goroutine pool and seeing if it helps. That's not the right long-term solution

@aclements why not? Would it set us back in any way? One obvious thing I can see is that it would restrict the concurrency, which may not be such a bad thing.

Another thing I'm interested in is using a manual pool that doesn't clear every GC. I'm hesitant to use the commonly cited buffered-channel approach that has a fixed max size, though the solution at https://blog.cloudflare.com/recycling-memory-buffers-in-go/ (see the 3rd piece of code there) looks like it may be a winner due to its dynamic size.

Dieterbe on 14 Dec 2016

One thing that might help is to half the size of the suspect data
structures and see if that cuts the latency in half. We obviously can't do
that (or we would have) but what we can do is artificially double their
size and see if the latency doubles. By applying this doubling one by one
to the various data structures we might be able to identify the critical
structures and better focus our attention on them.

On Wed, Dec 14, 2016 at 3:28 AM, Dieter Plaetinck notifications@github.com
wrote:

it would be worth trying a goroutine pool and seeing if it helps. That's
not the right long-term solution

@aclements https://github.com/aclements why not? Would it set us back
in any way? One obvious thing I can see is that it would restrict the
concurrency, which may not be such a bad thing.

Another thing I'm interested in is using a manual pool that doesn't clear
every GC. I'm hesitant to use the commonly cited buffered-channel approach
that has a fixed max size, though the solution at
https://blog.cloudflare.com/recycling-memory-buffers-in-go/ (see the 3rd
piece of code there) looks like it may be a winner due to its dynamic size.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-266973241, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn-o-kYJqunpq6K7Twk0Xi8tqHtuBks5rH6i6gaJpZM4HvxjD
.

RLH on 14 Dec 2016

The five-second profile includes 210ms worth of samples where runtime.gcAssistAlloc is on the stack. Of those, 150ms comes from sync.(*Pool).Get, 30ms comes from github.com/raintank/metrictank/mdata.(*AggMetric).Get, and 30ms comes from github.com/raintank/metrictank/api.(*Server).renderMetrics (first diagram).

Disabling poolcleanup() as @aclements suggests sounds very helpful.

The Pool entries are 32kB and the assist ratio is often around 12 for a total of around 400kB of scan work required to pay for their allocation. That's more than the 64kiB gcOverAssistWork, so the goroutines probably don't end up with excess assist credit.

The Pool's slow path (doing an allocation rather than returning a cached entry) is activated during the GC since the pools are empty at the beginning of the GC cycle. Additionally, the slow path is even slower since it's subject to GC assist work. In the five-second profile, sync.(*Pool).Get is on the stack for 160ms of which 150ms is doing GC assists, with only 10ms spent on non-GC-assist allocation work (second diagram).

go tool pprof -nodefraction=0 -edgefraction=0 -focus=gcAssistAlloc -web metrictank-binary prof5s.zip

screen shot 2016-12-14 at 8 27 09 am

go tool pprof -nodefraction=0 -edgefraction=0 -focus=Pool.*Get -web metrictank-binary prof5s.zip

screen shot 2016-12-14 at 8 41 15 am

rhysh on 14 Dec 2016

it would be worth trying a goroutine pool and seeing if it helps. That's not the right long-term solution
@aclements why not? Would it set us back in any way? One obvious thing I can see is that it would restrict the concurrency, which may not be such a bad thing.

I just mean that it's not the right long-term solution because the runtime should be doing the right thing and not forcing the user to create their own goroutine pool. Goroutine pools widen the semantic gap between the application and the runtime, so not having a pool should be better and I consider it a runtime bug if a user-created pool improves things.

Another thing I'm interested in is using a manual pool that doesn't clear every GC. I'm hesitant to use the commonly cited buffered-channel approach that has a fixed max size, though the solution at https://blog.cloudflare.com/recycling-memory-buffers-in-go/ (see the 3rd piece of code there) looks like it may be a winner due to its dynamic size.

This is essentially what I was suggesting with commenting out poolcleanup(). If that does help, we know the pool cleanup is involved. @rhysh's analysis of the profile strongly indicates that this is involved, and doing an experiment with this commented out would cinch it.

aclements on 15 Dec 2016

@Dieterbe, were you able to experiment with disabling pool cleanup to see if that was the culprit causing latency spikes?

aclements on 6 Jun 2017

Hey . Haven't had time yet . On vacation now. Should be able to give it ago in a week or two maybe 3

Dieterbe on 10 Jun 2017

i updated my go tree to latest master, i presume that's the best starting point? do i comment out the call to poolcleanup or to clearpools ?. because clearpools calls poolcleanup and a bunch of other stuff ?

Dieterbe on 20 Jun 2017

i updated my go tree to latest master, i presume that's the best starting point?

Sounds good

do i comment out the call to poolcleanup or to clearpools ?

Comment out the call to poolcleanup. While it's probably safe to disable all of clearpools, I'd have to think more to be sure, and I don't think the rest of clearpools is the problem.

aclements on 20 Jun 2017

ok i need to document some things for my own reference because i had a hard time reproducing. due to some changes we made, much more memory is being allocated in the http server path compared to last time, so i had to change things to build a large initial heap and do smaller requests (so no more of the "each request creating multiple/100 goroutines"), this way I got 7seconds in between every GC run, each of a few hundred micros STW, but creating app latencies of hundreds of ms as we saw before.

so here's some settings for my own reference

primary-node = false # don't fill up hard drive
GODEBUG: gctrace=1,gcpacertrace=1
retentions = 1s:1d:1min:1441,10s:2d:1min:1441,1min:5d:10min:145,10min:10d:1h:25

date && fakemetrics -carbon-tcp-address localhost:2003 -statsd-addr localhost:8125 -shard-org -keys-per-org 1000 -speedup 100 -stop-at-now -offset 24h && date && echo "now doing realtime" && fakemetrics -carbon-tcp-address localhost:2003 -statsd-addr localhost:8125 -shard-org -keys-per-org 1000

another note, since last time, we switched instrumentation library in metrictank (the api server being tested), it now uses histogram buckets with human-friendly boundaries so if measurements like exactly 800.0ms look suspicious, this is why.
i put some grafana snapshots below. on the 2nd row of charts you can correlate memory usage, GC activity and api request latencies.

now, what you've been waiting for ....

run with stock go

go version devel +be855e3f28 Mon Jun 19 23:26:32 2017 +0000 linux/amd64
https://snapshot.raintank.io/dashboard/snapshot/lA4rKPPTK4ocXo279HRZlO7SWryqvlRM?orgId=2
for gctrace and pacer trace output, mt-logs-before.zip

echo -e "GET http://localhost:6060/render?target=some.id.of.a.metric.1&from=-10s\nX-Org-Id: 1\n\n" | vegeta attack -duration 10m -rate 200 | vegeta report -reporter='hist[0,5ms,10ms,15ms,20ms,25ms,30ms,35ms,40ms,45ms,50ms,55ms,60ms,70ms,75ms,80ms,85ms,90ms,95ms,100ms]'
^CBucket           #      %       Histogram
[0,      5ms]    10941  86.05%  ################################################################
[5ms,    10ms]   79     0.62%   
[10ms,   15ms]   21     0.17%   
[15ms,   20ms]   14     0.11%   
[20ms,   25ms]   13     0.10%   
[25ms,   30ms]   11     0.09%   
[30ms,   35ms]   12     0.09%   
[35ms,   40ms]   13     0.10%   
[40ms,   45ms]   11     0.09%   
[45ms,   50ms]   10     0.08%   
[50ms,   55ms]   11     0.09%   
[55ms,   60ms]   13     0.10%   
[60ms,   70ms]   17     0.13%   
[70ms,   75ms]   12     0.09%   
[75ms,   80ms]   9      0.07%   
[80ms,   85ms]   9      0.07%   
[85ms,   90ms]   8      0.06%   
[90ms,   95ms]   11     0.09%   
[95ms,   100ms]  11     0.09%   
[100ms,  +Inf]   1488   11.70%  ########

run with go patched to remove poolcleanup().

--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -2045,9 +2045,9 @@ func sync_runtime_registerPoolCleanup(f func()) {

 func clearpools() {
        // clear sync.Pools
-       if poolcleanup != nil {
-               poolcleanup()
-       }
+       //      if poolcleanup != nil {
+       //              poolcleanup()
+       //      }

go version devel +be855e3f28 Mon Jun 19 23:26:32 2017 +0000 linux/amd64
actually I ran the workload (on the same running instance) twice, with a minute in between.
the first one without any specific instrumentation, the other one, with a 60s trace running (while the stresstest was also running)
snapshot of both : https://snapshot.raintank.io/dashboard/snapshot/eLLse46puj9BxM44fvDtIGJT2M3PpT5p?orgId=2
mt logs with gctrace and pacer trace mt-logs-patched.zip
trace file :
trace.bin.zip (spoiler alert : seeing a checkerboard again with lots of MARK ASSIST)

run 1 without tracing.

~/g/s/g/r/m/build ❯❯❯ echo -e "GET http://localhost:6060/render?target=some.id.of.a.metric.1&from=-10s\nX-Org-Id: 1\n\n" | vegeta attack -duration 1m -rate 200 | vegeta report -reporter='hist[0,5ms,10ms,15ms,20ms,25ms,30ms,35ms,40ms,45ms,50ms,55ms,60ms,70ms,75ms,80ms,85ms,90ms,95ms,100ms]'
Bucket           #      %       Histogram
[0,      5ms]    10407  86.72%  #################################################################
[5ms,    10ms]   20     0.17%   
[10ms,   15ms]   19     0.16%   
[15ms,   20ms]   19     0.16%   
[20ms,   25ms]   16     0.13%   
[25ms,   30ms]   12     0.10%   
[30ms,   35ms]   11     0.09%   
[35ms,   40ms]   9      0.07%   
[40ms,   45ms]   10     0.08%   
[45ms,   50ms]   12     0.10%   
[50ms,   55ms]   8      0.07%   
[55ms,   60ms]   10     0.08%   
[60ms,   70ms]   18     0.15%   
[70ms,   75ms]   12     0.10%   
[75ms,   80ms]   8      0.07%   
[80ms,   85ms]   10     0.08%   
[85ms,   90ms]   6      0.05%   
[90ms,   95ms]   13     0.11%   
[95ms,   100ms]  4      0.03%   
[100ms,  +Inf]   1376   11.47%  ########

run 2 with tracing enabled.

echo -e "GET http://localhost:6060/render?target=some.id.of.a.metric.1&from=-10s\nX-Org-Id: 1\n\n" | vegeta attack -duration 1m -rate 200 | vegeta report -reporter='hist[0,5ms,10ms,15ms,20ms,25ms,30ms,35ms,40ms,45ms,50ms,55ms,60ms,70ms,75ms,80ms,85ms,90ms,95ms,100ms]'
Bucket           #      %       Histogram
[0,      5ms]    10510  87.58%  #################################################################
[5ms,    10ms]   61     0.51%   
[10ms,   15ms]   15     0.12%   
[15ms,   20ms]   11     0.09%   
[20ms,   25ms]   14     0.12%   
[25ms,   30ms]   12     0.10%   
[30ms,   35ms]   7      0.06%   
[35ms,   40ms]   11     0.09%   
[40ms,   45ms]   10     0.08%   
[45ms,   50ms]   9      0.07%   
[50ms,   55ms]   9      0.07%   
[55ms,   60ms]   9      0.07%   
[60ms,   70ms]   16     0.13%   
[70ms,   75ms]   10     0.08%   
[75ms,   80ms]   6      0.05%   
[80ms,   85ms]   10     0.08%   
[85ms,   90ms]   9      0.07%   
[90ms,   95ms]   8      0.07%   
[95ms,   100ms]  7      0.06%   
[100ms,  +Inf]   1256   10.47%  #######

let me know if you want me to rerun this experiment and provide anything else.

Dieterbe on 20 Jun 2017

What version of the program did you use for the tests? I'm looking at version raintank/metrictank@edaa819866a9db2a242a749577571bcdb1101a6a.

Looking at the mark assists during the GC at 4.6–5.4 seconds, many of them are for allocations at github.com/raintank/metrictank/vendor/github.com/klauspost/compress/flate.NewWriter:1266. That allocates a flate.Writer, which contains a flate.compressor, which contains large fields such as hashHead [hashSize]uint32 and tokens tokens, which means the flate.Writer's size is around 1 MB.

The gcpacertrace lines show an assist ratio of around 14 to 22, so goroutines making those 1 MB allocations would need to scan 14 to 22 MB of memory. @aclements has said that goroutines working for the GC can scan 1–2 MB per millisecond, which explains why the goroutines making those allocations don't usually finish within 20ms.

@Dieterbe can those allocations be avoided for the sake of testing by sending requests without an Accept-Encoding header (or one that doesn't contain "gzip")? In the long run, can those allocations be pooled?

$ go get github.com/mdempsky/maligned
$ maligned github.com/raintank/metrictank/vendor/github.com/klauspost/compress/flate
.../github.com/raintank/metrictank/vendor/github.com/klauspost/compress/flate/deflate.go:80:17: \
    struct of size 918784 could be 918768
[...]

Aside: is there an open-source command line tool that can print the size of a single struct, specified by name?

rhysh on 20 Jun 2017

What version of the program did you use for the tests?

a slightly out of date version compared to master, 6fb3609ee790ead6cbbaccc59f26d53a2307806b , it was just missing a few minor commits, and also I had a few local non-impactful changes. (let me know if any of this matters, happy to re-run with clean master or whatever)

can those allocations be avoided for the sake of testing by sending requests without an Accept-Encoding header (or one that doesn't contain "gzip")? In the long run, can those allocations be pooled?

I also ran into this (and opened https://github.com/raintank/metrictank/issues/658) but I don't have the right solution yet. I had actually tried with Accept-Encoding: plain which stopped the gzipping but then it was allocating about 3x as much. This looks like a regression in our software that we will need to address, though i was hoping it wouldn't interfere with the troubleshooting of the issue at hand here. does it? especially wrt to the particular test to disable pool cleaning?

Dieterbe on 20 Jun 2017

Use whichever metrictank commit works well for you. I wanted to be sure that I knew what commit was in play so I could correctly interpret the line numbers and call graph.

During the GC cycle, the program is spending a lot of time on GC assists. The arithmetic seems to work out: the assist ratio is around 20, the program is making megabyte-sized allocations, so it needs to do around 20MB of scan work, which takes maybe 20ms.

The logs you shared (logs-patched.txt) show a GC cycle starting every 8 seconds and lasting around 1 second of wall-clock time. When a GC cycle is active, the program seems to do very little other than assist with the collection; most goroutine execution spans have a "MARK ASSIST" span below them for nearly their whole duration.

gc 200 @1104.958s 2%: 0.073+953+0.12 ms clock, 0.58+5425/1905/79+1.0 ms cpu, 2906->2913->1491 MB, 2981 MB goal, 8 P
gc 201 @1111.928s 2%: 0.051+888+0.20 ms clock, 0.41+5066/1776/73+1.6 ms cpu, 2908->2914->1490 MB, 2982 MB goal, 8 P
gc 202 @1118.888s 2%: 0.078+760+0.13 ms clock, 0.63+4330/1520/108+1.0 ms cpu, 2906->2913->1491 MB, 2980 MB goal, 8 P

screen shot 2017-06-20 at 3 00 33 pm

It looks like the program is making a lot of expensive allocations, and maybe that's a bug to be fixed. But the difference between the cost of allocating memory between GC runs and allocating it during a GC run is very large and surprising.

rhysh on 22 Jun 2017

Consider an alternate tuning for the GC pacer, one which effectively has a GC cycle running all the time. With the level of load being handled by the process shown in the execution trace (with GOMAXPROCS=8), one or two GC background mark workers running constantly might generate enough scan credit that the mutator threads wouldn't need to do direct assists; they'd steal it from the background mark workers.

That tuning would likely lead to suboptimal behavior for other programs (more CPU cycles spent on GC, maybe more floating garbage), but I think it shows that the tuning and design of the pacer is on a continuum, and that there might be another point along it that is an acceptable compromise between batch and interactive programs.

Currently the pacer determines the single start time for both the background workers and enabling assists. Could it instead start the background workers (and install the write barriers, etc), but delay enabling the assists until a second memory threshold was reached?

rhysh on 22 Jun 2017

We have a similar problem with our service.
Response latencies for max and p99 are very high. This happens only while GC is running and we think that it's connected to goroutines assisting with GC.

Why?

We see that make() latency jumps when GC is running.
I have disabled GC assist with a small patch and our latencies plummeted.

Chart of an internal latency taken from inside a service. Before and after the patch.
screenshot from 2017-06-23 16-31-57

Chart of an external latency taken from the client. Before and after the patch.
screenshot from 2017-06-23 16-32-44

mkevac on 23 Jun 2017

Goals
1) Limit heap growth. The GC pacer guarantees that allocation can not overrun marking by using assists.
2) Adding time (bytes of allocation) by starting the GC cycle earlier, but without assists, results in allocations quickly eating up the bonus time since the mutator isn't slowed by mark work assists. The GC is back where we started.
3) Write barriers are expensive, the longer they are turned on the less throughput.
4) The GC uses 25% of the CPU when on. If the mutator can't keep up after periodically losing 25% of CPU to the GC then one must either add CPU or reduce mutator requirements. Alternatively it can increase RAM which will decrease the number of GC cycles so the time under pressure is reduced.
5) Theoretically marking can adversely affect the mutator by inducing cache capacity misses though there seems to be no evidence of that here.

I don't see a clear answer yet.

RLH on 23 Jun 2017

I already posted this on golang-nuts, but since it seems like the same problem I will post here. We are still struggling with how to resolve this and would really appreciate any suggestions.

From original post, added slightly more about our hardware:

We are experiencing a problem that I believe may be related to this issue 14812.

Here is what we observe: at Laserlike one of our core user-facing services (the “leaf”) typically responds to a particular rpc in <400ms. During GC we see spikes in latency to >5s on some simple requests. The stop the world periods are short, so the GC spikes appear to happen at other times.

We have been unable to replicate this in a simple program, but we did run our code in a test mode that repros it. In our test environment the server loads ~10 GB of persistent data (never deleted so doesn’t really need to be GCed), and we ask for 8 processors. We are running go version 1.8.3 on kubernetes on GCP machine of type n1-highmem-64. This GCP hardware has 64 virtual CPUs (a hyperthread on a 2.6GHz Xeon E5), with 416GB memory. This is shared among other docker containers, of course, but as the trace file shows, our process uses 8 CPUs (and around 16GB memory). The environment variables of concern are that we set GOMAXPROCS to 8, SetGCPercent to 20, and while generating the profiles we set GODEBUG=gctrace=1,gcpacertrace=1. To create the problem we send the server a single request with >500 tasks..

This google drive folder has leaf-logs-full.redacted.txt as well as other performance tooling files. A snippet from that log here shows normal query responses and timing:

I0719 22:50:22.467367 leaf.go:363] Worker #5 done search for '[redacted]', took 0.013 seconds
I0719 22:50:22.467406 leaf.go:225] Worker #5 starting search for '[redacted]'
I0719 22:50:22.467486 leaf.go:363] Worker #6 done search for '[redacted]', took 0.001 seconds
I0719 22:50:22.467520 leaf.go:225] Worker #6 starting search for '[redacted]'
I0719 22:50:22.468050 leaf.go:363] Worker #9 done search for '[redacted]', took 0.005 seconds

We have observed that if a GC cycle happens to start while serving traffic (which is often) AND there is a large amount of time spent in assist, then our serving latency skyrockets by 10x. In the log the slowdown commences roughly when pacer assist starts at I0719 22:50:31.079283 and then reverts to normal latencies shortly after the gc cycle completes at I0719 22:50:36.806085.

Below I copy parts of the log where we see latencies of up to 729ms on tasks. I also bold the line that shows 32929ms spent on alloc gc assist.

We have captured an attached cpu profile during this time which seems to confirm a large amount of time spent in runtime.gcAssistAlloc.func1.

Pardon our ignorance about GC in golang, but our hypothesis about what may be going wrong is that our large in-memory data structures are causing gc to often go into assist mode, and that for reasons we don’t understand malloc becomes expensive in that mode. Since we also create ~100k new data objects when processing user requests, we are guessing those allocs become very slow. Another piece of evidence for this hypothesis is that we have another (prototype) implementation of this kind of service that makes more use of object pools and doesn’t seem to have as much of slowdown during GC.

Note on data-structures:
The principal data structures can be thought of as:
Map[uint64][]byte (about 10M map entries, the slice lengths between 5K to 50K) (around ~10G total memory usage)
Map[uint64][]uint64 (about 50M map entries, the slice lengths vary between 10 and 100K, in a zipfian distribution, about 3G total memory usage)
These data structures mostly stay as is over the life of the program.

We are trying to figure out how to solve this so would appreciate any advice. An engineer on our team wrote up the following ideas, none of which are that great:
Figure out a simple way to prevent our goroutines slowing down during GC. I had some hopes LockOSThread() could be made to work, but it didn't seem to help in my experiments. I'm not ruling this solution out entirely, but if it's the write barriers that are the main problem, I don't have much hope.
Run at least 2 replicas of all our servers. Manage their GC cycles ourselves, synchronized so that at most one replica is in GC at any given time. The clients should either send all requests to both replicas (and cancel when one replies), or use some more complicated Kubernetes and client logic so a GCing replica is never sent requests. This is the simplest solution that is likely to work, and doesn't require us to change our habits too much. It just costs more. :)
Refactor our current servers and program future servers with the explicit goal of reducing GC burden. Give our servers more memory and increase SetGCPercent() so that Garbage Collection happens less frequently. Use simpler data structures with fewer pointers to reduce the length of time the GC cycle lasts when it does run. This isn't a perfect solution, because the GC will still hit us on occasion, but it reduces the harm.
Throw out much of the convenience of Go and write our own allocators/shared pointers that index into a flat byte buffer, making no use of the GC. (Either do this for particular objects, or just write a general solution.) Make GC cycles so infrequent that we can either turn GC off completely without going OOM, or at least run it only once per day, during off hours. This seems very hard, and we'll be fighting with Go a lot.
Implement the core serving data structures of our service in C++, paired off with another server written in Go that manages populating that server’s data -- the Go server would manage the data in the C++ server via rpc, and query it over rpc.
The nuclear option: decide that Go is simply not a suitable language for our serving stack. Rewrite major servers in C++.

Right now #5 is our (terrible but only feasible-seeming) choice.

Appreciate any advice!
Steve Baker

Logs with gc log lines:

pacer: assist ratio=+3.155887e+001 (scan 2669 MB in 10072->10157 MB) workers=2+0
I0719 22:50:31.097707 leaf.go:363] Worker #3 done search for '[redacted]', took 0.017 seconds
I0719 22:50:31.098036 leaf.go:225] Worker #3 starting search for '[redacted]'
I0719 22:50:31.133959 leaf.go:363] Worker #6 done search for '[redacted]', took 0.069 seconds
I0719 22:50:31.134300 leaf.go:225] Worker #6 starting search for '[redacted]'
I0719 22:50:31.142191 leaf.go:363] Worker #9 done search for '[redacted]', took 0.066 seconds
I0719 22:50:31.142275 leaf.go:225] Worker #9 starting search for '[redacted]'
I0719 22:50:31.173921 leaf.go:363] Worker #10 done search for '[redacted]', took 0.098 seconds
I0719 22:50:31.174009 leaf.go:225] Worker #10 starting search for '[redacted]'
I0719 22:50:31.328796 leaf.go:363] Worker #15 done search for '[redacted]', took 0.260 seconds
I0719 22:50:31.330194 leaf.go:225] Worker #15 starting search for '[redacted]'
I0719 22:50:31.395579 leaf.go:363] Worker #11 done search for '[redacted]', took 0.374 seconds

I0719 22:50:36.317004 leaf.go:363] Worker #5 done search for '[redacted]', took 0.729 seconds
I0719 22:50:36.317190 leaf.go:225] Worker #5 starting search for '[redacted]'
I0719 22:50:36.370191 leaf.go:363] Worker #15 done search for '[redacted]', took 0.192 seconds
I0719 22:50:36.371446 leaf.go:225] Worker #15 starting search for '[redacted]'
I0719 22:50:36.421953 leaf.go:363] Worker #7 done search for '[redacted]', took 0.116 seconds
I0719 22:50:36.422092 leaf.go:225] Worker #7 starting search for '[redacted]'
I0719 22:50:36.570008 leaf.go:363] Worker #12 done search for '[redacted]', took 0.778 seconds
I0719 22:50:36.572970 leaf.go:225] Worker #12 starting search for '[redacted]'
I0719 22:50:36.573571 leaf.go:363] Worker #6 done search for '[redacted]', took 0.710 seconds
I0719 22:50:36.573721 leaf.go:225] Worker #6 starting search for '[redacted]'
pacer: H_m_prev=8875421544 h_t=+1.900000e-001 H_T=10561751637 h_a=+1.983366e-001 H_a=10635742360 h_g=+2.000000e-001 H_g=10650505852 u_a=+9.735779e-001 u_g=+2.500000e-001 W_a=2354135120 goalΔ=+1.000000e-002 actualΔ=+8.336587e-003 u_a/u_g=+3.894312e+000
gc 1072 @4337.737s 39%: 0.37+5688+0.32 ms clock, 2.2+32929/11310/21+1.9 ms cpu, 10072->10143->8661 MB, 10157 MB goal, 8 P

stbaker on 27 Jul 2017

Tossing some more data in here via https://github.com/cockroachdb/cockroach/issues/17243. The TL;DR is that disabling GC assist reduces 99%-tile latencies in one load test by 60%.

The GC uses 25% of the CPU when on.

I have a patch that extends the gctrace=1 messages to report on how much CPU is used during marking. In my tests, while GC CPU usage is relatively low overall (12-15%), CPU usage during marking is in the 55-60% range. If I disable GC assist, CPU usage during marking is limited to 25% as you would expect from the dedicated GC workers.

petermattis on 27 Jul 2017

A quick update: I'm working on a modification to the GC pacer that I believe should substantially reduce assist time. I'm working on a thorough evaluation of it right now to make sure there aren't any surprising consequences and I'm hoping to have a prototype ready for people to try next week.

aclements on 27 Jul 2017

@stbaker, thanks for the detailed information. Would you be able to capture an execution trace using Go 1.9? I wouldn't expect GC to behave very differently in 1.9, but there have been some significant improvements to the tracer itself that would make it much easier to see what's going on.

We have captured an attached cpu profile during this time which seems to confirm a large amount of time spent in runtime.gcAssistAlloc.func1.

I agree. Based on the gctrace and the profile, you're spending a lot of time in assists.

This is almost certainly exacerbated by setting GOGC to 20, which is telling the garbage collector work five times harder than it really wants to. Another interesting thing in your gctrace logs (which is probably also caused by setting GOGC to 20) is that the garbage collector appears to be running nearly all the time. In fact, the huge latency spikes you're seeing might actually be from sweeping, since this has to complete in the time between GC cycles.

Pardon our ignorance about GC in golang, but our hypothesis about what may be going wrong is that our large in-memory data structures are causing gc to often go into assist mode, and that for reasons we don’t understand malloc becomes expensive in that mode.

Assists don't really have to do with the existing data structures. The goal of assists is to ensure the heap doesn't grow too quickly, so the cost of assists is a function of your application's allocation rate. It sounds like your allocation rate is extremely high.

My advice would be to do what you can to increase GOGC closer to the default of 100, and to try to reduce your application's allocation rate (I assume you've already looked at the heap profile, but if not you should definitely check out pprof's -alloc_space and -alloc_objects modes). And hopefully the pacer change I'm working on will help.

aclements on 27 Jul 2017

A quick update: I'm working on a modification to the GC pacer that I believe should substantially reduce assist time. I'm working on a thorough evaluation of it right now to make sure there aren't any surprising consequences and I'm hoping to have a prototype ready for people to try next week.

I'm definitely interested in trying this. Note that I have tested on top of go1.9rc1 and see no difference in GC assist behavior. We have made efforts to reduce the number and size of allocations and doing so provides incremental small benefits, but nothing nearly as significant as the hack to disable GC assist completely.

petermattis on 27 Jul 2017

Peter Mattis,
Is your application the same one @stbaker https://github.com/stbaker is
speaking of or is the application new to this discussion.

On Thu, Jul 27, 2017 at 5:12 PM, Peter Mattis notifications@github.com
wrote:

A quick update: I'm working on a modification to the GC pacer that I
believe should substantially reduce assist time. I'm working on a thorough
evaluation of it right now to make sure there aren't any surprising
consequences and I'm hoping to have a prototype ready for people to try
next week.

I'm definitely interested in trying this. Note that I have tested on top
of go1.9rc1 and see no difference in GC assist behavior. We have made
efforts to reduce the number and size of allocations and doing so provides
incremental small benefits, but nothing nearly as significant as the hack
to disable GC assist completely.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-318487963, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn3EXL6Pxl7IOA9gCDC2XB2P77BNUks5sSP1OgaJpZM4HvxjD
.

RLH on 28 Jul 2017

Different application: CockroachDB.

On Fri, Jul 28, 2017 at 7:13 AM Richard L. Hudson notifications@github.com
wrote:

Peter Mattis,
Is your application the same one @stbaker https://github.com/stbaker is
speaking of or is the application new to this discussion.

On Thu, Jul 27, 2017 at 5:12 PM, Peter Mattis notifications@github.com
wrote:

A quick update: I'm working on a modification to the GC pacer that I
believe should substantially reduce assist time. I'm working on a
thorough
evaluation of it right now to make sure there aren't any surprising
consequences and I'm hoping to have a prototype ready for people to try
next week.

I'm definitely interested in trying this. Note that I have tested on top
of go1.9rc1 and see no difference in GC assist behavior. We have made
efforts to reduce the number and size of allocations and doing so
provides
incremental small benefits, but nothing nearly as significant as the hack
to disable GC assist completely.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-318487963, or
mute
the thread
<
https://github.com/notifications/unsubscribe-auth/AA7Wn3EXL6Pxl7IOA9gCDC2XB2P77BNUks5sSP1OgaJpZM4HvxjD

.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-318626242, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AF6f98gPRB8ljIjaCUcekOTEeJUXpS90ks5sScJsgaJpZM4HvxjD
.

petermattis on 28 Jul 2017

Just wanted to follow up with the resolution we eventually settled on at Laserlike (@stbaker's company).

To sum up, the one-size-fits-all GC is absolutely not fitting us. I think these are the relevant features of our use case:

High base load of memory, which the non-generational GC must completely scan every cycle.
CPU-intensive background processes that also allocate lots of temporary memory.
CPU-intensive request handling that needs to be responsive.
We're using SetGCPercent(20), but note that the problem does not go away by setting this to 100 (and doubling the high base memory would be a very expensive "fix", anyway).

I believe the _intention_ of how the GC would behave is that it would steal assist time in a balanced way, ramping up gradually as it loses ground, slowing down the program just enough so that it can complete its cycle before the specified memory growth % is reached. Graceful degradation like this would be fine. In _practice_, what is happening is the GC triggers late, panics, and completely usurps all CPU for seconds at a time until it finishes its expensive scan. I don't know enough about how assist works to hazard a guess why, but there is never a gradual ramp up of assist time; it goes from 0 to 95%+ of CPU almost instantly. Golang is proud of its tiny Stop-The-World pauses, but there is very little functional difference between a STW pause and taking over all CPU cycles. From that perspective, our pauses are actually 5s or more, which makes it impossible to function as a performant server.

Thanks to everyone (Marko Kevac, @petermattis, @Dieterbe) who posted with the solution of editing the Go runtime. While we were at it, we actually added several knobs:

A function to turn assist off
A function to adjust gcGoalUtilization of CPU from its default of 0.25
A function to tell when the GC is running (exposing gcphase)

Simply turning assist off wasn't quite enough, as the GC was not able to keep up and we would OOM under load. _Some_ amortized speed degradation was clearly necessary; we just needed it to be spread out rather than hitting us in one massive spike. So we also changed gcGoalUtilization to 0.5 and had the background workers stop whenever they noticed the GC was running. This was enough to completely, dramatically solve our latency issues:

leaf-latency

It's an admirable goal to have a GC that "just works" for everyone and doesn't need tuning, but this is an ideal, and possibly not achievable. Until then, I think it's vital to add some more "knobs" to the GC (and discussions of it need to be more open about the downsides of its design choices - in the top 10 Google results for "golang garbage collection", assist is mentioned exactly once, without definition or elaboration). It hurts the adoption of Go if a scenario like this (that should be solvable) requires editing the Go runtime itself.

dkisman-laserlike on 26 Aug 2017

👍11

@dkisman-laserlike can you please share your patch?

mkevac on 26 Aug 2017

👍1

The code we added to runtime was really really basic; we didn't even bother with atomicity. Partly because importing atomic from runtime causes an import cycle. 😉 But here's our patch (works on Go 1.8.3, not sure about 1.9).
runtime-v3.patch.zip

dkisman-laserlike on 28 Aug 2017

Great data, quantitatively supported proposals are always a pleasure to see.

Algorithmic specific knobs are more likely to lead to legacy issue and
impede future progress than non-algorithmic knobs. GOGC is an example of a
non-algorithmic knob, it makes sense with generational GC as well as other
algorithms such as a simple two space copy scheme or even a traditional
malloc/free approach. While the interpretation of GOGC may vary slightly
between algorithms the general concept is applicable across algorithms.
GOMAXPROCS is another example of an knob that is not algorithm specific.
Different flavors of schedulers, for example work stealing, work dealing,
async preemption, cooperative preemption, and so forth all can use such a
knob. The GC also uses the GOMAXPROCS knob. Furthermore, unlike the OS, the
Go runtime does not know how the hardware resources should be divided
between various hardware processes so it is useful to have a knob that
informs Go about constraints imposed by co-tenants. A useful non-algorithm
specific knob that is in the same spirit is GOMAXHEAP. GOMAXHEAP sets an
upper bound on the amount of RAM available to the GC. Like GOMAXPROCS,
GOMAXHEAP is about HW availability and communicating resource availability.
Knowing the upper bounds allows the GC to balance use of CPU against RAM.
If RAM is plentiful trade RAM for CPU, if RAM is scarce trade CPU for RAM.

This naturally leads into why the GC algorithm is currently optimized as it
is. The GC's use of RAM is based not so much in some software technology
but in the economics of cloud/server hardware. Let's consider how cloud
providers configures instances. One representative cloud configuration
page* has 17 base configurations showing RAM and vCpus, a vCPU is a HW
thread, or a hyperthread in Intel speak. 13 of the configurations are for
more than 1 vCPU and they all allocate 4 GBytes / vCPU. Setting aside the
recent high demand for DDR that has lead to high market prices there seems
no reason to believe that RAM $ / Gigabyte won't continue to keep pace with
or exceed the CPU $ / performance. Your logs indicate the application is
using only 1.5 Gigabyte / HW thread. Pushing this a bit further, the logs
show a GOMAXPROCS of 8 and a live heap of a bit more than 8 Gigabytes.
Using 4 Gigabyte/HW thread it seems reasonable to increase GOGC from 20 to

This should reduce GC frequency by 10x and create a 90 and 99th %tile
graph similar to what you shared while better utilize typical
configurations.

*https://aws.amazon.com/ec2/instance-types/

On Mon, Aug 28, 2017 at 2:14 PM, dkisman-laserlike <[email protected]

wrote:

The code we added to runtime was really really basic; we didn't even
bother with atomicity. Partly because importing atomic from runtime
causes an import cycle. 😉 But here's our patch.
runtime-v3.patch.zip
https://github.com/golang/go/files/1257758/runtime-v3.patch.zip

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-325433200, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wny4yc3_JeyNy1XjpKszX2KNbUjSEks5scwOagaJpZM4HvxjD
.

RLH on 29 Aug 2017

So the recommendation to alleviate this issue is simply to increase GOGC? I can certainly try that, though i'm concerned we'll still have big (probably bigger, since more work will need to be done) latency spikes, just less frequent. I would gladly rather sacrifice more cpu to keep latencies reasonable when GC runs, but it seems like that's not an option?

GOMAXHEAP is about HW availability and communicating resource availability.
Knowing the upper bounds allows the GC to balance use of CPU against RAM.
If RAM is plentiful trade RAM for CPU, if RAM is scarce trade CPU for RAM.

this is interesting, does this actually exist? can't find a reference to GOMAXHEAP in the source repo, on google, or on github

thanks!

Dieterbe on 29 Aug 2017

GOMAXHEAP doesn't exist, at least not yet, but see #16843.

ianlancetaylor on 29 Aug 2017

GOMAXHEAP as a command line flag was just an example. The discussion that
Ian pointed to seems to be a more reasonable and robust API.

GOGC will reduce the frequency of GC but you will see the same degradation
when the GC happens. Put another way it will not help worse case but
99%tile may be helped.

On Tue, Aug 29, 2017 at 1:13 PM, Ian Lance Taylor notifications@github.com
wrote:

GOMAXHEAP doesn't exist, at least not yet, but see #16843
https://github.com/golang/go/issues/16843.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-325731956, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn_5tyxmfkBzK0wviqI369O2Qxzv3ks5sdEbAgaJpZM4HvxjD
.

RLH on 29 Aug 2017

Change https://golang.org/cl/59970 mentions this issue: runtime: separate soft and hard heap limits

gopherbot on 29 Aug 2017

Change https://golang.org/cl/59971 mentions this issue: runtime: reduce background mark work to from 25% to 20%

gopherbot on 29 Aug 2017

@dkisman-laserlike and @Dieterbe, would you be able to try out the two CLs I just mailed (https://golang.org/cl/59970 and https://golang.org/cl/59971)? I believe these should significantly reduce mutator latency variance from assists.

I'm still working on an analysis of the effects of these, and it may be necessary to do some other work before these go in (like optimizing the write barrier and possibly reducing the preemption bound for fractional workers), but any evaluation results you can give me would be very appreciated.

aclements on 29 Aug 2017

@aclements I have a different application from @dkisman-laserlike and @Dieterbe, but am also sensitive to the GC assist. I tested go1.9 with your 2 patches and unpatched go1.9 and against go1.8.3 with GC assist disabled:

| | go1.9 | go1.9+59970+59971 | go1.8.3+no-gc-assist |
| --- | --- | --- | --- |
| 99%-tile latency (ms) | 50.3 | 48.2 | 30.4 |

I can gather more data if desired.

PS I tested go1.8.3+no-gc-assist because I had that patch sitting around. I can test that on go1.9, but I'd expect the result to be similar.

petermattis on 30 Aug 2017

The goal of those two fixes was to minimize, if not eliminate, GC assists
when the program was well provisioned and in a steady state. Are the
gctrace lines basically the same between the different runs?

Is there any easy way to figure out when the high latency typically hits?
Is it immediately before the end of the GC or is it spread out during the
entire GC? If it is near or at the end of the GC does it only happen when
the GC overruns the goal?

On Wed, Aug 30, 2017 at 4:31 PM, Peter Mattis notifications@github.com
wrote:

@aclements https://github.com/aclements I have a different application
from @dkisman-laserlike https://github.com/dkisman-laserlike and
@Dieterbe https://github.com/dieterbe, but am also sensitive to the GC
assist. I tested go1.9 with your 2 patches and unpatched go1.9 and against
go1.8.3 with GC assist disabled:
go1.9 go1.9+59970+59971 go1.8.3+no-gc-assist
99%-tile latency (ms) 50.3 48.2 30.4

I can gather more data if desired.

PS I tested go1.8.3+no-gc-assist because I had that patch sitting around.
I can test that on go1.9, but I'd expect the result to be similar.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-326109858, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn8Vhy_Bw5wLDjJ5ai-3jbi5QVD27ks5sdcaFgaJpZM4HvxjD
.

RLH on 31 Aug 2017

The goal of those two fixes was to minimize, if not eliminate, GC assists
when the program was well provisioned and in a steady state. Are the
gctrace lines basically the same between the different runs?

I didn't have gctrace turned on for these tests. I'll do so and report back. Is there anything else you'd like to see?

Is there any easy way to figure out when the high latency typically hits?
Is it immediately before the end of the GC or is it spread out during the
entire GC? If it is near or at the end of the GC does it only happen when
the GC overruns the goal?

Well, traces can give insight, though I'd hardly classify inspecting them as easy, especially when there are thousands of operations per second and you're trying to understand the high latency for a handful of the operations. Note that the numbers in my comment above were from tests on a 6-node cluster. I'll try reproducing on a single-node cluster as well in order to simplify the problem.

One bit of debugging info that I've thought about adding is tracking the amount of GC assist time each goroutine performs. Take a snapshot of that value before starting the operation and at the end of the operation and log high latency operations along with the amount of time the goroutine spent in GC assist. One complexity is that an operation sometimes has multiple goroutines executing work on its behalf, but I think that is mostly avoided in the single-node setup.

petermattis on 31 Aug 2017

Change https://golang.org/cl/60790 mentions this issue: cmd/trace: add minimum mutator utilization (MMU) plot

gopherbot on 1 Sep 2017

Ok, @aclements, I tried running our leaf servers under moderate load for a reasonable amount of time (to achieve steady state), using three different binaries:

screen shot 2017-09-01 at 11 12 37 am

The large spikes are during server startup and can be ignored. The first long section is vanilla Go 1.8.3. The second section is our hacked Go 1.8.3 with GC assist turned off, etc. The third section is Go 1.8.3 with your patch (adapted to 1.8.3; we haven't upgraded to 1.9 yet).

The hack clearly wins, but your patch seems to have improved things slightly - except for the weird spikiness at the end, which I unfortunately can't explain. It may be something on our end, unrelated to the test; I'm just including it in case it's important.

It's not quite as "scientific", but I also manually tried some large requests known to trigger the GC misbehaviour, and saw 10x variance (0.4s to 4s) in response times for the patched and unpatched GCs, compared to 0.4s to 1s for the hacked GC. So I'm afraid the worst case behaviour is definitely still happening.

Here are the gctrace and gcpacertrace logs for you, in case you feel like combing through data. Each line has the prefix "[gctype]-[shard#]", where gctype is "unpatched", "patched", or "hacked" for the three tests, and shard# is 0-3 since our server is sharded 4 ways.
gclogs.zip

EDIT: Hmm, unfortunately the "patched" logs got truncated when whatever happened at 7:00 AM happened. I'll try running again and get some better "patched" logs.
EDIT 2: The spikiness was an unrelated issue. The current logs should work, though the loads differ slightly.

dkisman-laserlike on 1 Sep 2017

@RLH Here are gctrace and gcpacertrace logs:

go1.9.txt
go1.9+59970+59971.txt
go1.9-no-gc-assist.txt

The gctrace lines include a patch which computes cpu usage during mark (<gc-%cpu>/<mark-%cpu>) . #59970 and #59971 seem to have reduced assist time but not eliminated it. I believe the system to be well provisioned and not running at capacity. From the go1.9 run:

gc 500 @226.410s 6/60%: 0.25+67+0.30 ms clock, 1.7+192/133/0+2.1 ms cpu, 306->311->158 MB, 317 MB goal, 8 P

From the go1.9+59970+59971 run:

gc 500 @253.473s 4/52%: 0.30+69+0.20 ms clock, 2.4+106/114/71+1.6 ms cpu, 317->326->164 MB, 332 MB goal, 8 P

go1.9-no-gc-assist:

gc 500 @262.197s 4/40%: 0.42+89+0.21 ms clock, 3.4+0/178/109+1.6 ms cpu, 325->336->165 MB, 341 MB goal, 8 P

petermattis on 2 Sep 2017

My reading of the second trace line, is that the recent CLs are not
accomplishing their objectives. The GC starts with a heap of 317 and should
spread work out over 15MB of allocations arriving at the 332 MB goal just
as the GC work is finished. Work finishes early at 326MB or 9MB into the
15MB allocation estimate. This is earlier than expected. The GC starts at
the right time but for some reason we are seeing assists that do not need
to be there.

Not sure what is going on. Sorry if I already asked but do you have a
simple reproducer?

On Sat, Sep 2, 2017 at 8:21 AM, Peter Mattis notifications@github.com
wrote:

@RLH https://github.com/rlh Here are gctrace and gcpacertrace logs:

go1.9.txt https://github.com/golang/go/files/1272149/go1.9.txt
go1.9+59970+59971.txt
https://github.com/golang/go/files/1272152/go1.9.59970.59971.txt
go1.9-no-gc-assist.txt
https://github.com/golang/go/files/1272151/go1.9-no-gc-assist.txt

The gctrace lines include a patch which computes cpu usage during mark (
/) . #59970 and #59971 seem to have reduced assist
time but not eliminated it. I believe the system to be well provisioned and
not running at capacity. From the go1.9 run:

gc 500 @226.410s 6/60%: 0.25+67+0.30 ms clock, 1.7+192/133/0+2.1 ms cpu,
306->311->158 MB, 317 MB goal, 8 P

From the go1.9+59970+59971 run:

gc 500 @253.473s 4/52%: 0.30+69+0.20 ms clock, 2.4+106/114/71+1.6 ms cpu,
317->326->164 MB, 332 MB goal, 8 P

go1.9-no-gc-assist:

gc 500 @262.197s 4/40%: 0.42+89+0.21 ms clock, 3.4+0/178/109+1.6 ms cpu,
325->336->165 MB, 341 MB goal, 8 P

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-326740992, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wnzcjne9lYE3eBkoJERS5Vti0QrnCks5seUhngaJpZM4HvxjD
.

RLH on 5 Sep 2017

Sorry if I already asked but do you have a simple reproducer?

As in a simple reproduction program? No. I can reproduce the above traces fairly easily, though. And the code is open source so I could write up reproduction instructions if that would be helpful.

petermattis on 5 Sep 2017

Instruction on how to reproduce seem the best way forward. Thanks,

On Tue, Sep 5, 2017 at 12:26 PM, Peter Mattis notifications@github.com
wrote:

Sorry if I already asked but do you have a simple reproducer?

As in a simple reproduction program? No. I can reproduce the above traces
fairly easily, though. And the code is open source
https://github.com/cockroachdb/cockroach so I could write up
reproduction instructions if that would be helpful.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-327229717, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn4B_3SmXTifEtg6ce0YcntM9lA9dks5sfXY3gaJpZM4HvxjD
.

RLH on 5 Sep 2017

@RLH Here you go. See https://github.com/cockroachdb/cockroach/blob/master/CONTRIBUTING.md:

~ go get -d github.com/cockroachdb/cockroach
~ cd $GOPATH/src/github.com/cockroachdb/cockroach
~ make build

You might run into problems with dependencies such as the C compiler and CMake. Happy to help you work through those. Install the load generator with:

~ go get github.com/cockroachdb/loadgen/kv

Run cockroach:

~ rm -fr cockroach-data && GODEBUG=gctrace=1 ./cockroach start --insecure --logtostderr --log-dir=

Run the load generator:

~ kv --read-percent 95 --splits 100 --concurrency 32 'postgres://root@localhost:26257?sslmode=disable'

Run for a few minutes and the heap size will grow to 200-300MB and you should see:

gc 987 @18446554934.453s 0%: 0.086+65+0.13 ms clock, 0.68+184/130/0+1.0 ms cpu, 261->267->137 MB, 274 MB goal, 8 P

When I gathered the gctrace logs above, I was running cockroach and kv on different machines to ensure there was no interaction between the processes.

petermattis on 5 Sep 2017

Some recommended compiler versions would be helpful.
Gcc 4.8.4 is the latest I have easy access to, and clang 3.9 complains that

/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/arena.cc:37:31: error: 'alignof' applied to an expression is a GNU extension [-Werror,-Wgnu-alignof-expression]
static const int kAlignUnit = alignof(std::max_align_t);
                              ^
/usr/local/google/home/drchase/work/gocode/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/arena.cc:37:44: error: no member named 'max_align_t' in namespace 'std'
static const int kAlignUnit = alignof(std::max_align_t);
                                      ~~~~~^

dr2chase on 5 Sep 2017

@dr2chase Yeah, we require A C++ compiler that supports C++11. Note that GCC prior to 6.0 doesn't work. For clang, I'm using LLVM version 8.1.0.

petermattis on 5 Sep 2017

Version requirements are listed here: https://github.com/cockroachdb/cockroach/blob/master/CONTRIBUTING.md#getting-and-building

Alternatively, if you have Docker on your machine, you can replace the make build step with ./build/builder.sh make build

a-robinson on 5 Sep 2017

Where do I find an 8.1 version of LLVM? I don't see it here: http://releases.llvm.org/

dr2chase on 5 Sep 2017

@dr2chase Heh, that must be some sort of Apple-ism as that is what clang -v reported on my Mac. The Docker builder image @a-robinson mention has clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final) installed.

petermattis on 6 Sep 2017

This appears to be some sort of library mismatch nonsense.

dr2chase on 6 Sep 2017

It does appear that clang-3.9 is using the GCC libs.

rlh@rlh:~/work/go/src/runtime$ clang-3.9 -v
clang version 3.9.1-4ubuntu3~14.04.2 (tags/RELEASE_391/rc2)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.8
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.8.4
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.9
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.9.3
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.7.3
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8.4
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.9
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.9.3
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.8
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.8.4
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.9
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.9.3
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.7.3
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.8.4
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.9
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.9.3
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8

On Tue, Sep 5, 2017 at 6:18 PM, dr2chase notifications@github.com wrote:

This appears to be some sort of library mismatch nonsense.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-327319560, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn_CjFN3_xnHozLLLyCCHu874un-dks5sfcifgaJpZM4HvxjD
.

RLH on 6 Sep 2017

After hacking of arena.cc to hard coding an alignment value I now have this
running on the pre-11 libs. I should be able to debug from here.
Thanks for the clues.

On Tue, Sep 5, 2017 at 6:41 PM, Rick Hudson rlh@golang.org wrote:

It does appear that clang-3.9 is using the GCC libs.

rlh@rlh:~/work/go/src/runtime$ clang-3.9 -v
clang version 3.9.1-4ubuntu3~14.04.2 (tags/RELEASE_391/rc2)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.8
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.8.4
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.9
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-linux-gnu/4.9.3
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.7
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-
linux-gnu/4.7.3
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-
linux-gnu/4.8.4
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.9
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-
linux-gnu/4.9.3
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.8
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.8.4
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.9
Found candidate GCC installation: /usr/lib/gcc/i686-linux-gnu/4.9.3
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.7.3
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.8.4
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.9
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/4.9.3
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8

On Tue, Sep 5, 2017 at 6:18 PM, dr2chase notifications@github.com wrote:

This appears to be some sort of library mismatch nonsense.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-327319560, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn_CjFN3_xnHozLLLyCCHu874un-dks5sfcifgaJpZM4HvxjD
.

RLH on 6 Sep 2017

Change https://golang.org/cl/24704 mentions this issue: runtime: over-assist in terms of scan work instead of allocation bytes

gopherbot on 6 Nov 2017

Change https://golang.org/cl/24706 mentions this issue: runtime: shorten mark worker preemption to 1ms

gopherbot on 6 Nov 2017

Change https://golang.org/cl/24705 mentions this issue: runtime: adjust trigger based only on CPU utilization

gopherbot on 6 Nov 2017

Change https://golang.org/cl/24708 mentions this issue: runtime: essentially disable dedicated and fractional mark workers

gopherbot on 6 Nov 2017

Change https://golang.org/cl/24707 mentions this issue: runtime: add a control to disable the idle worker

gopherbot on 6 Nov 2017

Change https://golang.org/cl/24703 mentions this issue: runtime: flush assist credit on goroutine exit

gopherbot on 6 Nov 2017

Gopherbot probably shouldn't post in response to old CLs that are being abandoned. :) Only the last of these (CL 24703) is real, and it's just a minor fix.

It would be great if everyone could try current master and report their findings. The fixes for the pacer that everyone tried earlier are now in, along with several other latency improvements.

@dkisman-laserlike, I took another look at the traces you posted (thanks!), and noticed something deeper this time that I think explains why the earlier patches didn't help you (and you probably won't see much improvement on master). I think what's going on is that your application's idle time varies significantly between GC cycles. Since GC scheduling is based on a feedback system, it "gets used to" being able to soak up the application's idle time for a few cycles. Then it hits a cycle that doesn't have idle time, but realizes too late that it started too late and makes up for the loss by having a lot of assists. It would adjust to this over the next few cycles, but then it starts having idle time available again. If it's not too annoying to run your test, you can disable idle GC workers by disabling the two if statements in findrunnable in runtime/proc.go that mention gcBlackenEnabled != 0. I'd be interested in the results.

aclements on 6 Nov 2017

I'll give go-tip a whirl on our workload later this week.

petermattis on 7 Nov 2017

I don't have precisely the same machine setup as previously, so the numbers below differ from ones I've reported earlier. Before running the tests below, I let my load generator run as fast as possible to see the capacity of the system. Then I limited the rate of the load generator to ~50% of the max. Here are the numbers for a 3-node cockroach cluster compiled with go1.9 vs go-tip (devel +33c246f Tue Nov 7 14:18:00 2017 +0000):

go1.9:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14760.2      2.4      0.8     13.6     37.7

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14998.5      1.1      0.8      2.9      8.9

Big thumbs up on the reduction in 95%-tile and 99%-tile latencies. For reference, if I don't rate limit the load generator I see:

go1.9:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     27537.1      7.0      2.4     26.2     52.4

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     27368.3      7.0      2.8     27.3     41.9

From the rate limited runs, here are a few lines of gctrace output from go1.9 compiled binary:

gc 299 @270.168s 3%: 0.28+94+0.13 ms clock, 1.9+226/188/0+0.95 ms cpu, 399->410->209 MB, 420 MB goal, 8 P
gc 300 @271.634s 3%: 0.26+108+0.10 ms clock, 2.1+197/216/0+0.83 ms cpu, 398->409->211 MB, 419 MB goal, 8 P
gc 301 @273.202s 3%: 0.21+104+0.11 ms clock, 1.7+196/207/0+0.93 ms cpu, 401->411->210 MB, 422 MB goal, 8 P

And from go-tip:

gc 285 @270.125s 2%: 0.041+120+0.14 ms clock, 0.33+151/231/48+1.1 ms cpu, 417->436->220 MB, 435 MB goal, 8 P
gc 286 @271.713s 2%: 0.091+117+0.16 ms clock, 0.73+19/234/139+1.3 ms cpu, 419->434->216 MB, 441 MB goal, 8 P
gc 287 @273.322s 2%: 0.061+120+0.12 ms clock, 0.48+115/240/74+0.96 ms cpu, 415->430->216 MB, 433 MB goal, 8 P

petermattis on 7 Nov 2017

👍1

@petermattis, thanks for re-running! The first set of numbers looks great. The second set might just be showing the natural queuing delays of a fully loaded system.

Your gctrace is interesting. Just based on those three lines, I suspect you're encountering to some degree the same problem with idle fluctuation that I hypothesized in https://github.com/golang/go/issues/14812#issuecomment-342266846. Can you paste a few more lines of gctrace? A dozen is probably enough.

aclements on 7 Nov 2017

Yeah, I'm not concerned about the second set of numbers. I included those to very that top-end performance wasn't suffering.

Would it be worthwhile to try disabling idle GC workers?

go1.9 gctrace:

gc 290 @256.195s 3%: 0.25+100+0.10 ms clock, 2.0+208/199/0+0.84 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 291 @257.727s 3%: 0.24+104+0.21 ms clock, 1.9+212/207/0+1.7 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 292 @259.293s 3%: 0.25+104+0.11 ms clock, 2.0+191/208/0.24+0.93 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 293 @260.812s 3%: 0.29+106+0.25 ms clock, 2.0+209/202/0+1.7 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 294 @262.345s 3%: 0.078+108+0.18 ms clock, 0.63+214/217/13+1.4 ms cpu, 401->411->211 MB, 420 MB goal, 8 P
gc 295 @263.929s 3%: 0.14+103+3.2 ms clock, 1.0+214/206/0+22 ms cpu, 402->412->210 MB, 422 MB goal, 8 P
gc 296 @265.492s 3%: 0.26+103+0.11 ms clock, 2.1+205/207/0+0.94 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 297 @267.053s 3%: 0.16+103+0.10 ms clock, 1.3+211/204/0+0.86 ms cpu, 401->411->210 MB, 421 MB goal, 8 P
gc 298 @268.609s 3%: 0.18+102+0.19 ms clock, 1.5+216/204/0+1.5 ms cpu, 400->410->210 MB, 420 MB goal, 8 P
gc 299 @270.168s 3%: 0.28+94+0.13 ms clock, 1.9+226/188/0+0.95 ms cpu, 399->410->209 MB, 420 MB goal, 8 P
gc 300 @271.634s 3%: 0.26+108+0.10 ms clock, 2.1+197/216/0+0.83 ms cpu, 398->409->211 MB, 419 MB goal, 8 P
gc 301 @273.202s 3%: 0.21+104+0.11 ms clock, 1.7+196/207/0+0.93 ms cpu, 401->411->210 MB, 422 MB goal, 8 P

go-tip gctrace:

gc 276 @255.549s 2%: 0.041+118+0.16 ms clock, 0.32+116/236/73+1.3 ms cpu, 416->431->216 MB, 434 MB goal, 8 P
gc 277 @257.165s 2%: 0.038+118+0.14 ms clock, 0.30+88/234/86+1.1 ms cpu, 414->429->216 MB, 432 MB goal, 8 P
gc 278 @258.793s 2%: 0.075+125+0.096 ms clock, 0.60+111/248/57+0.77 ms cpu, 415->431->217 MB, 433 MB goal, 8 P
gc 279 @260.381s 2%: 0.043+119+0.17 ms clock, 0.34+92/237/81+1.4 ms cpu, 416->431->216 MB, 435 MB goal, 8 P
gc 280 @261.956s 2%: 0.14+127+0.14 ms clock, 1.1+111/253/57+1.1 ms cpu, 414->431->217 MB, 433 MB goal, 8 P
gc 281 @263.588s 2%: 0.049+132+0.12 ms clock, 0.39+109/264/59+1.0 ms cpu, 416->433->218 MB, 435 MB goal, 8 P
gc 282 @265.224s 2%: 0.051+126+0.14 ms clock, 0.40+76/251/98+1.1 ms cpu, 417->433->217 MB, 437 MB goal, 8 P
gc 283 @266.856s 2%: 0.074+125+0.092 ms clock, 0.59+107/247/59+0.74 ms cpu, 417->432->217 MB, 435 MB goal, 8 P
gc 284 @268.487s 2%: 0.075+128+0.12 ms clock, 0.60+92/255/55+1.0 ms cpu, 417->432->217 MB, 435 MB goal, 8 P
gc 285 @270.125s 2%: 0.041+120+0.14 ms clock, 0.33+151/231/48+1.1 ms cpu, 417->436->220 MB, 435 MB goal, 8 P
gc 286 @271.713s 2%: 0.091+117+0.16 ms clock, 0.73+19/234/139+1.3 ms cpu, 419->434->216 MB, 441 MB goal, 8 P
gc 287 @273.322s 2%: 0.061+120+0.12 ms clock, 0.48+115/240/74+0.96 ms cpu, 415->430->216 MB, 433 MB goal, 8 P

petermattis on 7 Nov 2017

Would it be worthwhile to try disabling idle GC workers?

There isn't as much idle variance in the longer gctrace, but if it's not too much trouble, I would love to know what effect disabling idle GC has and what the trace looks like. Most of the cycles in your longer trace still have more assist time than they're supposed to, which I suspect is related to idle GC.

BTW, this patch should disable idle GC:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 112543db10..febb676a07 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -2290,7 +2290,7 @@ stop:
        // We have nothing to do. If we're in the GC mark phase, can
        // safely scan and blacken objects, and have work to do, run
        // idle-time marking rather than give up the P.
-       if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
+       if false && gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
                _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode
                gp := _p_.gcBgMarkWorker.ptr()
                casgstatus(gp, _Gwaiting, _Grunnable)
@@ -2357,7 +2357,7 @@ stop:
        }

        // Check for idle-priority GC work again.
-       if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
+       if false && gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
                lock(&sched.lock)
                _p_ = pidleget()
                if _p_ != nil && _p_.gcBgMarkWorker == 0 {

aclements on 8 Nov 2017

Disabling GC idle didn't seem to affect latencies. The run below shows slightly better 99%-tile latency and slightly worse 95%-tile latency vs go-tip, but I think this is noise.

go-tip+no-gc-idle:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14999.0      1.1      0.8      3.1      8.4

go-tip:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     14998.5      1.1      0.8      2.9      8.9

gctrace output shows no GC idle time as expected and somewhat lower assist times, though I thought the recent GC changes were intended to eliminate GC assist in most cases.

gc 266 @255.887s 3%: 0.084+155+0.16 ms clock, 0.67+82/309/0+1.3 ms cpu, 421->441->221 MB, 442 MB goal, 8 P
gc 267 @257.543s 3%: 0.086+147+0.11 ms clock, 0.69+86/295/0+0.91 ms cpu, 422->441->220 MB, 443 MB goal, 8 P
gc 268 @259.179s 3%: 0.081+150+0.095 ms clock, 0.65+95/300/0+0.76 ms cpu, 420->440->220 MB, 441 MB goal, 8 P
gc 269 @260.741s 3%: 0.16+154+0.10 ms clock, 1.3+91/308/0+0.83 ms cpu, 421->441->221 MB, 441 MB goal, 8 P
gc 270 @262.389s 3%: 0.070+149+0.093 ms clock, 0.56+81/299/0+0.74 ms cpu, 422->441->221 MB, 443 MB goal, 8 P
gc 271 @264.034s 3%: 0.031+148+0.10 ms clock, 0.25+88/296/0+0.80 ms cpu, 421->440->220 MB, 442 MB goal, 8 P
gc 272 @265.676s 3%: 0.081+145+0.094 ms clock, 0.65+88/289/0+0.75 ms cpu, 420->439->220 MB, 440 MB goal, 8 P
gc 273 @267.307s 3%: 0.20+153+0.10 ms clock, 1.6+95/307/0+0.83 ms cpu, 420->440->221 MB, 440 MB goal, 8 P
gc 274 @268.949s 3%: 0.11+154+0.14 ms clock, 0.91+92/308/0+1.1 ms cpu, 421->440->221 MB, 442 MB goal, 8 P
gc 275 @270.515s 3%: 0.055+154+0.097 ms clock, 0.44+82/306/0+0.78 ms cpu, 421->441->221 MB, 442 MB goal, 8 P
gc 276 @272.171s 3%: 0.12+152+0.16 ms clock, 1.0+98/303/0+1.3 ms cpu, 422->441->221 MB, 443 MB goal, 8 P
gc 277 @273.820s 3%: 0.052+157+0.11 ms clock, 0.42+89/314/0+0.94 ms cpu, 421->441->221 MB, 442 MB goal, 8 P

If I also disable GC assist (small patch to src/runtime/malloc.go) I see:

go-tip+no-gc-idle+no-gc-assist:
_____N_____ops/sec__avg(ms)__p50(ms)__p95(ms)__p99(ms)
   192     15000.1      0.9      0.8      2.2      3.9

And the corresponding gctrace:

gc 250 @258.759s 3%: 0.031+187+0.088 ms clock, 0.24+0/375/0+0.70 ms cpu, 431->456->227 MB, 451 MB goal, 8 P
gc 251 @260.447s 3%: 0.058+184+0.20 ms clock, 0.46+0/367/0+1.6 ms cpu, 433->458->226 MB, 454 MB goal, 8 P
gc 252 @262.137s 3%: 0.064+181+0.24 ms clock, 0.51+0/353/0+1.9 ms cpu, 432->455->225 MB, 452 MB goal, 8 P
gc 253 @263.857s 3%: 0.093+188+0.070 ms clock, 0.75+0/376/0+0.56 ms cpu, 430->455->227 MB, 450 MB goal, 8 P
gc 254 @265.592s 3%: 0.11+175+0.12 ms clock, 0.90+0/350/0+0.97 ms cpu, 433->456->224 MB, 454 MB goal, 8 P
gc 255 @267.308s 3%: 0.047+175+0.12 ms clock, 0.37+0/350/0+0.96 ms cpu, 429->453->225 MB, 449 MB goal, 8 P
gc 256 @269.018s 2%: 0.036+180+0.17 ms clock, 0.28+0/360/0+1.3 ms cpu, 430->454->225 MB, 450 MB goal, 8 P
gc 257 @270.690s 2%: 0.068+185+0.11 ms clock, 0.54+0/369/0+0.91 ms cpu, 430->455->225 MB, 450 MB goal, 8 P
gc 258 @272.374s 2%: 0.074+193+0.12 ms clock, 0.52+0/386/0+0.86 ms cpu, 431->457->227 MB, 451 MB goal, 8 P
gc 259 @274.111s 2%: 0.11+184+0.25 ms clock, 0.94+0/367/0+2.0 ms cpu, 433->457->225 MB, 454 MB goal, 8 P

petermattis on 8 Nov 2017

Hi,

I posted on golang-nuts but was pointed to this issue. I've also got a service that's encountering long request latencies due to CPU starvation when the GC runs. Like a couple others have mentioned, I think the GC may act poorly on long linked lists - I have a 1M-long list/map used as an LRU cache.

Here's one trace captured from the service. I think most look pretty similar.
GC forced
pacer: assist ratio=+8.611156e-002 (scan 960 MB in 15690->21268 MB) workers=1++0.000000e+000
pacer: H_m_prev=11151043456 h_t=+9.500000e-001 H_T=21744534739 h_a=+4.765627e-001 H_a=16465215016 h_g=+1.000000e+000 H_g=22302086912 u_a=+9.186874e-001 u_g=+3.000000e-001 W_a=347113784 goalΔ=+5.000000e-002 actualΔ=-4.734373e-001 u_a/u_g=+3.062291e+000
gc 36 @1197.776s 1%: 0.59+2210+0.29 ms clock, 2.3+5915/2192/0+1.1 ms cpu, 15690->15702->10671 MB, 21268 MB goal, 4 P
scvg7: inuse: 15834, idle: 107, sys: 15942, released: 0, consumed: 15942 (MB)
pacer: sweep done at heap size 10851MB; allocated 179MB during sweep; swept 2026864 pages at +1.906868e-004 pages/byte

This is on Go 1.10.3 with 4 cores. The allocation is slow enough compared to the heap size that the GC runs compulsorily every 120 seconds, but changing GOGC to make it run more often doesn't help the CPU issue, just makes the spikes more frequent.

Are there any fixes for the latency spikes on Go's roadmap?

Edit: Also, I don't understand why the GC runs this concurrent phase in such a short duration. If the GCs are spaced 120 seconds apart, why use just ~2.5s of that doing GC? Services typically don't like spikey load, so it seems silly to burst all the GC at once.

brian-armstrong-discord on 9 Aug 2018

A couple of questions.
How did you determine that the GC was correlated with (or cause) the
latency problems? Perhaps you have a trace you could share that shows the
correlations.

Do you see the problem with tip?

Since you suspect long lists, have you experimented with other data
structures to see if the problem is resolved.

Have you reduced this to something that we can duplicate locally on tip?

Do you have a work around and if so could you share your approach?

As for the roadmap, tip (and 1.11) holds a number of changes to the GC as
we will continue to improve GC latency.

Thanks for whatever information you can provide.

On Wed, Aug 8, 2018 at 7:44 PM, brian-armstrong-discord <
[email protected]> wrote:

Hi,

I posted on golang-nuts but was pointed to this issue. I've also got a
service that's encountering long request latencies due to CPU starvation
when the GC runs. Like a couple others have mentioned, I think the GC may
act poorly on long linked lists - I have a 1M-long list/map used as an LRU
cache.

Here's one trace captured from the service. I think most look pretty
similar.
GC forced
pacer: assist ratio=+8.611156e-002 (scan 960 MB in 15690->21268 MB)
workers=1++0.000000e+000
pacer: H_m_prev=11151043456 h_t=+9.500000e-001 H_T=21744534739
h_a=+4.765627e-001 H_a=16465215016 h_g=+1.000000e+000 H_g=22302086912
u_a=+9.186874e-001 u_g=+3.000000e-001 W_a=347113784 goalΔ=+5.000000e-002
actualΔ=-4.734373e-001 u_a/u_g=+3.062291e+000
gc 36 @1197.776s 1%: 0.59+2210+0.29 ms clock, 2.3+5915/2192/0+1.1 ms cpu,
15690->15702->10671 MB, 21268 MB goal, 4 P
scvg7: inuse: 15834, idle: 107, sys: 15942, released: 0, consumed: 15942
(MB)
pacer: sweep done at heap size 10851MB; allocated 179MB during sweep;
swept 2026864 pages at +1.906868e-004 pages/byte

This is on Go 1.10.3 with 4 cores. The allocation is slow enough compared
to the heap size that the GC runs compulsorily every 120 seconds, but
changing GOGC to make it run more often doesn't help the CPU issue, just
makes the spikes more frequent.

Are there any fixes for the latency spikes on Go's roadmap?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-411589415, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn3TE57REBzLs85chX_Ir8_mlpwx0ks5uO3f5gaJpZM4HvxjD
.

RLH on 13 Aug 2018

Unfortunately I don't have any traces that work. I captured a few but they just render blank pages in the browser, and I didn't have time to dig into what was wrong with them. There's no ambiguity that the spikes are correlated with the GC behavior though since our metrics (and htop) show very clear CPU spikes that occur when Go reports that it has started its GC.

I haven't tried any Go builds but it wouldn't be trivial for us to change out to a nonstandard Go binary on our build system. Are there GC fixes in Go > 1.10.3?

We did come up with a very hacky workaround which is to add more machines to the fleet. By bringing the average CPU utilization under 20%, the impact of the CPU spikes is not as pronounced on request latency. The spikes are still present and still completely saturate the CPU, but the smaller load doesn't seem to be as affected by it. However it seems rather unfortunate if Go can only use 1/5 of the computing resources of a machine it's on because of its GC design.

I'll come up with a repro repo that we can share. I actually think it will be pretty easy given the variables we tested, but I'll need to actually test and confirm to be sure.

brian-armstrong-discord on 13 Aug 2018

Thanks, we await the repro and hopefully a trace.

Could you also indicate what your latency SLO is when you send us the
repro? The gctrace numbers indicate that the GC is taking 1% of CPU and ran
36 times in 1197 seconds which seem reasonable.

The CPU spike at GC is working as intended. If there are idle Ps then the
GC will used them so you should expect full CPU utilization. What is not
intended is for it to starve the mutator which is why a trace will help to
understand what is going on.

On Mon, Aug 13, 2018 at 5:26 PM, brian-armstrong-discord <
[email protected]> wrote:

Unfortunately I don't have any traces that work. I captured a few but they
just render blank pages in the browser, and I didn't have time to dig into
what was wrong with them. There's no ambiguity that the spikes are
correlated with the GC behavior though since our metrics (and htop) show
very clear CPU spikes that occur when Go reports that it has started its GC.

I haven't tried any Go builds but it wouldn't be trivial for us to change
out to a nonstandard Go binary on our build system. Are there GC fixes in
Go > 1.10.3?

We did come up with a very hacky workaround which is to add more machines
to the fleet. By bringing the average CPU utilization under 20%, the impact
of the CPU spikes is not as pronounced on request latency. The spikes are
still present and still completely saturate the CPU, but the smaller load
doesn't seem to be as affected by it. However it seems rather unfortunate
if Go can only use 1/5 of the computing resources of a machine it's on
because of its GC design.

I'll come up with a repro repo that we can share. I actually think it will
be pretty easy given the variables we tested, but I'll need to actually
test and confirm to be sure.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/14812#issuecomment-412670266, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn8nLwp-QU70jsceys052CUd25OnVks5uQe7_gaJpZM4HvxjD
.

RLH on 14 Aug 2018

The CPU spike at GC is working as intended. If there are idle Ps then the GC will used them so you should expect full CPU utilization.

I know this goes way beyond the scope of the Go runtime, but as it's arguably a common configuration I think it warrants mentioning: such a behavior (CPU spike during GC) is not desirable when using CFS bandwidth control cfs_quota_us/cfs_period_us (i.e. when running in basically any linux container with a CPU limitation set), as the default CFS period is 100ms and a GC-induced spike in CPU usage that burns through the quota will make the kernel stop scheduling the process until the end of the period, thereby turning what would be a harmless concurrent GC phase into a pretty long (in the order of the cfs_period_us, ~100ms) stop-the-world pause.

I am not saying that we need to necessarily do something about this in Go, just that this may not be the most desirable behavior in a pretty common configuration (running a go program in a linux container with cpu limits), and therefore stretching a bit the definition of "working as intended" if the intention is having consistent sub-ms stop-the-world pauses.

CAFxX on 7 Aug 2019

In my experience keeping a large map (10M entries) seriously impacts the system latency. Under heavy loads the typical response time is under 100 microseconds. When GC kicks in, the system stops responding altogether for 3-4 seconds. It happens every ~5 minutes. The system has 4 cores. In the gctrace I see that GC frees ~8GB of memory and it takes most of 4s. In this scenario GC consumes all cores.

The flow is: default HTTP server gets a short HTTP GET (under 200 bytes), handler looks in the map (syncmap), generates a short response (under 200 bytes). The map gets a new entry or two every few hours.

In one approach GC could avoid scanning objects which appear to live long time. Low level malloc/free API in the unsafe package could help me as well.

I switched to BigCache which solved the problem.

larytet on 7 Aug 2019

👍2

If you could turn that large map problem into some sort of a test or benchmark, that would be really useful. There's a short list of few-millisecond hiccup causes, but I don't know of any related to maps.

dr2chase on 7 Aug 2019

@dr2chase This is not quite a "benchmark" you probably have in mind, but it demonstrates the latency problem https://github.com/larytet-go/map-benchmark
This is a simple HTTP server. When allocated memory reaches ~3GB the server stops responding for ~5s. I keep the system around 0% idle time. Arguably the demo - in-memory data base - is not a typical Go micro service. The service is CPU and DDR bound.

The issue is in no way a new one. I assume that the behavior is by design. If GC sees that the system remains busy it will stay put (GC "marker" lags allocations) until the amount of allocated memory crosses a threshold (GCPercent?). The threshold forces the GC's hand. GC will consume all available cores, release the allocated memory, return to hibernation (background). I think that I understand some of the trade offs. GC's main goal is to let the system serve occasional short spikes with minimal latency impact (?). Devops is expected to add nodes until the latency distribution hits the target.

I do have a strong opinion about this. The GC is excellent when the machine has CPUs to spare and/or the service waits for I/O a lot. I personally prefer predictable latency with well defined and reasonable worst case. I wonder why there is no a separate list for long living objects, const pointers, access to runtime/unsafe malloc/free, a buddy allocator in the standard library, more GC tuning options, GC feedback and so on. Is it possible to avoid scanning 10M pointers in a map each and every time GC wakes up? Should I use rate limiter in all services? I agree that my opinion is not a very educated one.

In the trace on a 4 core system I see

SCHED 119544ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
GC forced
gc 15 @120.185s 0%: 0.011+1.8+0.006 ms clock, 0.045+0/1.8/1.5+0.027 ms cpu, 2->2->1 MB, 4 MB goal, 4 P
SCHED 120557ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]
SCHED 120050ms: gomaxprocs=4 idleprocs=0 threads=11 spinningthreads=0 idlethreads=2 runqueue=0 [0 0 0 0]
gc 15 @113.396s 6%: 0.010+7082+0.005 ms clock, 0.043+8.9/7071/18652+0.020 ms cpu, 2977->3054->1663 MB, 3245 MB goal, 4 P
SCHED 121066ms: gomaxprocs=4 idleprocs=4 threads=12 spinningthreads=0 idlethreads=5 runqueue=0 [0 0 0 0]

Links
https://github.com/golang/go/issues/9477
Probably relevant https://github.com/golang/go/issues/27410
GC performance http://big-elephants.com/2018-09/unexpected-gc-pauses/

larytet on 7 Aug 2019

👍2 👀1

Sorry for the huge delay in replying, the short answer to your problem is "don't call memstat that often and don't put the call someplace you would mind waiting for an answer". There was an old problem with memstat blocking till the GC finished; we though we had fixed it, but apparently you have found a new problem with memstat blocking. When I remove the calls to memstat from your test program, all is well.

Even when we fix this new bug, frequent calls to memstat are not a great idea, because it actually stops the world. Not for long, but everything. It has overheads.

dr2chase on 13 Jan 2020

We have a software in production that has a similar behavior.
Even without calling memstat, we experience large delays (+100ms for heap size in the range of 7 or 8 GB)
Unfortunately the software is not open-source.

Memory models is composed of a relatively low number of btrees (10) (IIRC https://github.com/emirpasic/gods#redblacktree) that represent partitions of a large dataset (loaded at startup) of key/value pairs, where the value component is a Go structure and the key is a string

Once loaded, the dataset is 95% read and 5% updated

From traces it is clearly evident that large delays are caused by GC sweeping and starving mutators (more frequent) or GC STW phase (rare - please see this example)

genez on 14 Jan 2020

Which version of Go? There's several bugs that cause few-millisecond pauses for one or more threads, some are planned for 1.15, one -- failure to preempt call-free long-running loops -- got fixed in 1.14, so if you test with tip and the latency gets better, that was your problem. We also fixed one of the others in 1.14, but it's limited to just a few milliseconds. See https://github.com/golang/go/issues/27732 for a discussion of those bugs, and traces showing what they look like -- it would be nice to know if what's causing your problems is new or known.

A quick summary of the known bugs (from 27732):

Stack allocations are not divisible work; can take a long time to process. (Workaround for 1.14 added https://go-review.googlesource.com/c/go/+/180817/ -- use go build -gcflags 'all=-smallframes' hello.go)
Mark assist applied to global and stack allocations (generally, roots) doesn't generate any credit.
Sweeping problem (#18155 - experimental fix here: https://go-review.googlesource.com/c/go/+/187817/ )
Reschedule (i.e., quantum expires) that discovers a need for GC does not wake other Ps (or, does not check global run queue) (i.e., the work to do expanded by 1, if any Ps are idle, they one should be wakened to take over doing "real work"). (Fixed in 1.14: https://go-review.googlesource.com/c/go/+/146817/ )
(Related to 4?) Dedicated worker doesn't kick goroutines out of its local run queue.

The workaround for bug 1 will also address bug 2 -- if there's no oversized frames to mark, the size of the accounting error is small, also.

Bug 3 is rare, usually only occurs during program startup or if your code is very very good at not generating garbage; if the heap is completely live, scanning for a free slot can run very long. The workaround is somewhat nutty -- add a counter to the allocator for the problem, and every 64 allocations (counter&63 == 0) drop the allocation on the floor and do it again. This leaves holes in your heap that prevent long stalls for scanning.

If you try these workarounds and one of them works, that's really useful information about which bugs is the bigger problem and needs attention first.

It's also helpful to exclude startup from delay stats for long-running servers; those are a separate category of bug, and (our opinion) not nearly as important as steady-state glitches. They do matter; minor changes in startup heap growth heuristics have a noticeable effect on build speed, so we care, just not as much.

dr2chase on 14 Jan 2020

Thanks for the detailed follow-up
I am using Go 1.13.5 on windows x64

I have a reduced/modified version of the sample program that I can publish if needed.
Unfortunately the simplified version exhibits a less evident issue but still can be usefull for investigations.

Behavior is the same with 1.13 and tip (go version devel +563287ae06 Mon Jan 13 23:35:18 2020 +0000 windows/amd64)

Based on your explanation it look like we are in scenario (3)
We have a relatively large allocation at startup, then we execute a large number of read/update on the data set. Only the second phase is considered in the trace.

I expect that the real live heap is about 95% of the total heap

This is a sample fragment of the output of the trace tool:

I will provide source code and other details as soon as possible

genez on 14 Jan 2020

please check repro code here: https://github.com/genez/collectiontest
please note this is extremely simplified, production code does more computations (json and string handling, logging, etc...) that lead to >50ms latency

in production code we are not using stdlib maps because they do not shrink and cause OOM panics, we use a custom data structure (an rb+ tree where we serialize the items as a byte slice)

on my test bench (Intel i9 16 core, 32 GB of RAM) I can repro >10ms latency
go version go1.13 windows/amd64

it looks like the platform (win, linux) does not matter

genez on 18 Feb 2020

👍1

Short answer it is definitely sweeping, and the patch for sweep delays works nicely.
Unfortunately the next mole to whack is mark assist, and though I think it's technically working as intended, it's sometimes not good.

You can see these in the MMU graph, here's a particularly terrible one (some look great, this one is awful even with an experimental "fix"). By turning the mark-assist toggle on and off you can see where most of the problem comes from:

Screen Shot 2020-02-18 at 5 45 54 PM
Screen Shot 2020-02-18 at 5 46 20 PM

dr2chase on 18 Feb 2020

Thank you for this insight!
I am not sure I can fully grasp what does "turning mark assist off" mean and how to correctly read those two charts

I'm sorry but I am not that confident in runtime internals

genez on 19 Feb 2020

"Turning mark assist off" refers to the theft-of-time toggles for the graph; if you don't count mark-assist "theft" of mutator time, the worst-case delay is below 10ms, and the longest window in which only 50% of the cpu is available is about 25-30ms.

Turning mark assist "on" (in the accounting), the 50% point is pushed out into seconds.

The chart shows "minimum mutator utilization" (MMU) and is available from "go tool trace". The time on the bottom (x-axis) is the size of a window that is slid across the trace, and the fraction (y-axis) is the minimum for all windows (of that size) fraction of time in which the mutator wanted to run but could not. The toggle buttons ("Include") allow you to specify things that are not mutation -- if it's one of those, then the mutator is "not running". "View" allows you to specify either the entire system or worst-case per-goroutine ("system" runs the window across the threads in a stack, takes the fraction for the entire stack, "Per-goroutine" is instead thread-by-thread).

By default the graph shows worst case, and toggling "Percentile" is supposed to get you more nuanced numbers, but I think there is some sort of a bug because I'm not seeing that no matter how hard I fiddle.

One thing to be careful of is that this is a distillation of your app's memory allocation, and it is quite possible that it sees mark assist problems that your app will not. Spending a lot of time in mark assist happens when the mutator (your app) is allocating quickly during GC -- when that happens a tax is applied to allocations to prevent the GC from falling too far behind. If your app does more work that is not allocation -- or even if it does a lot more allocations but they are often smaller than in this benchmark -- then you won't see the mark assist problem.

Spending a lot of time in sweep (your original problem) is a property of a heap that is very well-filled; it's large, and it's 100% live objects. Right after GC completes, a "background sweeper" thread starts to figure out where free space is, but if the mutator allocates before that thread has processed very much memory, then the mutator has to go looking (sweeping) instead. When a lot of the heap is 100% filled, it can look for a while. The patch for that simply says "after you've looked at 100 spans with no luck, give up and allocate a new span".

So, if you're up for patching a Go release, you might try
https://go-review.googlesource.com/c/go/+/187817/
and see if your latency problem goes away (meaning, your real app does other stuff), or if it only gets a little better (you app allocates a lot anyway and we need to improve mark assist.)

Thanks very much for this benchmark. This gives us a nice reproducible case (taken from a real-world app) of sweep being a problem, which will probably get that fix into 1.15, and it provides a nice motivating case for improving mark assist.

dr2chase on 19 Feb 2020

❤2

I just repeated the test on a real production-grade linux machine with the following Go versions:

Go 1.13.8 (the one we are using at the moment in prod)
Go tip origin/master 218f457

It looks like there's a nice improvement
I can upload and share trace output files if needed

I did try to patch the tip sources, but I get compilation errors:

Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.

/opt/go_MASTER/src/runtime/mcentral.go:262:17: too many arguments in call to mheap_.freeSpan
        have (*mspan, bool)
        want (*mspan)

/opt/go_MASTER/src/runtime/mcentral.go:272:19: too many arguments in call to mheap_.alloc
        have (uintptr, spanClass, bool, bool)
        want (uintptr, spanClass, bool)

genez on 24 Feb 2020

Go: runtime: GC causes latency spikes

Most helpful comment

All 116 comments

run with stock go

run with go patched to remove poolcleanup().

Related issues