Go: runtime: first benchmark run is much slower than subsequent runs

Created on 5 Nov 2018  路  5Comments  路  Source: golang/go

$ cd $GOROOT/test/bench/go1
$ go test -bench=TimeFormat -count=5

Want:

The first benchmark run is similar to the others, maybe a bit slower because caches are cold. Sample:

goos: darwin
goarch: amd64
BenchmarkTimeFormat-8        5000000           261 ns/op
BenchmarkTimeFormat-8        5000000           254 ns/op
BenchmarkTimeFormat-8        5000000           253 ns/op
BenchmarkTimeFormat-8        5000000           252 ns/op
BenchmarkTimeFormat-8        5000000           249 ns/op
PASS
ok      _/Users/josh/go/tip/test/bench/go1  8.967s

Got:

The first benchmark run is drastically slower. Sample:

goos: darwin
goarch: amd64
BenchmarkTimeFormat-8        5000000           504 ns/op
BenchmarkTimeFormat-8        5000000           243 ns/op
BenchmarkTimeFormat-8        5000000           244 ns/op
BenchmarkTimeFormat-8        5000000           241 ns/op
BenchmarkTimeFormat-8        5000000           239 ns/op
PASS
ok      _/Users/josh/go/tip/test/bench/go1  10.259s

I noticed this while trying to understand some wild swings (50%+) in go1 benchmark performance that are not limited to the first benchmark run. I don't know whether the root cause is the same or not, but this issue at least reproduces reliably.

Bisect points to CL 139719 / c803ffc67d0c90b24eb5a60a7d573eecc39e7753.

cc @mknyszek

FrozenDueToAge release-blocker

Most helpful comment

I found the issue while I was writing this, so it the rest of the comment reads a little more like "how I figured it out". TL;DR is that we're calling madvise way too often due to some bad logic on my part. CL incoming.

It seems that on x64 Linux it's not so bad.

$ perflock ~/toolchain/bin/go test -bench=TimeFormat -count=5                                                                     
goos: linux
goarch: amd64
BenchmarkTimeFormat-8            5000000               318 ns/op
BenchmarkTimeFormat-8            5000000               297 ns/op
BenchmarkTimeFormat-8            5000000               299 ns/op
BenchmarkTimeFormat-8            5000000               298 ns/op
BenchmarkTimeFormat-8            5000000               296 ns/op
PASS
ok      _/usr/local/google/home/mknyszek/toolchain/test/bench/go1       10.911s        

On Darwin I was able to reproduce it, and I confirmed the cause is my change. If I run the same command with -count=1 and -cpuprofile cpu.prof, then put cpu.prof through pprof, I get:

      flat  flat%   sum%        cum   cum%
     1.42s 47.65% 47.65%      1.42s 47.65%  runtime.memclrNoHeapPointers
     0.68s 22.82% 70.47%      0.68s 22.82%  runtime.madvise
     0.37s 12.42% 82.89%      0.37s 12.42%  runtime.memmove
     0.11s  3.69% 86.58%      0.25s  8.39%  time.Time.AppendFormat
     0.10s  3.36% 89.93%      0.10s  3.36%  runtime.(*mspan).init
     0.09s  3.02% 92.95%      0.09s  3.02%  runtime.(*mspan).refillAllocCache
     0.03s  1.01% 93.96%      0.03s  1.01%  time.(*Location).lookup
     0.03s  1.01% 94.97%      0.03s  1.01%  time.absDate
     0.02s  0.67% 95.64%      0.02s  0.67%  runtime.mmap
     0.02s  0.67% 96.31%      0.02s  0.67%  time.nextStdChunk

The runtime.madvise being so high up definitely points to my change(s). madvise might be more expensive on Darwin than on Linux. If I run with -count=5:

Showing top 10 nodes out of 47                                               
      flat  flat%   sum%        cum   cum%                         
    2080ms 17.48% 17.48%     2080ms 17.48%  runtime.memclrNoHeapPointers
    1670ms 14.03% 31.51%     6310ms 53.03%  time.Time.AppendFormat
    1620ms 13.61% 45.13%     1620ms 13.61%  runtime.memmove  
    1050ms  8.82% 53.95%     1090ms  9.16%  time.nextStdChunk
     800ms  6.72% 60.67%     1050ms  8.82%  time.appendInt
     690ms  5.80% 66.47%      800ms  6.72%  time.(*Location).lookup
     560ms  4.71% 71.18%      560ms  4.71%  runtime.madvise
     530ms  4.45% 75.63%      540ms  4.54%  time.absDate
     430ms  3.61% 79.24%      430ms  3.61%  runtime.(*mspan).refillAllocCache
     420ms  3.53% 82.77%      420ms  3.53%  runtime.(*mspan).init

Then it's not quite so high up, since those madvise calls are likely mostly made during the first run. Maybe during a heap growth.

To make sure, I tracked down where the madvise calls were coming from with some print debugging. Turns out I was wrong! The heap-growth scavenging routine only _actually_ scavenges one enormous span. The real problem arises primarily from a performance bug in a previous change of mine.

When freeing a span, the runtime will may try to coalesce a scavenged span with a neighboring unscavenged span, so on merging, we re-scavenge the whole thing. Freeing a span happens more frequently than you might think, since when we're allocating a new span we search for a "best fit" free span, and if it's not of the same size we trim off the excess. In this particular case, we have one enormous free span which we scavenge. Then, lots of smaller allocations come in, and they end up hitting the space where that scavenged span lived, trimming it and returning the trimmings (which are considered scavenged) each time, only for those trimmings to get allocated out of again.

In this freeing process, due to a bug, we actually end up re-scavenging the span we're freeing if it's already scavenged, even if it didn't coalesce with anything. So, we basically have a _lot_ of unnecessary additional madvise calls happening, which has a big impact if madvise is expensive of the platform.

I hacked together a fix for the freeing logic locally and it does in fact fix this issue. CL incoming.

All 5 comments

FYI @martisch @randall77 this is part of the puzzling go1 benchmark results I mentioned in CL 147360

Thanks for spotting and reporting this, @josharian! I'm sure you've saved countless head-scratching hours :)

Thanks for bisecting and for reporting the issue!

I suspect that the regession you're seeing actually is the result of the runtime scavenging (i.e. releasing pages back to the OS) immediately after the first heap growth. For small heaps this shouldn't really be an issue since the scavenging routine should back out quite fast, but it seems that it's doing more work than expected in this case and perhaps being a little too aggressive.

I'll investigate and see what I can do.

I found the issue while I was writing this, so it the rest of the comment reads a little more like "how I figured it out". TL;DR is that we're calling madvise way too often due to some bad logic on my part. CL incoming.

It seems that on x64 Linux it's not so bad.

$ perflock ~/toolchain/bin/go test -bench=TimeFormat -count=5                                                                     
goos: linux
goarch: amd64
BenchmarkTimeFormat-8            5000000               318 ns/op
BenchmarkTimeFormat-8            5000000               297 ns/op
BenchmarkTimeFormat-8            5000000               299 ns/op
BenchmarkTimeFormat-8            5000000               298 ns/op
BenchmarkTimeFormat-8            5000000               296 ns/op
PASS
ok      _/usr/local/google/home/mknyszek/toolchain/test/bench/go1       10.911s        

On Darwin I was able to reproduce it, and I confirmed the cause is my change. If I run the same command with -count=1 and -cpuprofile cpu.prof, then put cpu.prof through pprof, I get:

      flat  flat%   sum%        cum   cum%
     1.42s 47.65% 47.65%      1.42s 47.65%  runtime.memclrNoHeapPointers
     0.68s 22.82% 70.47%      0.68s 22.82%  runtime.madvise
     0.37s 12.42% 82.89%      0.37s 12.42%  runtime.memmove
     0.11s  3.69% 86.58%      0.25s  8.39%  time.Time.AppendFormat
     0.10s  3.36% 89.93%      0.10s  3.36%  runtime.(*mspan).init
     0.09s  3.02% 92.95%      0.09s  3.02%  runtime.(*mspan).refillAllocCache
     0.03s  1.01% 93.96%      0.03s  1.01%  time.(*Location).lookup
     0.03s  1.01% 94.97%      0.03s  1.01%  time.absDate
     0.02s  0.67% 95.64%      0.02s  0.67%  runtime.mmap
     0.02s  0.67% 96.31%      0.02s  0.67%  time.nextStdChunk

The runtime.madvise being so high up definitely points to my change(s). madvise might be more expensive on Darwin than on Linux. If I run with -count=5:

Showing top 10 nodes out of 47                                               
      flat  flat%   sum%        cum   cum%                         
    2080ms 17.48% 17.48%     2080ms 17.48%  runtime.memclrNoHeapPointers
    1670ms 14.03% 31.51%     6310ms 53.03%  time.Time.AppendFormat
    1620ms 13.61% 45.13%     1620ms 13.61%  runtime.memmove  
    1050ms  8.82% 53.95%     1090ms  9.16%  time.nextStdChunk
     800ms  6.72% 60.67%     1050ms  8.82%  time.appendInt
     690ms  5.80% 66.47%      800ms  6.72%  time.(*Location).lookup
     560ms  4.71% 71.18%      560ms  4.71%  runtime.madvise
     530ms  4.45% 75.63%      540ms  4.54%  time.absDate
     430ms  3.61% 79.24%      430ms  3.61%  runtime.(*mspan).refillAllocCache
     420ms  3.53% 82.77%      420ms  3.53%  runtime.(*mspan).init

Then it's not quite so high up, since those madvise calls are likely mostly made during the first run. Maybe during a heap growth.

To make sure, I tracked down where the madvise calls were coming from with some print debugging. Turns out I was wrong! The heap-growth scavenging routine only _actually_ scavenges one enormous span. The real problem arises primarily from a performance bug in a previous change of mine.

When freeing a span, the runtime will may try to coalesce a scavenged span with a neighboring unscavenged span, so on merging, we re-scavenge the whole thing. Freeing a span happens more frequently than you might think, since when we're allocating a new span we search for a "best fit" free span, and if it's not of the same size we trim off the excess. In this particular case, we have one enormous free span which we scavenge. Then, lots of smaller allocations come in, and they end up hitting the space where that scavenged span lived, trimming it and returning the trimmings (which are considered scavenged) each time, only for those trimmings to get allocated out of again.

In this freeing process, due to a bug, we actually end up re-scavenging the span we're freeing if it's already scavenged, even if it didn't coalesce with anything. So, we basically have a _lot_ of unnecessary additional madvise calls happening, which has a big impact if madvise is expensive of the platform.

I hacked together a fix for the freeing logic locally and it does in fact fix this issue. CL incoming.

Change https://golang.org/cl/147559 mentions this issue: runtime: stop unnecessary span scavenges on free

Was this page helpful?
0 / 5 - 0 ratings

Related issues

longzhizhi picture longzhizhi  路  3Comments

mingrammer picture mingrammer  路  3Comments

ajstarks picture ajstarks  路  3Comments

natefinch picture natefinch  路  3Comments

Miserlou picture Miserlou  路  3Comments