Go: runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption

Created on 26 Feb 2020 · 37Comments · Source: golang/go

What version of Go are you using (`go version`)?

λ go version
go version go1.14 windows/amd64

λ go version
go version go1.13.8 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output



λ go env

set GO111MODULE=

set GOARCH=amd64

set GOBIN=

set GOCACHE=C:UsersklausAppDataLocalgo-build

set GOENV=C:UsersklausAppDataRoaminggoenv

set GOEXE=.exe

set GOFLAGS=

set GOHOSTARCH=amd64

set GOHOSTOS=windows

set GONOPROXY=

set GONOSUMDB=

set GOOS=windows

set GOPATH=e:gopath

set GOPRIVATE=

set GOPROXY=https://goproxy.io

set GOROOT=c:go

set GOSUMDB=sum.golang.org

set GOTMPDIR=

set GOTOOLDIR=c:gopkgtoolwindows_amd64

set GCCGO=gccgo

set AR=ar

set CC=gcc

set CXX=g++

set CGO_ENABLED=1

set GOMOD=

set CGO_CFLAGS=-g -O2

set CGO_CPPFLAGS=

set CGO_CXXFLAGS=-g -O2

set CGO_FFLAGS=-g -O2

set CGO_LDFLAGS=-g -O2

set PKG_CONFIG=pkg-config

set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=d:tempwintempgo-build155272862=/tmp/go-build -gno-record-gcc-switches

What did you do?

Run benchmark: https://play.golang.org/p/WeuJg6yaOuJ

go test -bench=. -test.benchtime=10s used to test.

What did you expect to see?

Close or similar benchmark speeds.

What did you see instead?

40% performance regression.

λ benchcmp go113.txt go114.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     87026         121741        +39.89%
BenchmarkCompressAllocationsSingle/gzip-32      88654         122632        +38.33%

This is not a purely theoretical benchmark. While suboptimal, this is the easiest way to compress a piece of data, so this will be seen in the wild. It could also indicate a general regression for applications allocating a lot.

Edit: This is not related to changes in the referenced packages. Seeing this when using packages outside the stdlib as well.

NeedsFix Performance help wanted

Source

klauspost

👍8

All 37 comments

Somebody proposed to remove 'ReportAllocs'. It has an impact, but not very big:

λ benchcmp go113-no.txt go114-no.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     88082         117794        +33.73%
BenchmarkCompressAllocationsSingle/gzip-32      89279         118946        +33.23%

klauspost on 26 Feb 2020

Turning off DEP seems to reduce the impact, but still considerably slower:

λ bcdedit.exe /set {current} nx AlwaysOff -> Reboot. Confirm setting with λ wmic OS Get DataExecutionPrevention_SupportPolicy -> 0.

λ benchcmp go113-depoff.txt go114-depoff.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-32     87619         107249        +22.40%
BenchmarkCompressAllocationsSingle/gzip-32      89344         107670        +20.51%

This actually matches my initial measurements before isolating the issue, so I am unsure if this is just some variance based on system state.

klauspost on 26 Feb 2020

cc @mknyszek

josharian on 26 Feb 2020

/cc @aclements @rsc @randall77

cagedmantis on 26 Feb 2020

The regression in the benchmark is reproducible on Linux (but to a lesser extent).

mknyszek on 26 Feb 2020

Or... hm. I spoke too soon. 5 runs shows that the benchmark is pretty noisy on Linux and it's not obvious that there's a regression.

name                                old time/op    new time/op    delta
CompressAllocationsSingle/flate-48     182µs ±12%     169µs ± 4%    ~     (p=0.151 n=5+5)
CompressAllocationsSingle/gzip-48      170µs ± 7%     184µs ± 9%    ~     (p=0.056 n=5+5)

name                                old alloc/op   new alloc/op   delta
CompressAllocationsSingle/flate-48     814kB ± 0%     814kB ± 0%  +0.00%  (p=0.016 n=4+5)
CompressAllocationsSingle/gzip-48      814kB ± 0%     814kB ± 0%  +0.00%  (p=0.008 n=5+5)

name                                old allocs/op  new allocs/op  delta
CompressAllocationsSingle/flate-48      16.0 ± 0%      16.0 ± 0%    ~     (all equal)
CompressAllocationsSingle/gzip-48       17.0 ± 0%      17.0 ± 0%    ~     (all equal)

That's what I get for making conclusions on just one benchmark run. This probably needs many more runs to see a concrete difference, if one exists.

Since this seems much more easily reproducible on Windows, I'm going to try there.

mknyszek on 26 Feb 2020

Cannot reproduce on linux.

$ go env



GO111MODULE=""

GOARCH="amd64"

GOBIN=""

GOCACHE="/home/leaf/.cache/go-build"

GOENV="/home/leaf/.config/go/env"

GOEXE=""

GOFLAGS=""

GOHOSTARCH="amd64"

GOHOSTOS="linux"

GOINSECURE=""

GONOPROXY=""

GONOSUMDB=""

GOOS="linux"

GOPATH="/home/leaf/go"

GOPRIVATE=""

GOPROXY="https://proxy.golang.org,direct"

GOROOT="/usr/lib/go-1.14"

GOSUMDB="sum.golang.org"

GOTMPDIR=""

GOTOOLDIR="/usr/lib/go-1.14/pkg/tool/linux_amd64"

GCCGO="gccgo"

AR="ar"

CC="gcc"

CXX="g++"

CGO_ENABLED="1"

GOMOD=""

CGO_CFLAGS="-g -O2"

CGO_CPPFLAGS=""

CGO_CXXFLAGS="-g -O2"

CGO_FFLAGS="-g -O2"

CGO_LDFLAGS="-g -O2"

PKG_CONFIG="pkg-config"

GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build636896530=/tmp/go-build -gno-record-gcc-switches"

$ ~/go/bin/benchstat bench_1.1{3,4}.txt

name old time/op new time/op delta
CompressAllocationsSingle/flate-8 180µs ± 2% 181µs ± 4% ~ (p=1.000 n=10+10)
CompressAllocationsSingle/gzip-8 183µs ± 7% 204µs ± 1% +11.38% (p=0.000 n=9+10)

Edit: benchmark run with ReportAlloc removed.

leafrhythms on 26 Feb 2020

While I set up my Windows environment: @klauspost what suggests to you that this is a regression in allocation performance vs. something else?

mknyszek on 26 Feb 2020

@mknyszek It is only really doing allocations. The compression part is trivial, but it needs enough content to activate all allocations.

Profiles: alloc-pprof.zip

λ go tool pprof cpu113.pprof
Type: cpu
Time: Feb 26, 2020 at 5:05pm (CET)
Duration: 25.46s, Total samples = 1.01mins (238.80%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 39.56s, 65.07% of 60.80s total
Dropped 227 nodes (cum <= 0.30s)
Showing top 10 nodes out of 144
      flat  flat%   sum%        cum   cum%
    11.28s 18.55% 18.55%     14.95s 24.59%  runtime.scanobject
     5.51s  9.06% 27.62%      5.54s  9.11%  runtime.stdcall2
     4.86s  7.99% 35.61%      4.86s  7.99%  runtime.runGCProg
     4.76s  7.83% 43.44%      4.76s  7.83%  runtime.memclrNoHeapPointers
     3.82s  6.28% 49.72%      3.83s  6.30%  runtime.stdcall3
        2s  3.29% 53.01%      2.03s  3.34%  runtime.runqgrab
     1.97s  3.24% 56.25%      1.97s  3.24%  runtime.procyield
     1.94s  3.19% 59.44%      1.96s  3.22%  runtime.(*lfstack).pop
     1.86s  3.06% 62.50%      1.90s  3.12%  runtime.stdcall1
     1.56s  2.57% 65.07%     21.19s 34.85%  runtime.gcDrain

λ go tool pprof cpu114.pprof
Type: cpu
Time: Feb 26, 2020 at 5:07pm (CET)
Duration: 26.61s, Total samples = 1.34mins (303.24%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.07mins, 79.31% of 1.34mins total
Dropped 271 nodes (cum <= 0.01mins)
Showing top 10 nodes out of 95
      flat  flat%   sum%        cum   cum%
  0.54mins 40.51% 40.51%   0.55mins 40.59%  runtime.stdcall2
  0.17mins 12.93% 53.43%   0.22mins 16.34%  runtime.scanobject
  0.12mins  8.99% 62.42%   0.12mins  8.99%  runtime.memclrNoHeapPointers
  0.08mins  5.94% 68.36%   0.08mins  6.00%  runtime.stdcall1
  0.07mins  4.87% 73.23%   0.07mins  4.87%  runtime.runGCProg
  0.02mins  1.65% 74.88%   0.02mins  1.65%  runtime.procyield
  0.02mins  1.28% 76.15%   0.02mins  1.28%  runtime.runqgrab
  0.02mins  1.17% 77.32%   0.02mins  1.18%  runtime.stdcall7
  0.01mins  1.00% 78.32%   0.02mins  1.61%  runtime.greyobject
  0.01mins  0.99% 79.31%   0.03mins  1.90%  runtime.findObject

klauspost on 26 Feb 2020

SVGs:

Go 1.13: https://files.klauspost.com/profile002.svg
Go 1.14: https://files.klauspost.com/profile001.svg

klauspost on 26 Feb 2020

FWIW, I ran it on Linux Arm64 as well (shared VM so anything goes), but the results seems consitent:

$ benchcmp go11{3,4}.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                                      old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate-8     408292        491127        +20.29%
BenchmarkCompressAllocationsSingle/gzip-8      365706        453792        +24.09%

$ go env



$ go env

GO111MODULE=""

GOARCH="arm64"

GOBIN=""

GOCACHE="/home/leitzler/.cache/go-build"

GOENV="/home/leitzler/.config/go/env"

GOEXE=""

GOFLAGS=""

GOHOSTARCH="arm64"

GOHOSTOS="linux"

GOINSECURE=""

GONOPROXY=""

GONOSUMDB=""

GOOS="linux"

GOPATH="/home/leitzler/go"

GOPRIVATE=""

GOPROXY="https://proxy.golang.org,direct"

GOROOT="/usr/local/go"

GOSUMDB="sum.golang.org"

GOTMPDIR=""

GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64"

GCCGO="gccgo"

AR="ar"

CC="gcc"

CXX="g++"

CGO_ENABLED="1"

GOMOD=""

CGO_CFLAGS="-g -O2"

CGO_CPPFLAGS=""

CGO_CXXFLAGS="-g -O2"

CGO_FFLAGS="-g -O2"

CGO_LDFLAGS="-g -O2"

PKG_CONFIG="pkg-config"

GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build024325012=/tmp/go-build -gno-record-gcc-switches"

leitzler on 26 Feb 2020

@klauspost Thanks for the profiles! I was about to do this myself.

I disagree that this indicates anything to with allocation; none of those symbols are involved in allocation. The first one I see is in the top 20 in the diff, and it's right at the bottom (and it's a deleted symbol at that!).

(pprof) top 20
Showing nodes accounting for 21.44s, 35.26% of 60.80s total
Dropped 261 nodes (cum <= 0.30s)
Showing top 20 nodes out of 173
      flat  flat%   sum%        cum   cum%
    27.17s 44.69% 44.69%     27.21s 44.75%  runtime.stdcall2
    -3.74s  6.15% 38.54%     -3.75s  6.17%  runtime.stdcall3
     2.93s  4.82% 43.36%      2.94s  4.84%  runtime.stdcall1
     2.49s  4.10% 47.45%      2.49s  4.10%  runtime.memclrNoHeapPointers
    -1.48s  2.43% 45.02%     -1.50s  2.47%  runtime.(*lfstack).pop
    -0.97s  1.60% 43.42%        -1s  1.64%  runtime.runqgrab
     0.94s  1.55% 44.97%      0.95s  1.56%  runtime.stdcall7
    -0.93s  1.53% 43.44%     -0.93s  1.53%  runtime.runGCProg
    -0.89s  1.46% 41.97%     31.05s 51.07%  runtime.gcDrain
    -0.88s  1.45% 40.53%     20.40s 33.55%  runtime.lock
    -0.85s  1.40% 39.13%     -1.77s  2.91%  runtime.scanobject
    -0.64s  1.05% 38.08%     -0.64s  1.05%  runtime.procyield
    -0.55s   0.9% 37.17%     -0.53s  0.87%  runtime.heapBits.next
     0.50s  0.82% 37.99%      0.50s  0.82%  runtime.shouldStealTimers (inline)
    -0.45s  0.74% 37.25%     -0.14s  0.23%  runtime.findObject
    -0.43s  0.71% 36.55%     -1.43s  2.35%  runtime.runqsteal
    -0.41s  0.67% 35.87%     -4.83s  7.94%  runtime.findrunnable
    -0.38s  0.62% 35.25%     -0.39s  0.64%  runtime.(*lfstack).push
     0.33s  0.54% 35.79%      0.34s  0.56%  runtime.scanblock
    -0.32s  0.53% 35.26%     -0.40s  0.66%  runtime.(*treapNode).updateInvariants

What's more interesting to me is the prevalence of stdcall2 in the diff. Peeking at that a few times...

(pprof) peek runtime.stdcall2 
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                            21.65s 79.57% |   runtime.semasleep
                                             5.56s 20.43% |   runtime.preemptM
    27.17s 44.69% 44.69%     27.21s 44.75%                | runtime.stdcall2
                                             0.04s  0.15% |   runtime.stdcall
----------------------------------------------------------+-------------
(pprof) peek runtime.semasleep
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                            21.68s   100% |   runtime.lock
     0.03s 0.049% 0.049%     21.68s 35.66%                | runtime.semasleep
                                            21.65s 99.86% |   runtime.stdcall2
----------------------------------------------------------+-------------
(pprof) peek runtime.lock
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                            27.34s 134.02% |   runtime.preemptM
                                            -2.33s 11.42% |   runtime.(*mheap).alloc_m
                                            -1.57s  7.70% |   runtime.findrunnable
                                            -0.79s  3.87% |   runtime.stopm
                                            -0.62s  3.04% |   runtime.exitsyscallfast_pidle
                                            -0.33s  1.62% |   runtime.gcResetMarkState
                                            -0.32s  1.57% |   runtime.(*mheap).allocManual
                                            -0.31s  1.52% |   runtime.(*mheap).freeSpan.func1
                                            -0.22s  1.08% |   runtime.bgscavenge.func2
                                            -0.14s  0.69% |   runtime.timerproc
                                             0.14s  0.69% |   runtime.goschedImpl
                                             0.13s  0.64% |   runtime.(*mheap).allocSpan
                                            -0.11s  0.54% |   runtime.startm
                                            -0.07s  0.34% |   runtime.addtimer
                                            -0.07s  0.34% |   runtime.handoffp
                                            -0.07s  0.34% |   runtime.scavengeSleep
                                            -0.06s  0.29% |   runtime.gcBgMarkWorker.func2
                                            -0.05s  0.25% |   runtime.exitsyscall0
                                            -0.04s   0.2% |   runtime.gcstopm
                                             0.04s   0.2% |   runtime.gcFlushBgCredit
                                            -0.03s  0.15% |   runtime.runSafePointFn
                                            -0.03s  0.15% |   runtime.wakeScavenger
                                             0.02s 0.098% |   runtime.(*pageAlloc).scavengeOne.func1
                                            -0.02s 0.098% |   runtime.addspecial
                                             0.02s 0.098% |   runtime.forEachP
                                            -0.02s 0.098% |   runtime.acquireSudog
                                            -0.02s 0.098% |   runtime.gcSweep
                                            -0.02s 0.098% |   runtime.stackcache_clear
                                            -0.02s 0.098% |   runtime.stopTheWorldWithSema
                                             0.01s 0.049% |   runtime.(*mcentral).cacheSpan
                                             0.01s 0.049% |   runtime.freespecial
                                             0.01s 0.049% |   runtime.mProf_Free
                                            -0.01s 0.049% |   runtime.(*mheap).freeManual
                                            -0.01s 0.049% |   runtime.(*mheap).reclaimChunk
                                            -0.01s 0.049% |   runtime.nextMarkBitArenaEpoch
                                            -0.01s 0.049% |   runtime.scavengeSleep.func1
                                            -0.01s 0.049% |   runtime.setprofilebucket
                                            -0.01s 0.049% |   runtime.startTheWorldWithSema
                                                 0     0% |   runtime.(*mcentral).freeSpan
                                                 0     0% |   runtime.(*mcentral).uncacheSpan
                                                 0     0% |   runtime.(*mheap).reclaim
                                                 0     0% |   runtime.mProf_Malloc
                                                 0     0% |   runtime.schedEnableUser
    -0.88s  1.45%  1.45%     20.40s 33.55%                | runtime.lock
                                            21.68s 106.27% |   runtime.semasleep
                                            -0.58s  2.84% |   runtime.procyield
                                             0.14s  0.69% |   runtime.osyield
                                             0.03s  0.15% |   runtime.semacreate
                                             0.01s 0.049% |   runtime.newstack
----------------------------------------------------------+-------------
                                            -0.15s   100% |   runtime.checkdead
    -0.15s  0.25%  1.69%     -0.15s  0.25%                | runtime.lockextra
----------------------------------------------------------+-------------
(pprof) peek runtime.preemptM
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context          
----------------------------------------------------------+-------------
                                            38.18s   100% |   runtime.preemptone
     0.15s  0.25%  0.25%     38.18s 62.80%                | runtime.preemptM
                                            27.34s 71.61% |   runtime.lock
                                             5.56s 14.56% |   runtime.stdcall2
                                             2.92s  7.65% |   runtime.stdcall1
                                             1.14s  2.99% |   runtime.unlock
                                             0.95s  2.49% |   runtime.stdcall7
                                             0.07s  0.18% |   runtime.wantAsyncPreempt (inline)
                                             0.03s 0.079% |   runtime.gFromTLS
                                             0.02s 0.052% |   runtime.isAsyncSafePoint
----------------------------------------------------------+-------------

It looks like a lock in preemptM is either contended or hit very often. I'll try to see exactly what line it's coming from.

@klauspost Your SVGs suggest the same thing (look at the difference between stdcall2 in both) AFAICT.

CC @aclements because asynchronous preemption.

mknyszek on 26 Feb 2020

@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.

mknyszek on 26 Feb 2020

@aclements Most of it is coming from os_windows.go:1185 in go1.14. The line is lock(&suspendLock).

mknyszek on 26 Feb 2020

Digging deeper with @cherrymui, it looks like most of the preemption requests are coming from runtime.(*gcControllerState).enlistWorker. This benchmark is single-threaded, and enlistWorker only tries to preempt user code instead of taking an idle P. I notice that GOMAXPROCS=32 for the original reported issue, so it should just take an idle P (since there's lots!) but it doesn't. It preempts user code instead.

There's a comment there about #19112 which talks about deadlocks for why we can't just take an idle P...

mknyszek on 26 Feb 2020

@leitzler Can you please file a new issue? It's almost certainly unrelated to this one, because this is all about Windows-specific code.

mknyszek on 26 Feb 2020

@mknyszek No expert on this, but it could be processor specific AFAICT. Running on Ryzen 3950x here - 16c/32t. I will run on my 6c/12t Intel laptop and see if it reproduces as well.

Edit: 6c/12t: Shows 24% degradation as well. HOWEVER, by running with -cpu=4 it is neutral. I will run a few more test on my desktop, since my laptop is a bit noisier.

klauspost on 26 Feb 2020

I’ll give it another run with profiles later tonight when I get home. (In a new issue depending on how this turns out meanwhile)

leitzler on 26 Feb 2020

The number of threads affects the numbers quite a bit:

commandline: go test -bench=. -test.benchtime=10s -cpu=1,2,4,8,16,32

λ benchcmp 113t.txt 114t.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkCompressAllocationsSingle/flate        83597         65062         -22.17%
BenchmarkCompressAllocationsSingle/flate-2      83767         91519         +9.25%
BenchmarkCompressAllocationsSingle/flate-4      75547         80647         +6.75%
BenchmarkCompressAllocationsSingle/flate-8      79197         84755         +7.02%
BenchmarkCompressAllocationsSingle/flate-16     82064         97530         +18.85%
BenchmarkCompressAllocationsSingle/flate-32     88006         106613        +21.14%
BenchmarkCompressAllocationsSingle/gzip         91624         108324        +18.23%
BenchmarkCompressAllocationsSingle/gzip-2       80929         97885         +20.95%
BenchmarkCompressAllocationsSingle/gzip-4       79115         81557         +3.09%
BenchmarkCompressAllocationsSingle/gzip-8       82221         85995         +4.59%
BenchmarkCompressAllocationsSingle/gzip-16      86477         96064         +11.09%
BenchmarkCompressAllocationsSingle/gzip-32      91024         108272        +18.95%

Even though the test is single-threaded, the total number of threads available has a very big impact.

I am rerunning the benchmark with -test.count=5, but so far the numbers match up. The outliers are not flukes AFAICT.

Edit: To confirm the numbers, here is 5 runs of each:

name                                old time/op  new time/op   delta
CompressAllocationsSingle/flate     82.4µs ± 1%   64.5µs ± 0%  -21.70%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-2   85.4µs ± 1%   89.6µs ± 2%   +4.98%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-4   75.7µs ± 1%   84.7µs ± 1%  +11.82%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-8   77.3µs ± 1%   84.7µs ± 0%   +9.58%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-16  83.0µs ± 2%   97.4µs ± 2%  +17.28%  (p=0.008 n=5+5)
CompressAllocationsSingle/flate-32  87.8µs ± 1%  108.7µs ± 0%  +23.79%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip      91.2µs ± 0%  107.6µs ± 1%  +17.92%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-2    83.8µs ± 2%   95.5µs ± 5%  +13.98%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-4    79.1µs ± 0%   82.7µs ± 2%   +4.47%  (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-8    80.3µs ± 1%   87.9µs ± 0%   +9.49%  (p=0.008 n=5+5)

(benchmark auto-terminated, but the numbers for gzip-16 and gzip-32 should be confirmed as well).

klauspost on 26 Feb 2020

Yeah, I think this is consistent with our previous analysis. There is no contention if there is just one thread. The more threads, the more threads trying to preempt each other, thus the heavier contention.

cherrymui on 26 Feb 2020

👍1

@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.

It isn't dedicated hardware so I'm not sure it is that much worth, it's a 8 vcpu 8 gb VM. I re-ran the test, 10 times each alternating between 1.13 & 1.14 and it was actually way less difference:

$ benchstat go113.txt go114.txt
name                               old time/op    new time/op    delta
CompressAllocationsSingle/flate-8     437µs ± 3%     469µs ± 1%   +7.37%  (p=0.000 n=10+9)
CompressAllocationsSingle/gzip-8      381µs ± 2%     435µs ± 3%  +14.28%  (p=0.000 n=9+10)

profiles.zip

leitzler on 26 Feb 2020

@mknyszek I don't know why you changed the title to 'compression benchmark'. This is not related to compression.

Change the benchmark to:

        for i := 0; i < b.N; i++ {
            b := make([]byte, 800<<10)
            _, _ = io.Copy(ioutil.Discard, bytes.NewBuffer(b))
        }

and the result is:

λ benchcmp 113.txt 114.txt
benchmark                                       old ns/op     new ns/op     delta
BenchmarkAllocation/flate-32     49657         56362         +13.50%

klauspost on 26 Feb 2020

@klauspost That's true, thanks for pointing it out. I'll fix it again.

@leitzler Looking at the profiles... it looks like it's exactly the same issue. runtime.tgkill is suddenly at the top of the profile and it comes from signalM, which in turn comes from preemptM. Then it follows the same path up to enlistWorker.

mknyszek on 26 Feb 2020

👍1

I think the answer here is to just fix #19112 and this will mostly go away.

Putting on the help wanted preemptively (ha) in case it turns out #19112 is too hard to fix (or I just need help fixing it), then we need a different solution.

mknyszek on 28 Feb 2020

So I think fixing #19112 is feasible... but it's tricky. Anything we do would basically need an invariant that you can't try to change a G's status while holding sched.lock because the scan path can acquire it in the general case.

The fix is really simple, I think: make injectglist just casgstatus(_Gwaiting, _Grunnable) for all the goroutines in the list _before_ putting them on the global run queue. AFAICT this is the only place in the whole codebase where we try to change a G's status while holding sched.lock but maybe this is too restrictive. I'm pretty sure it's also safe to do this without the sched.lock because Gs on a gList are only owned by the list.

The alternative is to have enlistWorker hand off the "wakep" to somebody else (like a "p waker" goroutine or sysmon or something) but the delay there may be too high.

@aclements and @ianlancetaylor, WDYT?

Yet another alternative is to not try to solve this problem and instead avoid having enlistWorker fire so often. This would require a heuristic or something for when more mark workers are actually needed, and would require more thought.

mknyszek on 28 Feb 2020

@mknyszek Well, what I think is that I think someone should review https://golang.org/cl/216198 which seems to already do what you want.

ianlancetaylor on 28 Feb 2020

👍1

@ianlancetaylor Ah. Well, I totally forgot about that patch. :) Happy to review it.

mknyszek on 28 Feb 2020

Change https://golang.org/cl/223797 mentions this issue: runtime: prefer to wake an idle P when enlisting bg mark workers

gopherbot on 17 Mar 2020

❤1

Should this have Milestone 1.15 and backport to 1.14?

networkimprov on 17 Mar 2020

@networkimprov Milestone 1.15 definitely. Backport to 1.14 might be possible, it's a bit risky though (just https://golang.org/cl/223797 isn't actually enough; the change it depends on is larger).

mknyszek on 17 Mar 2020

@klauspost I haven't gotten around to verifying this myself yet (some noise problems with the Windows box I have access to), but it would still be good to make sure the change fixes the problem for you (perhaps there's a difference between our Windows systems; and it also looks like maybe you have a less noisy system). Could you please try to run your benchmark with Go at the current master branch and https://golang.org/cl/223797 applied on top?

mknyszek on 25 Mar 2020

I will attempt to verify it tomorrow 👍

klauspost on 25 Mar 2020

@klauspost The freeze is coming soon, friendly ping! :) Please give it a try if you get the chance or let me know if you can't.

mknyszek on 21 Apr 2020

Sorry @mknyszek . Didn't have a build system set up, so that got me sidetracked.

Unfortunately it doesn't seem to make any measurable difference for me and the original regression still stands.

Go 1.13.8:

BenchmarkCompressAllocationsSingle/flate-32               147931         81671 ns/op      813699 B/op         16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32                145420         83511 ns/op      813876 B/op         17 allocs/op

Current master (79395c55e2):

BenchmarkCompressAllocationsSingle/flate-32               116026        109378 ns/op      813701 B/op         16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32                106546        110225 ns/op      813877 B/op         17 allocs/op

Master with 223797 applied:

BenchmarkCompressAllocationsSingle/flate-32               109663        110618 ns/op      813701 B/op         16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32                107888        111298 ns/op      813876 B/op         17 allocs/op

klauspost on 22 Apr 2020

Hmmm... I see. Thank you for trying! This needs more investigation, then. That CL is still probably worth landing (if only to close #19112).

mknyszek on 22 Apr 2020

Moving milestone to 1.16.

ianlancetaylor on 16 Jun 2020

Unfortunately, CL 223797 still has some lock ordering issues, so we've decided it's safer to bump this to 1.17.

aclements on 17 Nov 2020

👀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

cmd/go: go get extremely slow with no feedback by default

Miserlou · 3Comments

x/net/proxy: Non-SOCKS ALL_PROXY results in proxy: unknown scheme: http

bbodenmiller · 3Comments

x/tools/cmd/gorename: gorename fails with on cgo-using files

ashb · 3Comments

cmd/go: progress bar for "go get"

longzhizhi · 3Comments

proposal: cmd/vet: vet should warn when time.Time type (or types embed it) is used as map keys.

go101 · 3Comments

Go: runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

All 37 comments

Related issues

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?