go version)?λ go version go version go1.14 windows/amd64 λ go version go version go1.13.8 windows/amd64
Yes.
go env)?go env Output
λ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:UsersklausAppDataLocalgo-build
set GOENV=C:UsersklausAppDataRoaminggoenv
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=e:gopath
set GOPRIVATE=
set GOPROXY=https://goproxy.io
set GOROOT=c:go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=c:gopkgtoolwindows_amd64
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=d:tempwintempgo-build155272862=/tmp/go-build -gno-record-gcc-switches
Run benchmark: https://play.golang.org/p/WeuJg6yaOuJ
go test -bench=. -test.benchtime=10s used to test.
Close or similar benchmark speeds.
40% performance regression.
λ benchcmp go113.txt go114.txt
benchmark old ns/op new ns/op delta
BenchmarkCompressAllocationsSingle/flate-32 87026 121741 +39.89%
BenchmarkCompressAllocationsSingle/gzip-32 88654 122632 +38.33%
This is not a purely theoretical benchmark. While suboptimal, this is the easiest way to compress a piece of data, so this will be seen in the wild. It could also indicate a general regression for applications allocating a lot.
Edit: This is not related to changes in the referenced packages. Seeing this when using packages outside the stdlib as well.
Somebody proposed to remove 'ReportAllocs'. It has an impact, but not very big:
λ benchcmp go113-no.txt go114-no.txt
benchmark old ns/op new ns/op delta
BenchmarkCompressAllocationsSingle/flate-32 88082 117794 +33.73%
BenchmarkCompressAllocationsSingle/gzip-32 89279 118946 +33.23%
Turning off DEP seems to reduce the impact, but still considerably slower:
λ bcdedit.exe /set {current} nx AlwaysOff -> Reboot. Confirm setting with λ wmic OS Get DataExecutionPrevention_SupportPolicy -> 0.
λ benchcmp go113-depoff.txt go114-depoff.txt
benchmark old ns/op new ns/op delta
BenchmarkCompressAllocationsSingle/flate-32 87619 107249 +22.40%
BenchmarkCompressAllocationsSingle/gzip-32 89344 107670 +20.51%
This actually matches my initial measurements before isolating the issue, so I am unsure if this is just some variance based on system state.
cc @mknyszek
/cc @aclements @rsc @randall77
The regression in the benchmark is reproducible on Linux (but to a lesser extent).
Or... hm. I spoke too soon. 5 runs shows that the benchmark is pretty noisy on Linux and it's not obvious that there's a regression.
name old time/op new time/op delta
CompressAllocationsSingle/flate-48 182µs ±12% 169µs ± 4% ~ (p=0.151 n=5+5)
CompressAllocationsSingle/gzip-48 170µs ± 7% 184µs ± 9% ~ (p=0.056 n=5+5)
name old alloc/op new alloc/op delta
CompressAllocationsSingle/flate-48 814kB ± 0% 814kB ± 0% +0.00% (p=0.016 n=4+5)
CompressAllocationsSingle/gzip-48 814kB ± 0% 814kB ± 0% +0.00% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
CompressAllocationsSingle/flate-48 16.0 ± 0% 16.0 ± 0% ~ (all equal)
CompressAllocationsSingle/gzip-48 17.0 ± 0% 17.0 ± 0% ~ (all equal)
That's what I get for making conclusions on just one benchmark run. This probably needs many more runs to see a concrete difference, if one exists.
Since this seems much more easily reproducible on Windows, I'm going to try there.
Cannot reproduce on linux.
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/leaf/.cache/go-build"
GOENV="/home/leaf/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leaf/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.14/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build636896530=/tmp/go-build -gno-record-gcc-switches"
$ ~/go/bin/benchstat bench_1.1{3,4}.txt
name old time/op new time/op delta
CompressAllocationsSingle/flate-8 180µs ± 2% 181µs ± 4% ~ (p=1.000 n=10+10)
CompressAllocationsSingle/gzip-8 183µs ± 7% 204µs ± 1% +11.38% (p=0.000 n=9+10)
Edit: benchmark run with ReportAlloc removed.
While I set up my Windows environment: @klauspost what suggests to you that this is a regression in allocation performance vs. something else?
@mknyszek It is only really doing allocations. The compression part is trivial, but it needs enough content to activate all allocations.
Profiles: alloc-pprof.zip
λ go tool pprof cpu113.pprof
Type: cpu
Time: Feb 26, 2020 at 5:05pm (CET)
Duration: 25.46s, Total samples = 1.01mins (238.80%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 39.56s, 65.07% of 60.80s total
Dropped 227 nodes (cum <= 0.30s)
Showing top 10 nodes out of 144
flat flat% sum% cum cum%
11.28s 18.55% 18.55% 14.95s 24.59% runtime.scanobject
5.51s 9.06% 27.62% 5.54s 9.11% runtime.stdcall2
4.86s 7.99% 35.61% 4.86s 7.99% runtime.runGCProg
4.76s 7.83% 43.44% 4.76s 7.83% runtime.memclrNoHeapPointers
3.82s 6.28% 49.72% 3.83s 6.30% runtime.stdcall3
2s 3.29% 53.01% 2.03s 3.34% runtime.runqgrab
1.97s 3.24% 56.25% 1.97s 3.24% runtime.procyield
1.94s 3.19% 59.44% 1.96s 3.22% runtime.(*lfstack).pop
1.86s 3.06% 62.50% 1.90s 3.12% runtime.stdcall1
1.56s 2.57% 65.07% 21.19s 34.85% runtime.gcDrain
λ go tool pprof cpu114.pprof
Type: cpu
Time: Feb 26, 2020 at 5:07pm (CET)
Duration: 26.61s, Total samples = 1.34mins (303.24%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1.07mins, 79.31% of 1.34mins total
Dropped 271 nodes (cum <= 0.01mins)
Showing top 10 nodes out of 95
flat flat% sum% cum cum%
0.54mins 40.51% 40.51% 0.55mins 40.59% runtime.stdcall2
0.17mins 12.93% 53.43% 0.22mins 16.34% runtime.scanobject
0.12mins 8.99% 62.42% 0.12mins 8.99% runtime.memclrNoHeapPointers
0.08mins 5.94% 68.36% 0.08mins 6.00% runtime.stdcall1
0.07mins 4.87% 73.23% 0.07mins 4.87% runtime.runGCProg
0.02mins 1.65% 74.88% 0.02mins 1.65% runtime.procyield
0.02mins 1.28% 76.15% 0.02mins 1.28% runtime.runqgrab
0.02mins 1.17% 77.32% 0.02mins 1.18% runtime.stdcall7
0.01mins 1.00% 78.32% 0.02mins 1.61% runtime.greyobject
0.01mins 0.99% 79.31% 0.03mins 1.90% runtime.findObject
SVGs:
Go 1.13: https://files.klauspost.com/profile002.svg
Go 1.14: https://files.klauspost.com/profile001.svg
FWIW, I ran it on Linux Arm64 as well (shared VM so anything goes), but the results seems consitent:
$ benchcmp go11{3,4}.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark old ns/op new ns/op delta
BenchmarkCompressAllocationsSingle/flate-8 408292 491127 +20.29%
BenchmarkCompressAllocationsSingle/gzip-8 365706 453792 +24.09%
$ go env
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/leitzler/.cache/go-build"
GOENV="/home/leitzler/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/leitzler/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build024325012=/tmp/go-build -gno-record-gcc-switches"
@klauspost Thanks for the profiles! I was about to do this myself.
I disagree that this indicates anything to with allocation; none of those symbols are involved in allocation. The first one I see is in the top 20 in the diff, and it's right at the bottom (and it's a deleted symbol at that!).
(pprof) top 20
Showing nodes accounting for 21.44s, 35.26% of 60.80s total
Dropped 261 nodes (cum <= 0.30s)
Showing top 20 nodes out of 173
flat flat% sum% cum cum%
27.17s 44.69% 44.69% 27.21s 44.75% runtime.stdcall2
-3.74s 6.15% 38.54% -3.75s 6.17% runtime.stdcall3
2.93s 4.82% 43.36% 2.94s 4.84% runtime.stdcall1
2.49s 4.10% 47.45% 2.49s 4.10% runtime.memclrNoHeapPointers
-1.48s 2.43% 45.02% -1.50s 2.47% runtime.(*lfstack).pop
-0.97s 1.60% 43.42% -1s 1.64% runtime.runqgrab
0.94s 1.55% 44.97% 0.95s 1.56% runtime.stdcall7
-0.93s 1.53% 43.44% -0.93s 1.53% runtime.runGCProg
-0.89s 1.46% 41.97% 31.05s 51.07% runtime.gcDrain
-0.88s 1.45% 40.53% 20.40s 33.55% runtime.lock
-0.85s 1.40% 39.13% -1.77s 2.91% runtime.scanobject
-0.64s 1.05% 38.08% -0.64s 1.05% runtime.procyield
-0.55s 0.9% 37.17% -0.53s 0.87% runtime.heapBits.next
0.50s 0.82% 37.99% 0.50s 0.82% runtime.shouldStealTimers (inline)
-0.45s 0.74% 37.25% -0.14s 0.23% runtime.findObject
-0.43s 0.71% 36.55% -1.43s 2.35% runtime.runqsteal
-0.41s 0.67% 35.87% -4.83s 7.94% runtime.findrunnable
-0.38s 0.62% 35.25% -0.39s 0.64% runtime.(*lfstack).push
0.33s 0.54% 35.79% 0.34s 0.56% runtime.scanblock
-0.32s 0.53% 35.26% -0.40s 0.66% runtime.(*treapNode).updateInvariants
What's more interesting to me is the prevalence of stdcall2 in the diff. Peeking at that a few times...
(pprof) peek runtime.stdcall2
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
21.65s 79.57% | runtime.semasleep
5.56s 20.43% | runtime.preemptM
27.17s 44.69% 44.69% 27.21s 44.75% | runtime.stdcall2
0.04s 0.15% | runtime.stdcall
----------------------------------------------------------+-------------
(pprof) peek runtime.semasleep
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
21.68s 100% | runtime.lock
0.03s 0.049% 0.049% 21.68s 35.66% | runtime.semasleep
21.65s 99.86% | runtime.stdcall2
----------------------------------------------------------+-------------
(pprof) peek runtime.lock
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
27.34s 134.02% | runtime.preemptM
-2.33s 11.42% | runtime.(*mheap).alloc_m
-1.57s 7.70% | runtime.findrunnable
-0.79s 3.87% | runtime.stopm
-0.62s 3.04% | runtime.exitsyscallfast_pidle
-0.33s 1.62% | runtime.gcResetMarkState
-0.32s 1.57% | runtime.(*mheap).allocManual
-0.31s 1.52% | runtime.(*mheap).freeSpan.func1
-0.22s 1.08% | runtime.bgscavenge.func2
-0.14s 0.69% | runtime.timerproc
0.14s 0.69% | runtime.goschedImpl
0.13s 0.64% | runtime.(*mheap).allocSpan
-0.11s 0.54% | runtime.startm
-0.07s 0.34% | runtime.addtimer
-0.07s 0.34% | runtime.handoffp
-0.07s 0.34% | runtime.scavengeSleep
-0.06s 0.29% | runtime.gcBgMarkWorker.func2
-0.05s 0.25% | runtime.exitsyscall0
-0.04s 0.2% | runtime.gcstopm
0.04s 0.2% | runtime.gcFlushBgCredit
-0.03s 0.15% | runtime.runSafePointFn
-0.03s 0.15% | runtime.wakeScavenger
0.02s 0.098% | runtime.(*pageAlloc).scavengeOne.func1
-0.02s 0.098% | runtime.addspecial
0.02s 0.098% | runtime.forEachP
-0.02s 0.098% | runtime.acquireSudog
-0.02s 0.098% | runtime.gcSweep
-0.02s 0.098% | runtime.stackcache_clear
-0.02s 0.098% | runtime.stopTheWorldWithSema
0.01s 0.049% | runtime.(*mcentral).cacheSpan
0.01s 0.049% | runtime.freespecial
0.01s 0.049% | runtime.mProf_Free
-0.01s 0.049% | runtime.(*mheap).freeManual
-0.01s 0.049% | runtime.(*mheap).reclaimChunk
-0.01s 0.049% | runtime.nextMarkBitArenaEpoch
-0.01s 0.049% | runtime.scavengeSleep.func1
-0.01s 0.049% | runtime.setprofilebucket
-0.01s 0.049% | runtime.startTheWorldWithSema
0 0% | runtime.(*mcentral).freeSpan
0 0% | runtime.(*mcentral).uncacheSpan
0 0% | runtime.(*mheap).reclaim
0 0% | runtime.mProf_Malloc
0 0% | runtime.schedEnableUser
-0.88s 1.45% 1.45% 20.40s 33.55% | runtime.lock
21.68s 106.27% | runtime.semasleep
-0.58s 2.84% | runtime.procyield
0.14s 0.69% | runtime.osyield
0.03s 0.15% | runtime.semacreate
0.01s 0.049% | runtime.newstack
----------------------------------------------------------+-------------
-0.15s 100% | runtime.checkdead
-0.15s 0.25% 1.69% -0.15s 0.25% | runtime.lockextra
----------------------------------------------------------+-------------
(pprof) peek runtime.preemptM
Showing nodes accounting for 19.88s, 32.70% of 60.80s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
38.18s 100% | runtime.preemptone
0.15s 0.25% 0.25% 38.18s 62.80% | runtime.preemptM
27.34s 71.61% | runtime.lock
5.56s 14.56% | runtime.stdcall2
2.92s 7.65% | runtime.stdcall1
1.14s 2.99% | runtime.unlock
0.95s 2.49% | runtime.stdcall7
0.07s 0.18% | runtime.wantAsyncPreempt (inline)
0.03s 0.079% | runtime.gFromTLS
0.02s 0.052% | runtime.isAsyncSafePoint
----------------------------------------------------------+-------------
It looks like a lock in preemptM is either contended or hit very often. I'll try to see exactly what line it's coming from.
@klauspost Your SVGs suggest the same thing (look at the difference between stdcall2 in both) AFAICT.
CC @aclements because asynchronous preemption.
@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.
@aclements Most of it is coming from os_windows.go:1185 in go1.14. The line is lock(&suspendLock).
Digging deeper with @cherrymui, it looks like most of the preemption requests are coming from runtime.(*gcControllerState).enlistWorker. This benchmark is single-threaded, and enlistWorker only tries to preempt user code instead of taking an idle P. I notice that GOMAXPROCS=32 for the original reported issue, so it should just take an idle P (since there's lots!) but it doesn't. It preempts user code instead.
There's a comment there about #19112 which talks about deadlocks for why we can't just take an idle P...
@leitzler Can you please file a new issue? It's almost certainly unrelated to this one, because this is all about Windows-specific code.
@mknyszek No expert on this, but it could be processor specific AFAICT. Running on Ryzen 3950x here - 16c/32t. I will run on my 6c/12t Intel laptop and see if it reproduces as well.
Edit: 6c/12t: Shows 24% degradation as well. HOWEVER, by running with -cpu=4 it is neutral. I will run a few more test on my desktop, since my laptop is a bit noisier.
I’ll give it another run with profiles later tonight when I get home. (In a new issue depending on how this turns out meanwhile)
The number of threads affects the numbers quite a bit:
commandline: go test -bench=. -test.benchtime=10s -cpu=1,2,4,8,16,32
λ benchcmp 113t.txt 114t.txt
benchmark old ns/op new ns/op delta
BenchmarkCompressAllocationsSingle/flate 83597 65062 -22.17%
BenchmarkCompressAllocationsSingle/flate-2 83767 91519 +9.25%
BenchmarkCompressAllocationsSingle/flate-4 75547 80647 +6.75%
BenchmarkCompressAllocationsSingle/flate-8 79197 84755 +7.02%
BenchmarkCompressAllocationsSingle/flate-16 82064 97530 +18.85%
BenchmarkCompressAllocationsSingle/flate-32 88006 106613 +21.14%
BenchmarkCompressAllocationsSingle/gzip 91624 108324 +18.23%
BenchmarkCompressAllocationsSingle/gzip-2 80929 97885 +20.95%
BenchmarkCompressAllocationsSingle/gzip-4 79115 81557 +3.09%
BenchmarkCompressAllocationsSingle/gzip-8 82221 85995 +4.59%
BenchmarkCompressAllocationsSingle/gzip-16 86477 96064 +11.09%
BenchmarkCompressAllocationsSingle/gzip-32 91024 108272 +18.95%
Even though the test is single-threaded, the total number of threads available has a very big impact.
I am rerunning the benchmark with -test.count=5, but so far the numbers match up. The outliers are not flukes AFAICT.
Edit: To confirm the numbers, here is 5 runs of each:
name old time/op new time/op delta
CompressAllocationsSingle/flate 82.4µs ± 1% 64.5µs ± 0% -21.70% (p=0.008 n=5+5)
CompressAllocationsSingle/flate-2 85.4µs ± 1% 89.6µs ± 2% +4.98% (p=0.008 n=5+5)
CompressAllocationsSingle/flate-4 75.7µs ± 1% 84.7µs ± 1% +11.82% (p=0.008 n=5+5)
CompressAllocationsSingle/flate-8 77.3µs ± 1% 84.7µs ± 0% +9.58% (p=0.008 n=5+5)
CompressAllocationsSingle/flate-16 83.0µs ± 2% 97.4µs ± 2% +17.28% (p=0.008 n=5+5)
CompressAllocationsSingle/flate-32 87.8µs ± 1% 108.7µs ± 0% +23.79% (p=0.008 n=5+5)
CompressAllocationsSingle/gzip 91.2µs ± 0% 107.6µs ± 1% +17.92% (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-2 83.8µs ± 2% 95.5µs ± 5% +13.98% (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-4 79.1µs ± 0% 82.7µs ± 2% +4.47% (p=0.008 n=5+5)
CompressAllocationsSingle/gzip-8 80.3µs ± 1% 87.9µs ± 0% +9.49% (p=0.008 n=5+5)
(benchmark auto-terminated, but the numbers for gzip-16 and gzip-32 should be confirmed as well).
Yeah, I think this is consistent with our previous analysis. There is no contention if there is just one thread. The more threads, the more threads trying to preempt each other, thus the heavier contention.
@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue.
It isn't dedicated hardware so I'm not sure it is that much worth, it's a 8 vcpu 8 gb VM. I re-ran the test, 10 times each alternating between 1.13 & 1.14 and it was actually way less difference:
$ benchstat go113.txt go114.txt
name old time/op new time/op delta
CompressAllocationsSingle/flate-8 437µs ± 3% 469µs ± 1% +7.37% (p=0.000 n=10+9)
CompressAllocationsSingle/gzip-8 381µs ± 2% 435µs ± 3% +14.28% (p=0.000 n=9+10)
@mknyszek I don't know why you changed the title to 'compression benchmark'. This is not related to compression.
Change the benchmark to:
for i := 0; i < b.N; i++ {
b := make([]byte, 800<<10)
_, _ = io.Copy(ioutil.Discard, bytes.NewBuffer(b))
}
and the result is:
λ benchcmp 113.txt 114.txt
benchmark old ns/op new ns/op delta
BenchmarkAllocation/flate-32 49657 56362 +13.50%
@klauspost That's true, thanks for pointing it out. I'll fix it again.
@leitzler Looking at the profiles... it looks like it's exactly the same issue. runtime.tgkill is suddenly at the top of the profile and it comes from signalM, which in turn comes from preemptM. Then it follows the same path up to enlistWorker.
I think the answer here is to just fix #19112 and this will mostly go away.
Putting on the help wanted preemptively (ha) in case it turns out #19112 is too hard to fix (or I just need help fixing it), then we need a different solution.
So I think fixing #19112 is feasible... but it's tricky. Anything we do would basically need an invariant that you can't try to change a G's status while holding sched.lock because the scan path can acquire it in the general case.
The fix is really simple, I think: make injectglist just casgstatus(_Gwaiting, _Grunnable) for all the goroutines in the list _before_ putting them on the global run queue. AFAICT this is the only place in the whole codebase where we try to change a G's status while holding sched.lock but maybe this is too restrictive. I'm pretty sure it's also safe to do this without the sched.lock because Gs on a gList are only owned by the list.
The alternative is to have enlistWorker hand off the "wakep" to somebody else (like a "p waker" goroutine or sysmon or something) but the delay there may be too high.
@aclements and @ianlancetaylor, WDYT?
Yet another alternative is to not try to solve this problem and instead avoid having enlistWorker fire so often. This would require a heuristic or something for when more mark workers are actually needed, and would require more thought.
@mknyszek Well, what I think is that I think someone should review https://golang.org/cl/216198 which seems to already do what you want.
@ianlancetaylor Ah. Well, I totally forgot about that patch. :) Happy to review it.
Change https://golang.org/cl/223797 mentions this issue: runtime: prefer to wake an idle P when enlisting bg mark workers
Should this have Milestone 1.15 and backport to 1.14?
@networkimprov Milestone 1.15 definitely. Backport to 1.14 might be possible, it's a bit risky though (just https://golang.org/cl/223797 isn't actually enough; the change it depends on is larger).
@klauspost I haven't gotten around to verifying this myself yet (some noise problems with the Windows box I have access to), but it would still be good to make sure the change fixes the problem for you (perhaps there's a difference between our Windows systems; and it also looks like maybe you have a less noisy system). Could you please try to run your benchmark with Go at the current master branch and https://golang.org/cl/223797 applied on top?
I will attempt to verify it tomorrow 👍
@klauspost The freeze is coming soon, friendly ping! :) Please give it a try if you get the chance or let me know if you can't.
Sorry @mknyszek . Didn't have a build system set up, so that got me sidetracked.
Unfortunately it doesn't seem to make any measurable difference for me and the original regression still stands.
Go 1.13.8:
BenchmarkCompressAllocationsSingle/flate-32 147931 81671 ns/op 813699 B/op 16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32 145420 83511 ns/op 813876 B/op 17 allocs/op
Current master (79395c55e2):
BenchmarkCompressAllocationsSingle/flate-32 116026 109378 ns/op 813701 B/op 16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32 106546 110225 ns/op 813877 B/op 17 allocs/op
Master with 223797 applied:
BenchmarkCompressAllocationsSingle/flate-32 109663 110618 ns/op 813701 B/op 16 allocs/op
BenchmarkCompressAllocationsSingle/gzip-32 107888 111298 ns/op 813876 B/op 17 allocs/op
Hmmm... I see. Thank you for trying! This needs more investigation, then. That CL is still probably worth landing (if only to close #19112).
Moving milestone to 1.16.
Unfortunately, CL 223797 still has some lock ordering issues, so we've decided it's safer to bump this to 1.17.