On Linux:
$ perflock bash -c 'for i in $(seq 10); do rm -f /tmp/x1 && 9 time go build -o /tmp/x1 cmd/go; done'
1.27u 0.14s 1.10r go build -o /tmp/x1 cmd/go ...
1.23u 0.16s 1.09r go build -o /tmp/x1 cmd/go ...
1.18u 0.17s 1.12r go build -o /tmp/x1 cmd/go ...
1.20u 0.16s 1.11r go build -o /tmp/x1 cmd/go ...
1.22u 0.18s 1.07r go build -o /tmp/x1 cmd/go ...
1.23u 0.13s 1.09r go build -o /tmp/x1 cmd/go ...
1.23u 0.15s 1.08r go build -o /tmp/x1 cmd/go ...
1.25u 0.14s 1.09r go build -o /tmp/x1 cmd/go ...
1.23u 0.14s 1.05r go build -o /tmp/x1 cmd/go ...
1.21u 0.15s 1.08r go build -o /tmp/x1 cmd/go ...
$ perflock bash -c 'for i in $(seq 10); do rm -f /tmp/x2 && 9 time go build -ldflags=-w -o /tmp/x2 cmd/go; done'
0.66u 0.16s 0.63r go build -ldflags=-w -o /tmp/x2 ...
0.72u 0.12s 0.63r go build -ldflags=-w -o /tmp/x2 ...
0.67u 0.13s 0.57r go build -ldflags=-w -o /tmp/x2 ...
0.66u 0.17s 0.57r go build -ldflags=-w -o /tmp/x2 ...
0.67u 0.13s 0.57r go build -ldflags=-w -o /tmp/x2 ...
0.66u 0.16s 0.57r go build -ldflags=-w -o /tmp/x2 ...
0.68u 0.14s 0.57r go build -ldflags=-w -o /tmp/x2 ...
0.66u 0.13s 0.58r go build -ldflags=-w -o /tmp/x2 ...
0.68u 0.13s 0.57r go build -ldflags=-w -o /tmp/x2 ...
0.70u 0.10s 0.59r go build -ldflags=-w -o /tmp/x2 ...
$
Just plain "go list -json cmd/go", which essentially does the prep for a build, takes about 0.12r, so the upper bound on time spent in link here is about 0.45r without DWARF and 0.93r with DWARF. That is, emitting DWARF doubles the amount of time it takes for the linker to execute. We don't pay this cost for "go test" or "go run", because we turn DWARF off for those already.
The Mac shows a similar ~2X slowdown. I thought maybe #12259 was related but given that Linux clearly is unaffected by that issue, it appears to be a general DWARF problem.
I've put the proposal to turn off DWARF by default (#26074) on hold pending better understanding of where all the time is going (this issue). If we decide that DWARF really is a 2X cost, then we need to go back to #26074 and decide whether that's worth paying on every build when most people make no use of the DWARF info.
/cc @dr2chase @cherrymui @thanm @ianlancetaylor @randall77 @aclements @RLH
As a point of comparison, I ran both a build -a cmd/go and just the link step to get a sense of the spectrum between a fully-non-incremental build and a fully-incremental build:
perflock benchcmd -n 10 BuildCmdGoAll sh -c "rm -f /tmp/x1 && go build -o /tmp/x1 -a cmd/go" > dwarf
perflock benchcmd -n 10 BuildCmdGoAll sh -c "rm -f /tmp/x1 && go build -o /tmp/x1 -a -gcflags -dwarf=false -ldflags -w cmd/go" > nodwarf
perflock benchcmd -n 10 LinkCmdGo sh -c "rm -f /tmp/x1 && go build -o /tmp/x1 cmd/go" >> dwarf
perflock benchcmd -n 10 LinkCmdGo sh -c "rm -f /tmp/x1 && go build -o /tmp/x1 -ldflags -w cmd/go" >> nodwarf
benchsave nodwarf dwarf
From my linux/amd64 laptop:
name old time/op new time/op delta
BuildCmdGoAll 10.0s ±11% 10.6s ± 5% +6.41% (p=0.004 n=9+10)
LinkCmdGo 626ms ± 5% 1096ms ± 2% +75.17% (p=0.000 n=10+10)
name old peak-RSS-bytes new peak-RSS-bytes delta
BuildCmdGoAll 163M ± 2% 198M ± 8% +21.33% (p=0.000 n=10+10)
LinkCmdGo 164M ± 2% 192M ± 7% +17.37% (p=0.000 n=10+10)
(https://perf.golang.org/search?q=upload:20180711.1)
This at least probably explains how we missed this. We were exceedingly careful about compiler performance and kept an eye on overall build performance, but weren't looking at the effects on just the link step. Perhaps compilebench should include a linker benchmark.
Looking at the profiles, about half of the additional link time comes from compression. An easy thing to try would be to reduce the compression level to favor speed. About a quarter of the time comes from dwarfgeneratedebugsyms (half of that is in writelines, which we may be able to move more of into the compiler, or parallelize in the linker itself). And the other quarter comes from additional time spent in GC (corresponding to the ~20% increase in peak RSS).
An easy thing to try would be to reduce the compression level to favor speed.
This is pretty promising. Using zlib.BestSpeed wins back about half of the linker slow-down. Comparing no-DWARF ("old") to BestSpeed compressed DWARF ("new"):
name old time/op new time/op delta
BuildCmdGoAll 10.0s ±11% 10.8s ± 5% +8.03% (p=0.003 n=9+9)
LinkCmdGo 626ms ± 5% 860ms ± 2% +37.36% (p=0.000 n=10+8)
And the impact on binary size is minimal.
cmd/go bytes
nodwarf 10106953
dwarf 12159049 nodwarf+20%
dwarf-speed 12408905 nodwarf+23%
dwarf-nozlib 17766473 nodwarf+76%
Change https://golang.org/cl/123356 mentions this issue: cmd/link: use BestSpeed for DWARF compression
@aclements benchcmd looks really nice. Where does it come from? Some light googling and github searches didn't turn it up.
Here's a comparison of building vs just linking jujud between Go 1.10 and tip (including BestSpeed DWARF compression):
name old time/op new time/op delta
BuildJujudAll 43.5s ± 0% 41.6s ± 2% -4.29% (p=0.000 n=8+10)
LinkJujud 9.28s ± 1% 11.73s ± 2% +26.36% (p=0.000 n=9+10)
So, we're winning on non-incremental builds, probably because of indexed export info, but the link step in isolation is 26% slower. Not great, though certainly better than 2X or the 40% reported in https://github.com/golang/go/issues/26074#issuecomment-400844992.
Change https://golang.org/cl/126656 mentions this issue: cmd/go: add $GOFLAGS environment variable
Here is what I see today.
> perflock bash -c 'for i in $(seq 10); do rm -f /tmp/x1 && time go build -o /tmp/x1 cmd/go; done'
real 0m0.812s
user 0m0.882s
sys 0m0.204s
real 0m0.767s
user 0m0.862s
sys 0m0.168s
real 0m0.756s
user 0m0.906s
sys 0m0.144s
real 0m0.788s
user 0m0.909s
sys 0m0.153s
real 0m0.735s
user 0m0.854s
sys 0m0.124s
real 0m0.762s
user 0m0.883s
sys 0m0.163s
real 0m0.746s
user 0m0.870s
sys 0m0.115s
real 0m0.767s
user 0m0.832s
sys 0m0.189s
real 0m0.743s
user 0m0.902s
sys 0m0.143s
real 0m0.762s
user 0m0.888s
sys 0m0.145s
> perflock bash -c 'for i in $(seq 10); do rm -f /tmp/x2 && time go build -ldflags=-w -o /tmp/x2 cmd/go; done'
real 0m0.551s
user 0m0.636s
sys 0m0.179s
real 0m0.587s
user 0m0.645s
sys 0m0.152s
real 0m0.576s
user 0m0.747s
sys 0m0.162s
real 0m0.578s
user 0m0.652s
sys 0m0.162s
real 0m0.569s
user 0m0.714s
sys 0m0.133s
real 0m0.545s
user 0m0.666s
sys 0m0.173s
real 0m0.559s
user 0m0.664s
sys 0m0.158s
real 0m0.548s
user 0m0.659s
sys 0m0.200s
real 0m0.561s
user 0m0.668s
sys 0m0.148s
real 0m0.549s
user 0m0.719s
sys 0m0.168s
WIth DWARF the average is
real 0.763s
user 0.878s
sys 0.154s
WIthout DWARF:
real 0.562s
user 0.677s
sys 0.163s
So today the slowdown of using DWARF is around 30%, which is a lot better than 200%.
Looking at the CPU profile when linking cmd/go I see 4.76% of CPU time in dwarfGenerateDebugInfo, 6.35% in dwarfGenerateDebugSyms, and 12.7% of time in compressSyms. By way of comparison, runtime.scanObject takes 15.87%.
These numbers seem plausible to me and I think we've answered the question posed by this issue, so closing.
Most helpful comment
This is pretty promising. Using
zlib.BestSpeedwins back about half of the linker slow-down. Comparing no-DWARF ("old") to BestSpeed compressed DWARF ("new"):And the impact on binary size is minimal.