openbsd-amd64-64 trybots are taking 11+ minutes (which causes TryBots as a whole to take 11+ minutes rather than ~5)
We need to figure out what's slow on them, and/or just shard it out more.
/cc @dmitshur @bcmills @andybons
SELECT Builder, AVG(Seconds) as Sec FROM builds.Builds WHERE IsTry=True AND StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 HOUR) and Repo = "go" AND FailureURL = "" GROUP BY 1 ORDER BY Sec DESC;
Row | Builder | Sec |
-- | -- | -- | --
1 | openbsd-amd64-64 | 579.7221071917958 |
2 | linux-amd64-race | 488.40762166018374 |
3 | nacl-386 | 434.8139809606734 |
4 | windows-amd64-2016 | 424.604860819551 |
5 | nacl-amd64p32 | 418.4696299015715 |
6 | windows-386-2008 | 414.7469431190204 |
7 | js-wasm | 371.9747404238125 |
8 | misc-vet-vetall | 358.80661270393875 |
9 | linux-386 | 353.81094730244905 |
10 | linux-amd64 | 345.036077108898 |
11 | misc-compile | 337.44598333253055 |
12 | misc-compile-mips | 335.70810570520416 |
13 | freebsd-amd64-12_0 | 328.52744295724483 |
14 | misc-compile-openbsd | 293.41003601271416 |
15 | misc-compile-netbsd | 292.8116776015307 |
16 | misc-compile-freebsd | 292.80485985481636 |
17 | misc-compile-nacl | 288.17948818259185 |
18 | misc-compile-plan9 | 273.5849724516735 |
19 | misc-compile-ppc | 251.7265086680816
SELECT Builder, Event, AVG(Seconds) as Sec FROM builds.Spans WHERE Builder LIKE 'openbsd-amd64%' AND Error='' And IsTry=True AND StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 HOUR) and Repo = "go" GROUP BY 1, 2 ORDER BY Sec DESC;
Row | Builder | Event | Sec |
-- | -- | -- | -- | --
1 | openbsd-amd64-64 | make_and_test | 534.0490917608572 |
2 | openbsd-amd64-64 | make | 292.82514376291306 |
3 | openbsd-amd64-64 | run_test:go_test:cmd/go | 150.02306724040426 |
4 | openbsd-amd64-64 | run_test:cgo_test | 88.3467336194681 |
5 | openbsd-amd64-64 | run_test:runtime:cpu124 | 78.4045334334468 |
6 | openbsd-amd64-64 | run_test:go_test:cmd/compile/internal/gc | 63.46040540948936 |
7 | openbsd-amd64-64 | run_test:go_test:net | 62.921281942893614 |
8 | openbsd-amd64-64 | run_test:go_test:runtime | 59.65075333531915 |
9 | openbsd-amd64-64 | run_test:cgo_errors | 39.74668672670213 |
10 | openbsd-amd64-64 | get_helper | 36.28992712104489 |
11 | openbsd-amd64-64 | create_gce_buildlet | 35.651753116401366 |
12 | openbsd-amd64-64 | get_buildlet | 33.989285913102044 |
13 | openbsd-amd64-64 | run_test:go_test:cmd/compile/internal/ssa | 31.60199228663415 |
14 | openbsd-amd64-64 | create_gce_instance | 30.614607733554422 |
15 | openbsd-amd64-64 | run_test:go_test:cmd/vet | 25.966327777999997 |
16 | openbsd-amd64-64 | run_test:go_test:net/http | 23.022429393063828 |
17 | openbsd-amd64-64 | write_snapshot_to_gcs | 19.260539691499996 |
18 | openbsd-amd64-64 | run_test:doc_progs | 16.33576500525532 |
19 | openbsd-amd64-64 | run_test:go_test:runtime/pprof | 13.31036519180851 |
20 | openbsd-amd64-64 | run_test:go_test:reflect | 11.999205834765958 |
21 | openbsd-amd64-64 | run_test:go_test:time | 11.741334522617022 |
22 | openbsd-amd64-64 | run_test:sync_cpu | 11.533254003170732 |
23 | openbsd-amd64-64 | write_snapshot_tar | 10.8773563 |
24 | openbsd-amd64-64 | run_test:go_test:cmd/compile | 10.752162821125 |
25 | openbsd-amd64-64 | run_test:go_test:cmd/fix | 10.435803139355556 |
26 | openbsd-amd64-64 | run_tests_multi | 10.243288797263473 |
27 | openbsd-amd64-64 | run_test:go_test:strings | 10.110943746 |
28 | openbsd-amd64-64 | run_test:go_test:cmd/link | 10.061246428116279 |
29 | openbsd-amd64-64 | run_test:go_test:cmd/link/internal/ld | 10.037645228209302 |
30 | openbsd-amd64-64 | run_test:go_test:go/types | 9.813017223727273 |
31 | openbsd-amd64-64 | run_test:go_test:syscall | 9.579807829382979 |
32 | openbsd-amd64-64 | run_test:go_test:strconv | 9.096669728574469 |
33 | openbsd-amd64-64 | run_test:nolibgcc:net | 8.353404652658536 |
34 | openbsd-amd64-64 | run_test:go_test:os/signal | 8.149411695148936 |
35 | openbsd-amd64-64 | run_test:go_test:os | 8.032081740425532 |
36 | openbsd-amd64-64 | run_test:go_test:math | 7.8759232913157895 |
37 | openbsd-amd64-64 | run_test:go_test:net/http/httptrace | 7.7749100352500005 |
38 | openbsd-amd64-64 | run_test:go_test:math/big | 7.75858940580851 |
39 | openbsd-amd64-64 | run_test:go_test:cmd/internal/obj/x86 | 7.6362746259500005 |
40 | openbsd-amd64-64 | get_source_from_gitmirror | 7.515116951666666 |
41 | openbsd-amd64-64 | get_source | 7.277790430666666 |
42 | openbsd-amd64-64 | run_test:bench_go1 | 7.0304439564893615 |
43 | openbsd-amd64-64 | run_test:moved_goroot | 6.851026539365853 |
44 | openbsd-amd64-64 | run_test:go_test:cmd/nm | 6.5756059088085115 |
45 | openbsd-amd64-64 | run_test:go_test:cmd/cover | 6.451060486723404 |
46 | openbsd-amd64-64 | run_test:go_test:cmd/objdump | 6.444223596553191 |
47 | openbsd-amd64-64 | run_test:go_test:runtime/trace | 6.383058027941177 |
48 | openbsd-amd64-64 | run_test:go_test:testing | 5.998117573319149 |
49 | openbsd-amd64-64 | run_test:go_test:cmd/vendor/github.com/google/pprof/internal/driver | 5.980447624906977 |
50 | openbsd-amd64-64 | run_test:wiki | 5.823946847042554
Wow, just running make.bash (which isn't sharded out over N buildlets) is more than twice as slow as other platforms:
SELECT Builder, Event, AVG(Seconds) as Sec FROM builds.Spans WHERE Event = 'make' AND Error='' And IsTry=True AND StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 HOUR) and Repo = "go" GROUP BY 1, 2 ORDER BY Sec DESC;
Row | Builder | Event | Sec |
-- | -- | -- | -- | --
1 | openbsd-amd64-64 | make | 292.82514376291306 |
2 | nacl-386 | make | 176.9535785543913 |
3 | nacl-amd64p32 | make | 169.24032677876087 |
4 | windows-386-2008 | make | 158.65642536708697 |
5 | windows-amd64-2016 | make | 142.23586712976086 |
6 | js-wasm | make | 137.46539279367394 |
7 | linux-386 | make | 134.50720768395655 |
8 | freebsd-amd64-12_0 | make | 124.52324519041304 |
9 | misc-vet-vetall | make | 124.14415335852175 |
10 | linux-amd64-race | make | 123.95929911093478 |
11 | linux-amd64 | make | 123.54718755441306
Likely suspect: #18314 (use a tmpfs on OpenBSD)
I tried doing the memory filesystem on /tmp/ in an OpenBSD 6.4 amd64 instance (via gomote ssh) and it works, but it's still not any faster.
Still 5 minutes ....
bradfitz@gdev:~/src/golang.org/x/build$ time gomote run user-bradfitz-openbsd-amd64-64-0 go/src/make.bash
Building Go cmd/dist using /tmp/workdir/go1.4.
Building Go toolchain1 using /tmp/workdir/go1.4.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for openbsd/amd64.
---
Installed Go for openbsd/amd64 in /tmp/workdir/go
Installed commands in /tmp/workdir/go/bin
real 5m3.824s
user 0m0.136s
sys 0m0.024s
bradfitz@gdev:~/src/golang.org/x/build$ time gomote run -system user-bradfitz-openbsd-amd64-64-0 mount
/dev/sd0a on / type ffs (local, wxallowed)
mfs:85198 on /tmp type mfs (asynchronous, local, nodev, nosuid, size=2097152 512-blocks)
real 0m0.108s
user 0m0.064s
sys 0m0.044s
bradfitz@gdev:~/src/golang.org/x/build$ time gomote run -system user-bradfitz-openbsd-amd64-64-0 df
Filesystem 512-blocks Used Avail Capacity Mounted on
/dev/sd0a 18153212 1652976 15592576 10% /
mfs:85198 2057756 1656516 298356 85% /tmp
real 0m0.107s
user 0m0.096s
sys 0m0.012s
It sees 4 cores:
buildlet$ sysctl hw.ncpufound
hw.ncpufound=4
buildlet$ sysctl -a | grep cpu
kern.ccpu=1948
hw.ncpu=4
hw.cpuspeed=2500
hw.ncpufound=4
hw.ncpuonline=4
machdep.cpuvendor=GenuineIntel
machdep.cpuid=0x306e4
machdep.cpufeature=0x1f9bfbff
buildlet$ dmesg | grep -i core
cpu0: smt 0, core 0, package 0
cpu1: smt 0, core 1, package 0
cpu2: smt 1, core 0, package 0
cpu3: smt 1, core 1, package 0
The kernel we're running is:
OpenBSD 6.4 (GENERIC.MP) #364: Thu Oct 11 13:30:23 MDT 2018
[email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
Is this Spectre/Meldown mitigations shutting down SMT? Can we disable that for the builders?
/cc @mdempsky
@bradfitz I think you can try setting "sysctl hw.smt=1" to re-enable hyper threading.
It's already enabled:
$ sysctl -a | grep -i smt
hw.smt=1
So, that's not it. It's crazy that OpenBSD is 2x slower. If it were 10% slower I'd assume, "Oh, OpenBSD prioritizes security over performance" and be fine with that. But 2x makes me think we have a configuration problem somewhere.
Have you tried increasing login.conf limits (as I suggested on twitter)?
Which would you increase? We have:
default:\
:path=/usr/bin /bin /usr/sbin /sbin /usr/X11R6/bin /usr/local/bin /usr/local/sbin:\
:umask=022:\
:datasize-max=768M:\
:datasize-cur=768M:\
:maxproc-max=256:\
:maxproc-cur=128:\
:openfiles-max=1024:\
:openfiles-cur=512:\
:stacksize-cur=4M:\
:localcipher=blowfish,a:\
:tc=auth-defaults:\
:tc=auth-ftp-defaults:
The default settings are low. You could try setting datasize-max/cur and stacksize-cur to "unlimited"
@stmuk Wouldn't the resource limits being too low just cause the build to fail rather than to proceed slowly?
Yeah. The issue is speed, not failure to build.
@mdempsky Maybe but it's easy enough to try.
@bradfitz How does your build work? Do you build bootstrap 1.4 with OpenBSD clang and then compile go? If so do you see the slow down with both steps?
@mdempsky Maybe but it's easy enough to try.
This is all very tedious & slow to work on, so I don't eagerly pursue avenues that don't at least make sense. Maybe if I were really desperate. But given limited time, I'd rather spend it on trying to collect system-wide profiling information or otherwise getting more visibility into the problem, rather than just changing random things.
How does your build work? Do you build bootstrap 1.4 with OpenBSD clang and then compile go? If so do you see the slow down with both steps?
We push a pre-built Go 1.4 to it and use that.
@bradfitz Maybe a first step would be to use cmd/dist's GOBUILDTIMELOGFILE to see if any particular steps are slower, or the whole thing is proportionally slower?
$ GOBUILDTIMELOGFILE=/tmp/buildtime.txt ./make.bash
Building Go cmd/dist using /usr/lib/google-golang.
Building Go toolchain1 using /usr/lib/google-golang.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /usr/local/google/home/mdempsky/wd/go
Installed commands in /usr/local/google/home/mdempsky/wd/go/bin
$ cat /tmp/buildtime.txt
Fri Jan 18 13:37:01 PST 2019 start make.bash
Fri Jan 18 13:37:03 PST 2019 +2.2s start dist bootstrap
Fri Jan 18 13:37:03 PST 2019 +2.6s build toolchain1
Fri Jan 18 13:37:18 PST 2019 +18.0s build go_bootstrap
Fri Jan 18 13:37:28 PST 2019 +27.9s build toolchain2
Fri Jan 18 13:37:45 PST 2019 +44.1s build toolchain3
Fri Jan 18 13:38:00 PST 2019 +59.9s build toolchain
Fri Jan 18 13:38:11 PST 2019 +70.3s end dist bootstrap
@bradfitz Too many negatives in that for me to parse that or motivate me to try and help further. I just regret wasting my time trying to help.
@stmuk, sorry, I didn't mean to waste your time. But with me and @mdempsky both thinking that such a tweak wouldn't do anything, it's not a high priority of mine to try. I appreciate you throwing it out there, even if it's not the answer. I at least went and read the OpenBSD man pages for those knobs.
@bradfitz You were right the login cap limit relaxation made no difference whatever.
@mdempsky Running on i5 Gen 5 Vbox host with OEL7.6 and OpenBSD 6.4 guests under vagrant I get the unexpected result of a slightly faster OpenBSD build!
There are different compilers in use to build the 1.4 I bootstrapped tip off. OpenBSD has their patched clang 6 whereas Linux has gcc 4.8.5. OBSD has a noatime mount but otherwise no changes were made.
I'm wondering if we are just seeing differences due to the underlying virtualisation. I may experiment with QEMU and more similar C compilers if I get a chance.
go version devel +5538a9a Fri Jan 18 22:41:47 2019 +0000 linux/amd64
real 3m10.367s
user 2m41.822s
sys 0m14.216s
Sat Jan 19 12:37:19 UTC 2019 start make.bash
Sat Jan 19 12:37:20 UTC 2019 +1.4s start dist bootstrap
Sat Jan 19 12:37:20 UTC 2019 +1.4s build toolchain1
Sat Jan 19 12:37:38 UTC 2019 +19.0s build go_bootstrap
Sat Jan 19 12:37:57 UTC 2019 +38.2s build toolchain2
Sat Jan 19 12:38:57 UTC 2019 +98.2s build toolchain3
Sat Jan 19 12:39:48 UTC 2019 +149.1s build toolchain
Sat Jan 19 12:40:29 UTC 2019 +190.6s end dist bootstrap
go version devel +5538a9a Fri Jan 18 22:41:47 2019 +0000 openbsd/amd64
real 2m41.425s
user 1m55.670s
sys 2m0.150s
Sat Jan 19 04:51:44 PST 2019 start make.bash
Sat Jan 19 04:51:46 PST 2019 +2.2s start dist bootstrap
Sat Jan 19 04:51:46 PST 2019 +2.3s build toolchain1
Sat Jan 19 04:52:07 PST 2019 +23.6s build go_bootstrap
Sat Jan 19 04:52:25 PST 2019 +41.4s build toolchain2
Sat Jan 19 04:53:14 PST 2019 +90.0s build toolchain3
Sat Jan 19 04:53:51 PST 2019 +127.2s build toolchain
Sat Jan 19 04:54:26 PST 2019 +162.1s end dist bootstrap
The problem is not the compiler or the VM software or the FS. I'm the maintainer of BaCon which also runs a big bash script and it's slow as hell. Something happens between bash and the OpenBSD base which makes the bash scripts slow. Maybe something related to the memory protections.
@juanfra684, our bash script is a few lines that just calls into a Go program that does the rest. Our performance issue is not bash related.
You're right, sorry for the misunderstanding.
I've built the go (only make, no tests) port on 6.4 and -current, and there is a -14% of difference:
6.4
real 205.50
user 303.09
sys 138.58
-current
real 186.69
user 300.99
sys 73.94
Recently the malloc code was changed to work better with multithreaded programs. (I reverted the malloc changes and the result is the same.)
OpenBSD doesn't have magic knobs to speedup things but you could tune a few thing to help the bootstrap. Firstly, if the VM host is using flash drives for storage, forget mfs. It's not an equivalent in speed to Linux tmpfs and you can usually run the FS operations faster in a simple consumer grade SSD.
About the mount options, use noatime, softdep
. Linux is using relatime
and a log backed FS by default, so the comparison with a plain OpenBSD installation is not fair.
You could add also a few entries to /etc/sysctl.conf
:
kern.pool_debug=0
: 0 is the default for stable versions but you could forget it if you're comparing the performance of stable with current.kern.bufcachepercent=80
: the default is 20 and the maximum is 80. The correct percentage depends of how much RAM the system has.kern.maxvnodes
: check what value the system has and increase it. The default value is very very conservative. It's the limit of vnodes which the system can maintain in memory.Firstly, if the VM host is using flash drives for storage, forget mfs. It's not an equivalent in speed to Linux tmpfs and you can usually run the FS operations faster in a simple consumer grade SSD.
Sounds good. Yes, we're using GCE's SSDs, which are fast.
About the mount options, use noatime, softdep. Linux is using relatime and a log backed FS by default, so the comparison with a plain OpenBSD installation is not fair.
We already had noatime
, but not softdep
.
kern.bufcachepercent=80: the default is 20 and the maximum is 80. The
We had 20.
kern.maxvnodes: check what value the system has and increase it. The default value is very very conservative. It's the limit of vnodes which the system can maintain in memory.
We had kern.maxvnodes=37669
but during a build I see it only go up to kern.numvnodes=25417
so it doesn't appear we're running into that limit.
About the mount options, use noatime, softdep. Linux is using relatime and a log backed FS by default, so the comparison with a plain OpenBSD installation is not fair.
We already had
noatime
, but notsoftdep
.
One of your commands shows: /dev/sd0a on / type ffs (local, wxallowed)
. That's why I suggested noatime
.
kern.bufcachepercent=80: the default is 20 and the maximum is 80. The
We had 20.
If you have a few GB of RAM, try with 80
. If the system crashes, select something lower. The reason why OpenBSD is not using 80
by default is because there is a bug somewhere which crashes the system when the value is high. Anyway, the most of users will never have that crash.
kern.maxvnodes: check what value the system has and increase it. The default value is very very conservative. It's the limit of vnodes which the system can maintain in memory.
We had
kern.maxvnodes=37669
but during a build I see it only go up tokern.numvnodes=25417
so it doesn't appear we're running into that limit.
A better way to check the correct value for maxvnodes
is to use systat -s1
during the build:
Namei Sys-cache Proc-cache No-cache
Calls hits % hits % miss %
185 131 71 25 14 29 16
No-cache
will tell you if you have enough vnodes or not. Probably kern.maxvnodes=100000
is a safe value for your system.
Anyway, those things will only speedup slightly the process. There is an underlying problem on OpenBSD slowing down the build and tests.
One of your commands shows: /dev/sd0a on / type ffs (local, wxallowed). That's why I suggested noatime.
We remount it when the build daemon starts up:
That runs before any build.
async
is faster than softdep
and you can't use both, so forget my suggestion about that.
go version devel +5538a9a Fri Jan 18 22:41:47 2019 +0000 linux/amd64
real 3m10.367s
[...]
sys 0m14.216s
go version devel +5538a9a Fri Jan 18 22:41:47 2019 +0000 openbsd/amd64
real 2m41.425s
[...]
sys 2m0.150s
Despite the odd wallclock result. OpenBSD does seem to be spending nearly 9 times as much time in system calls!
I looked at our historical timing information for the builders for the cmd/compile/internal/gc
tests. Turns out the median duration for OpenBSD 6.4 amd64 is 64.5 seconds but linux-amd64 is 20.2s and freebsd-amd64 is 19.6s.
So OpenBSD is 3x slower.
I tried comparing FreeBSD against FreeBSD modified to not use the vdso nanotime and didn't see a difference.
Profiling that test above, on OpenBSD I see:
(pprof) top 30
Showing nodes accounting for 890ms, 100% of 890ms total
Showing top 30 nodes out of 210
flat flat% sum% cum cum%
240ms 26.97% 26.97% 240ms 26.97% runtime.nanotime
90ms 10.11% 37.08% 90ms 10.11% runtime.thrsleep
80ms 8.99% 46.07% 80ms 8.99% runtime.memclrNoHeapPointers
80ms 8.99% 55.06% 80ms 8.99% runtime.osyield
50ms 5.62% 60.67% 60ms 6.74% syscall.Syscall
40ms 4.49% 65.17% 40ms 4.49% runtime.kevent
40ms 4.49% 69.66% 60ms 6.74% runtime.scanobject
20ms 2.25% 71.91% 20ms 2.25% runtime.heapBitsSetType
20ms 2.25% 74.16% 280ms 31.46% runtime.mallocgc
20ms 2.25% 76.40% 20ms 2.25% runtime.memmove
20ms 2.25% 78.65% 20ms 2.25% runtime.nextFreeFast
20ms 2.25% 80.90% 20ms 2.25% runtime.thrwakeup
20ms 2.25% 83.15% 20ms 2.25% runtime.walltime
10ms 1.12% 84.27% 30ms 3.37% go/scanner.(*Scanner).Scan
10ms 1.12% 85.39% 10ms 1.12% internal/poll.(*FD).Write
10ms 1.12% 86.52% 50ms 5.62% reflect.Value.call
10ms 1.12% 87.64% 10ms 1.12% regexp.(*Regexp).tryBacktrack
10ms 1.12% 88.76% 10ms 1.12% runtime.(*gcSweepBuf).push
10ms 1.12% 89.89% 10ms 1.12% runtime.heapBits.next
10ms 1.12% 91.01% 10ms 1.12% runtime.jmpdefer
10ms 1.12% 92.13% 10ms 1.12% runtime.mapaccess2
10ms 1.12% 93.26% 10ms 1.12% runtime.newdefer
10ms 1.12% 94.38% 40ms 4.49% runtime.newobject
10ms 1.12% 95.51% 10ms 1.12% runtime.reentersyscall
10ms 1.12% 96.63% 10ms 1.12% runtime.scanblock
10ms 1.12% 97.75% 10ms 1.12% runtime.spanOf
10ms 1.12% 98.88% 420ms 47.19% runtime.systemstack
10ms 1.12% 100% 10ms 1.12% syscall.Syscall6
0 0% 100% 20ms 2.25% bufio.(*Scanner).Text
0 0% 100% 10ms 1.12% bytes.(*Buffer).Grow
On FreeBSD I see:
(pprof) top 30
Showing nodes accounting for 350ms, 100% of 350ms total
Showing top 30 nodes out of 163
flat flat% sum% cum cum%
70ms 20.00% 20.00% 70ms 20.00% runtime.sys_umtx_op
30ms 8.57% 28.57% 50ms 14.29% runtime.scanobject
20ms 5.71% 34.29% 60ms 17.14% runtime.mallocgc
20ms 5.71% 40.00% 20ms 5.71% runtime.memclrNoHeapPointers
10ms 2.86% 42.86% 20ms 5.71% bufio.(*Scanner).Scan
10ms 2.86% 45.71% 30ms 8.57% cmd/compile/internal/gc.TestShiftGeneric
10ms 2.86% 48.57% 10ms 2.86% compress/flate.(*decompressor).huffmanBlock
10ms 2.86% 51.43% 10ms 2.86% debug/dwarf.(*Reader).Next
10ms 2.86% 54.29% 10ms 2.86% go/token.Token.Precedence
10ms 2.86% 57.14% 10ms 2.86% reflect.(*rtype).Kind
10ms 2.86% 60.00% 10ms 2.86% regexp.(*Regexp).tryBacktrack
10ms 2.86% 62.86% 10ms 2.86% regexp.(*bitState).reset
10ms 2.86% 65.71% 10ms 2.86% regexp.(*inputs).clear
10ms 2.86% 68.57% 10ms 2.86% runtime.(*mTreap).rotateRight
10ms 2.86% 71.43% 10ms 2.86% runtime.(*mspan).objIndex
10ms 2.86% 74.29% 20ms 5.71% runtime.(*mspan).sweep
10ms 2.86% 77.14% 10ms 2.86% runtime.growslice
10ms 2.86% 80.00% 10ms 2.86% runtime.heapBitsForAddr
10ms 2.86% 82.86% 10ms 2.86% runtime.kevent
10ms 2.86% 85.71% 10ms 2.86% runtime.memmove
10ms 2.86% 88.57% 10ms 2.86% runtime.pageIndexOf
10ms 2.86% 91.43% 10ms 2.86% runtime.sigprocmask
10ms 2.86% 94.29% 30ms 8.57% runtime.slicebytetostring
10ms 2.86% 97.14% 10ms 2.86% runtime.spanOfUnchecked
10ms 2.86% 100% 10ms 2.86% syscall.Syscall
0 0% 100% 20ms 5.71% bufio.(*Scanner).Text
0 0% 100% 10ms 2.86% bytes.(*Buffer).ReadFrom
0 0% 100% 60ms 17.14% cmd/compile/internal/gc.TestCode
0 0% 100% 70ms 20.00% cmd/compile/internal/gc.TestIntendedInlining
0 0% 100% 10ms 2.86% cmd/compile/internal/gc.TestIntendedInlining.func1
/cc @aclements
OpenBSD spends a lot of time in nanotime, thrsleep (via semasleep), so looking at those, I see:
TEXT runtime·nanotime(SB),NOSPLIT,$24
MOVQ CLOCK_MONOTONIC, DI // arg 1 - clock_id
LEAQ 8(SP), SI // arg 2 - tp
MOVL $87, AX // sys_clock_gettime
SYSCALL
MOVQ 8(SP), AX // sec
MOVQ 16(SP), DX // nsec
// sec is in AX, nsec in DX
// return nsec in AX
IMULQ $1000000000, AX
ADDQ DX, AX
MOVQ AX, ret+0(FP)
RET
//go:nosplit
func semasleep(ns int64) int32 {
_g_ := getg()
// Compute sleep deadline.
var tsp *timespec
if ns >= 0 {
var ts timespec
var nsec int32
ns += nanotime()
ts.set_sec(int64(timediv(ns, 1000000000, &nsec)))
ts.set_nsec(nsec)
tsp = &ts
}
semasleep is calling nanotime which multiplies by 1000000000 just to then divide it by 1000000000?
Probably room for improvement there.
Edit: or not. OpenBSD's __thrsleep(2) says it returns EINVAL if the timespec's nanoseconds value is too large. So the timediv is probably inevitable.
Linux and FreeBSD's nanotime implementations have the same IMULQ $1e9, AX
instructions, so I doubt that's why OpenBSD's is so slow.
@bradfitz clock_gettime is an unlocked syscall on OpenBSD 6.4, and it's pretty simple. I'm pretty surprised to see it so high on the table.
The main thing that controls its behavior is the setting of sysctl kern.timecounter
. Can you show what that is on the builders?
Might be worth just pasting the entirety of dmesg
and sysctl -a
while you're at it.
Does go use system's malloc here or it's something unrelated? 20ms 2.25% 74.16% 280ms 31.46% runtime.mallocgc
.
Some time related syscalls were modified recently, maybe running the tests on a temporal -current installation would be a good idea.
Go does not use the system malloc
. It is completely unrelated.
Does go use system's malloc here or it's something unrelated?
No, runtime.mallocgc
is the entry point to Go's garbage collected heap allocator, which is implemented on top of mmap
on OpenBSD (and Go's other POSIX-ish OSes).
@mdempsky, oh right, somebody else had mentioned the time source as a likely problem (https://twitter.com/johansglock/status/1086569264935550977) and referenced this pvclock driver (http://openbsd-archive.7691.n7.nabble.com/pvclock-4-td355443.html)
It's using acpitimer0:
# sysctl kern.timecounter
kern.timecounter.tick=1
kern.timecounter.timestepwarnings=0
kern.timecounter.hardware=acpitimer0
kern.timecounter.choice=i8254(0) tsc(-1000) acpitimer0(1000) dummy(-1000000)
OpenBSD 6.4 (GENERIC.MP) #364: Thu Oct 11 13:30:23 MDT 2018 [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 3848261632 (3669MB) avail mem = 3722362880 (3549MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf2a70 (16 entries) bios0: vendor Google version "Google" date 01/01/2011 bios0: Google Google Compute Engine acpi0 at bios0: rev 0 acpi0: sleep states S3 S4 S5 acpi0: tables DSDT FACP SSDT APIC WAET SRAT acpi0: wakeup devices acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Xeon(R) CPU @ 2.50GHz, 2500.61 MHz, 06-3e-04 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ITSC,FSGSBASE,SMEP,ERMS,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,MELTDOWN cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 970MHz cpu1 at mainbus0: apid 2 (application processor) cpu1: Intel(R) Xeon(R) CPU @ 2.50GHz, 2424.94 MHz, 06-3e-04 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ITSC,FSGSBASE,SMEP,ERMS,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,MELTDOWN cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 1 (application processor) cpu2: Intel(R) Xeon(R) CPU @ 2.50GHz, 2425.00 MHz, 06-3e-04 cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ITSC,FSGSBASE,SMEP,ERMS,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,MELTDOWN cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 1, core 0, package 0 cpu3 at mainbus0: apid 3 (application processor) cpu3: Intel(R) Xeon(R) CPU @ 2.50GHz, 2424.95 MHz, 06-3e-04 cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ITSC,FSGSBASE,SMEP,ERMS,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,MELTDOWN cpu3: 256KB 64b/line 8-way L2 cache cpu3: smt 1, core 1, package 0 ioapic0 at mainbus0: apid 0 pa 0xfec00000, version 11, 24 pins acpiprt0 at acpi0: bus 0 (PCI0) acpicpu0 at acpi0: C1(@1 halt!) acpicpu1 at acpi0: C1(@1 halt!) acpicpu2 at acpi0: C1(@1 halt!) acpicpu3 at acpi0: C1(@1 halt!) "ACPI0006" at acpi0 not configured acpicmos0 at acpi0 "QEMU0001" at acpi0 not configured "ACPI0007" at acpi0 not configured "ACPI0007" at acpi0 not configured "ACPI0007" at acpi0 not configured "ACPI0007" at acpi0 not configured pvbus0 at mainbus0: KVM pci0 at mainbus0 bus 0 pchb0 at pci0 dev 0 function 0 "Intel 82441FX" rev 0x02 pcib0 at pci0 dev 1 function 0 "Intel 82371AB PIIX4 ISA" rev 0x03 piixpm0 at pci0 dev 1 function 3 "Intel 82371AB Power" rev 0x03: SMBus disabled virtio0 at pci0 dev 3 function 0 "Qumranet Virtio SCSI" rev 0x00 vioscsi0 at virtio0: qsize 8192 scsibus1 at vioscsi0: 253 targets sd0 at scsibus1 targ 1 lun 0:SCSI4 0/direct fixed serial.Google_PersistentDisk_ sd0: 10240MB, 512 bytes/sector, 20971520 sectors, thin virtio0: msix shared virtio1 at pci0 dev 4 function 0 "Qumranet Virtio Network" rev 0x00 vio0 at virtio1: address 42:01:0a:f0:00:65 virtio1: msix per-VQ isa0 at pcib0 isadma0 at isa0 com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo com0: console com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo com2 at isa0 port 0x3e8/8 irq 5: ns16550a, 16 byte fifo pckbc0 at isa0 port 0x60/5 irq 1 irq 12 pckbd0 at pckbc0 (kbd slot) wskbd0 at pckbd0 mux 1 pms0 at pckbc0 (aux slot) wsmouse0 at pms0 mux 0 pcppi0 at isa0 port 0x61 spkr0 at pcppi0 vscsi0 at root scsibus2 at vscsi0: 256 targets softraid0 at root scsibus3 at softraid0: 256 targets root on sd0a (26b0025335694106.a) swap on sd0b dump on sd0b
# sysctl -a
kern.ostype=OpenBSD
kern.osrelease=6.4
kern.osrevision=201811
kern.version=OpenBSD 6.4 (GENERIC.MP) #364: Thu Oct 11 13:30:23 MDT 2018
[email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
kern.maxvnodes=6603
kern.maxproc=1310
kern.maxfiles=7030
kern.argmax=262144
kern.securelevel=1
kern.hostname=buildlet.my.domain
kern.hostid=0
kern.clockrate=tick = 10000, tickadj = 40, hz = 100, profhz = 100, stathz = 100
kern.dnsjackport=0
kern.posix1version=200809
kern.ngroups=16
kern.job_control=1
kern.saved_ids=1
kern.boottime=Tue Jan 29 13:18:23 2019
kern.domainname=
kern.maxpartitions=16
kern.rawpartition=2
kern.maxthread=1950
kern.nthreads=57
kern.osversion=GENERIC.MP#364
kern.somaxconn=128
kern.sominconn=80
kern.nosuidcoredump=1
kern.fsync=1
kern.sysvmsg=1
kern.sysvsem=1
kern.sysvshm=1
kern.msgbufsize=98256
kern.malloc.buckets=16,32,64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288
kern.malloc.bucket.16=(calls = 1038 total_allocated = 512 total_free = 151 elements = 256 high watermark = 1280 could_free = 0)
kern.malloc.bucket.32=(calls = 1634 total_allocated = 256 total_free = 72 elements = 128 high watermark = 640 could_free = 0)
kern.malloc.bucket.64=(calls = 4388 total_allocated = 1728 total_free = 33 elements = 64 high watermark = 320 could_free = 0)
kern.malloc.bucket.128=(calls = 16379 total_allocated = 6272 total_free = 36 elements = 32 high watermark = 160 could_free = 113)
kern.malloc.bucket.256=(calls = 2890 total_allocated = 96 total_free = 21 elements = 16 high watermark = 80 could_free = 0)
kern.malloc.bucket.512=(calls = 538 total_allocated = 96 total_free = 2 elements = 8 high watermark = 40 could_free = 0)
kern.malloc.bucket.1024=(calls = 5248 total_allocated = 4688 total_free = 5 elements = 4 high watermark = 20 could_free = 0)
kern.malloc.bucket.2048=(calls = 44 total_allocated = 22 total_free = 1 elements = 2 high watermark = 10 could_free = 0)
kern.malloc.bucket.4096=(calls = 576 total_allocated = 15 total_free = 1 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.8192=(calls = 345 total_allocated = 211 total_free = 2 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.16384=(calls = 196 total_allocated = 3 total_free = 0 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.32768=(calls = 10 total_allocated = 9 total_free = 0 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.65536=(calls = 183 total_allocated = 4 total_free = 0 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.131072=(calls = 2 total_allocated = 2 total_free = 0 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.262144=(calls = 4 total_allocated = 4 total_free = 0 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.bucket.524288=(calls = 1 total_allocated = 1 total_free = 0 elements = 1 high watermark = 5 could_free = 0)
kern.malloc.kmemnames=free,,devbuf,,pcb,rtable,,,,ifaddr,soopts,sysctl,counters,,ioctlops,,,,,iov,mount,,NFS_req,NFS_mount,,vnodes,namecache,UFS_quota,UFS_mount,shm,VM_map,sem,dirhash,ACPI,VM_pmap,,,,file,file_desc,,proc,subproc,VFS_cluster,,,MFS_node,,,Export_Host,NFS_srvsock,,NFS_daemon,ip_moptions,in_multi,ether_multi,mrt,ISOFS_mount,ISOFS_node,MSDOSFS_mount,MSDOSFS_fat,MSDOSFS_node,ttys,exec,miscfs_mount,fusefs_mount,,,,,,,,,pfkey_data,tdb,xform_data,,pagedep,inodedep,newblk,,,indirdep,,,,,,,,,VM_swap,,,,,,UVM_amap,UVM_aobj,,USB,USB_device,USB_HC,,memdesc,,,crypto_data,,IPsec_creds,,,,emuldata,,,,,,,,,ip6_options,NDP,,,temp,NTFS_mount,NTFS_node,NTFS_fnode,NTFS_dir,NTFS_hash,NTFS_attr,NTFS_data,NTFS_decomp,NTFS_vrun,kqueue,,SYN_cache,UDF_mount,UDF_file_entry,UDF_file_id,,AGP_Memory,DRM
kern.malloc.kmemstat.free=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.devbuf=(inuse = 9526, calls = 10617, memuse = 6381K, limblocks = 0, mapblocks = 0, maxused = 6382K, limit = 78644K, spare = 0, sizes = (16,32,64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144))
kern.malloc.kmemstat.pcb=(inuse = 78, calls = 114, memuse = 17K, limblocks = 0, mapblocks = 0, maxused = 18K, limit = 78644K, spare = 0, sizes = (16,32,128,1024))
kern.malloc.kmemstat.rtable=(inuse = 61, calls = 144, memuse = 2K, limblocks = 0, mapblocks = 0, maxused = 3K, limit = 78644K, spare = 0, sizes = (16,32,64,128,256))
kern.malloc.kmemstat.ifaddr=(inuse = 25, calls = 26, memuse = 8K, limblocks = 0, mapblocks = 0, maxused = 8K, limit = 78644K, spare = 0, sizes = (32,64,128,256,4096))
kern.malloc.kmemstat.soopts=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.sysctl=(inuse = 3, calls = 3, memuse = 2K, limblocks = 0, mapblocks = 0, maxused = 2K, limit = 78644K, spare = 0, sizes = (32,128,1024))
kern.malloc.kmemstat.counters=(inuse = 79, calls = 79, memuse = 67K, limblocks = 0, mapblocks = 0, maxused = 67K, limit = 78644K, spare = 0, sizes = (64,128,256,512,1024,4096,8192))
kern.malloc.kmemstat.ioctlops=(inuse = 0, calls = 1469, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 4K, limit = 78644K, spare = 0, sizes = (256,512,1024,2048,4096))
kern.malloc.kmemstat.iov=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.mount=(inuse = 1, calls = 1, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (1024))
kern.malloc.kmemstat.NFS_req=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NFS_mount=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.vnodes=(inuse = 1169, calls = 1172, memuse = 74K, limblocks = 0, mapblocks = 0, maxused = 74K, limit = 78644K, spare = 0, sizes = (64,128,256))
kern.malloc.kmemstat.namecache=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.UFS_quota=(inuse = 1, calls = 1, memuse = 32K, limblocks = 0, mapblocks = 0, maxused = 32K, limit = 78644K, spare = 0, sizes = (32768))
kern.malloc.kmemstat.UFS_mount=(inuse = 5, calls = 5, memuse = 37K, limblocks = 0, mapblocks = 0, maxused = 37K, limit = 78644K, spare = 0, sizes = (64,512,2048,32768))
kern.malloc.kmemstat.shm=(inuse = 2, calls = 2, memuse = 2K, limblocks = 0, mapblocks = 0, maxused = 2K, limit = 78644K, spare = 0, sizes = (256,1024))
kern.malloc.kmemstat.VM_map=(inuse = 2, calls = 2, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (256))
kern.malloc.kmemstat.sem=(inuse = 2, calls = 2, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (32,128))
kern.malloc.kmemstat.dirhash=(inuse = 18, calls = 45, memuse = 4K, limblocks = 0, mapblocks = 0, maxused = 4K, limit = 78644K, spare = 0, sizes = (16,32,64,128,512))
kern.malloc.kmemstat.ACPI=(inuse = 1822, calls = 12774, memuse = 199K, limblocks = 0, mapblocks = 0, maxused = 292K, limit = 78644K, spare = 0, sizes = (16,32,64,128,2048))
kern.malloc.kmemstat.VM_pmap=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.file=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.file_desc=(inuse = 1, calls = 2, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (512))
kern.malloc.kmemstat.proc=(inuse = 61, calls = 335, memuse = 48K, limblocks = 0, mapblocks = 0, maxused = 72K, limit = 78644K, spare = 0, sizes = (16,64,1024,4096,8192))
kern.malloc.kmemstat.subproc=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.VFS_cluster=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.MFS_node=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.Export_Host=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NFS_srvsock=(inuse = 1, calls = 1, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (128))
kern.malloc.kmemstat.NFS_daemon=(inuse = 1, calls = 1, memuse = 16K, limblocks = 0, mapblocks = 0, maxused = 16K, limit = 78644K, spare = 0, sizes = (16384))
kern.malloc.kmemstat.ip_moptions=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.in_multi=(inuse = 11, calls = 11, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (32,64,128))
kern.malloc.kmemstat.ether_multi=(inuse = 1, calls = 1, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (32))
kern.malloc.kmemstat.mrt=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.ISOFS_mount=(inuse = 1, calls = 1, memuse = 32K, limblocks = 0, mapblocks = 0, maxused = 32K, limit = 78644K, spare = 0, sizes = (32768))
kern.malloc.kmemstat.ISOFS_node=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.MSDOSFS_mount=(inuse = 1, calls = 1, memuse = 16K, limblocks = 0, mapblocks = 0, maxused = 16K, limit = 78644K, spare = 0, sizes = (16384))
kern.malloc.kmemstat.MSDOSFS_fat=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.MSDOSFS_node=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.ttys=(inuse = 390, calls = 390, memuse = 1723K, limblocks = 0, mapblocks = 0, maxused = 1723K, limit = 78644K, spare = 0, sizes = (512,1024,8192))
kern.malloc.kmemstat.exec=(inuse = 0, calls = 300, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 2K, limit = 78644K, spare = 0, sizes = (16,32,256,1024))
kern.malloc.kmemstat.miscfs_mount=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.fusefs_mount=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.pfkey_data=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.tdb=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.xform_data=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.pagedep=(inuse = 1, calls = 1, memuse = 8K, limblocks = 0, mapblocks = 0, maxused = 8K, limit = 78644K, spare = 0, sizes = (8192))
kern.malloc.kmemstat.inodedep=(inuse = 1, calls = 1, memuse = 32K, limblocks = 0, mapblocks = 0, maxused = 32K, limit = 78644K, spare = 0, sizes = (32768))
kern.malloc.kmemstat.newblk=(inuse = 1, calls = 1, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (512))
kern.malloc.kmemstat.indirdep=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.VM_swap=(inuse = 7, calls = 7, memuse = 75K, limblocks = 0, mapblocks = 0, maxused = 75K, limit = 78644K, spare = 0, sizes = (16,64,2048,32768,65536))
kern.malloc.kmemstat.UVM_amap=(inuse = 257, calls = 1720, memuse = 21K, limblocks = 0, mapblocks = 0, maxused = 21K, limit = 78644K, spare = 0, sizes = (16,32,64,128,256,8192))
kern.malloc.kmemstat.UVM_aobj=(inuse = 2, calls = 2, memuse = 3K, limblocks = 0, mapblocks = 0, maxused = 3K, limit = 78644K, spare = 0, sizes = (16,2048))
kern.malloc.kmemstat.USB=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.USB_device=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.USB_HC=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.memdesc=(inuse = 1, calls = 1, memuse = 4K, limblocks = 0, mapblocks = 0, maxused = 4K, limit = 78644K, spare = 0, sizes = (4096))
kern.malloc.kmemstat.crypto_data=(inuse = 1, calls = 1, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (1024))
kern.malloc.kmemstat.IPsec_creds=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.emuldata=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.ip6_options=(inuse = 1, calls = 1, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (128))
kern.malloc.kmemstat.NDP=(inuse = 4, calls = 4, memuse = 1K, limblocks = 0, mapblocks = 0, maxused = 1K, limit = 78644K, spare = 0, sizes = (32))
kern.malloc.kmemstat.temp=(inuse = 55, calls = 4236, memuse = 2321K, limblocks = 0, mapblocks = 0, maxused = 2447K, limit = 78644K, spare = 0, sizes = (16,32,64,128,512,1024,2048,4096,8192,16384,65536,524288))
kern.malloc.kmemstat.NTFS_mount=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_node=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_fnode=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_dir=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_hash=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_attr=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_data=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_decomp=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.NTFS_vrun=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.kqueue=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.SYN_cache=(inuse = 2, calls = 2, memuse = 16K, limblocks = 0, mapblocks = 0, maxused = 16K, limit = 78644K, spare = 0, sizes = (8192))
kern.malloc.kmemstat.UDF_mount=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.UDF_file_entry=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.UDF_file_id=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.AGP_Memory=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.malloc.kmemstat.DRM=(inuse = 0, calls = 0, memuse = 0K, limblocks = 0, mapblocks = 0, maxused = 0K, limit = 78644K, spare = 0, sizes = (none))
kern.cp_time=36,0,131,28,18,11730
kern.nchstats.good_hits=18258
kern.nchstats.negative_hits=732
kern.nchstats.bad_hits=1600
kern.nchstats.false_hits=0
kern.nchstats.misses=5831
kern.nchstats.long_names=0
kern.nchstats.pass2=190
kern.nchstats.2passes=284
kern.nchstats.ncs_revhits=9
kern.nchstats.ncs_revmiss=0
kern.nchstats.ncs_dothits=0
kern.nchstats.nch_dotdothits=0
kern.forkstat.forks=325
kern.forkstat.vforks=0
kern.forkstat.tforks=9
kern.forkstat.kthreads=17
kern.forkstat.fork_pages=51946
kern.forkstat.vfork_pages=0
kern.forkstat.tfork_pages=0
kern.forkstat.kthread_pages=0
kern.nselcoll=0
kern.tty.tk_nin=782
kern.tty.tk_nout=32356
kern.tty.tk_rawcc=777
kern.tty.tk_cancc=5
kern.ccpu=1948
kern.fscale=2048
kern.nprocs=50
kern.stackgap_random=262144
kern.allowkmem=0
kern.splassert=1
kern.nfiles=131
kern.ttycount=65
kern.numvnodes=3046
kern.seminfo.semmni=10
kern.seminfo.semmns=60
kern.seminfo.semmnu=30
kern.seminfo.semmsl=60
kern.seminfo.semopm=100
kern.seminfo.semume=10
kern.seminfo.semusz=112
kern.seminfo.semvmx=32767
kern.seminfo.semaem=16384
kern.shminfo.shmmax=33554432
kern.shminfo.shmmin=1
kern.shminfo.shmmni=128
kern.shminfo.shmseg=128
kern.shminfo.shmall=8192
kern.maxclusters=262144
kern.timecounter.tick=1
kern.timecounter.timestepwarnings=0
kern.timecounter.hardware=acpitimer0
kern.timecounter.choice=i8254(0) tsc(-1000) acpitimer0(1000) dummy(-1000000)
kern.maxlocksperuid=1024
kern.bufcachepercent=20
kern.wxabort=0
kern.consdev=tty00
kern.netlivelocks=2
kern.pool_debug=0
kern.global_ptrace=0
kern.audio.record=0
vm.loadavg=0.02 0.02 0.00
vm.psstrings=0x7f7ffffedf70
vm.swapencrypt.enable=1
vm.swapencrypt.keyscreated=0
vm.swapencrypt.keysdeleted=0
vm.nkmempages=32768
vm.anonmin=10
vm.vtextmin=5
vm.vnodemin=10
fs.posix.setuid=1
net.inet.ip.forwarding=0
net.inet.ip.redirect=1
net.inet.ip.ttl=64
net.inet.ip.sourceroute=0
net.inet.ip.directed-broadcast=0
net.inet.ip.portfirst=1024
net.inet.ip.portlast=49151
net.inet.ip.porthifirst=49152
net.inet.ip.porthilast=65535
net.inet.ip.maxqueue=300
net.inet.ip.encdebug=0
net.inet.ip.ipsec-expire-acquire=30
net.inet.ip.ipsec-invalid-life=60
net.inet.ip.ipsec-pfs=1
net.inet.ip.ipsec-soft-allocs=0
net.inet.ip.ipsec-allocs=0
net.inet.ip.ipsec-soft-bytes=0
net.inet.ip.ipsec-bytes=0
net.inet.ip.ipsec-timeout=86400
net.inet.ip.ipsec-soft-timeout=80000
net.inet.ip.ipsec-soft-firstuse=3600
net.inet.ip.ipsec-firstuse=7200
net.inet.ip.ipsec-enc-alg=aes
net.inet.ip.ipsec-auth-alg=hmac-sha1
net.inet.ip.mtudisc=1
net.inet.ip.mtudisctimeout=600
net.inet.ip.ipsec-comp-alg=deflate
net.inet.ip.ifq.len=0
net.inet.ip.ifq.maxlen=2048
net.inet.ip.ifq.drops=0
net.inet.ip.mforwarding=0
net.inet.ip.multipath=0
net.inet.ip.mrtproto=19
net.inet.ip.arpqueued=0
net.inet.ip.arptimeout=1200
net.inet.ip.arpdown=20
net.inet.icmp.maskrepl=0
net.inet.icmp.bmcastecho=0
net.inet.icmp.errppslimit=100
net.inet.icmp.rediraccept=0
net.inet.icmp.redirtimeout=600
net.inet.icmp.tstamprepl=1
net.inet.ipip.allow=0
net.inet.tcp.rfc1323=1
net.inet.tcp.keepinittime=150
net.inet.tcp.keepidle=14400
net.inet.tcp.keepintvl=150
net.inet.tcp.slowhz=2
net.inet.tcp.baddynamic=1,7,9,11,13,15,17,18,19,20,21,22,23,25,37,42,43,49,53,57,67,68,70,77,79,80,87,88,95,101,102,103,104,105,106,107,109,110,111,113,115,117,119,123,129,135,137,138,139,143,152,163,164,177,178,179,191,194,199,201,202,204,206,210,213,220,372,389,427,433,443,444,445,464,465,468,512,513,514,515,521,526,530,531,532,540,543,544,545,548,554,556,587,631,636,646,706,749,750,751,754,760,871,873,888,901,993,995,1080,1109,1127,1433,1434,1524,1525,1529,1723,1900,2049,2105,2106,2108,2110,2111,2112,2120,2121,2401,2600,2601,2602,2603,2604,2605,2606,2607,2608,2627,2983,3031,3109,3260,3306,3389,3517,3689,3690,4190,4444,4500,4559,5002,5060,5222,5269,5280,5298,5353,5354,5432,5680,5900,6000,6001,6002,6003,6004,6005,6006,6007,6008,6009,6010,6514,6566,7000,7001,7002,7003,7004,7005,7006,7007,7008,7009,7326,8025,8026,8140,8953,9418,10050,10051,16992,16993,16994,16995,20005
net.inet.tcp.sack=1
net.inet.tcp.mssdflt=512
net.inet.tcp.rstppslimit=100
net.inet.tcp.ackonpush=0
net.inet.tcp.ecn=0
net.inet.tcp.syncachelimit=10255
net.inet.tcp.synbucketlimit=105
net.inet.tcp.rfc3390=2
net.inet.tcp.reasslimit=32768
net.inet.tcp.sackholelimit=32768
net.inet.tcp.always_keepalive=0
net.inet.tcp.synuselimit=100000
net.inet.tcp.rootonly=2049
net.inet.tcp.synhashsize=293
net.inet.udp.checksum=1
net.inet.udp.baddynamic=7,9,13,18,19,22,37,39,49,53,67,68,69,70,80,88,105,107,109,110,111,123,129,135,137,138,139,143,161,162,163,164,177,178,179,191,194,199,201,202,204,206,210,213,220,372,389,427,444,445,464,468,500,512,513,514,517,518,520,525,533,546,547,548,554,587,623,631,636,646,664,706,749,750,751,993,995,1433,1434,1524,1525,1645,1646,1701,1723,1812,1813,1900,2049,2401,3031,3517,3689,3784,3785,4190,4444,4500,4559,4754,4755,4789,5002,5060,5298,5353,5354,5432,7000,7001,7002,7003,7004,7005,7006,7007,7008,7009,7784,8025,8067,9418,10050,10051,16992,16993,16994,16995,20005,26740
net.inet.udp.recvspace=41600
net.inet.udp.sendspace=9216
net.inet.udp.rootonly=2049
net.inet.gre.allow=0
net.inet.gre.wccp=0
net.inet.esp.enable=1
net.inet.esp.udpencap=1
net.inet.esp.udpencap_port=4500
net.inet.ah.enable=1
net.inet.mobileip.allow=0
net.inet.etherip.allow=0
net.inet.ipcomp.enable=0
net.inet.carp.allow=1
net.inet.carp.preempt=0
net.inet.carp.log=2
net.inet.divert.recvspace=65636
net.inet.divert.sendspace=65636
net.inet6.ip6.forwarding=0
net.inet6.ip6.redirect=1
net.inet6.ip6.hlim=64
net.inet6.ip6.mrtproto=0
net.inet6.ip6.maxfragpackets=200
net.inet6.ip6.log_interval=5
net.inet6.ip6.hdrnestlimit=10
net.inet6.ip6.dad_count=1
net.inet6.ip6.auto_flowlabel=1
net.inet6.ip6.defmcasthlim=1
net.inet6.ip6.use_deprecated=1
net.inet6.ip6.maxfrags=200
net.inet6.ip6.mforwarding=0
net.inet6.ip6.multipath=0
net.inet6.ip6.multicast_mtudisc=0
net.inet6.ip6.neighborgcthresh=2048
net.inet6.ip6.maxdynroutes=4096
net.inet6.ip6.dad_pending=0
net.inet6.ip6.mtudisctimeout=600
net.inet6.ip6.ifq.len=0
net.inet6.ip6.ifq.maxlen=2048
net.inet6.ip6.ifq.drops=0
net.inet6.ip6.soiikey=7fb7b12d4cfa145448ade20fef62b2e2
net.inet6.icmp6.redirtimeout=600
net.inet6.icmp6.nd6_delay=5
net.inet6.icmp6.nd6_umaxtries=3
net.inet6.icmp6.nd6_mmaxtries=3
net.inet6.icmp6.errppslimit=100
net.inet6.icmp6.nd6_maxnudhint=0
net.inet6.icmp6.mtudisc_hiwat=1280
net.inet6.icmp6.mtudisc_lowat=256
net.inet6.icmp6.nd6_debug=0
net.inet6.divert.recvspace=65636
net.inet6.divert.sendspace=65636
net.bpf.bufsize=32768
net.bpf.maxbufsize=2097152
net.mpls.ttl=255
net.mpls.maxloop_inkernel=16
net.mpls.mapttl_ip=1
net.mpls.mapttl_ip6=0
net.pipex.enable=0
net.pipex.inq.len=0
net.pipex.inq.maxlen=256
net.pipex.inq.drops=0
net.pipex.outq.len=0
net.pipex.outq.maxlen=256
net.pipex.outq.drops=0
hw.machine=amd64
hw.model=Intel(R) Xeon(R) CPU @ 2.50GHz
hw.ncpu=4
hw.byteorder=1234
hw.pagesize=4096
hw.disknames=sd0:26b0025335694106
hw.diskcount=1
hw.cpuspeed=2500
hw.vendor=Google
hw.product=Google Compute Engine
hw.serialno=GoogleCloud-48FCB60BD20B13009CC205380F0B84CB
hw.uuid=48fcb60b-d20b-1300-9cc2-05380f0b84cb
hw.physmem=3848261632
hw.usermem=3848249344
hw.ncpufound=4
hw.allowpowerdown=1
hw.smt=1
hw.ncpuonline=4
machdep.console_device=tty00
machdep.bios.diskinfo.128=bootdev = 0xa0000204, cylinders = 1023, heads = 255, sectors = 63
machdep.bios.cksumlen=1
machdep.allowaperture=0
machdep.cpuvendor=GenuineIntel
machdep.cpuid=0x306e4
machdep.cpufeature=0x1f9bfbff
machdep.kbdreset=0
machdep.xcrypt=0
machdep.lidaction=1
machdep.forceukbd=0
machdep.tscfreq=2500608250
machdep.invarianttsc=1
ddb.radix=16
ddb.max_width=80
ddb.max_line=24
ddb.tab_stop_width=8
ddb.panic=1
ddb.console=0
ddb.log=1
ddb.trigger=0
vfs.mounts.ffs has 1 mounted instance
vfs.ffs.max_softdeps=23704
vfs.ffs.sd_tickdelay=2
vfs.ffs.sd_worklist_push=0
vfs.ffs.sd_blk_limit_push=0
vfs.ffs.sd_ino_limit_push=0
vfs.ffs.sd_blk_limit_hit=0
vfs.ffs.sd_ino_limit_hit=0
vfs.ffs.sd_sync_limit_hit=0
vfs.ffs.sd_indir_blk_ptrs=0
vfs.ffs.sd_inode_bitmap=0
vfs.ffs.sd_direct_blk_ptrs=0
vfs.ffs.sd_dir_entry=0
vfs.ffs.dirhash_dirsize=2560
vfs.ffs.dirhash_maxmem=2097152
vfs.ffs.dirhash_mem=23943
vfs.nfs.iothreads=-1
vfs.fuse.fusefs_open_devices=0
vfs.fuse.fusefs_fbufs_in=0
vfs.fuse.fusefs_fbufs_wait=0
vfs.fuse.fusefs_pool_pages=0
@bradfitz Yeah, that seems very plausible to me. It looks like pvclock(4) has been committed upstream: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/pv/pvclock.c
So it might actually be worth trying a -current snapshot to see if Go builds faster.
@mdempsky, could you send a CL to x/build/env/openbsd-amd64/make.bash
that pulls from snapshots or whatever?
I can then deploy a new image based on it.
I ran the build on -current changing only the timecounter entry:
real 190.79
user 307.62
sys 88.26
acpihpet0
real 215.19
user 305.29
sys 150.30
@bradfitz Looks like @dmitshur already added a SNAPSHOT=false
variable to make.bash that you just need to hack to SNAPSHOT=true
.
@juanfra684 Thanks, that's helpful. And I infer that's inside a KVM guest (since I don't think pvclock is usable otherwise)?
It's disappointing though that it only improved real time by ~10%. It seems like there's still more slow downs to identify.
Here are my results on a Celeron 2955U (dual-core 1.4Ghz Haswell) running ubnutu 18.04.1 host with an openbsd 6.4 vm w KVM (my kern.timecounter.hardware=acpihpet0):
GENERIC.MP
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Celeron(R) 2955U @ 1.40GHz, 112.48 MHz, 06-45-01
cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,SSE3,PCLMUL,VMX,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,XSAVE,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,FSGSBASE,ERMS,INVPCID,IBRS,IBPB,SSBD,ARAT,XSAVEOPT,MELTDOWN
cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 16-way L2 cache
cpu0: ITLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: DTLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 999MHz
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Intel(R) Celeron(R) 2955U @ 1.40GHz, 176.06 MHz, 06-45-01
cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,SSE3,PCLMUL,VMX,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,XSAVE,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,FSGSBASE,ERMS,INVPCID,IBRS,IBPB,SSBD,ARAT,XSAVEOPT,MELTDOWN
cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 16-way L2 cache
cpu1: ITLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu1: DTLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu1: smt 0, core 0, package 1
Duration: 1.10mins, Total samples = 4.73s ( 7.17%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 3.97s, 83.93% of 4.73s total
Dropped 224 nodes (cum <= 0.02s)
Showing top 30 nodes out of 187
flat flat% sum% cum cum%
0.80s 16.91% 16.91% 0.80s 16.91% runtime.thrsleep
0.73s 15.43% 32.35% 0.73s 15.43% runtime.usleep
0.53s 11.21% 43.55% 0.53s 11.21% runtime.nanotime
0.34s 7.19% 50.74% 0.34s 7.19% runtime.kevent
0.31s 6.55% 57.29% 0.31s 6.55% runtime.memclrNoHeapPointers
0.26s 5.50% 62.79% 0.29s 6.13% syscall.Syscall
0.18s 3.81% 66.60% 0.18s 3.81% runtime.thrwakeup
0.15s 3.17% 69.77% 0.18s 3.81% runtime.scanobject
0.09s 1.90% 71.67% 0.09s 1.90% runtime.nextFreeFast
0.07s 1.48% 73.15% 0.09s 1.90% runtime.lock
0.07s 1.48% 74.63% 0.64s 13.53% runtime.mallocgc
0.07s 1.48% 76.11% 0.07s 1.48% runtime.memmove
0.05s 1.06% 77.17% 0.89s 18.82% runtime.sysmon
0.04s 0.85% 78.01% 0.04s 0.85% indexbytebody
0.04s 0.85% 78.86% 0.05s 1.06% runtime.scanblock
0.03s 0.63% 79.49% 0.03s 0.63% runtime.(*guintptr).cas
0.03s 0.63% 80.13% 0.33s 6.98% runtime.sweepone
0.02s 0.42% 80.55% 0.08s 1.69% go/parser.(*parser).next0
0.02s 0.42% 80.97% 0.03s 0.63% regexp.(*Regexp).tryBacktrack
0.02s 0.42% 81.40% 0.09s 1.90% runtime.exitsyscall
0.02s 0.42% 81.82% 0.47s 9.94% runtime.timerproc
0.02s 0.42% 82.24% 0.03s 0.63% syscall.Syscall6
0.01s 0.21% 82.45% 0.05s 1.06% bufio.(*Scanner).Scan
0.01s 0.21% 82.66% 0.12s 2.54% bytes.(*Buffer).ReadFrom
0.01s 0.21% 82.88% 0.07s 1.48% cmd/compile/internal/gc.TestShiftGeneric
0.01s 0.21% 83.09% 0.07s 1.48% debug/dwarf.(*buf).entry
0.01s 0.21% 83.30% 0.36s 7.61% go/parser.(*parser).parseFile
0.01s 0.21% 83.51% 0.21s 4.44% go/parser.(*parser).parseLiteralValue
0.01s 0.21% 83.72% 0.05s 1.06% go/parser.(*parser).parseParameterList
0.01s 0.21% 83.93% 0.22s 4.65% go/parser.(*parser).parseUnaryExpr
md5-e36902e7f3a344dd6548eb7961ed9532
GENERIC.MP_NO_MELTDOWN
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Celeron(R) 2955U @ 1.40GHz, 51.83 MHz, 06-45-01
cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,SSE3,PCLMUL,VMX,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,XSAVE,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,FSGSBASE,ERMS,INVPCID,IBRS,IBPB,SSBD,ARAT,XSAVEOPT
cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 16-way L2 cache
cpu0: ITLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: DTLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 1000MHz
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Intel(R) Celeron(R) 2955U @ 1.40GHz, 158.44 MHz, 06-45-01
cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,SSE3,PCLMUL,VMX,SSSE3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,XSAVE,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,FSGSBASE,ERMS,INVPCID,IBRS,IBPB,SSBD,ARAT,XSAVEOPT
cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 512KB 64b/line 16-way L2 cache
cpu1: ITLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu1: DTLB 255 4KB entries direct-mapped, 255 4MB entries direct-mapped
cpu1: smt 0, core 0, package 1
md5-7c002b7725f5a19669e02b881f451bbf
Duration: 1.01mins, Total samples = 3.07s ( 5.05%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for 2.41s, 78.50% of 3.07s total
Dropped 163 nodes (cum <= 0.02s)
Showing top 30 nodes out of 244
flat flat% sum% cum cum%
0.43s 14.01% 14.01% 0.43s 14.01% runtime.thrsleep
0.27s 8.79% 22.80% 0.27s 8.79% runtime.kevent
0.24s 7.82% 30.62% 0.24s 7.82% runtime.usleep
0.22s 7.17% 37.79% 0.25s 8.14% syscall.Syscall
0.18s 5.86% 43.65% 0.18s 5.86% runtime.memclrNoHeapPointers
0.17s 5.54% 49.19% 0.17s 5.54% runtime.nanotime
0.11s 3.58% 52.77% 0.53s 17.26% runtime.mallocgc
0.09s 2.93% 55.70% 0.10s 3.26% runtime.scanobject
0.07s 2.28% 57.98% 0.07s 2.28% runtime.nextFreeFast
0.06s 1.95% 59.93% 0.06s 1.95% runtime.scanblock
0.05s 1.63% 61.56% 0.06s 1.95% go/scanner.(*Scanner).next
0.05s 1.63% 63.19% 0.05s 1.63% runtime.heapBitsSetType
0.05s 1.63% 64.82% 0.07s 2.28% runtime.lock
0.05s 1.63% 66.45% 0.06s 1.95% strings.Index
0.04s 1.30% 67.75% 0.04s 1.30% runtime.thrwakeup
0.03s 0.98% 68.73% 0.19s 6.19% cmd/compile/internal/gc.TestShiftGeneric
0.03s 0.98% 69.71% 0.31s 10.10% go/parser.(*parser).parseBinaryExpr
0.03s 0.98% 70.68% 0.03s 0.98% runtime.memmove
0.02s 0.65% 71.34% 0.02s 0.65% compress/flate.(*decompressor).huffSym
0.02s 0.65% 71.99% 0.04s 1.30% compress/flate.(*decompressor).huffmanBlock
0.02s 0.65% 72.64% 0.16s 5.21% go/parser.(*parser).next0
0.02s 0.65% 73.29% 0.06s 1.95% go/scanner.(*Scanner).scanNumber
0.02s 0.65% 73.94% 0.02s 0.65% memeqbody
0.02s 0.65% 74.59% 0.12s 3.91% reflect.Value.call
0.02s 0.65% 75.24% 0.13s 4.23% regexp.(*Regexp).backtrack
0.02s 0.65% 75.90% 0.03s 0.98% regexp.(*Regexp).tryBacktrack
0.02s 0.65% 76.55% 0.02s 0.65% runtime.(*gcBitsArena).tryAlloc
0.02s 0.65% 77.20% 0.03s 0.98% runtime.acquireSudog
0.02s 0.65% 77.85% 0.02s 0.65% runtime.casgstatus
0.02s 0.65% 78.50% 0.03s 0.98% runtime.gcWriteBarrier
Regarding kern.timecounter.hardware, I think it's safe to force the tsc
one before rebuilding the image because the hardware is recent enought for the tsc to be invariant (I did this in the FreeBSD images as well).
update: My first "patch" to disable meltdown wasn't doing anything significant, changed to directly jump to the .Lcpu_secure
branch and updated the results. Test ran 10s faster, nanotime call is also lower.
@juanfra684 Thanks, that's helpful. And I infer that's inside a KVM guest (since I don't think pvclock is usable otherwise)?
Yes, qemu+kvm with an AMD SVM CPU.
@paulzhol Thanks for the test report. I'm not sure I understand what conclusion to draw from your results though.
Are you suggesting OpenBSD's meltdown mitigations might be the cause of the OpenBSD slowdown on GCE? Is there an easy way to inspect the value of cpu_meltdown? E.g., would it appear in dmesg and/or sysctl output?
@mdempsky sorry, my initial results were wrong, I wasn't disabling Meltdown mitigation completely (or rather was still using the split kernel-userspace page tables). I've updated with the proper patch and actual results.
in dmesg it is shown in the cpu flags:
cpu0: FPU,VME ... ARAT,XSAVEOPT,MELTDOWN
You can see from my attached log, the GENERIC.MP_NO_MELTDOWN kernel doesn't report them, while the GENERIC.MP one (and @bradfitz's) does.
@mdempsky KPTI effects all syscall heavy workloads,not OpenBSD specifically. My assumption is that the OpenBSD developrs are more strict and maybe have enabled additional measures (L1TF mitigations on the host will effect VMs significantly, and I'm not sure if there's something that can be turned off in the guest).
It would be interesting to see if you can get a more significant speedup on GCE with tsc as timecounter and Meltdown disabled.
Change https://golang.org/cl/160317 mentions this issue: cmd/debugnewvm: add --just-make flag, more timing info in log
Change https://golang.org/cl/160318 mentions this issue: dashboard: add openbsd-amd64-current
Hooray, pvclock seemed to do the trick! (or something in -current)
Before:
openbsd-amd64-64: make.bash in 5m9s
After:
openbsd-amd64-current: make.bash in 2m23s (I see "pvclock0 at pvbus0" during boot).
Two takeaways here:
Hey @reyk, I thought you'd like to see this. Thanks for the pvclock driver--- it halves the time needed to build Go on GCE!
@bradfitz could you try adding kern.timecounter.hardware=tsc
here for the regular 6.4 image?
@paulzhol, will do (tomorrow). That's also what Bryan Steele is recommending trying at https://twitter.com/canadianbryan/status/1090365284182097922
@paulzhol, I couldn't wait.
I'm testing a 6.4 image with TSC now. It boots and says 6.4 and ...
Automatic boot in progress: starting file system checks.
/dev/sd0a (f56cff0d5491ef20.a): file system is clean; not checking
setting tty flags
pf enabled
hw.smt: 0 -> 1
kern.timecounter.hardware: acpitimer0 -> tsc
starting network
And it takes 2m24s.
So, same as -CURRENT with pvclock0. So I think we'll keep our builders on 6.4 for now with that sysctl set.
What is making OpenBSD choose acpitimer0 instead of tsc? Should we be worried about any problems by forcing this?
/cc @brynet
Hi, (@brynet / canadianbryan) here. Someone else may be better able to answer the why, but I believe acpitimer0 is chosen because it has a higher tc quality value (1000), this is likely because there's no other high quality reference timecounter that can be used to calibrate the TSC on this machine, and without calibration tsc remains at (-1000) and is never used.
A manual change is also required in OpenBSD's own vmm(4)/vmd(8) hypervisor for similar reasons.
As for any concerns about forcing this, considering it's an ephemeral VM as you mentioned, it's probably not a problem. But perhaps @reyk or @mbelop can say for certain.
Change https://golang.org/cl/160319 mentions this issue: env/openbsd-amd64: force tsc time source, double the build speed
Thanks for the feedback! It’s nice to see the performance impact.
A little background:
OpenBSD only uses TSC when it is an invariant TSC that is found in Nehalem-and-later CPUs. A low priority says that a TSC has been found but did not indicate the invariant flag or did not calibrate correctly.
We do not intend to implement support for classic variant TSCs that have to be synchronized across CPUs and cores. The logic is complex and error-prone and OpenBSD lacks support for per-CPU time counters.
While KVM supports invariant TSC, it is usually turned off in KVM-based clouds as it breaks features on the host such as VM live migration. So the workaround of “asking your cloud provider to turn on invariant TSC” will be rejected by the provider for an understandable reason.
For that reason I implemented the pvclock* guest driver because it allows to run OpenBSD in such clouds without problems. pvclock is a KVM interface that exposes the TSC of the host via a shared page and it also considers the additional delay of the VM exits. If pvclock exists, you should even prefer it over invariant TSC as it is theoretically better suited for VMs.
There is a case when the host doesn’t have a real invariant TSC itself. In this case I do not attach pvclock either. But this only happens on old CPUs that you wouldn’t find in a public cloud like GCE or Exoscale anymore.
What is making OpenBSD choose acpitimer0 instead of tsc?
TSC frequency isn't specified on older CPUs, so OpenBSD tries to compute the frequency from another timecounter. However, there are several ways that this bootstrapping can fail, and unfortunately tsc.c doesn't log why it fails.
If I had to guess, the problem is that tsc.c assumes something went wrong if a timecounter read takes more than 50,000 cycles (RECALIBRATE_SMI_THRESHOLD
), and acpitimer is so slow under kvm that it always takes more than this, so tsc.c is never able to bootstrap successfully from it.
Should we be worried about any problems by forcing this?
Since tsc.c isn't able to calibrate the TSC frequency, it's assuming it's the same as the CPU frequency. As long as this assumption is right, it's fine. But otherwise, there's a chance of clock skew. In particular, if you're measuring the 2m24s build time on the VM itself, there's a chance it simply looks faster because the clock is bad.
I would try a simple validity test of running rdate -p -v time.nist.gov; sleep 60; rdate -p -v time.nist.gov
and just make sure the rdate results advanced by a minute. As long as that's the case, then I think we're good to use tsc on GCE.
@reyk
A low priority says that a TSC has been found but did not indicate the invariant flag or did not calibrate correctly.
My reading of tsc.c is that we wouldn't even call tc_init(&tsc_timecounter);
if the TSC isn't invariant. I think being registered but having a -1000 quality (as seen in @bradfitz's sysctl output above) can only happen if it's invariant/constant but can't be calibrated from another timecounter?
@mdempsky
Calibration is the hardest part of those TSCs. This code has been tweaked multiple times.
We have to figure out the base frequency of the TSC and this is measured with an algorithm that uses delay() and RDTSC on older CPUs. If the calibration fails it leaves the priority at -1000.
Some Skylake-and-later CPUs provide CPUIDs to read the constant TSC frequency and to skip the calibration. This is the most reliable method but it is also not very generally applicable as it needs a table with hardcoded defaults. See tsc_freq_req() - it potentially needs to be updated for newer CPU models. I guess that this might be the problem in what you’ve seen.
Fun fact: I’ve seen kernel code in FreeBSD that parses the frequency in the CPUID model string as a fallback. But this only worked on old constant TSCs that had the same base frequency as the CPU. For example, it would parse 1.7GHz from “Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz” while this CPU has a TSC that runs at about 2.3GHz.
In particular, if you're measuring the 2m24s build time on the VM itself, there's a chance it simply looks faster because the clock is bad.
No, it's being measured from a program (on Linux) that drives the buildlet over the network.
I would try a simple validity test of running rdate -p -v time.nist.gov; sleep 60; rdate -p -v time.nist.gov and just make sure the rdate results advanced by a minute. As long as that's the case, then I think we're good to use tsc on GCE.
Doesn't look great:
SERIAL: 2019/01/30 07:59:18 Running /usr/local/bin/bash with args ["bash" "-c" "rdate -p -v time.nist.gov; sleep 60; rdate -p -v time.nist.gov"] and env [...] in dir /tmp/workdir
Wed Jan 30 07:59:11 PST 2019
rdate: adjust local clock by -7.503640 seconds
Wed Jan 30 08:00:07 PST 2019
rdate: adjust local clock by -21.513980 seconds
Getting off by 21.5 seconds in 60 seconds looks like the TSC calibration definitely didn't work, which explains why it wasn't selected in the first place.
But then I ran it again (new VM boot, different CPU type?) and:
2019/01/30 16:10:40 SERIAL: 2019/01/30 08:10:38 [0xc0001f2c60] Running /usr/local/bin/bash with args ["bash" "-c" "rdate -p -v time.nist.gov; sleep 60; rdate -p -v time.nist.gov"] and env ["HOME=/" "PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin" "PWD=/" "INRC=1" "_=/bin/sh" "WORKDIR=/tmp/workdir" "GOROOT_BOOTSTRAP=/tmp/workdir/go1.4" "TMPDIR=/tmp/workdir/tmp" "GOCACHE=/tmp/workdir/gocache"] in dir /tmp/workdir
Wed Jan 30 08:10:54 PST 2019
rdate: adjust local clock by 0.507475 seconds
Wed Jan 30 08:11:52 PST 2019
rdate: adjust local clock by 0.530718 seconds
2019/01/30 16:11:52 sleep: <nil>, <nil>
There the adjustments were much more reasonable, but from the client's point of view it took from 2019/01/30 16:10:40
to 2019/01/30 16:11:52
... 72 seconds. But maybe rdate takes 6 seconds per run?
I wonder if we're not specifying which CPU type we want so we're getting something random. (https://cloud.google.com/compute/docs/cpu-platforms)
I'd just run -CURRENT with pvclock instead of worrying about TSC, but even though -CURRENT is effectively 6.4 right now, I worry that in a month or two, CURRENT might introduce some breaking change that Go isn't ready for and we'll be unable to rebuild our builder's VM image if we need something new in it.
So I'd like to do one of:
Recommendations?
I wonder if we're not specifying which CPU type we want so we're getting something random.
Nevermind, it's not random. The docs say:
If you do not actively specify a desired CPU platform, the instance uses the default platform for the zone where the instance will run. You can see a list of zones and their default CPU platforms in the Regions and Zones page.
We run in us-central1-f
, which says:
Available CPU platforms:
Intel Xeon E5 v2 (Ivy Bridge) platform (default)
Intel Xeon E5 v3 (Haswell) platform (default from April 5th, 2019)
Intel Xeon E5 v4 (Broadwell) platform
Intel Xeon (Skylake) platform
So we're on Ivy Bridge for now, and will switch to Haswell by default in a couple months here.
We could request a different CPU type if the TSC/OpenBSD is better in one of those.
On third run, the whole rdate+sleep 60+rdate took exactly 60 seconds from client's perspective, with 9.4 second drift:
2019/01/30 16:15:48 WorkDir: /tmp/workdir, <nil>
2019/01/30 16:15:48 SERIAL: 2019/01/30 08:15:51 [0xc0001f4b00] Running /usr/local/bin/bash with args ["bash" "-c" "rdate -p -v time.nist.gov; sleep 60; rdate -p -v time.nist.gov"] and env ["HOME=/" "PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin" "PWD=/" "INRC=1" "_=/bin/sh" "WORKDIR=/tmp/workdir" "GOROOT_BOOTSTRAP=/tmp/workdir/go1.4" "TMPDIR=/tmp/workdir/tmp" "GOCACHE=/tmp/workdir/gocache"] in dir /tmp/workdir
Wed Jan 30 08:15:49 PST 2019
rdate: adjust local clock by -2.915089 seconds
Wed Jan 30 08:16:48 PST 2019
rdate: adjust local clock by -9.437488 seconds
2019/01/30 16:16:48 sleep: <nil>, <nil>
Hi,
I agree with the analysis by @mdempsky that PMTimer is too slow on GCE which prevents TSC calibration.
So I'd like to do one of:
fix the TSC calibration:
- require Skylake when running a VM? https://cloud.google.com/compute/docs/cpu-platforms ... if that'd help OpenBSD?
Probably the easiest until 6.5 arrives in May.
- pass some addition sysctl to force a certain TSC frequency?
You cannot adjust TSC frequency via a sysctl interface. It would be useless for our users in the long run anyway.
- somehow backport the pvclock driver to 6.4
Backporting is straightforward, but would you be able to compile and store the kernel yourself? OpenBSD doesn't backport drivers to -stable due to a short development cycle.
- modify our VM-building make.bash to force a particular snapshot date (say, 2019-01-30) instead of relying on the snapshot du jour. But not sure the mirrors even retain all those old snapshots. Likely not.
Mirrors don't retain snapshots. If you're able to store and fetch the snapshot yourself that might be a way to go. Then you can schedule updates on your own. FYI, right now LLVM 7 is being integrated into OpenBSD-current.
Recommendations?
Probably the easiest until 6.5 arrives in May.
@mbelop, I take that to mean that OpenBSD does do better on Skylake?
At least my test seems to confirm. On Skylake it boots and says:
Automatic boot in progress: starting file system checks.
/dev/sd0a (f56cff0d5491ef20.a): file system is clean; not checking
setting tty flags
pf enabled
hw.smt: 0 -> 1
kern.timecounter.hardware: tsc -> tsc
So looks like it had already selected TSC on its own with Skylake and the sysctl was a no-op.
But the rdate result wasn't great:
Wed Jan 30 09:02:10 PST 2019
rdate: adjust local clock by -1.932407 seconds
Wed Jan 30 09:03:10 PST 2019
rdate: adjust local clock by -8.596295 seconds
8.5 seconds off in 60 seconds.
But fastest build yet: 2m11s. So, yay Skylake?
Maybe the wall clock drift or rdate result doesn't matter.
Probably the easiest until 6.5 arrives in May.
@mbelop, I take that to mean that OpenBSD does do better on Skylake?
At least my test seems to confirm. On Skylake it boots and says:
Automatic boot in progress: starting file system checks. /dev/sd0a (f56cff0d5491ef20.a): file system is clean; not checking setting tty flags pf enabled hw.smt: 0 -> 1 kern.timecounter.hardware: tsc -> tsc
So looks like it had already selected TSC on its own with Skylake and the sysctl was a no-op.
As long as your hypervisor sets the CPU model of the VCPU to whatever value Skylake uses,
we don't calibrate TSC at all, since Skylake and Kaby Lake CPUs provide TSC frequency in
one of the CPUID outputs.
But the rdate result wasn't great:
Supposedly because KVM doesn't compensate for VM switches.
That's what PVCLOCK driver is supposed to do (the shared page is updated on VMENTER)
and that's why PVCLOCK is the preferred clock source.
Wed Jan 30 09:02:10 PST 2019 rdate: adjust local clock by -1.932407 seconds Wed Jan 30 09:03:10 PST 2019 rdate: adjust local clock by -8.596295 seconds
8.5 seconds off in 60 seconds.
But fastest build yet: 2m11s. So, yay Skylake?
Maybe the wall clock drift or rdate result doesn't matter.
I don't think it matters a whole lot for building and running regress tests especially since you
probably don't have long running VMs (i.e. you're probably tearing VM down after a single build?)
(i.e. you're probably tearing VM down after a single build?)
Yes.
Should the FreeBSD builders be made to stop using TSC as well? Can Broadwell be set as the minimum CPU type (I think it's available in all the regions)?
So Skylake alone doesn't seem to do the trick for OpenBSD.
I tried booting our original OpenBSD 6.4 image (without the sysctl to set kern.timecounter.hardware) and it comes up and says...
cpu0: Intel(R) Xeon(R) CPU @ 2.00GHz, 1999.32 MHz, 06-55-03
cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16
C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges
cpu0: apic clock running at 909MHz
(The 2.0 GHz to me implies Skylake based on https://cloud.google.com/compute/docs/cpu-platforms at least)
But then when I run sysctl kern.timecounter.hardware:
kern.timecounter.hardware=acpitimer0
And it's slow.
So it seems I still need to force it to TSC?
Or is GCE not giving me Skylake, despite (seemingly?) asking for it?
@paulzhol, I don't know. You want to investigate and open a separate bug? But they're not slow (same speed as Linux, which does use the kvm clock), so I'm not worried. Whatever FreeBSD is doing by default seems to work.
So Skylake alone doesn't seem to do the trick for OpenBSD.
I tried booting our original OpenBSD 6.4 image (without the sysctl to set kern.timecounter.hardware) and it comes up and says...
cpu0: Intel(R) Xeon(R) CPU @ 2.00GHz, 1999.32 MHz, 06-55-03 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SS,HTT,SSE3,PCLMUL,SSSE3,FMA3,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16 C,RDRAND,HV,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,ITSC,FSGSBASE,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,AVX512CD,AVX512BW,AVX512VL,IBRS,IBPB,STIBP,SSBD,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 909MHz
(The 2.0 GHz to me implies Skylake based on https://cloud.google.com/compute/docs/cpu-platforms at least)
As you can see from 06-55-03
the CPU model is 0x55 (most likely a Skylake X) while models recognized by the kernel are 0x4e (mobile) or 0x5e (desktop). Skylake X is no longer part of the fixed frequency list since this commit: https://github.com/openbsd/src/commit/c0b7f8b44d9faa8a96127235666170c5c93b0b12. Unfortunately, I don't have any further information as to why that is so.
But then when I run sysctl kern.timecounter.hardware:
kern.timecounter.hardware=acpitimer0
And it's slow.
So it seems I still need to force it to TSC?
Or is GCE not giving me Skylake, despite (seemingly?) asking for it?
Change https://golang.org/cl/160457 mentions this issue: buildlet, dashboard: add min CPU platform knob, require Skylake for OpenBSD
As you can see from 06-55-03 the CPU model is 0x55 (most likely a Skylake X) while models recognized by the kernel are 0x4e (mobile) or 0x5e (desktop). Skylake X is no longer part of the fixed frequency list since this commit: openbsd/src@c0b7f8b. Unfortunately, I don't have any further information as to why that is so.
Linux fix is https://github.com/torvalds/linux/commit/b511203093489eb1829cb4de86e8214752205ac6
https://lwn.net/Articles/752159/ says:
* Intel Core X-Series (Skylake) processors use a hardcoded Time Stamp
Counter (TSC) frequency of 25 MHz. In some cases this can be imprecise and
lead to timing-related problems such as time drift, timers being triggered
early, or TSC clock instability. This update mitigates these problems by no
longer using the "native_calibrate_tsc()" function to define the TSC
frequency. Refined calibration is now used to update the clock rate
accordingly in these cases. (BZ#1547854)
So Linux redoes calibration on this CPU. OpenBSD gives up on it.
I'm starting to think mirroring ~today's CURRENT snapshot to our own storage for reproducible VM image builds might be the best path forward, rather than backporting pvclock to 6.4 or worrying about the TSC / CPU issues.
@paulzhol, I don't know. You want to investigate and open a separate bug? But they're not slow (same speed as Linux, which does use the kvm clock), so I'm not worried. Whatever FreeBSD is doing by default seems to work.
@bradfitz FreeBSD is not doing the right thing by default, I've forced it to use TSC in https://github.com/golang/build/commit/2e001a4022dd101c83c51ef5cb78db7da9ca9ff3#diff-ca6ebc387c9f22d5808dfac3060344bd
based on https://lists.freebsd.org/pipermail/freebsd-cloud/2017-January/000080.html (so the 10.4, 11.2 and 12.0 images are affected).
That's why I'm asking if we should revert (or maybe you can request Broadwell/Skylake machines for FreeBSD builders as well, no kvmclock is being used).
As you can see from 06-55-03 the CPU model is 0x55 (most likely a Skylake X) while models recognized by the kernel are 0x4e (mobile) or 0x5e (desktop). Skylake X is no longer part of the fixed frequency list since this commit: openbsd/src@c0b7f8b. Unfortunately, I don't have any further information as to why that is so.
Linux fix is torvalds/linux@b511203
https://lwn.net/Articles/752159/ says:
* Intel Core X-Series (Skylake) processors use a hardcoded Time Stamp Counter (TSC) frequency of 25 MHz. In some cases this can be imprecise and lead to timing-related problems such as time drift, timers being triggered early, or TSC clock instability. This update mitigates these problems by no longer using the "native_calibrate_tsc()" function to define the TSC frequency. Refined calibration is now used to update the clock rate accordingly in these cases. (BZ#1547854)
Doesn't really explain much (-;
I take this ^^^^ back since the commit does explain what happened.
So Linux redoes calibration on this CPU. OpenBSD gives up on it.
This isn't correct. We don't give up immediately, we attempt to recalibrate with a PM timer or an HPET timer, except that it doesn't work for you. Then we give up.
I'm starting to think mirroring ~today's CURRENT snapshot to our own storage for reproducible VM image builds might be the best path forward, rather than backporting pvclock to 6.4 or worrying about the TSC / CPU issues.
I can provide a backport patch for 6.4, but I don't know how would you go about integrating it in your test system. I totally support your decision to move to -current however.
A relevant discussion for about TSC calibration (on FreeBSD): http://freebsd.1045724.x6.nabble.com/TSC-calibration-in-virtual-machines-td6266249.html#a6266308
It seems that TSC calibration in virtual machines sometimes can do more harm
than good. Should we default to trusting the information provided by a hypervisor?
A relevant discussion for about TSC calibration (on FreeBSD): http://freebsd.1045724.x6.nabble.com/TSC-calibration-in-virtual-machines-td6266249.html#a6266308
It seems that TSC calibration in virtual machines sometimes can do more harm
than good. Should we default to trusting the information provided by a hypervisor?
Pardon me if this is just taken out of context, but this question sounds silly, since if you're running under a hypervisor that has total control of the guest memory and execution flow you're trusting it by the very definition. There's no question.
This isn't correct. We don't give up immediately, we attempt to recalibrate with a PM timer or an HPET timer, except that it doesn't work for you. Then we give up.
Ah, okay. Thanks.
It'd be nice of course if it worked, or at least whitelisted GCE as having a correctly advertised value like it sounds like FreeBSD is considering.
I'll probably go the -current route.
This isn't correct. We don't give up immediately, we attempt to recalibrate with a PM timer or an HPET timer, except that it doesn't work for you. Then we give up.
Ah, okay. Thanks.
It'd be nice of course if it worked, or at least whitelisted GCE as having a correctly advertised value like it sounds like FreeBSD is considering.
Correctly advertised value of what?
I'll probably go the -current route.
Correctly advertised value of what?
It was my impression from that FreeBSD thread that some hypervisors advertise the TSC frequency. For instance, VMware does:
https://svnweb.freebsd.org/base/head/sys/x86/x86/tsc.c?r1=221178&r2=221214&pathrev=221214
That thread made it sound like GCE does something similar.
Correctly advertised value of what?
It was my impression from that FreeBSD thread that some hypervisors advertise the TSC frequency. For instance, VMware does:
https://svnweb.freebsd.org/base/head/sys/x86/x86/tsc.c?r1=221178&r2=221214&pathrev=221214
That thread made it sound like GCE does something similar.
Yeah, I don't think that's the case. But if you do, please let us know.
Yeah, I don't think that's the case. But if you do, please let us know.
Doesn't GCE (kvm) has this commit https://github.com/qemu/qemu/commit/9954a1582e18b03ddb66f6c892dccf2c3508f4b2 ?
Doesn't GCE (kvm) has this commit qemu/qemu@9954a15 ?
GCE uses KVM but with its own implementation; it doesn't use QEMU. So it might do something like that, but that QEMU code wouldn't tell us whether it does.
So if I'm reading https://www.kernel.org/doc/Documentation/virtual/kvm/timekeeping.txt correctly
3.7) TSC virtualization - VMX
VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
instructions, which is enough for full virtualization of TSC in any manner. In
addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
field specified in the VMCS. Special instructions must be used to read and
write the VMCS field.
My understanding is that if the hypervisor reports a TSC to the guest, then I assume the hypervisor does all the validations required (trapping RDTSC instructions if necessary, keeping the vcpus TSC's in sync etc.) and no calibration should be required (or possible) in the guest. So we are safe to keep forcing TSC on Broadwell/Skylake hosts until something like https://svnweb.freebsd.org/base/head/sys/x86/x86/tsc.c?r1=221178&r2=221214&pathrev=221214 is applied upstream for GCE/KVM?
Most helpful comment
Hey @reyk, I thought you'd like to see this. Thanks for the pvclock driver--- it halves the time needed to build Go on GCE!