Split off from #37121
package bench1
import (
"math"
"testing"
)
const N = 64
func BenchmarkFast(b *testing.B) {
var x, y, z [N]float32
for i := 0; i < b.N; i++ {
mulFast(&x, &y, &z)
}
}
func mulFast(x, y, z *[N]float32) {
for i := 0; i < N; i++ {
z[i] = x[i] * y[i]
}
}
func BenchmarkSlow(b *testing.B) {
var z [N]float32
var x, y [N]uint32
for i := 0; i < b.N; i++ {
mulSlow(&x, &y, &z)
}
}
func mulSlow(x, y *[N]uint32, z *[N]float32) {
for i := 0; i < N; i++ {
z[i] = math.Float32frombits(x[i]) * math.Float32frombits(y[i])
}
}
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x
goos: darwin
goarch: amd64
BenchmarkFast-16 10000000 55.9 ns/op
BenchmarkSlow-16 10000000 61.1 ns/op
PASS
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x -test.cpuprofile=cpu.prof
goos: darwin
goarch: amd64
BenchmarkFast-16 10000000 89.7 ns/op
BenchmarkSlow-16 10000000 223 ns/op
PASS
For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.
This bug is reproducible from at least 1.11.
Change https://golang.org/cl/219127 mentions this issue: cmd/compile: add rewrite to avoid MOVLi2f
@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler.
@cherrymui reports that gratuitously inserting the VZEROUPPER before the conversion instruction also works. Wheeee!
Can we put the VZEROUPPER in our signal handler return path, when the code that was interrupted was Go code? That would be a more efficient way to ensure avx-cleanliness of float registers for Go.
We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin....
@dr2chase tried that, it didn't help. I also tried adding VZEROUPPER to asyncPreempt, it didn't help either. I also tried to change sigtramp to immediately return, it also didn't help...
Here's a C/assembly reproducer:
main.c
#include <stdio.h>
#include <stdint.h>
#include <signal.h>
#define N 64
#define SIG SIGFPE
// Does z = y+float64(x) for 64 entries. This is the main benchmark.
void add(uint64_t *x, double *y, double *z);
uint64_t x[N];
double y[N];
double z[N];
void handler(int sig, siginfo_t *info, void *arg) {
}
int main(int argc, char *argv[]) {
// Install handler for a signal.
// sa_sigaction is required; sa_handler doesn't trigger the bug.
struct sigaction act;
act.sa_sigaction = handler;
act.sa_flags |= SA_SIGINFO;
sigaction(SIG, &act, NULL);
// Raise the signal, which dirties the upper bits of YMM registers.
raise(SIG);
// Run the benchmark.
for (int i = 0; i < 40000000; i++) {
add(&x[0], &y[0], &z[0]);
}
return 0;
}
add.s:
.globl _add
_add:
// Uncomment the vzeroupper to "fix" the bug.
//vzeroupper
movq $0, %rax
loop:
cmpq $64, %rax
je done
movq (%rdi, %rax, 8), %rcx
movq %rcx, %xmm0
addsd (%rsi, %rax, 8), %xmm0
movsd %xmm0, (%rdx, %rax, 8)
incq %rax
jmp loop
done:
ret
I also tried adding VZEROUPPER to asyncPreempt, it didn't help either.
That was not true. I had nosplit in mulFast/mulSlow left for previous experiments, which caused it not to actually inject the call. Removing the nosplit makes it work, and it actually faster.
Before:
$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8 14685382 90.9 ns/op
BenchmarkSlow-8 8348937 184 ns/op
PASS
ok _/tmp/pp 3.121s
After:
$ go-tmp test -bench=. .
goos: darwin
goarch: amd64
BenchmarkFast-8 24309422 48.6 ns/op
BenchmarkSlow-8 25514097 49.0 ns/op
PASS
ok _/tmp/pp 2.553s
Change https://golang.org/cl/219131 mentions this issue: runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64
@randall77 @jyknight and I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point.
On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out.
On 1.12, ~13% slower.
On 1.14, ~11% faster.
On 1.15, ~11% faster.
So for some reason this bug doesn't appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost.
I reported a bug to apple here.
We'll see if we get any traction on that.
@randall77, unfortunately your bug report to Apple seems to return a 404.
The link works fine for me. It is behind a login, though. Not sure if that would give you a 404 or not.
I did login and tried again but in vain

Anyways it is not a problem, was just checking up on this issue as I was re-triaging bugs that are in the Go1.15 milestone. Sorry for the noise :)
Do we think that there is anything to fix in Go here?
No, I think there's nothing left on our end.
The bug I linked to above now says:
Recent Similar Reports:Less than 10
Resolution:Potential fix identified - For a future OS update
Pretty vague, but at least something is happening.
Maybe when something is released, we can circle back and check the fix works, and then maybe remove our workaround (CL 219131) when the buggy versions of OSX become unsupported.
No need to keep the issue open, though.
Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)
Apple claims this was fixed in 10.15.6. Anyone want to verify? I'm still on 10.15.2.
(Just build and run the C code and see what happens.)
Nothing happens:
$ time ./a.out
=====
JOB ./a.out
99% cpu
1.97s real
1.96s user
0.00s sys
Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.
Try it with and without the VZEROUPPER in add.s commented out. If it's fixed, there shouldn't be any performance difference between the two.
Here's with VZEROUPPER commented out:
=====
JOB ./a.out
99% cpu
1.88s real
1.88s user
0.00s sys
So it's still slightly faster than with VZEROUPPER commented (I posted above).
Ok, sounds fixed.
So once we stop supporting releases <10.15.6, we can get rid of the VZEROUPPER patch.
We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.
We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.
We don't have the ability to track minimum OS support yet, so the best alternative I can suggest for now is to open a new tracking issue.
@HenkPoley thinks this might have regressed in 10.15.7. See his comment in #41152.
Just for the record. I have no system to test this on. I don't run Go. Just gave a headsup that the _issue in macOS_ that Apple forgets to use VZEROUPPER in their code (causing issues here) seems to have re-appeared. Which was the original issue, isn't it?
I don't think Go need to do anything, except shift that 'remove VZEROUPPER patch once Darwin <10.15.6 is not supported #41152' to =<10.15.7
Right, sorry, it's not that there's any fix to do for this bug. We just need to re-evaluate the point at which the currently checked-in fix will be superfluous.
@randall77 Does this issue still need to be open, or is it enough that #41152 is open?
It's currently milestoned for Go 1.15, so if it needs to stays open, we should move it to a milestone that isn't in the past. If there's something different to do, it might be more clear to open a new issue and reference this one.
Yes, #41152 is all we need now.
Most helpful comment
We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don't help, we think the bug is in Darwin....