Go: cmd/compile: optimize overhead from CPU feature detection

Created on 1 Jan 2020 · 7Comments · Source: golang/go

As investigated in https://github.com/golang/go/issues/36196, the overhead of checking for hardware FMA on every iteration of a loop causes it to slow down. @josharian's CL 212360, which introduces a HasCPUFeature intrinsic, somewhat alleviates this overhead, but it is still non-negligble. We should look into lowering or in some cases eliminating the overhead for operations that require CPU feature detection, like population count, FMA, rounding, SSE3, etc...

One method is to hoist the check outside the loop. To quote https://github.com/golang/go/issues/15808#issuecomment-568224699

we’d need an optimization that rewrote:
var b bool = ...
for ... {
  if b {
    ...
  } else {
    ...
  }
}
into
var b bool = ...
if b {
  for ... {
    ...
  }
} else {
  for ... {
    ... 
  }
}
Which is not always going to be an improvement.

For large loops and operations that permit > 2 implementations, the above optimization could result in inflated binaries, but it works well for smaller loops.

Another method is to set a function pointer to the preferred implementation on program initialization, so that all invocations incur an indirect function call overhead, with the benefit that the implementation wouldn't change at runtime. This would be akin to the dispatcher in GCC's function multi-versioning.

It is worth further investigating opportunities for optimization in this space.

NeedsInvestigation Performance

Source

smasher164

👍1

Most helpful comment

The externally usable version of internal/cpu is golang.org/x/sys/cpu.

But, regardless, we should not expect users to manually hoist a CPU-specific flag.

ianlancetaylor on 2 Jan 2020

👍3

All 7 comments

One option would be to allow people to hoist the check manually and then elide the checks inside the loops.

func OnesCount(xs []uint64) int {
    var total int
    if cpu.HasSSE2 {
        for _, x := range xs {
            total += bits.OnesCount64(x)    
        }
    } else {
        for _, x := range xs {
            total += bits.OnesCount64(x)    
        }
    }
    return total
}

egonelbre on 2 Jan 2020

👎1

This becomes even more noticeable when the loop is unrolled at all, because then you don't even have another branch between the things.

I did some testing to find out what the impact was, and because I was not awake enough at the time, I ended up with an implementation that had the untaken branches to call the library functions, but not the popcount instructions, and discovered that the cost of that, plus the cost of the popcount operation without the branches, is much smaller than the cost of the branches and the popcount instructions. I'm not sure why. But the net impact is that the cost of the branch before every popcount, even though it's obviously a completely predictable branch (and perf confirms that it's predicted essentially 100% of the time), ends up being to increase the cost of popcount something like 3x overall.

seebs on 2 Jan 2020

@egonelbre One option would be to allow people to hoist the check manually

One issue with this is that internal/cpu is not a public API, and I am not sure it is something we want to surface to the user. /cc @martisch

smasher164 on 2 Jan 2020

The externally usable version of internal/cpu is golang.org/x/sys/cpu.

But, regardless, we should not expect users to manually hoist a CPU-specific flag.

ianlancetaylor on 2 Jan 2020

👍3

This is fixed by https://github.com/golang/go/commit/fff7509d472778cae5e652dbe2479929c666c24f

rasky on 4 Apr 2020

🚀2

Change https://golang.org/cl/227238 mentions this issue: cmd/compile: use MOVBQZX for OpAMD64LoweredHasCPUFeature

gopherbot on 5 Apr 2020

I tried CL 227238. I'm using AMD Ryzen 5 3500U.

name      old time/op  new time/op  delta
FMA-8     1.16ns ± 0%  0.95ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-8  0.89ns ± 0%  0.98ns ± 0%   ~     (p=1.000 n=1+1)

name      old time/op  new time/op  delta
FMA-8     1.16ns ± 0%  0.73ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-8  0.89ns ± 0%  0.96ns ± 0%   ~     (p=1.000 n=1+1)



md5-b65c90cffa015573ee89ad7a2d937f6c



name      old time/op  new time/op  delta
FMA-8     1.16ns ± 0%  1.17ns ± 0%   ~     (p=1.000 n=1+1)
NonFMA-8  0.89ns ± 0%  1.21ns ± 0%   ~     (p=1.000 n=1+1)