Go: cmd/compile: CMOV instruction slower than branch

Created on 10 Jul 2018  Â·  18Comments  Â·  Source: golang/go

What version of Go are you using (go version)?

go1.11beta1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/rk/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/rk/GoSpace/Projects"
GOPROXY=""
GORACE=""
GOROOT="/home/rk/GoSpace/GO"
GOTMPDIR=""
GOTOOLDIR="/home/rk/GoSpace/GO/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473408654=/tmp/go-build -gno-record-gcc-switches"
VGOMODROOT=""

What did you do?

I built the code below with go 1.10 and go 1.11.
https://play.golang.org/p/22MEbiXFpzo

What did you expect to see?

The binary built by go 1.11 is as fast as that built by go 1.10.

What did you see instead?

The binary built by go 1.11 is incredibly slower than that built by go 1.10.

Go 1.11 compiles the function "down" to assembly like this:

    MOVQ pos+32(SP), AX
    MOVQ list+8(SP), CX
    MOVL (CX)(AX*4), DX
    LEAQ 1(AX)(AX*1), BX
    MOVQ list+16(SP), SI
    DECQ SI
    JMP L42
L28:
    MOVL DI, (CX)(AX*4)
    LEAQ 1(BX)(BX*1), DI
    MOVQ BX, AX
    MOVQ DI, BX
L42:
    CMPQ BX, SI
    JGE L73
    MOVL 4(CX)(BX*4), DI
    LEAQ 1(BX), R8
    MOVL (CX)(BX*4), R9
    CMPL DI, R9
    CMOVQHI R8, BX
//  JHI L100
L66:
    MOVL (CX)(BX*4), DI
    CMPL DX, DI
    JCS L28
L73:    
    CMPQ BX, SI
    JNE L97
    MOVL (CX)(BX*4), SI
    CMPL DX, SI
    JCC L92
    MOVL SI, (CX)(AX*4)
L88:    
    MOVL DX, (CX)(BX*4)
    RET
L92:
    MOVQ AX, BX
    JMP L88
L97:    
    MOVQ AX, BX
    JMP L88
L100:
//  MOVQ R8, BX
//  JMP L66

If replacing the CMOV instruction with a branch, it can be 80% faster.

FrozenDueToAge Performance release-blocker

Most helpful comment

I see similar things. I suspect there are two different issues - one with N=100 and one with large N.
I'm going to focus on the large N for now.

With a few more data points (old=tip with cmov generation turned off, new=tip):

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          1752           2022           +15.41%
BenchmarkHeapSort/1000-8         53765          35192          -34.54%
BenchmarkHeapSort/10000-8        760440         502731         -33.89%
BenchmarkHeapSort/100000-8       9790058        7496200        -23.43%
BenchmarkHeapSort/1000000-8      131420794      130939404      -0.37%
BenchmarkHeapSort/2000000-8      294149147      333858182      +13.50%
BenchmarkHeapSort/10000000-8     2271577039     3583052169     +57.73%
BenchmarkHeapSort/20000000-8     5213875449     9264420912     +77.69%

As the array gets larger, cmov gets relatively slower. My L3 cache is 10MB, which is N=1.25e6, just about at the inflection point.
The cmov is only being used to select the correct child, like this:

if list[kid+1] > list[kid] {
    kid++
}

The array size dependence seems an awful lot like cache effects. If the array we're sorting fits in L3 cache, it is fast. If the array does not fit, it's slower. My suspicion is that when using branches, if the branch is predicted correctly (probably 50% of the time) we can issue the next read (the grandchild in the heap) in parallel with the current read. Using cmov, there's a data dependence from one load to the next, so we never issue two of them in parallel. That might be enough of an effect to explain the result - the cmov serializes the reads, and the cache miss is so slow that the branch mispredict & restart can all be hidden in the cache miss.

It seems a pretty niche case where this effect comes into play. I wouldn't want to remove cmov generation because of it.

All 18 comments

cc @TocarIP @rasky

@PeterRK What processor are you running?

I've microbenchmarked the code here on my Core i7-4510U between go1.10.3 and go1.11beta1 and benchstat returns the following:

name        old time/op  new time/op  delta
HeapSort-4  52.3µs ± 3%  39.8µs ± 2%  -23.89%  (p=0.000 n=10+10)

This would indicate there is actually a 24% speed improvement between go1.10.3 and go1.11beta1. I can confirm that the difference is the use of the CMOV instruction in the down function.

Here is the Benchmark function:

const size = 1000

func makeSlice(size int) []uint32 {
    a := make([]uint32, size)
    for i := range a {
        a[i] = rand.Uint32()
    }
    return a
}

func BenchmarkHeapSort(b *testing.B) {
    a := makeSlice(size)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        HeapSort(a)
    }
}

For larger arrrays (size = 1000000) go1.11beta1 is getting slower.

name        old time/op  new time/op  delta
HeapSort-4  75.9ms ± 2%  86.0ms ± 1%  +13.29%  (p=0.000 n=10+9)

I have tried i7-8700 & i5-7500.
Binary built by go 1.11 have less cache miss and less branch miss,but 80% more execution time.
I think it can not be said 5x slower. @josharian

test with 20M list:
10
11

test code:
https://play.golang.org/p/h8mmeue7v1R

Here's an auto-contained benchmark that can be directly copy-pasted in a _test.go file for whoever wants to try it: https://play.golang.org/p/4VwOmmmX069

Here's the result with count=5 on my machine for sizes from 100 to 1000000 (old is 1.10, new is tip):

name                old time/op  new time/op  delta
HeapSort/100-4      1.79µs ± 3%  2.45µs ± 2%  +36.43%  (p=0.008 n=5+5)
HeapSort/1000-4     52.1µs ± 3%  39.2µs ± 0%  -24.73%  (p=0.016 n=5+4)
HeapSort/10000-4     607µs ± 7%   540µs ± 0%  -11.08%  (p=0.016 n=5+4)
HeapSort/100000-4   6.55ms ± 1%  7.09ms ± 2%   +8.20%  (p=0.008 n=5+5)
HeapSort/1000000-4  75.0ms ± 0%  85.7ms ± 2%  +14.27%  (p=0.008 n=5+5)

On a i7-4510U cpu.

Thanks @ALTree for the self-contained benchmark. I was very interested in the CMOV instruction so I gave it a shot too. I see different results in my CPU (Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz).

name                old time/op  new time/op  delta
HeapSort/100-4      2.03µs ± 1%  2.43µs ± 1%  +19.92%  (p=0.008 n=5+5)
HeapSort/1000-4     60.0µs ± 0%  38.8µs ± 1%  -35.29%  (p=0.008 n=5+5)
HeapSort/10000-4     681µs ± 1%   529µs ± 0%  -22.26%  (p=0.008 n=5+5)
HeapSort/100000-4   7.61ms ± 1%  6.78ms ± 1%  -10.93%  (p=0.008 n=5+5)
HeapSort/1000000-4  87.8ms ± 2%  85.9ms ± 1%   -2.19%  (p=0.008 n=5+5)

Things look more or less same, except the last row, where I see perf improvement rather than a slowdown. For the 100 case, I see a slowdown too.

Sorry if this just adds noise. I only posted because I saw different results in the last case.

This might be related to #25298 as pointed out by @randall77 on golang-dev. See the discussion branch predictor versus CMOV.

The benchmark code has the problem that after the first iteration slice a is already sorted.
Here is a fix: https://play.golang.org/p/b_-IC1OSEWq

The new benchstat looks like this:

name                 old time/op  new time/op  delta
HeapSort/100-4       5.59µs ± 3%  3.74µs ± 4%  -33.14%  (p=0.000 n=10+10)
HeapSort/1000-4      75.0µs ± 0%  50.1µs ± 1%  -33.26%   (p=0.000 n=7+10)
HeapSort/10000-4      879µs ± 2%   610µs ± 0%  -30.61%   (p=0.000 n=10+8)
HeapSort/100000-4    10.8ms ± 0%   8.8ms ± 1%  -18.80%   (p=0.000 n=8+10)
HeapSort/1000000-4    141ms ± 1%   146ms ± 1%   +3.62%  (p=0.000 n=10+10)
HeapSort/10000000-4   2.54s ± 3%   4.23s ± 6%  +66.86%  (p=0.000 n=10+10)

Apparently branch prediction and speculative execution (loads) are faster than CMOV if the slice size exceeds the level 2 cache.

Honestly, I'm not sure what to do. @TocarIP?

go 1.11 vs go 1.10 on different scales:
2018-07-11

Go 1.11 loses its advantage when data size grows. CMOV causes that.

I see similar things. I suspect there are two different issues - one with N=100 and one with large N.
I'm going to focus on the large N for now.

With a few more data points (old=tip with cmov generation turned off, new=tip):

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          1752           2022           +15.41%
BenchmarkHeapSort/1000-8         53765          35192          -34.54%
BenchmarkHeapSort/10000-8        760440         502731         -33.89%
BenchmarkHeapSort/100000-8       9790058        7496200        -23.43%
BenchmarkHeapSort/1000000-8      131420794      130939404      -0.37%
BenchmarkHeapSort/2000000-8      294149147      333858182      +13.50%
BenchmarkHeapSort/10000000-8     2271577039     3583052169     +57.73%
BenchmarkHeapSort/20000000-8     5213875449     9264420912     +77.69%

As the array gets larger, cmov gets relatively slower. My L3 cache is 10MB, which is N=1.25e6, just about at the inflection point.
The cmov is only being used to select the correct child, like this:

if list[kid+1] > list[kid] {
    kid++
}

The array size dependence seems an awful lot like cache effects. If the array we're sorting fits in L3 cache, it is fast. If the array does not fit, it's slower. My suspicion is that when using branches, if the branch is predicted correctly (probably 50% of the time) we can issue the next read (the grandchild in the heap) in parallel with the current read. Using cmov, there's a data dependence from one load to the next, so we never issue two of them in parallel. That might be enough of an effect to explain the result - the cmov serializes the reads, and the cache miss is so slow that the branch mispredict & restart can all be hidden in the cache miss.

It seems a pretty niche case where this effect comes into play. I wouldn't want to remove cmov generation because of it.

I can somewhat verify my claim. I can add a prefetch to the benchmark and that mostly fixes the large N behavior. At the head of the loop in down, do:

if kid*2 < last {
    _ = atomic.LoadUint32(&list[kid*2])
}

Now old=nocmov, new=tip:

benchmark                        old ns/op      new ns/op      delta
BenchmarkHeapSort/100-8          2029           2262           +11.48%
BenchmarkHeapSort/1000-8         56825          37950          -33.22%
BenchmarkHeapSort/10000-8        807093         526252         -34.80%
BenchmarkHeapSort/100000-8       10113466       6808430        -32.68%
BenchmarkHeapSort/1000000-8      130448267      101057677      -22.53%
BenchmarkHeapSort/2000000-8      295489785      243496350      -17.60%
BenchmarkHeapSort/4000000-8      721814370      646561035      -10.43%
BenchmarkHeapSort/6000000-8      1202974592     1161344992     -3.46%
BenchmarkHeapSort/8000000-8      1729255015     1735633353     +0.37%
BenchmarkHeapSort/10000000-8     2274368639     2370006748     +4.21%
BenchmarkHeapSort/20000000-8     5265656303     5989478005     +13.75%

There is still some effect, but it is much reduced.

Perhaps we could detect the fact that the result of a CMOV is used as an argument to a load, and suppress the use of a CMOV in those cases. That would allow the chip to continue to speculate the load. Seems like a pretty vague condition, though. And definitely not 1.11 material.
I'm going to punt this issue to 1.12. I think we're at a reasonable state for the release.

Are there µ-arches where the CMOV doesn't block the speculative load? (That seems like the sort of thing that register renaming is supposed to take care of.)

@bcmills I don't think so. The speculative load needs the address from which to load, and I don't know of any archs which speculate the address.

addr := ...
if cond { addr++ }
_ = *addr

With regular branches, cond does not need to be evaluated for the load to be issued. The processor speculates on the branch and either issues the load from addr or addr+1, depending on the branch direction it predicted.

addr := ...
addr = CMOV(cond, addr+1, addr)
_ = *addr

With the CMOV, the address from which you load is data-dependent on cond, so cond must be evaluated before the load is issued.

Change https://golang.org/cl/145717 mentions this issue: cmd/compile: don't use CMOV ops to compute load addresses

Change https://golang.org/cl/168117 mentions this issue: cmd/compile: replace OpSlicemask with branch

Was this page helpful?
0 / 5 - 0 ratings