Go: cmd/compile: optimize bitset tests

Created on 8 May 2019  路  7Comments  路  Source: golang/go

What version of Go are you using (go version)?

go version go1.12.1 windows/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\rillig\AppData\Local\go-build
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\Users\rillig\go
set GOPROXY=
set GORACE=
set GOROOT=C:\Program Files\Go
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\rillig\Program Files\cygwin\tmp\go-build152819318=/tmp/go-build -gno-record-gcc-switches

What did you do?

https://play.golang.org/p/G1EhaxsYsAV

What did you expect to see?

When a bitset is queried for a single bit, it doesn't matter whether set & q is compared to 0 or to the expected bit. Comparing to 0 saves an instruction on x86_64 and doesn't modify a register.

Therefore I expected the generated code to use TESTQ $0x8, AX instead of the combined ANDQ and CMPQ.

I didn't measure how often this pattern appears in the wild, but when I saw the generated code I immediately wondered why it uses two instructions when the same effect can be achieved with a single instruction and even fewer side effects.

What did you see instead?

  bitset.go:13          MOVQ os.Args+8(SB), AX
  bitset.go:14          NOPL
  bitset.go:14          XORPS X0, X0
  bitset.go:14          MOVUPS X0, 0x40(SP)
  bitset.go:14          LEAQ runtime.types+65664(SB), CX
  bitset.go:14          MOVQ CX, 0x40(SP)
  bitset.go:9           ANDQ $0x8, AX              <-- seems inefficient
  bitset.go:9           CMPQ $0x8, AX              <-- seems inefficient
  bitset.go:9           SETE AL
  bitset.go:14          MOVZX AL, AX
FrozenDueToAge NeedsFix Performance help wanted

Most helpful comment

Yes, it should use BTL. It's smaller, but probably not any faster than TESTQ.
It might be hard to find a benchmark which cares. I'm happy for this to go in as long as the benchmarks don't get worse.

All 7 comments

Indeed. We currently handle x & 8 == 0 but not x & 8 == 8.

Change https://golang.org/cl/175957 mentions this issue: cmd/compile: optimize bitset tests

This optimisation is only valid for x & y == y if y has exactly one bit set (i.e. is a power of two).

See this example: https://play.golang.org/p/fAqS02z3Fgm

@tmthrgd it's right, finding a way to fix it.

@randall77 is there anyway to check a *ssa.Value is power of two? I see there's a function isPowerOfTwo, but it gets an int64 instead.

isPowerOfTwo is probably the right one to use.
You're going to need to make the rules require that one argument is a constant.

It would be nice if you could do this optimization in the generic rules, something like:

(Eq64 (And64 x (Const64 [y])) (Const64 [y])) && isPowerOfTwo(y) -> (Ne64 (And64 x (Const64 [y])) (Const64 [0]))

(And 32/16/8 versions, Ne -> Eq also)
That way all the architectures get the optimization.

@randall77 interesting, with these rules added in generic.rules:

// Optimize bitsets
(Eq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [y])) && isPowerOfTwo(y)
  -> (Neq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [0]))
(Neq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [y])) && isPowerOfTwo(y)
  -> (Eq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [0]))

For return set&8 == 8, the assembly code generate using BTL instead of TESTQ:

"".AllBitsSet STEXT nosplit size=15 args=0x18 locals=0x0
    0x0000 00000 (main.go:8)    TEXT    "".AllBitsSet(SB), NOSPLIT|ABIInternal, $0-24
    0x0000 00000 (main.go:8)    FUNCDATA    $0, gclocals路33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (main.go:8)    FUNCDATA    $1, gclocals路33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (main.go:8)    FUNCDATA    $2, gclocals路33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (main.go:12)   PCDATA  $0, $0
    0x0000 00000 (main.go:12)   PCDATA  $1, $0
    0x0000 00000 (main.go:12)   MOVQ    "".set+8(SP), AX
    0x0005 00005 (main.go:12)   BTL $3, AX
    0x0009 00009 (main.go:12)   SETCS   "".~r2+24(SP)
    0x000e 00014 (main.go:12)   RET

I benched with no speed improvement compare to ANDQ + COMPQ.

Yes, it should use BTL. It's smaller, but probably not any faster than TESTQ.
It might be hard to find a benchmark which cares. I'm happy for this to go in as long as the benchmarks don't get worse.

Was this page helpful?
0 / 5 - 0 ratings