go version)?go version go1.12.1 windows/amd64
yes
go env)?set GOARCH=amd64 set GOBIN= set GOCACHE=C:\Users\rillig\AppData\Local\go-build set GOEXE=.exe set GOFLAGS= set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOOS=windows set GOPATH=C:\Users\rillig\go set GOPROXY= set GORACE= set GOROOT=C:\Program Files\Go set GOTMPDIR= set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64 set GCCGO=gccgo set CC=gcc set CXX=g++ set CGO_ENABLED=1 set GOMOD= set CGO_CFLAGS=-g -O2 set CGO_CPPFLAGS= set CGO_CXXFLAGS=-g -O2 set CGO_FFLAGS=-g -O2 set CGO_LDFLAGS=-g -O2 set PKG_CONFIG=pkg-config set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\rillig\Program Files\cygwin\tmp\go-build152819318=/tmp/go-build -gno-record-gcc-switches
https://play.golang.org/p/G1EhaxsYsAV
When a bitset is queried for a single bit, it doesn't matter whether set & q is compared to 0 or to the expected bit. Comparing to 0 saves an instruction on x86_64 and doesn't modify a register.
Therefore I expected the generated code to use TESTQ $0x8, AX instead of the combined ANDQ and CMPQ.
I didn't measure how often this pattern appears in the wild, but when I saw the generated code I immediately wondered why it uses two instructions when the same effect can be achieved with a single instruction and even fewer side effects.
bitset.go:13 MOVQ os.Args+8(SB), AX bitset.go:14 NOPL bitset.go:14 XORPS X0, X0 bitset.go:14 MOVUPS X0, 0x40(SP) bitset.go:14 LEAQ runtime.types+65664(SB), CX bitset.go:14 MOVQ CX, 0x40(SP) bitset.go:9 ANDQ $0x8, AX <-- seems inefficient bitset.go:9 CMPQ $0x8, AX <-- seems inefficient bitset.go:9 SETE AL bitset.go:14 MOVZX AL, AX
Indeed. We currently handle x & 8 == 0 but not x & 8 == 8.
Change https://golang.org/cl/175957 mentions this issue: cmd/compile: optimize bitset tests
This optimisation is only valid for x & y == y if y has exactly one bit set (i.e. is a power of two).
See this example: https://play.golang.org/p/fAqS02z3Fgm
@tmthrgd it's right, finding a way to fix it.
@randall77 is there anyway to check a *ssa.Value is power of two? I see there's a function isPowerOfTwo, but it gets an int64 instead.
isPowerOfTwo is probably the right one to use.
You're going to need to make the rules require that one argument is a constant.
It would be nice if you could do this optimization in the generic rules, something like:
(Eq64 (And64 x (Const64 [y])) (Const64 [y])) && isPowerOfTwo(y) -> (Ne64 (And64 x (Const64 [y])) (Const64 [0]))
(And 32/16/8 versions, Ne -> Eq also)
That way all the architectures get the optimization.
@randall77 interesting, with these rules added in generic.rules:
// Optimize bitsets
(Eq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [y])) && isPowerOfTwo(y)
-> (Neq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [0]))
(Neq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [y])) && isPowerOfTwo(y)
-> (Eq(8|16|32|64) (And(8|16|32|64) <t> x (Const(8|16|32|64) <t> [y])) (Const(8|16|32|64) <t> [0]))
For return set&8 == 8, the assembly code generate using BTL instead of TESTQ:
"".AllBitsSet STEXT nosplit size=15 args=0x18 locals=0x0
0x0000 00000 (main.go:8) TEXT "".AllBitsSet(SB), NOSPLIT|ABIInternal, $0-24
0x0000 00000 (main.go:8) FUNCDATA $0, gclocals路33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (main.go:8) FUNCDATA $1, gclocals路33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (main.go:8) FUNCDATA $2, gclocals路33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (main.go:12) PCDATA $0, $0
0x0000 00000 (main.go:12) PCDATA $1, $0
0x0000 00000 (main.go:12) MOVQ "".set+8(SP), AX
0x0005 00005 (main.go:12) BTL $3, AX
0x0009 00009 (main.go:12) SETCS "".~r2+24(SP)
0x000e 00014 (main.go:12) RET
I benched with no speed improvement compare to ANDQ + COMPQ.
Yes, it should use BTL. It's smaller, but probably not any faster than TESTQ.
It might be hard to find a benchmark which cares. I'm happy for this to go in as long as the benchmarks don't get worse.
Most helpful comment
Yes, it should use
BTL. It's smaller, but probably not any faster thanTESTQ.It might be hard to find a benchmark which cares. I'm happy for this to go in as long as the benchmarks don't get worse.