Author: Michael Jones
Last updated: May 2016
Careful bounds checking for array and slice operations is inherently in conflict with efficiency of operation.
While the Go compiler aspires to minimize redundant bounds checks, the common Push and Pop stack operations are situation where a special notation in source code would be a clarity benefit for the programmers writing and reading code and a compiler optimization benefit in terms of efficient code generation. This proposal explains that benefit.
The Push() operation adds data to a stack by copying items to the end of a storage area (indexed write) and then advances the stack pointer (the index). For a stack implemented as an array, this might look like:
var sp int
var stack [STACKSIZE]float64
:
//push
stack[sp] = 17.0
sp++
:
//pop
sp--
x := stack[sp]
Note that advancing the stack pointer with sp++ must not move beyond the end of the stack storage area just as retreating it with sp-- must not move before the start of the storage area. These can be tested in the increment and decrement, or equivalently, accesses can be tested on every indexed access to the storage area ("stack[sp]") as is the case of the Go compiler.
This example could potentially be simpler when the stack storage area is a slice because the stack pointer is implicitly the end of the slice's active region, the remaining space is the slice's unused underlying capacity, and the minimum depth is naturally zero; push and pop could easily become native actions on a slice.
It is often the case that multiple values need to be pushed-to or popped-from a stack. Since the first days of C-language programming, this has resulted in the idiomatic group Push():
*sp++ = a;
*sp++ = b;
*sp++ = c;
*sp++ = d;
and group Pop():
d = *--sp;
c = *--sp;
b = *--sp;
a = *--sp;
...which is speedy given indexed addressing modes in CPU instruction sets and the direct expressions provided by C. For Go, though, we cannot do this. We are forced into one of two paths, each with four bounds checks and four additions:
stack[sp] = a
sp++
stack[sp] = b
sp++
stack[sp] = c
sp++
stack[sp] = d
sp++
:
or else
stack[sp] = a
stack[sp+1] = b
stack[sp+2] = c
stack[sp+3] = d
sp += 4
The opportunity here is that the bounds check for the final Push() in a polyadic Push() sequence is a comprehensive replacement for each of the other bounds checks. That is, if sp+3 is safe to reference, then sp, sp+1, and sp+2 must also be safe to reference. The same notion applies to Pop(), where the lowest element in the stack, say sp-3, is a perfect test of the safety of access at offsets sp-2, sp-1, and sp.
This means that even a many valued push or pop need only do one bounds check to provide the full safety of the Go Language guarantee. The resulting performance gain scales linearly with the number of items pushed or popped as a group.
No specific notation is proposed. Here are a few musings, but part of the discussion needs to center on how to most naturally express this notion in Go. The proposer likes the second of these best, but perhaps better ways are possible.
StackPush(slice, a, b, c, d)
StackPop(slice, d, c, b, a) or "a, b, c, d" if preferred
Stack(slice) = a, b, c, d // "absorbing" multiple assignment to define push cardinality
d, c, b, a = Stack(slice) // ...and the converse which is implicitly Pop()
Note that this is not an unbalanced assignment. It means that the cardinality of variables on the "variable side" determines the group size being added to or removed from the stack. The first example here means "push four items" because there are four variables. It might be seen as a weakness that it does not easily generalize to cases with a stack on each side of the assignment operator because there is no place to specify the "four."
GO(slice).Push(4) = a, b, c, d // cast
a, b, c, d = GO(slice).Pop(4)
The result of the cast here takes the user's data structure to a built-in realm where all of the otherwise namespace polluting method names could be defined. I show Push and Pop, but we could have other compiler-known methods such as Len(), Reverse(), Sort(), etc. without taking name-space and freedom for function options away from the developer.
slice = Append(slice, a, b, c, d...)
d, c, b, a = Extract(slice, 4)
or
Extract(slice, d, c, b, a)
Extract(slice, d, c, b, a...) // strange overloading of ... here, but maybe ok
The existing one-at-a-time redundant bounds checks of stack access cost ~12% of overall runtime in several finely tuned compute-intensive applications using the latest and most efficient versions of the Go compiler. The proposal here seeks to regain this wasted execution time by making a class of redundant bounds tests easy to avoid.
Adding compiler-aware support for variadic Push and Pop on slices offers code clarity, exposes the structure of sequential accesses to the compiler, makes suppression of redundant bounds checks trivial, and allows more efficient runtime performance.
Have you played with the bounds check elimination work that @brtzsnr did for SSA? If you try with tip, then your Extract could look a bit like:
_ = slice[3]
a = slice[0]
b = slice[1]
c = slice[2]
d = slice[3]
I believe that should generate a single bounds check and indexed addressing. The cost is a single "assertion" at the top, but the savings are: no new intrinsics. :)
I don't like any proposal that modifies the language to
fix current compiler inefficiencies.
See some tips how to get rid of bound checks in this doc https://docs.google.com/document/d/1vdAEAjYdzjnPA9WDOQ1e4e05cYVMpqSxJYZT33Cqw2g/edit#
For example
stack[sp] = a
stack[sp+1] = b
stack[sp+2] = c
stack[sp+3] = d
can be rewritten as:
s := stack[sp:sp+4]
s[0] = a
s[1] = b
s[2] = c
s[3] = d
which should generate a single bound check,
In anycase try with to see where the bound checks are introduced -gcflags="-d=ssa/check_bce/debug=1"
Thanks for the comments here.
I am going to try the suggestions of josharian and brtzsnr right now.
minux, I appreciate your position. The question for me is "would first class interpretation of a slice as a stack" bring a welcome combination of performance and coding clarity / error reduction / etc. Maybe. Maybe not. Absolutely agree that language-change bandaids would be a mistake in any case.
I don't think this meets the bar for a language change, as much as I am sympathetic to this specific problem.
We should rather focus on getting better at bounds-checks elimination. Would it be possible to see the offending code so that compiler folks can look into compiler work, if necessary (perhaps after you tried the suggestions)? If this impacts performance by 12% I'd assume it to be part of extremely "hot" code, so perhaps even a small code snippet should show the problem?
Finally a request for clarification: I'm not sure I understand the difference between push(stack, a, b, c) and append(stack, a, b, c). Is there one?
Update:
Followed josharian's guidance (with several variations) and found an improved outcome. Still of course paying the one bounds check, but not more. I accept that there is a way to program around the issue, so the proposal's advantage now is mostly about clarity of expression. (See comments in the example below)
$ go test -bench=RichardsTwoWayIterative
BenchmarkRichardsTwoWayIterative04-8 50000000 32.3 ns/op
BenchmarkRichardsTwoWayIterative05-8 20000000 74.4 ns/op
BenchmarkRichardsTwoWayIterative06-8 10000000 211 ns/op
BenchmarkRichardsTwoWayIterative07-8 2000000 748 ns/op
BenchmarkRichardsTwoWayIterative08-8 500000 2778 ns/op
BenchmarkRichardsTwoWayIterative09-8 100000 15425 ns/op
BenchmarkRichardsTwoWayIterative10-8 10000 102282 ns/op
BenchmarkRichardsTwoWayIterative11-8 3000 493416 ns/op
BenchmarkRichardsTwoWayIterative12-8 500 2567382 ns/op
BenchmarkRichardsTwoWayIterative13-8 100 14133869 ns/op
BenchmarkRichardsTwoWayIterative14-8 20 82740477 ns/op
BenchmarkRichardsTwoWayIterative15-8 2 562817334 ns/op
BenchmarkRichardsTwoWayIterative16-8 1 3404951675 ns/op
PASS
ok queen 24.209s
$ go test -bench=RichardsTwoWayIterative -gcflags=-B
BenchmarkRichardsTwoWayIterative04-8 50000000 30.7 ns/op
BenchmarkRichardsTwoWayIterative05-8 20000000 67.3 ns/op
BenchmarkRichardsTwoWayIterative06-8 10000000 193 ns/op
BenchmarkRichardsTwoWayIterative07-8 2000000 643 ns/op
BenchmarkRichardsTwoWayIterative08-8 500000 2550 ns/op
BenchmarkRichardsTwoWayIterative09-8 100000 14659 ns/op
BenchmarkRichardsTwoWayIterative10-8 20000 91065 ns/op
BenchmarkRichardsTwoWayIterative11-8 3000 458559 ns/op
BenchmarkRichardsTwoWayIterative12-8 500 2373090 ns/op
BenchmarkRichardsTwoWayIterative13-8 100 13105999 ns/op
BenchmarkRichardsTwoWayIterative14-8 20 76434452 ns/op
BenchmarkRichardsTwoWayIterative15-8 3 477616045 ns/op
BenchmarkRichardsTwoWayIterative16-8 1 3162263098 ns/op
PASS
ok queen 25.485s
Robert:
Push and append are the same...the "extract" is all that is added in that case with names meant to suggest that append/extract == push/pop. Sorry for confusion on that point.
Sample code is always available. It was in the email I sent to Rob that lead to this proposal! ;-) Here is a simple example:
func richardsTwoWayIterative(n int) (count int) {
var c, l, r, p int
// var cc, ll, rr, pp [MaxN]int
var stack [4 * MaxN]int
for q0 := 0; q0 < n-2; q0++ {
for q1 := q0 + 2; q1 < n; q1++ {
bit0 := 1 << uint(q0)
bit1 := 1 << uint(q1)
c = bit0 | bit1 | (-1 << uint(n))
l = ((bit0 << 1) | bit1) << 1
r = ((bit0 >> 1) | bit1) >> 1
p = ^(l | c | r)
d := 4
for d >= 4 {
for p != 0 {
lsb := p & -p
p ^= lsb
nc := c | lsb
if nc == -1 {
count++
} else {
nl := (l | lsb) << 1
nr := (r | lsb) >> 1
np := ^(nl | nc | nr)
if np != 0 {
if p != 0 {
// STACK(slice) = c, l, r, p // https://github.com/golang/go/issues/15526
// cc[d], ll[d], rr[d], pp[d] = c, l, r, p
// d++
// _ = stack[d+3]
stack[d+3] = p
stack[d] = c
stack[d+1] = l
stack[d+2] = r
d += 4
// s := stack[d : d+4]
// s[0], s[1], s[2], s[3] = c, l, r, p
// d += 4
}
c, l, r, p = nc, nl, nr, np
}
}
}
// c, l, r, p = STACK(slice) // https://github.com/golang/go/issues/15526
// d--
// c, l, r, p = cc[d], ll[d], rr[d], pp[d]
// _ = stack[d-4]
c = stack[d-4]
p = stack[d-1]
r = stack[d-2]
l = stack[d-3]
d -= 4
// s := stack[d-4 : d]
// c, l, r, p = s[0], s[1], s[2], s[3]
// d -= 4
}
}
}
// double result due to 2-way horizontal symmetry
count <<= 1
return
}
Regarding clarity of expression: How about factoring out the stack ops into two function calls ("push", "pop") and see if they are getting inlined?
And if they aren't getting inlined, patch in CL 22782 and run with -gcflags="-m -m" to find out why.
Tried this. Summary:
stack[d+3] = p
stack[d] = c
stack[d+1] = l
stack[d+2] = r
d += 4
$ go test -bench=RichardsTwoWayIterative -benchtime=10s
BenchmarkRichardsTwoWayIterative04-8 500000000 32.9 ns/op
BenchmarkRichardsTwoWayIterative05-8 200000000 74.5 ns/op
BenchmarkRichardsTwoWayIterative06-8 50000000 217 ns/op
BenchmarkRichardsTwoWayIterative07-8 20000000 745 ns/op
BenchmarkRichardsTwoWayIterative08-8 5000000 2758 ns/op
BenchmarkRichardsTwoWayIterative09-8 1000000 15411 ns/op
BenchmarkRichardsTwoWayIterative10-8 200000 103947 ns/op
BenchmarkRichardsTwoWayIterative11-8 30000 486789 ns/op
BenchmarkRichardsTwoWayIterative12-8 5000 2599768 ns/op
BenchmarkRichardsTwoWayIterative13-8 1000 14371944 ns/op
BenchmarkRichardsTwoWayIterative14-8 200 81969665 ns/op
BenchmarkRichardsTwoWayIterative15-8 30 502210871 ns/op
BenchmarkRichardsTwoWayIterative16-8 3 3354597203 ns/op
type STACK struct {
sp int
stack [4 * MaxN]int
}
func (s *STACK) push4(c, l, r, p int) {
s.stack[s.sp+3] = p
s.stack[s.sp+0] = c
s.stack[s.sp+1] = l
s.stack[s.sp+2] = r
s.sp += 4
return
}
$ go test -bench=RichardsTwoWayIterative -benchtime=10s
BenchmarkRichardsTwoWayIterative04-8 500000000 35.3 ns/op
BenchmarkRichardsTwoWayIterative05-8 200000000 79.1 ns/op
BenchmarkRichardsTwoWayIterative06-8 50000000 246 ns/op
BenchmarkRichardsTwoWayIterative07-8 20000000 810 ns/op
BenchmarkRichardsTwoWayIterative08-8 5000000 3112 ns/op
BenchmarkRichardsTwoWayIterative09-8 1000000 17465 ns/op
BenchmarkRichardsTwoWayIterative10-8 200000 112961 ns/op
BenchmarkRichardsTwoWayIterative11-8 30000 546747 ns/op
BenchmarkRichardsTwoWayIterative12-8 5000 2711926 ns/op
BenchmarkRichardsTwoWayIterative13-8 1000 15259405 ns/op
BenchmarkRichardsTwoWayIterative14-8 200 87839617 ns/op
BenchmarkRichardsTwoWayIterative15-8 20 548524594 ns/op
BenchmarkRichardsTwoWayIterative16-8 3 3769988645 ns/op
PASS
func push(stack *[4 * MaxN]int, sp *int, c, l, r, p int) {
stack[(*sp)+3] = p
stack[(*sp)+0] = c
stack[(*sp)+1] = l
stack[(*sp)+2] = r
*sp = (*sp) + 4
}
$ go test -bench=RichardsTwoWayIterative -benchtime=10s
BenchmarkRichardsTwoWayIterative04-8 300000000 40.5 ns/op
BenchmarkRichardsTwoWayIterative05-8 100000000 103 ns/op
BenchmarkRichardsTwoWayIterative06-8 50000000 273 ns/op
BenchmarkRichardsTwoWayIterative07-8 20000000 923 ns/op
BenchmarkRichardsTwoWayIterative08-8 5000000 3534 ns/op
BenchmarkRichardsTwoWayIterative09-8 500000 24871 ns/op
BenchmarkRichardsTwoWayIterative10-8 100000 135996 ns/op
BenchmarkRichardsTwoWayIterative11-8 20000 695837 ns/op
BenchmarkRichardsTwoWayIterative12-8 5000 3612700 ns/op
BenchmarkRichardsTwoWayIterative13-8 1000 18441152 ns/op
BenchmarkRichardsTwoWayIterative14-8 100 107092404 ns/op
BenchmarkRichardsTwoWayIterative15-8 20 676950315 ns/op
BenchmarkRichardsTwoWayIterative16-8 3 4516315653 ns/op
$ go test -bench=RichardsTwoWayIterative -gcflags=-B -benchtime=10s
BenchmarkRichardsTwoWayIterative04-8 500000000 31.0 ns/op
BenchmarkRichardsTwoWayIterative05-8 200000000 70.2 ns/op
BenchmarkRichardsTwoWayIterative06-8 100000000 211 ns/op
BenchmarkRichardsTwoWayIterative07-8 20000000 687 ns/op
BenchmarkRichardsTwoWayIterative08-8 5000000 2743 ns/op
BenchmarkRichardsTwoWayIterative09-8 1000000 16873 ns/op
BenchmarkRichardsTwoWayIterative10-8 200000 97011 ns/op
BenchmarkRichardsTwoWayIterative11-8 30000 495367 ns/op
BenchmarkRichardsTwoWayIterative12-8 5000 2550674 ns/op
BenchmarkRichardsTwoWayIterative13-8 1000 14377273 ns/op
BenchmarkRichardsTwoWayIterative14-8 200 83683313 ns/op
BenchmarkRichardsTwoWayIterative15-8 20 523547845 ns/op
BenchmarkRichardsTwoWayIterative16-8 3 3341482554 ns/op
I acknowledge that the code snippet is not itself an award winner in terms of "clarity of exposition." It is the 4th in a developmental series and most of the comments are out at this point as it is focused on what changes between the versions. This series is just one of six or so ways to solve the same problem with a huge range of variations in performance and intricacy. The final version is very much faster, is dependent on -gcflags=-B, and rather completely inscrutable, with much like this...
:
s_left[i] = left
s_rigt[i] = rigt
PROC2:
bit = -bits & bits
bits ^= bit
board[i] = bit
s_bits[i] = bits
i++
mask ^= bit
left = (left | bit) << 1
rigt = (rigt | bit) >> 1
if i == posa {
if mask&topb != 0 {
goto BACK2
}
if mask&1 != 0 {
if ((left | rigt) & 1) != 0 {
goto BACK2
}
bits = 1
} else {
bits = mask & ^(left | rigt)
if bits == 0 {
goto BACK2
}
}
goto NEXT3
} else {
goto NEXT2
}
BACK2:
i--
bits = s_bits[i]
if bits != 0 {
mask = s_mask[i]
left = s_left[i]
:
Now within a few percentage points with the inlined functions, so going to that for clarity.
stack.sp = 4
for stack.sp >= 4 {
for p != 0 {
lsb := p & -p
p ^= lsb
nc := c | lsb
if nc == -1 {
count++
} else {
nl := (l | lsb) << 1
nr := (r | lsb) >> 1
np := ^(nl | nc | nr)
if np != 0 {
if p != 0 {
stack.push4(c, l, r, p)
}
c, l, r, p = nc, nl, nr, np
}
}
}
c, l, r, p = stack.pop4()
}
Still stickling with the appeal of slice-as-fast-stack, but benchmarks now show performance close enough with awkwardness hidden in functions. I am happy enough with this outcome.
It sounds like you're happy enough with the status quo, so can we close this proposal? You re-opened it.
Yes, sir. All fine. SSA for the win!
On Jul 31, 2016 8:51 PM, "Rob Pike" [email protected] wrote:
It sounds like you're happy enough with the status quo, so can we close
this proposal? You re-opened it.—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/15526#issuecomment-236467936, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHgypaNNhAMeaCzDdkqkF1Zj4MEZ2Qc5ks5qbUMfgaJpZM4IWf6i
.
Most helpful comment
I don't like any proposal that modifies the language to
fix current compiler inefficiencies.