Performance would be generally somewhat improved if arguments-to and results-from function and method calls used registers instead of the stack. Projected improvements based on a limited prototype are in the range of 5-10%.
The running CL for the prototype: https://go-review.googlesource.com/c/28832/
The prototype, because it is a prototype, uses a pragma that should be unnecessary in the final design, though helpful during development.
problems/tradeoffs noted below, through 2017-01-19 (https://github.com/golang/go/issues/18597#issuecomment-273928086)
This will reduce the usefulness of panic tracebacks because it will confuse/corrupt the per-frame argument information there. This was already a problem with SSA and register allocation and spilling, this will make it worse. Perhaps only results should be returned in registers.
Does this provide enough performance boost to justify "breaking" (how much?) existing assembly language?
Given that this breaks existing assembly language, why aren't we also doing a more thorough revision of the calling conventions to include callee-saves registers?
This should use the per-platform ABIs, to make interaction with native code go more smoothly.
Because this preserves the same memory layout of the existing calling, it will consume more stack space than it otherwise might, if that had been optimized out.
And the responses to above, roughly:
In compiler work, 5% (as a geometric mean across benchmarks) is a big deal.
Panic tracebacks are a problem; we will work on that. One possible mitigation is to use DWARF information to make tracebacks much better in general, including properly named and interpreted primitive values. A simpler mitigation for targeted debugging could be an annotation indicating that a function should be compiled to store arguments back to the stack (old style) to ensure that particular function's frame was adequately informative. This is also going to be an issue for improved inlining because regarded at a low level intermediate frames will disappear from backtraces.
The scope of required assembly-language changes is expected to be quite small; from Go-side function declarations the compiler ought to be able to create shims for the assembly to use around function entry, function exit, and surrounding assembly-language CALLs to named functions. The remaining case is assembly-language CALL via a pointer, and these are rare, especially outside the runtime. Thus, the need to bundle changes because they are disruptive is reduced, because the changes aren't that disruptive.
Incorporating callee-saves registers introduces a garbage-collector interaction that is not necessarily insurmountable, but other garbage-collected languages (e.g., Haskell) have been sufficiently intimidated by it that they elected not to use callee-saves registers. Because the assembler can also modify stack frames to include callee-save spill areas and introduce entry/exit shims to save/restore callee-save registers, this appears to have lower impact than initially estimated, and thus we have reduced need to bundle changes. In addition, if we stake out potential callee-save registers early developers can look ahead and adapt their code before it is required.
If each platform's ABI were used instead of this adaptation of the existing calling conventions, the assembly-language impact would be larger, the garbage-collector interactions would be larger, and either best-treatment for Go's multivalue returns would suffer or the cgo story would be sprinkled with asterisks and gotchas. As an alternative (a different way of obtaining a better story for cgo), would we consider a special annotation for cgo-related functions to indicate that exactly the platform calling conventions were used, with no compromises for Go performance?
CL https://golang.org/cl/35054 mentions this issue.
5% for amd64 doesn't feel like it justifies the cost of breaking all the assembly written to date.
Additionally, passing argument in registers will make displaying function
arguments in traceback much harder, if not impossible. One reason I like Go
so much is how Go's traceback makes debugging so much easier by providing
enough contextual information about the frames.
Frankly, I'd have expected much larger performance improvement for
switching the calling convention to use registers. If the benefit is indeed
only 5-10%, IMHO it's not worth the trouble. We should pursue better
inlining and automatic vectorization instead.
The Intel CPUs have invested so much circuit in optimizing the stack
(operations), so I'm not too surprised though. The benefit should be larger
on RISC architectures.
I agree with Minux, optimising function calls for 10% speedup vs inlining and all the code generation benefits that unlocks doesn't feel like a good investment.
I'm still thinking about the proposal, but I will note that for asm functions with go prototypes and normal stack usage, we can do automated rewrites. Which is also a reminder that the scope for this proposal should include updating vet's asmdecl check.
I wouldn't worry too much about assembly functions,
as in the worse case, we can have an opt-in mechanisms
for assembly functions and then develop a mode in
cmd/asm to help people rewrite (or better yet, automate
the translation in cmd/asm so that people won't even
notice the difference. we can have a new textflag for new
style assembly functions).
However, I do care deeply about arguments in tracebacks.
I tried patch set 30 of CL 28832 on this example:
https://play.golang.org/p/UV-E4wyL2T
And the result is:
panic: 0
goroutine 1 [running]:
panic(0x456a80, 0xc42000e118)
$GOROOT/src/runtime/panic.go:531 +0x1cf
main.F(0x50)
/tmp/x.go:8 +0x6b
main.F(0x0)
/tmp/x.go:6 +0xa
main.F(0x0)
/tmp/x.go:6 +0xa
main.F(0xc420052000)
/tmp/x.go:6 +0xa
main.F(0x0)
/tmp/x.go:6 +0xa
main.F(0x7)
/tmp/x.go:6 +0xa
main.F(0x7f6c34290000)
/tmp/x.go:6 +0xa
main.F(0x60)
/tmp/x.go:6 +0xa
main.F(0xc420070058)
/tmp/x.go:6 +0xa
main.F(0x0)
/tmp/x.go:6 +0xa
main.F(0xc4200001a0)
/tmp/x.go:6 +0xa
main.main()
/tmp/x.go:13 +0x9
I'd like to hear if we have any plans for restoring the existing
behavior of showing the first few arguments for each frame.
@minux Yes, we will have all the information necessary to print args as we do now. I suspect it just isn't implemented yet.
I'm curious to know how do you plan to implement it without
incurring the overhead of storing the initial value into memory?
The values enter in registers but they will be spilled to the arg slots if they are live at any call. So all live arguments will still be correct. Even modified live arguments should work (except for a few weird corner cases where multiple values derived from the same input are live).
Dead arguments are not correct even today. What you see in tracebacks is stale. They may get more wrong with this implementation, so it may be worth marking/not displaying dead args.
Things will get trickier if we allow callee-saved registers. We're thinking about it, but it isn't in the design doc yet.
Just passing arguments in register while still reserving stack slots
for them looks like MS's ABI. Isn't that negating a lot of the benefit
of passing arguments in registers in the first place? Passing
argument in register is precisely about not generating memory
traffic for register arguments.
Anyway, I'd like to read the full design docs of this and see the
discussion of various design decisions and trade offs made.
When we pass an argument in a register, the corresponding stack slot is reserved but not initialized, so there's no memory traffic for it. Only if the callee needs to spill it will that slot actually get initialized.
I'm not sure how "full" the design docs are.
I've got a CL up, and here's a more readable version of it that I'll try to keep up to date: https://gist.github.com/dr2chase/5a1107998024c76de22e122ed836562d
(Repeating some of that gist) I reviewed a bunch of ABIs, and the combination of
caused me to to decide to try something other than standard ABI(s). The new goal was to minimize change/cost while still getting plausible benefits, so I decided to keep as much of the existing design as possible.
The main loss in reserving stack space is increased stack size; we only spill if we must, but if we must, we spill to the "old" location.
As far as backtraces go, we need to improve our DWARF output for debuggers, and I think we will, and then use that to produce readable backtraces. That should get this as right as it can be (up to variables not being live) and would be more generally accessible than binary data.
So actually turning this on may be gated by DWARF improvements.
Because we still spill arguments to stack when it's live across calls, this
means using registered parameters won't help performance for most non-leaf
functions (and any functions with moderate size.)
Therefore, it seems registered parameters will mostly help small leaf
functions. And if that's true, I imagine inlining those functions will
actually provide even more speedup because then the compiler can optimize
across function call boundaries.
I'm not sure improving DWARF would help traceback. True, DWRAF provides an
elaborated way to specify location of arguments (and variables), but for
the case of traceback, we can only use SP relative addressing for the
arguments: even though DWARF allows us to describe that at certain point in
the program, certain argument is in a certain register, at the time of
traceback, such information is useless because registers are almost
guaranteed to be clobbered.
What I'd like to see is concrete result that changing ABI is worthy the
effort as compared to, say, better inlining support. Specifically, I like
to see evidence that registering arguments help large functions that cannot
be inlined.
If we have to change the ABI, I think allowing some callee-saved registers
will actually provide more benefit. And we need to cache g in a register on
amd64 (like the RISC architectures). I remember Russ proposed that we
maintain some essentially callee-saved register to save g. We should
investigate general callee-saved ABIs. Prior experience shows that
introducing callee-saved registers in a register-rich architecture helps
performance considerably, and it's not restricted to small functions.
Additionally, if we can align our callee-saved registers with the platform
ABI, cgo callbacks will be faster because we always save all platform
callee-saved registers at entry.
if we can align our callee-saved registers with the platform ABI, cgo callbacks will be faster
Not to mention more reliable. A lot of the signal trampolines in the runtime could be a whole lot simpler if they didn't have to deal with the ABI mismatch between signal handlers (which must use the platform ABI) and the runtime's Go functions. (I'm still not entirely convinced that there aren't more calling-convention bugs lurking.)
One problem of callee-saved registers is that we must figure out how to
make GC cope with it. Otherwise if we can only save non-pointer in
callee-saved registers across function calls, the benefit will be much
smaller.
This is complicated by the problem that when the callee decided to spill a
callee-saved register, how do they know if it should be stored in pointer
slots or not?
Pointer tagging is out of the question as we don't box everything. My
initial idea is having the caller reserves some stack space above the
outgoing arguments area for all callee-saved registers and have
(conditional) stack map for those. The callee will always save to
designated slot for the register. To avoid clearing the save area, we need
more pcdata to tell GC which of the callee-saved registers have been
spilled to their respective slots. This solution does negate some of the
benefit of having callee-saved registers, but coupled with better inlining,
I think it could still outperform register arguments alone.
Of course, one simple solution is to allow only pointers in callee-saved
registers.
At each call, have the PCDATA for that call record, for each callee-saved register, whether it holds 1) a pointer; 2) a non-pointer; 3) whatever it held on function entry. Also at each call, record the list of callee-saved registers saved from the caller, and where they are.
Now, on stack unwind, you can see the list of callee-saved registers. Then you look to the caller to see whether the value is a pointer or not. Of course you may have to look at the caller's caller, etc., but since the number of callee-saved registers is small and fixed it's easy to keep track of what you need to do as you walk up the stack.
5% for amd64 doesn't feel like it justifies the cost of breaking all the assembly written to date.
Much like SSA, I expect this change will have a much more meaningful effect on minority architectures.
Those fancy Intel chips can resolve push/pop pairs halfway through the instruction pipeline (without even emitting a uop!), while a load from the stack on a Cortex A57 has a 4-cycle result latency..
On Wed, Jan 11, 2017 at 9:39 PM, Philip Hofer notifications@github.com
wrote:
5% for amd64 doesn't feel like it justifies the cost of breaking all the
assembly written to date.Much like SSA, I expect this change will have a much more meaningful
effect on minority architectures.Those fancy Intel chips can resolve push/pop pairs halfway through the
instruction pipeline (without even emitting a uop!), while a load from the
stack on a Cortex A57 has a 4-cycle result latency.
http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
.Note: the gc compiler don't ever generate push or pop instructions. It
always uses
mov to access the stack.
Result latency doesn't matter much on a out-of-order core because L1D on
Intel
chips has similar latency.
I suspect the benefit could be higher, but for non-leaf functions we still
need to
spill the argument registers to stack, which negates most of the benefit
for large
and non-leaf functions.
Result latency doesn't matter much on a out-of-order core because L1D on
Intel
chips has similar latency.
It especially doesn't matter if you have shadow registers, but AFAIK none of the arm/arm64 designs have 'em.
I suspect the benefit could be higher, but for non-leaf functions we still
need to
spill the argument registers to stack, which negates most of the benefit
for large
and non-leaf functions.
You can't just move them to callee-saves if they're still live? Is that for GC, or a requirement for traceback?
On Wed, Jan 11, 2017 at 11:59 PM, Philip Hofer notifications@github.com
wrote:
I suspect the benefit could be higher, but for non-leaf functions we still
need to
spill the argument registers to stack, which negates most of the benefit
for large
and non-leaf functions.You can't just move them to callee-saves if they're still live? Is that
for GC, or a requirement for traceback?In the current ABI, every register is caller-save. My counterproposal is
introducing callee-save
registers but keep arguments passing on stack to preserve the current
traceback behavior.
In the current ABI, every register is caller-save. My counterproposal is
introducing callee-save
registers but keep arguments passing on stack to preserve the current
traceback behavior.
Ah. I guess I assumed, incorrectly, that this proposal included making some registers callee-save.
Therefore, it seems registered parameters will mostly help small leaf
functions. And if that's true, I imagine inlining those functions will
actually provide even more speedup because then the compiler can optimize
across function call boundaries.
Inlining doesn't help on function pointer calls.
@dr2chase
there's no memory traffic unless there's a spill, and then we must spill somewhere
That's not strictly true. Empty stack slots still decrease cache locality by forcing the actual in-use stack across a larger number of cache lines. There's no extra memory traffic between the CPU and cache unless there's a spill, but reserving stack slots may well increase memory traffic on the bus.
most of the existing ABIs (all but Arm64) are unfriendly to multiple return values in registers
Not true. It would be trivial to extend the SysV AMD64 ABI for multiple return values, for example: it already has two return registers (rax and rdx), and it's easy enough to define extensions of the ABI that pass additional Go return values in other caller-save registers.
For example, we could do something like:
rax: varargs count; 1st return
rbx: callee-saved
rcx: 4th argument; 3rd return
rdx: 3rd argument; 2nd return
rsp: stack pointer
rbp: frame pointer
rsi: 2nd argument
rdi: 1st argument
r8: 5th argument; 4th return
r9: 6th argument; 5th return
r11: temporary register
r12-r14: callee-saved
r15: GOT base pointer
Am I understanding correctly that this proposal is calling for structs to always be unpacked into registers before a call? Has any thought been given to passing structs as read-only references? I think this is how most (all?) of the ELF ABIs handle structs, particularly larger ones.
This way the callee gets to decide whether it needs to create a local copy of anything, avoiding copying in many cases. It is also presumably fairly common for only a subset of struct fields to be accessed so unpacking all or part of the struct may be unecessarily expensive (particularly if the struct has boolean fields). Obviously the reference would have to be to something on the stack (copied there, if necessary, by the caller) or to read-only memory.
For Go in particular it seems like this would be a nice fit because the only way to get const-like behaviour is to pass structs by value and internally passing them by reference instead would potentially make that idiom a lot cheaper.
Arrays and maybe slices could also be handled the same way.
On Jan 12, 2017 11:59 AM, "cherrymui" notifications@github.com wrote:
Therefore, it seems registered parameters will mostly help small leaf
functions. And if that's true, I imagine inlining those functions will
actually provide even more speedup because then the compiler can optimize
across function call boundaries.
Inlining doesn't help on function pointer calls.
Yes, but I doubt passing arguments help much either (at least in the
current proposal where it might still spill to memory). A better fix could
be de-virtualization or speculation. On the other hanf, having callee-saved
saved registers could help more (we can have more callee-saved registers
that most function have arguments.)
It might be interesting to gather empirical data about the distribution of pointer vs non-pointer arguments. Or perhaps more precisely, GC-relevant vs non-GC-relevant arguments, since there are some pointers that GC doesn't care about.
Ian's suggestion for tracking pointer vs non-pointer vs inbound in PCDATA is nice and clean and flexible, but even simpler and cheaper would be to simply partition into fixed sets of GC-able registers and non-GC-able registers.
I don't know if I made this point adequately well, but the estimated performance gains are based on benchmarks where X% of calls are hand-enabled (tied to function) to pass arguments in registers, spills and all. Rick helped me with VTune to figure out the smallest set of functions to opt-in to get a meaningful percentage of calls covered. The 5 and 10% improvement estimates are a conservative scale-up of the observed improvements to 100% coverage of calls.
I'm in the process of attempting to get both -gcflags=-l=4 and the current experiment working simultaneously so I can estimate their combined benefit in the same way; there are bugs in the interaction.
The current version of the proposal does call for structs to be unpacked field-by-field. This is influenced by a desire to get strings, slices, and interfaces to land entirely within registers, combined with lack of support for selecting a "field" from a register in the current backend and some worries about how this might work on amd64p32. I don't think that "field selection" from a register is something that we've implemented (yet).
One thing we could definitely do without impairing traceback is returning
values in registers. This also doesn't have GC complications as there is no
GC safe point immediately following a return instruction.
We do have to make sure to support at least three/four pointer return
values for the common case of return &T (or an interface) and an error.
I don't think there is any GC complications for register arguments as the
callee know which is pointer. Ian's suggested approach is for general
callee-saved registers. Partition the callee-save registers into pointer
and non-pointer classes is also a viable solution, but we need to gather
the types of hot values that live across a function call, not types of
function input arguments.
we need to gather the types of hot values that live across a function call, not types of function input arguments.
Right, thanks.
@josharian
even simpler and cheaper would be to simply partition into fixed sets of GC-able registers and non-GC-able registers.
That isn't obvious to me. Partitioning registers adds complexity to the register allocator, while PCDATA tracking adds complexity to the GC. Either approach seems straightforward, but both require additional code.
@mundaym
It is also presumably fairly common for only a subset of struct fields to be accessed so unpacking all or part of the struct may be unecessarily expensive (particularly if the struct has boolean fields).
That assumes that you've constructed the struct somewhere in memory in the first place. For large and/or long-lived structs, that's a valid assumption — but for many small structs (given reasonably effective inlining) it is not.
FWIW, the AMD64 psABI passes structs of "up to four eightbytes" in registers.
On Jan 13, 2017 3:01 PM, "Josh Bleecher Snyder" notifications@github.com
wrote:
we need to gather the types of hot values that live across a function call,
not types of function input arguments.
Right
Related idea: sometimes the compiler can choose whether to use an integer
or a pointer, esp. for loop induction variables, we might also need to take
this into account.
Additionally, if the compiler is sure that the GC can find another pointer
to a slice (e.g. the slice itself is on caller's stack), then the induction
pointer to its backing array (when saved in callee-saved registers) doesn't
have to be labeled as pointer.
the induction pointer to its backing array (when saved in callee-saved registers) doesn't have to be labeled as pointer.
I think it would, since stack copying still needs to update it.
Returning values in registers is slightly more difficult when runtime shims are needed (e.g., reflectcall). On the call side, there's already a "natural" potential second entrypoint for all functions that do a stack size check (provided that the original spill-site memory layout is retained) that makes it relatively easy to implement go statements.
Notes:
Do we need separate proposals for all the other great ideas put forth as alternatives?
Do we need separate proposals for all the other great ideas put forth as alternatives?
I say we continue the conversation here, and once it settles, someone--you, I imagine :)--digests all the input and comes back with an updated proposal taking it all into account.
Josh, it took a fair amount of work to obtain the credible-to-me 5 and 10% estimates for the existing proposal, and I took the time to do it because the potential impacts of doing this work are relatively large. Anything that doesn't have a large cost-to-other-people I think just gets put on a list of stuff to do, and we prioritize by roughly-expected bang-for-buck.
My view on this is that a 5-10% gain in general (geomean) compiled code performance is worthwhile (back in the price-workstations-by-benchmark-performance days, we'd have been giddily happy for such a gain, and that was back when pipelines were short and simple) and the possibility that this might have an annoying impact makes me want to do it sooner, rather than later, since the more users we have writing "interesting" code, the more concrete is poured around our feet. Low-impact changes we can do later.
I've been trying to get a feel for what the "actual problems" are with this proposal, where "hey, what about this other optimization" is not an actual problem. The ones that I see are:
I would add one more specific concern to your list:
I'm not saying that you necessarily need to fundamentally alter the proposal. It might well be that the outcome is that the proposal gets an "Alternatives" section added explaining why the various alternative suggestions here were declined in favor of the current one.
I've been trying to get a feel for what the "actual problems" are with this proposal, where "hey, what about this other optimization" is not an actual problem.
It depends to what extent the "what about this other optimization" suggestions interact with the original proposal. At least at first blush, that interaction is non-trivial. And having potentially better optimizations available is an "actual problem", since the work required to implement, debug, and migrate to a new ABI will dwarf the (not inconsiderable) work it took to validate your 5-10% number. That says to me it is worth at least taking some of the alternatives seriously, and updating the proposal in light of them.
@bcmills
if the new convention isn't a superset of the platform ABI, we may lose the opportunity to simplify assembly functions. (The migration is likely a lot of work, and I doubt we'll do it twice.)
As an aside: the s390x ELF ABI requires a 160-byte (!) caller-allocated save area, so I would rather not adhere strictly to it in Go due to the effect it would have on goroutine stack growth. Probably fine to keep the register allocations the same though.
For callee-saved pointer registers, we need callers to do some work. In particular, they need to make sure to zero any such registers which are dead before a call. Otherwise, callees will dutifully pass the values in those registers on to the GC which will then preserve the (maybe otherwise collectible) objects they point to. That's another cost that must be borne by callee-saved pointer registers.
Maybe that cost goes away with Ian's strategy, I'm not sure.
I agree that it's important to get this right so we only have to do it once. I don't want to force people to rewrite their assembly twice. To the extent that we can do meaningful experiments to gauge the effect of ABI-modifying optimizations (e.g. callee-saved registers), we should do that.
On Fri, Jan 13, 2017 at 10:48 PM, Keith Randall notifications@github.com
wrote:
For callee-saved pointer registers, we need callers to do some work. In
particular, they need to make sure to zero any such registers which are
dead before a call. Otherwise, callees will dutifully pass the values in
those registers on to the GC which will then preserve the (maybe otherwise
collectible) objects they point to. That's another cost that must be borne
by callee-saved pointer registers.Maybe that cost goes away with Ian's strategy, I'm not sure.
If the caller function know the pointer is dead at the call site, it just
need to record that it contains non-pointer at that time.
Basically, we need a separate PCDATA for each call instruction in the
program that records whether each callee-saved register is used (owned) by
the enclosing function and if used, whether it contains pointer. For each
function using callee-saved register, there will be FUNCDATA telling GC
where does it save the incoming callee-saved register.
When unwinding the stack, the traceback routine maintain a vector of
callee-saved registers contents and reload their contents as necessary as
it walks down the stack, and passing newly loaded pointer content to GC as
root.
The goal of the proposal is that assembly-language meddling be minimized
(readable proposal here https://gist.github.com/dr2chase/5a1107998024c76de22e122ed836562d ).
Assumption is that there will be a new tag set for adapted assembly language, the assembler will look for that tag, and if it does not see it then the code will automatically be made to work, provided that it lacks indirect CALLs (and perhaps it will error out for unadapted code containing indirect CALL).
For call-free assembly language, the prologue/epilogue is modified to call adapters emitted by the compiler to handle loads/stores from the argument/result portion of the stack.
For calls to named functions, one option is to surround the call with calls to compiler-generated wrappers. Doing this for every function leads to a lot of mostly-unused extra boilerplate to be discarded by the linker; a slightly more complex example requires some back-and-forth with the compiler to ensure that these function-type-specific adapters are generated on demand.
And indirect calls in Go asm are rare, certainly within Google where we can (and did) search for them.
So I'd like to suggest that separating these args-in-registers and callee-saves is not necessarily "doing it twice", though it would be for last-cycle performance-critical assembly.
Adding callee-saves seems to guarantee a lot more meddling in existing assembly language, unless we come up with some clever plan. If the clever plan were a tool, the tool must notice writes to callee-save registers, modify stack frame to add spill area for those registers, include FUNCDATA to indicate presence of spill. How does the tool know where in the assembly frame to put the spills?
Alternately, we could choose the callee-saves registers in this turn of the crank and strongly suggest that anyone rewriting their assembly language now avoid using them, or preemptively add the save/restore code, so that when we do start using them for callee-saves in the future there is not a roadblocking need to repair code in a hurry.
Another earlier proposal by Keith appears to get some of the benefits of callee-save for direct calls with less impact on assembly language; there, the compiler records which registers a function does not clobber, records this in the export data, and for each direct call instruction its unclobbered registers can be left unspilled. There's still interaction with GC, and this does nothing for indirect calls, but it has reduced source code impact.
For helping Cgo, might we be better off conforming exactly to the platform ABI for such calls, and only for those calls?
I wouldn't worry too much about assembly function migration.
In the worst case, we can implement automatic wrapper for them. (New style
assembly sets a new textflag, and linker will generate wrapper for old
style assembly function on the fly as needed. This scheme will work for
both register arguments and callee-saved registers.)
My main concern for the current proposal is still argument display in
tracebacks.
My position is that if we just use registers for return value and implement
callee-saved registers, we can get the benefit of register arguments
(potentially much more, because callee-saved registers doesn't necessarily
need spilling cross function calls), but will not lose the benefit of
argument shown in tracebacks. Even if it's not a performance win, I still
prefer the debugability provided by argument values in tracebacks (e.g.
invalid arguments like nil stand out, and it's usually the first few
arguments that will be passed in registers).
My main concern for the current proposal is still argument display in tracebacks.
@minux, can you expand on the problem you see with tracebacks? As I understand it, in the current proposal, any "controlled" traceback (where the goroutine is at a safe-point) works exactly like it does right now because the arguments will all be spilled to exactly the same places on the stack (unless, perhaps, their last use is before the first call?). Tracebacks caused by signals (nil pointer, divide-by-zero) may be a problem for printing the arguments of the inner-most frame, though all of the other frames will be fine. Is there some other problem you see?
Maybe we need a more elaborate way to communicate argument locations to the runtime for traceback. I think this is going to be more of a problem for inlining than for register arguments; we currently have no hope of recovering arguments for inlined frames (so, at least for now, we're not; perfect is the enemy of the good and all that.)
On Tue, Jan 17, 2017 at 11:26 AM, Austin Clements notifications@github.com
wrote:
My main concern for the current proposal is still argument display in
tracebacks.@minux https://github.com/minux, can you expand on the problem you see
with tracebacks? As I understand it, in the current proposal, any
"controlled" traceback (where the goroutine is at a safe-point) works
exactly like it does right now because the arguments will all be spilled to
exactly the same places on the stack (unless, perhaps, their last use is
before the first call?). Tracebacks caused by signals (nil pointer,
divide-by-zero) may be a problem for printing the arguments of the
inner-most frame, though all of the other frames will be fine. Is there
some other problem you see?This unnecessary spilling will have performance costs (and not to mention
the stack usage is the same as before). That's why I proposed callee-saved
register and only using registers for return values instead. They have
nicer interactions with traceback without losing performance.
This unnecessary spilling will have performance costs (and not to mention the stack usage is the same as before).
Backing up a step, are you concerned about tracebacks with the current proposal (which does potentially unnecessary spilling, partly to support finding arguments for things like tracebacks), or with an alternate proposal that does less spilling?
Adding callee-saves seems to guarantee a lot more meddling in existing assembly language, unless we come up with some clever plan. If the clever plan were a tool, the tool must notice writes to callee-save registers, modify stack frame to add spill area for those registers, include FUNCDATA to indicate presence of spill. How does the tool know where in the assembly frame to put the spills?
We don't need to notice writes to callee-save registers: we can spill them unconditionally. (If that's a performance problem, the trivial workaround is to rewrite the assembly function to use the new calling convention.)
We would need to do the spills at the point where we change calling conventions: when we go to call across conventions, we spill all the callee-saved registers, then copy all of the arguments to the stack, call the assembly function, copy return-parameters to registers, restore callee-saved registers, and we're done.
IIUC, that would only require FUNCDATA to indicate the spills at the call sites, for which we would generate the same FUNCDATA as for any other spill (indicating that those registers do not contain GC roots after the spill occurs).
Alternately, we could choose the callee-saves registers in this turn of the crank and strongly suggest that anyone rewriting their assembly language now avoid using them, or preemptively add the save/restore code, so that when we do start using them for callee-saves in the future there is not a roadblocking need to repair code in a hurry.
I like that idea. We're already trending a bit in that direction (see https://go-review.googlesource.com/#/c/35068/).
For helping Cgo, might we be better off conforming exactly to the platform ABI for such calls, and only for those calls?
Which calls do you mean, exactly? There are (unfortunately) several routes by which Go functions can end up being called from C ABI functions: at least explicit cgo calls and signal handlers. At any rate, I do think that (at least for platform ABIs that don't have crazy stack bloat) we should try to conform to the platform ABI.
On Tue, Jan 17, 2017 at 3:39 PM, Austin Clements notifications@github.com
wrote:
This unnecessary spilling will have performance costs (and not to mention
the stack usage is the same as before).Backing up a step, are you concerned about tracebacks with the current
proposal (which does potentially unnecessary spilling, partly to support
finding arguments for things like tracebacks), or with an alternate
proposal that does less spilling?To clarify my thinking:
Introducing register arguments in the most optimum case will make traceback
arguments incorrect,
but spilling the register arguments on function calls to solve the
traceback problem will reduce performance benefit of passing arguments in
registers.
Therefore, I suggest that we still pass arguments in stack (so that
traceback behavior is unchanged) and use call-saved registers and register
return values to get the speed up from changing ABI.
On Wed, Jan 18, 2017 at 12:42 AM, Bryan C. Mills notifications@github.com
wrote:
Adding callee-saves seems to guarantee a lot more meddling in existing
assembly language, unless we come up with some clever plan. If the clever
plan were a tool, the tool must notice writes to callee-save registers,
modify stack frame to add spill area for those registers, include FUNCDATA
to indicate presence of spill. How does the tool know where in the assembly
frame to put the spills?We don't need to notice writes to callee-save registers: we can spill them
unconditionally. (If that's a performance problem, the trivial workaround
is to rewrite the assembly function to use the new calling convention.)Right, as I explained in comment
https://github.com/golang/go/issues/18597#issuecomment-272600640
implementing callee-saved registers is not that hard. Even automatically
adding code to existing assembly code to spill and restore callee-save
registers is easy: notice the write to callee-saved register (which we can
trivial implement as functions not setting a NEWABI textflag), we enlarge
the frame, spill all callee-saved registers to stack, and generate the
FUNCDATA for the slots. And make the RET pseudo instruction to restore the
callee-saved registers.
Note that we control the frame sizes and the definition of instructions
(like RET), so we can make then do whatever it's easier for the users.
For user written NEWABI assembly functions, we will need to introduce some
new FUNCDATA tags to indicate the stack slot to save callee-saved
registers, in fact, we can even add special pseudo MOVQ instruction to do
that.
Thinking about tracebacks. At any given point in the program, we know whether an argument is still valid or not; we shouldn't show invalid arguments in tracebacks, but we may have some arguments unavailable.
The most interesting function arguments are in two categories. One is "things that could have been GC'd but weren't yet", which favors passing in registers without extra spilling for tracebacks. The other is "values which were passed further down the stack". The latter could be traced and displayed via FUNCDATA annotations using a more general version of @ianlancetaylor's suggested approach: if we know where the contents of the registers (and stack slots) came from, we can reconstruct the arguments which are still around because they were passed as arguments (or stored to local variables) further down the call chain.
Does DWARF already support that kind of value-propagation annotation?
Can you clarify "things that could have been GC'd but weren't yet" ?
Do you mean pointers that are functionally dead, and if a GC has occurred might now point to reused memory, or something else? That's a bad idea unless there is some indication (to debugger and/or traceback) that the pointer is stale and that examining its referent might lead to confusion.
Do you mean pointers that are functionally dead, and if a GC has occurred might now point to reused memory, or something else?
I mean pointers for which:
If function arguments are passed on the stack (or spilled for the lifetime of the function), then collecting those pointers requires extra work. On the other hand, if function arguments are passed in registers (and only spilled if they have live references after the spill), then collecting those pointers is automatic — and preserving them for display in tracebacks requires extra work.
Additionally, I'd want the arguments show in the traceback to be the
initial arguments to the function. Not updated values.
Because otherwise the meaning of the argument also depends on the exact
line number, which significantly reduces the value of showing them.
The main value of having the arguments is spotting plain wrong arguments
like 0xdeaddeaddead or nil without looking at the source code. If we only
flush the arguments to stack at call sites, then there is no guarantee the
arguments shown is the initial one and it could be changed. Having context
dependent argument not only is confusing, it also makes the value less
useful (considering when you don't actually know the version of Go used to
build the program, context dependent argument values are useless because
you can't refer to code; And this happens a lot when you're triaging new
issues).
We're already in a state where what is displayed in the traceback is context-dependent.
func f(x *int) {
x = new(int)
g()
.. use x ..
}
If there is a panic during the call to g, the value displayed in the stack traceback for x's argument slot will show the newly allocated pointer, not the original value of x at the start of the function.
The reason this happens is that the newly allocated pointer is spilled to the argument slot for x. We used to do this to ensure that the old value x pointed to could be garbage collected during the call to g. Now that (as of go 1.8) we use our precise liveness information for arguments (and have runtime.KeepAlive available to override if necessary), we could revisit this decision. It would cost some stack space, but probably not much.
In any case, all that is pretty tangential to the proposal at hand. Not making tracebacks worse is certainly worth discussing here. Making them better should be a separate proposal.
On Thu, Jan 19, 2017 at 5:56 PM, Keith Randall notifications@github.com
wrote:
In any case, all that is pretty tangential to the proposal at hand. Not
making tracebacks worse is certainly worth discussing here. Making them
better should be a separate proposal.But my point is that making traceback better might conflict with the
current proposal, which means we can't really discuss these two problems in
isolation.
That the proposals conflict is ok. And we can discuss those conflicts here. But without a concrete proposal about how you'd like to make tracebacks better it's hard to see exactly what those conflicts would be.
I'm interested in knowing how this would interact with non-optimized compilation. Right now non-optimized compilations registerize sparingly, which is great for debuggers since the go compiler isn't good at saving informations about registerization. Putting a lot more things into registers without getting better at writing the appropriate debug symbols would be bad.
I'm interested in knowing how this would interact with non-optimized compilation. Right now non-optimized compilations registerize sparingly, which is great for debuggers since the go compiler isn't good at saving informations about registerization.
Presumably part of the changes for the calling convention would be making the compiler emit more accurate DWARF info for register parameters.
On hold until @dr2chase is ready to proceed.
Current plan is 1.10, time got reallocated to loop unrolling and whatever help was required to get the Dwarf support better in general, so that we have the ability to say what we're doing with parameters.
There's been a lot of churn in the compiler in the last couple of months -- new source position, pushing the AST through, making things more ready for running in parallel, and moving liveness into the SSA form -- so I am okay with waiting a release.
Would the proposed calling convention omit frame pointers for leaf functions? I can see this varying based on whether the function stores the state of a callee-save register, takes the address of a local variable, etc. In that case, is there still a feasible way to obtain the call stack?
A naive proposal:
maybe a good optimization would be, instead of changing calling convention, to focus on eliminating local variables on stack by using registers, and sharing as much as possible the stack frame space between variables and arguments/return values of subcalls.
This way (1) registers would speed up things, (2) stack frames would be smaller, (3) existing calling convention would be preserved.
How would that work when a function has more than one caller?
On 6 Jan 2018, at 13:22, Wojciech Kaczmarek notifications@github.com wrote:
A naive proposal:
maybe a good optimization would be, instead of changing calling convention, to focus on eliminating local variables on stack by using registers, and sharing as much as possible the stack frame space between variables and arguments/return values of subcalls.
This way (1) registers would speed up things, (2) stack frames would be smaller, (3) existing calling convention would be preserved.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
I can be wrong, but register-based calling convention (CC) could give more performance boost if additional SSA rules are added that actually take advantage of it.
Currently, a big amount of small and average functions have many mem->reg->mem patterns that make certain optimizations on amd64 not worth it.
It is not fair to say that 10% is not significant.
It's 10% with optimizer that works on GOARCH=amd64 like it is GOARCH=386.
The potential gain is higher.
Lately, I was comparing 6g with gccgo and even without machine-dependent optimizations, aggressive inlining and constant folding there was about 25-40% performance difference in some allocation-free benchmarks. I do believe that this difference includes CC impact (because most other parts of output machine code look nearly the same).
As noted above, the performance benefit for a register-based calling convention is higher on RISC architectures like ppc64le & ppc64. If the above CL is not stale I would be willing to try it out. To me this is one of the biggest performance issues with golang on ppc64le.
Inlining does help but doesn't handle all cases if there are conditions that inhibit inlining. Also functions written in asm can't be inlined AFAIK.
The CL is very stale. First we have to get to a good place with debugger information (seems very likely for 1.10 [oops, 1.11]), and then we have to redo some of the experiment (it will go more quickly this time) with a better set of benchmarks. Some optimizations that looked promising on the go1 benchmarks turn out to look considerably less profitable when run against a wider set of benchmarks.
@dr2chase
Is the method used to determine the 5% performance improvement somewhere in the discussion? I'm curiously interested in what the function sample space actually looks like.
I think the measurement we want is how much does this help once inlining is fully enabled by default.
The estimate was based on counting call frequencies, looking at performance improvement on a small number of popular calls "registerized", and extrapolating that to the full number of calls.
I agree that we want to try better inlining first (it's in the queue for 1.11, now that we have that debugging information fixed), since that will cut the call count (and thus reduce the benefit) of this somewhat more complicated optimization. Mid-stack inlining, however, was one of the optimizations that looked a lot less good when applied to the larger suite of benchmarks. One problem with the larger benchmark suite is selection effect -- anyone who wrote a benchmark for parts of their application cares enough about a performance to write that benchmark, and has probably used it already to hand-optimize around rough spots in the Go implementation, so we'll see reduced gains from optimizing those rough spots.
and has probably used it already to hand-optimize around rough spots in the Go implementation
From my point of view, the goal of having a better inliner is to have more readable/maintainable (less "hand-optimized") code, not really faster code. So that we can build code addressing the problems we're trying to solve, not the shortcomings of the compiler.
There is an updated proposal at #40724, which addresses many of the issues raised here. Closing this proposal in favor of that one.
Most helpful comment
Additionally, passing argument in registers will make displaying function
arguments in traceback much harder, if not impossible. One reason I like Go
so much is how Go's traceback makes debugging so much easier by providing
enough contextual information about the frames.
Frankly, I'd have expected much larger performance improvement for
switching the calling convention to use registers. If the benefit is indeed
only 5-10%, IMHO it's not worth the trouble. We should pursue better
inlining and automatic vectorization instead.
The Intel CPUs have invested so much circuit in optimizing the stack
(operations), so I'm not too surprised though. The benefit should be larger
on RISC architectures.