This is a speculative idea from discussion with @aclements. Jotting it down so it doesn't get lost and to open discussion.
Spinning off many goroutines that grow their stack and then exit can be a source of performance pain. See e.g. #18138. One approach to fixing this is tracking and using statistics about stack growth behavior per spawn site.
Another possible approach is to:
The idea here is that the parent goroutine's stack is likely to have plenty of free space, so if the spawn is short-lived, we can avoid doing any stack allocation for it at all, much less multiple stack growths.
If the spawn lives long enough that the parent goroutine starts running again, then this is worse. So it might be worth tracking some basic statistics on the fly to decide whether to do this.
One common pattern where this may be problematic is where a parent spawns a goroutine and blocks on receiving the result of that goroutine from an unbuffered channel. In the basic form of this design, this would always force a stack-rip. A few ways I can think to get around this:
For solution 2 or 3, we could always do something like _defer objects, which are manually allocated and freed by the runtime.
Cilk (which inspired this idea) structurally enforces something like solution 2, but the spawning code always has to pre-arrange a place to put the result.
How about allocating the child goroutine's stack in the same allocation but not right next to the parent's stack? That is, suppose the parent has 100K allocated stack space, say (base, base+100K), and is currently using 60K. When it spawns the child, we can put the child's stack at, say, base+90K, and adjust the parent's stack bound. When either the parent's stack grows to 90K, or the child's stack grows to 10K, (or the parent fires up more goroutines which uses up the stack), relocate the stack.
This way, both goroutines are runnable (if not otherwise blocked), and the scheduler can choose whatever appropriate.
When it spawns the child, we can put the child's stack at, say, base+90K, and adjust the parent's stack bound.
If and when https://github.com/golang/go/issues/25999#issuecomment-400841088 materializes then this could even be done precisely by knowing how much stack space both parent and child actually require.
In Go terms I believe Cilk basically turned the go command into a call but
the calling frame was prepared in such a way that the parent's goroutine
stack could be copied, stolen, and then resumed by another P. I never
figured out the actual mechanism used in Cilk stealing but Go's ability to
relocate stacks is unique and seems to be part of the solution for Go. I
suspect that we could copy the child goroutine instead but that seems
racier.
This would also bound the number of goroutines the scheduler would have to
manage as well as a lot of other nice properties discussed in the Cilk
papers. Go should preserve these bounds if we do this.
On Fri, Sep 7, 2018 at 6:40 PM, Carlo Alberto Ferraris <
[email protected]> wrote:
When it spawns the child, we can put the child's stack at, say, base+90K,
and adjust the parent's stack bound.If and when #25999 (comment)
https://github.com/golang/go/issues/25999#issuecomment-400841088
materializes then this could even be done precisely by knowing how much
both parent and child need in term of stack space.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/27345#issuecomment-419584853, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wnx2xixDi0c2XgX3C78od7v5KuIijks5uYvXUgaJpZM4WSA6H
.
This would also bound the number of goroutines the scheduler would have to
manage as well as a lot of other nice properties discussed in the Cilk
papers. Go should preserve these bounds if we do this.
I think it's worth striving for similar bounds, but Cilk was able to achieve these bounds strictly only because it had a very strict fork/join model of parallelism, where the parent function couldn't even return until all of its child threads had returned/exited. Because of Go's ad hoc parallelism, the best we could do (which would still be quite good) is to say, if your program is structured like X, then it will have some space and time bounds.
Let the parent keep running until it blocks (like today), and then start running the new goroutine on the parent's stack. We'd need somewhere to stash the goroutine's argument frame. Here we could at least bound the amount of buffering for the argument frames, since we could always fall back to spawning a full goroutine.
@pjweinb pointed out to me today that we could switch to the child when the parent blocks or tries to start a second child goroutine. Bounding it to a single pending goroutine makes the bookkeeping much simpler and may go towards ensuring Cilk-like space bounds. We could also combine this with @cherrymui's idea, which would give us a place to put the pending argument frame: when the parent go's the child, we immediately assign most of the parent's remaining stack to the child, put the goroutine's argument frame in its section of the stack, shrink the parent's stack bounds, and let the parent keep running. When the parent blocks, tries to start another child, or encounters its stack bound, we switch to the child (without moving the parent's stack at that time).
This seems promising- any updates about working towards improving goroutine speculative stack spawning?
Sadly, no. Too much other work on our collective plates right now.
Most helpful comment
How about allocating the child goroutine's stack in the same allocation but not right next to the parent's stack? That is, suppose the parent has 100K allocated stack space, say
(base, base+100K), and is currently using 60K. When it spawns the child, we can put the child's stack at, say,base+90K, and adjust the parent's stack bound. When either the parent's stack grows to 90K, or the child's stack grows to 10K, (or the parent fires up more goroutines which uses up the stack), relocate the stack.This way, both goroutines are runnable (if not otherwise blocked), and the scheduler can choose whatever appropriate.