Go: runtime: use parent goroutine's stack for new goroutines

Created on 29 Aug 2018 · 7Comments · Source: golang/go

This is a speculative idea from discussion with @aclements. Jotting it down so it doesn't get lost and to open discussion.

Spinning off many goroutines that grow their stack and then exit can be a source of performance pain. See e.g. #18138. One approach to fixing this is tracking and using statistics about stack growth behavior per spawn site.

Another possible approach is to:

switch immediately in the scheduler from parent goroutine to spawned goroutine
use the (free part of) the parent's stack as the spawn's stack
relocate the spawn's stack as needed

The idea here is that the parent goroutine's stack is likely to have plenty of free space, so if the spawn is short-lived, we can avoid doing any stack allocation for it at all, much less multiple stack growths.

If the spawn lives long enough that the parent goroutine starts running again, then this is worse. So it might be worth tracking some basic statistics on the fly to decide whether to do this.

NeedsInvestigation Performance

Source

josharian

👍4 👀1

Most helpful comment

How about allocating the child goroutine's stack in the same allocation but not right next to the parent's stack? That is, suppose the parent has 100K allocated stack space, say (base, base+100K), and is currently using 60K. When it spawns the child, we can put the child's stack at, say, base+90K, and adjust the parent's stack bound. When either the parent's stack grows to 90K, or the child's stack grows to 10K, (or the parent fires up more goroutines which uses up the stack), relocate the stack.

This way, both goroutines are runnable (if not otherwise blocked), and the scheduler can choose whatever appropriate.

cherrymui on 7 Sep 2018

👍4

All 7 comments

One common pattern where this may be problematic is where a parent spawns a goroutine and blocks on receiving the result of that goroutine from an unbuffered channel. In the basic form of this design, this would always force a stack-rip. A few ways I can think to get around this:

Encourage people to use a single-element channel instead. Not so great for existing code.
Statically recognize this "send and exit" pattern and compile it into a combined operation that does them in the opposite order (the difference is unobservable). The value would still have to be buffered somewhere until the parent reached the channel receive. This may be tricky because there's no way to bound the amount of buffering necessary.
Let the parent keep running until it blocks (like today), and then start running the new goroutine on the parent's stack. We'd need somewhere to stash the goroutine's argument frame. Here we could at least bound the amount of buffering for the argument frames, since we could always fall back to spawning a full goroutine.

For solution 2 or 3, we could always do something like _defer objects, which are manually allocated and freed by the runtime.

Cilk (which inspired this idea) structurally enforces something like solution 2, but the spawning code always has to pre-arrange a place to put the result.

aclements on 6 Sep 2018

This way, both goroutines are runnable (if not otherwise blocked), and the scheduler can choose whatever appropriate.

cherrymui on 7 Sep 2018

👍4

When it spawns the child, we can put the child's stack at, say, base+90K, and adjust the parent's stack bound.

If and when https://github.com/golang/go/issues/25999#issuecomment-400841088 materializes then this could even be done precisely by knowing how much stack space both parent and child actually require.

CAFxX on 8 Sep 2018

In Go terms I believe Cilk basically turned the go command into a call but
the calling frame was prepared in such a way that the parent's goroutine
stack could be copied, stolen, and then resumed by another P. I never
figured out the actual mechanism used in Cilk stealing but Go's ability to
relocate stacks is unique and seems to be part of the solution for Go. I
suspect that we could copy the child goroutine instead but that seems
racier.

This would also bound the number of goroutines the scheduler would have to
manage as well as a lot of other nice properties discussed in the Cilk
papers. Go should preserve these bounds if we do this.

On Fri, Sep 7, 2018 at 6:40 PM, Carlo Alberto Ferraris <
[email protected]> wrote:

When it spawns the child, we can put the child's stack at, say, base+90K,
and adjust the parent's stack bound.

If and when #25999 (comment)
https://github.com/golang/go/issues/25999#issuecomment-400841088
materializes then this could even be done precisely by knowing how much
both parent and child need in term of stack space.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/27345#issuecomment-419584853, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wnx2xixDi0c2XgX3C78od7v5KuIijks5uYvXUgaJpZM4WSA6H
.

RLH on 9 Sep 2018

This would also bound the number of goroutines the scheduler would have to
manage as well as a lot of other nice properties discussed in the Cilk
papers. Go should preserve these bounds if we do this.

I think it's worth striving for similar bounds, but Cilk was able to achieve these bounds strictly only because it had a very strict fork/join model of parallelism, where the parent function couldn't even return until all of its child threads had returned/exited. Because of Go's ad hoc parallelism, the best we could do (which would still be quite good) is to say, if your program is structured like X, then it will have some space and time bounds.

Let the parent keep running until it blocks (like today), and then start running the new goroutine on the parent's stack. We'd need somewhere to stash the goroutine's argument frame. Here we could at least bound the amount of buffering for the argument frames, since we could always fall back to spawning a full goroutine.

@pjweinb pointed out to me today that we could switch to the child when the parent blocks or tries to start a second child goroutine. Bounding it to a single pending goroutine makes the bookkeeping much simpler and may go towards ensuring Cilk-like space bounds. We could also combine this with @cherrymui's idea, which would give us a place to put the pending argument frame: when the parent go's the child, we immediately assign most of the parent's remaining stack to the child, put the goroutine's argument frame in its section of the stack, shrink the parent's stack bounds, and let the parent keep running. When the parent blocks, tries to start another child, or encounters its stack bound, we switch to the child (without moving the parent's stack at that time).

aclements on 14 Sep 2018

This seems promising- any updates about working towards improving goroutine speculative stack spawning?