Go: syscall: use posix_spawn (or vfork) for ForkExec when possible

Created on 5 Jul 2013 · 56Comments · Source: golang/go

Why:

At a basic level posix_spawn(2) is a subset of fork(2). A new child process from
fork(2): 1) gets an exact copy of everything that the parent process had in memory, and
2) gets a copy of all the file descriptors that the parent process had open.
posix_spawn(2) preserves #2, but not #1. In some cases, say, shelling out a command,
it's unnecessary to get a copy of memory of the parent process. With copy-on-write, fork
will be less expensive but still, not necessary.

What's out there:

https://github.com/rtomayko/posix-spawn#benchmarks

I am wondering if it makes sense to have this API and let developers decide which one to
use (fork/exec vs. posix_spawn)

FrozenDueToAge Performance

Source

owenthereal

👍8

Most helpful comment

gitlab blogged about this patch giving a 30x improvement to the p99 latency of the git service they developed :) https://about.gitlab.com/2018/01/23/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/

keegancsmith on 23 Jan 2018

🎉29 👍23 ❤15

All 56 comments

Comment 1:

Some related discussion:
https://groups.google.com/d/topic/golang-dev/66rHnYuMaeM/discussion

alberts on 5 Jul 2013

Comment 2:

Sounds like MADV_DONTFORK is the way to go.

davecheney on 5 Jul 2013

Comment 3:

Thanks for the info. I am pasting the implementation of the posix-spawn gem in case it's
a helpful reference. Looks like it forces to use vfork on linux:
https://github.com/rtomayko/posix-spawn/blob/master/ext/posix-spawn.c#L399-L404
https://github.com/rtomayko/posix-spawn/blob/master/ext/posix-spawn.c#L418

owenthereal on 7 Jul 2013

Comment 4:

the problem of MADV_DONTFORK heap is that it's difficult for us to make sure
ForkExec code doesn't use heap at all.
i think posix_spawn is the way to go (vfork is removed in modern POSIX standard,
so it should be avoided when possible).

_Labels changed: added priority-later, performance, removed priority-triage._

_Status changed to Accepted._

minux on 8 Jul 2013

Comment 5:

The standards don't matter. What matters is what is available on the target systems.
vfork is fine if it's there. So is posix_spawn if it's there (and works and supports all
the current usages, including syscall.ProcAttr).

rsc on 8 Jul 2013

Comment 6:

_Labels changed: added go1.3maybe._

rsc on 27 Nov 2013

Comment 7:

_Labels changed: added release-none, removed go1.3maybe._

rsc on 4 Dec 2013

Comment 8:

_Labels changed: added repo-main._

rsc on 4 Dec 2013

I see this issue on a regular basis for machines that still have free physical memory (certainly enough for the process I would like to invoke), but are being called by a Go process with a very large virtual memory footprint using os.Exec. The invocation fails due to insufficient virtual memory to fork the parent Go process. I think calling this a "performance" issue isn't accurate.

Go daemons with large virtual memory footprints needing to invoke small command-line utilities is a common use case. It would be nice to get this assigned to the 1.5 release.

tarndt on 23 Jan 2015

👍2

I took a short look. On GNU/Linux posix_spawn is a C library function, not a system call. vfork is a special case of clone: you pass the CLONE_VFORK flag. This means that a program that cares can already use vfork on GNU/Linux, by setting the Cloneflags field in os/exec.Cmd.SysProcAttr or os.ProcAttr.Sys. So while this would be nice to fix I don't see a pressing need.

To fix we need to edit syscall/exec_linux.go to pass CLONE_VFORK when that is safe. It is pretty clearly safe if the only things the child needs to do after the fork is fiddle with file descriptors and call exec. If that is the case, as can be determined by looking at the sys fields, then we could add CLONE_VFORK to the clone flags. If somebody wants to try that out, that would be nice.

ianlancetaylor on 23 Jan 2015

If somebody wants to try that out, that would be nice.

@ianlancetaylor I see this issue on a regular basis, I will give you suggestion a try- Thanks!

tarndt on 4 Feb 2015

I'm facing with a similar problem. I'm running a go app that allocates about 14GB of VM and can't spawn a simple 'ps' command despite having at leat 300 MB system RAM still available. It would be really great if this issue would be fixed in 1.5

napsy on 13 Feb 2015

Hmm, I gave this a quick try a few days ago, but gave up for now.

I failed to determine why the child hangs after the clone syscall. And if the child hangs, the parent won't continue either in the CLONE_VFORK case.

I only activated CLONE_VFORK, if everything in syscall.SysProcAttr was set to it's zero value. But even such simple cases are not so simple it seems. if someone want's to work on this with me, just ping me here.

nightlyone on 13 Feb 2015

Did you pass CLONE_VM as well as CLONE_VFORK? I think that without CLONE_VM the parent won't be able to see when the child has exec'ed. Though I don't know why the child would hang.

ianlancetaylor on 13 Feb 2015

@ianlancetaylor yes, I passed both. But I guess the systems needs to be in a single thread mode for this to work, which Go doesn't seem to do at the moment. http://ewontfix.com/7/ has more info on this, if someone wants to continue here (e.g. my future self).

nightlyone on 14 Feb 2015

@ianlancetaylor
I'm a bit confused, this appears to work: http://play.golang.org/p/Bop1efiPJ4
Is this test flawed? I hadn't gotten around to trying this with our "real" code yet.

Edit: I even tried adding runtime.GOMAXPROCS(runtime.NumCPU()) and it still works.

tarndt on 6 Mar 2015

@tarndt CLONE_VM is not passed in your example. CLONE_VFORK without CLONE_VM will be If you add this, the go program calling execve hangs. Which is exactly what I have seen in my tests.

My current plan is to use madvise(...,MADV_DONTFORK) with the heap, but I haven't figured out yet how to do the file descriptor juggling in a safe manner without affecting the parent process and only using stack.

nightlyone on 7 Mar 2015

@tarndt If you use CLONE_VFORK without CLONE_VM, is that really any faster?

If it is faster, and it works, then I suppose we could use it.

ianlancetaylor on 7 Mar 2015

There is one reason to not use vfork. It's when the child needs to dup a
FUSE-backed file descriptor, which could block, and in the case of vfork,
also block the parent for indefinite amount of time.

See
https://groups.google.com/forum/#!msg/golang-nuts/11rdExWP6ac/rauEcCB66FUJ
([go-nuts] syscall: dup2 blocks contrary to runtime's expectations for fuse file systems)
for the discussion.

This is an edge case, but still worth considering when switching to vfork.

minux on 20 Mar 2015

I've run into this same problem, a Go program with a large virtual memory footprint is failing to fork/exec despite plenty of free memory.

My experiments with CLONE_VFORK|CLONE_VM end up with the parent being mysteriously zombified. With just CLONE_VM I get:

fatal error: runtime: stack split at bad time

anacrolix on 23 May 2015

For all those affected, temporary workaround on Linux until it is fixed properly can be one of following:

enable unconditional overcommit: sysctl -w vm.overcommit_memory=1
add swap to your host, with sysctl -w vm.swappiness=1 it will almost never going to be used, but it participates in calculations where Linux kernel decides to whether it can afford to satisfy allocation or not when default overcommit_memory=0 is in use

redbaron on 13 Jan 2016

👍1

I can confirm the above works (if you buy your devops people enough :beers: they let you do it).

tarndt on 20 Jan 2016

Thank you so much, @redbaron your solution worked.

nadermx on 27 Jan 2016

Hello,

what's the follow up for this issue? I have a process at 20 GB VMM and ~ 4 GB RSS and I think spawning will become a problem for me very soon.

napsy on 3 Mar 2016

@napsy There is no follow-up. Somebody needs to work on this and try out the various cases. I don't think it's going to matter very much for your program--your program should work OK either way. What we are talking about here is just a possible performance improvement to os.Exec. Of course, if that is not true, please do let us know.

ianlancetaylor on 3 Mar 2016

@ianlancetaylor as far as I understand, spawning a new process from a parent with such a large memory footprint could cause problems. If I misunderstood the problem, please correct me.

napsy on 3 Mar 2016

@napsy, you're assuming you will hit the problem. Instead of making it a
hypothetical, try making your process use more VMM and more RSS and see if
you do.

rsc on 3 Mar 2016

Wat. This is already a confirmed problem. It wants to be fixed regardless of whether it impacts this one user.

anacrolix on 5 Mar 2016

@anacrolix Nobody is saying it should not be fixed. We're just saying that there is no reason for someone to simply assume they are encounter this.

That said, it would be nice to have a simple test case showing the problem. We don't have that now.

ianlancetaylor on 7 Mar 2016

I've run into this problem recently. Here's a test case that I made for it:

https://gist.github.com/jd3nn1s/24896f55f20497a972914412f23ab23a

As the allowed overcommit is some heuristic here's some details of my setup:

devbox:~$ free
             total       used       free     shared    buffers     cached
Mem:       4030672    2508596    1522076       7760     217200    1836996
-/+ buffers/cache:     454400    3576272
Swap:      1048572       2356    1046216
devbox:~$ uname -a
Linux devbox 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

jd3nn1s on 21 Apr 2016

Here's a test case that works relatively well for me: https://gist.github.com/neelance/460f8a31f2391d2f3aafd7052348f66a
Its output:

0GB: 669.091µs
1GB: 1.884526ms
2GB: 3.384228ms
3GB: 4.528614ms
4GB: 5.735291ms
5GB: 8.574143ms
6GB: 9.779468ms
7GB: 11.032376ms
8GB: 13.148973ms
9GB: 15.560206ms
10GB: 16.600329ms

I've seen even worse latencies in production (up to several 100ms), but it is hard to simulate that in a test.

Here's a patch that works for the test case and the core package tests, I still have to give it a try in production: https://github.com/neelance/go/commit/b7edfba429d982e3e065d637334bcc63ad49f8f9

Test case output after patch:

0GB: 521.237µs
1GB: 515.594µs
2GB: 524.1µs
3GB: 499.79µs
4GB: 520.94µs
5GB: 504.268µs
6GB: 510.416µs
7GB: 546.991µs
8GB: 478.579µs
9GB: 524.383µs
10GB: 465.099µs

@ianlancetaylor I can bring it upstream if desired. This stuff was partially new to me, so I'd like to get feedback. E.g. I picked the register R12 at random (R11 didn't work), maybe there's a better pick.

neelance on 29 Apr 2016

👍7

Thanks for looking at this. Why do you need to preserve the return address?

ianlancetaylor on 29 Apr 2016

http://ewontfix.com/7/ referred to above describes a possible security issue if setuid is called from another thread while vforking. Probably not a problem unless https://github.com/golang/go/issues/1435 is resolved and setuid is implemented.

jd3nn1s on 29 Apr 2016

@ianlancetaylor The return address needs to be preserved because SYSCALL returns two times, once for the child and once for the parent. However, when using CLONE_VM there is just one stack which both use. CLONE_VFORK blocks the parent thread, so the child goes first. It executes RET and thus pops the return address from the stack. It eventually calls SYS_EXECVE which makes it detach from the shared memory space and thus the parent may continue execution. The parent now hits RET again, but the return address would not be there any more, except if you restore it from a register. I got that trick from glibc.

neelance on 29 Apr 2016

@neelance I tried your test (https://gist.github.com/neelance/460f8a31f2391d2f3aafd7052348f66a) with and without CLONE_VFORK (not CLONE_VM) and observe a significant speedup.

Without CLONE_VFORK:

0GB: 1.672371ms
1GB: 2.349266ms
2GB: 5.790721ms
3GB: 8.103551ms

With CLONE_VFORK (still no CLONE_VM):

0GB: 13.88µs
1GB: 15.294µs
2GB: 15.186µs
3GB: 12.553µs

I'm doing this on Go 1.7 with GOARCH=amd64.

Why is this the case? If the slowness is because calling clone without CLONE_VM will copy the memory, doesn't that mean that it should be slow even if you include CLONE_VFORK?

One thing I noticed in the documentation for clone is this:

   Another difference for the raw system call is that the child_stack
   argument may be zero, in which case copy-on-write semantics ensure
   that the child gets separate copies of stack pages when either
   process modifies the stack.  In this case, for correct operation, the
   CLONE_VM option should not be specified.

And Go does set child_stack to zero AFAICT. Does that mean this COW behavior is kicking in? Maybe the slowness is because the parent process is triggering copy of the heap with a GC or something else?

If not, is there any other reason why CLONE_VFORK alone should make it faster? Any downside to using it?

arya on 13 Dec 2016

Hmm, interesting observation. Yeah, maybe it is the parent process causing more COW work before the child process does its exec. CLONE_VFORK alone may already solve this without high risk.

What changes did you test? Have you used https://github.com/neelance/go/commit/b7edfba429d982e3e065d637334bcc63ad49f8f9 or have you simply specified CLONE_VFORK in the otherwise unmodified code? I think in theory that should be possible.

It would also be interesting to have numbers for the CLONE_VFORK + CLONE_VM case for comparison.

neelance on 13 Dec 2016

@neelance, do you want to get this into Go 1.9 in Feb?

bradfitz on 13 Dec 2016

@bradfitz Yes, I think it would be good to get this upstream. We should figure out if CLONE_VFORK is enough. If yes, then that would be a low risk, high reward change. If we want to include CLONE_VM as well, then it definitely needs a review by someone who knows a lot about this domain. My patch above seems to work, but it was new territory for me.

neelance on 13 Dec 2016

🎉2

@neelance my test was without your changes. It was just the gist I included with and without CLONE_VFORK.

arya on 13 Dec 2016

Wow nice, so the only change it does is to make the parent wait until the child does its exec. That sounds to me like a very safe thing to do.

Could you do a test run with my full patch to see if there is any additional gain in CLONE_VM?

neelance on 13 Dec 2016

@neelance I feel super stupid. My first test was incorrect, but CLONE_VFORK alone is marginally faster but not anywhere near CLONE_VM|CLONE_VFORK. Here are all 3:

Unmodified Go 1.7
Running https://gist.github.com/arya/7e23e8654e87a6e80608ade43ee31041#file-without_vfork-go

$ forker
0GB: 752.606µs
1GB: 6.114471ms
2GB: 8.482259ms
3GB: 12.444931ms

Unmodified Go 1.7, application code adds CLONE_VFORK
Running https://gist.github.com/arya/7e23e8654e87a6e80608ade43ee31041#file-with_vfork-go

$ forker
0GB: 977.023µs
1GB: 2.883356ms
2GB: 6.791622ms
3GB: 10.0657ms

Go 1.7 with this patch applied: https://github.com/neelance/go/commit/b7edfba429d982e3e065d637334bcc63ad49f8f9
Running https://gist.github.com/arya/7e23e8654e87a6e80608ade43ee31041#file-without_vfork-go

$ forker
0GB: 634.004µs
1GB: 624.716µs
2GB: 531.193µs
3GB: 902.282µs

I'm surprised though that memory is still copied despite the the child_stack is set to zero.

arya on 13 Dec 2016

As far as I understand the manpage of clone you can use it in 3 ways:

Do not specify CLONE_VM, then COW makes sure that the two processes don't interfere.
Specify CLONE_VM and provide a child_stack. Memory is shared, but the child uses the given stack, thus not interfering with the stack of the parent.
Specify CLONE_VM and CLONE_VFORK. Memory is shared, but the parent does not run until the child detached from the memory space via exec. You have to consider that the child modifies the stack that the parent uses afterwards (my patch).

neelance on 14 Dec 2016

@neelance That makes sense to me. AFAICT the third option (your patch) is this most feasible and performant. The first option seems to be what's in master and suffers from a large amount of copying. CLONE_VFORK avoids some of the copies, but not much apparently. The second option seems to me (as a novice to the internals) much more difficult to get right given the nature of Go and its management of memory. Is that accurate?

arya on 15 Dec 2016

Yes, I also think that the second option is harder to implement.

neelance on 15 Dec 2016

I work on an app that makes heavy use of subprocesses to manipulate iptables and ipsets (since that's their only supported API). After observing poor performance when my process is using a lot of RAM, I found this issue.

I tried adding CLONE_VFORK to our use of exec.Command but it seemed to make the throughput of my app worse! Maybe it's a matter of the process getting paused until the execve happens, lowering the throughput of goroutines running on other threads too.

FWIW, the previous version of our app was written in Python and we observed a dramatic improvement when we switched from Python's default fork/exec strategy to using posix_spawn via FFI.

fasaxc on 24 Feb 2017

Measuring in an app that's under load with work going on in other threads, I see Cmd.Start() take 0-1ms at 40MB VSS vs 50-60ms at 1.4GB VSS. That amounts to 50x difference in throughput for my app and given the presence of the fork lock, there doesn't seem to be a way around it, even when using multiple goroutines.

fasaxc on 24 Feb 2017

@fasaxc Yes, only using CLONE_VFORK without CLONE_VM will only add additional waiting for the subproces to exec without saving time anywhere else.

Would you mind applying the whole patch https://github.com/neelance/go/commit/f2077098297cfdd4cbbd5fd2302ae1ae3730dc0f to your GOROOT, then do go install -a syscall just to be sure and then rebuild your app with that? You don't need to modify your exec.Command. I'd be interested if the patch also improves your use case.

neelance on 24 Feb 2017

With "improve" I specifically mean the latency on high ram usage. You are right that in a low-RAM situation it may lower the throughput. Please check if it is still 50x when using the full patch.

neelance on 24 Feb 2017

That patch makes a dramatic improvement. I'm measuring 99%ile latency of 1ms vs 60ms before and a drop from 100% CPU to 20% CPU usage.

fasaxc on 24 Feb 2017

Yey, I'm happy to hear that. Any downsides that you see? What about the low-memory situation?

neelance on 24 Feb 2017

@neelance It seems to improve latency at small VSS size too (~40MB): 800us 99th %ile vs 2600us

fasaxc on 24 Feb 2017

Cool. So there are no reasons for not bringing this upstream. I'll create a CL today or tomorrow.

neelance on 24 Feb 2017

🎉3 👍3

@bradfitz CL ready at https://go-review.googlesource.com/#/c/37439/

neelance on 25 Feb 2017

CL https://golang.org/cl/37439 mentions this issue.

gopherbot on 25 Feb 2017

@neelance nice work!

owenthereal on 23 Mar 2017

🎉12 👍6

keegancsmith on 23 Jan 2018

🎉29 👍23 ❤15

Was this page helpful?

0 / 5 - 0 ratings

Related issues

proposal: cmd/vet: vet should warn when time.Time type (or types embed it) is used as map keys.

go101 · 3Comments

proposal: sync: Map.Delete method should return bool, indicating if key was deleted or not

lkarlslund · 3Comments

cannot find package "golang.org/x/sys/unix"

jayhuang75 · 3Comments

Proposal: supporting “symlinks” in GOPATH

myitcv · 3Comments

hcigvjrjir

natefinch · 3Comments