Go: os: StartProcess ETXTBSY race on Unix systems

Created on 18 Oct 2017 · 18Comments · Source: golang/go

Modern Unix systems appear to have a fundamental design flaw in the interaction between multithreaded programs, fork+exec, and the prohibition on executing a program if that program is open for writing.

Below is a simple multithreaded C program. It creates 20 threads all doing the same thing: write an exit 0 shell script to /var/tmp/fork-exec-N (for different N), and then fork and exec that script. Repeat ad infinitum. Note that the shell script fds are opened O_CLOEXEC, so that an fd being written by one thread does not leak into the fork+exec's shell script of a different thread.

On my Linux workstation, this program produces a never-ending stream of ETXTBSY errors. The problem is that O_CLOEXEC is not enough. The fd being written by one thread _can_ leak into the forked child of a second thread, and it stays there until that child calls exec. If the first thread closes the fd and calls exec before the second thread's child does exec, then the first thread's exec will get ETXTBSY, because somewhere in the system (specifically, in the child of the second thread), there is an fd still open for writing the first thread's shell script, and according to modern Unix rules, one must not exec a program if there exists any fd anywhere open for writing that program.

Five years ago this bit us because cmd/go installed cmd/cgo (that is, copied the binary from a temporary location to somewhere/bin/cgo) and then executed it. To fix this we put a sleep+retry loop around the fork+exec of cgo when it gets ETXTBSY. Now (as of last week or so) we don't ever install cmd/cgo and execute it in the same cmd/go process, so that specific race is gone, although as I write this cmd/go still has the sleep+retry loop, which I intend to remove.

Last week this bit us again because cmd/go updated a build stamp in the binary, closed it, and executed it. The resulting flaky ETXTBSY failures were reported as #22220. A pending CL fixes this by not updating the build stamp in temporary binaries, which are the main ones we execute. There's still one case where we write+execute a program, which is go test -cpuprofile x.prof pkg. The cpuprofile flag (and a few others) cause cmd/go to leave the pkg.test in the current directory for debugging purposes but also run the test. Luckily running the test is currently the final thing cmd/go does, and it waits for any other fork+exec'ed programs to finish before fork+exec'ing the test. So the race cannot happen in this case.

In general this race is going to happen every time anyone writes a program that both writes and executes a program. It's easy to imagine other build systems running into this, but also programs that do things like unzip a zip file and then run a program inside it - think a program supervisor or mini container runtime. As soon as there are multiple threads doing fork+exec at the same time, and one of them is doing fork+exec of a program that was previously open for write in the same process, you have a mysterious flaky problem.

It seems like maybe Go should take care of this, if possible. We've now hit it twice in cmd/go, five years apart, and at least this past time it took the better part of a day to figure out. (I don't remember how long it took five years ago, in part because I don't remember anything about discovering it five years ago. I also don't want to rediscover all this five years from now.)

There are a few hacks we could use:

In os.StartProcess, if we see ETXTBSY, sleep 100ms and try again, maybe a few times, up to say 1 second of sleeping. In general we don't know how long to sleep.
Arrange with a locking mechanism that close must never complete during a fork+exec sequence. The end of the fork+exec sequence needs to be the point where we know the close-on-exec fds have been closed. Unfortunately there is no portable way to identify that point.
- If the exec fails and the child tells us and exits, we can wait for the exit. That's easy.
- If the exec succeeds, we find out because the exec closes the child's end of the status pipe, and we get EOF.
  - If we know that an OS does close-on-exec work in increasing fd order, then we could also track the maximum fd we've opened and move the status pipe above that. Then seeing the status pipe close would mean all other fds are closed too.
  - If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)
- It may not be OK to block all closes on a wedged fork+exec (in general an exec'ed program may be loaded from some slow network server).
Note that vfork(2) is not a solution. Vfork is defined as the parent does not continue executing until the child is no longer using the parent's memory image. In the case of a successful exec, at least on Linux, vfork releases the memory image before doing any of the close-on-exec work, so the parent continues running before the child has closed the fds we care about.

None of these seem great. The ETXTBSY sleep, up to 1 second, might be the best option. It would certainly reduce the flake rate and in many cases would probably make it undetectable. It would not help exec of very slow-to-load programs, but that's not the common case.

I wondered how Java deals with this, and the answer seems to be that Java doesn't deal with this. https://bugs.openjdk.java.net/browse/JDK-8068370 was filed in 2014 and is still open.

#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdint.h>

void* runner(void*);

int
main(void)
{
    int i;
    pthread_t pid[20];

    for(i=1; i<20; i++)
        pthread_create(&pid[i], 0, runner, (void*)(uintptr_t)i);
    runner(0);
    return 0;
}

char script[] = "#!/bin/sh\nexit 0\n";

void*
runner(void *v)
{
    int i, fd, pid, status;
    char buf[100], *argv[2];

    i = (int)(uintptr_t)v;
    snprintf(buf, sizeof buf, "/var/tmp/fork-exec-%d", i);
    argv[0] = buf;
    argv[1] = 0;
    for(;;) {
        fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777);
        if(fd < 0) {
            perror("open");
            exit(2);
        }
        write(fd, script, strlen(script));
        close(fd);
        pid = fork();
        if(pid < 0) {
            perror("fork");
            exit(2);
        }
        if(pid == 0) {
            execve(buf, argv, 0);
            exit(errno);
        }
        if(waitpid(pid, &status, 0) < 0) {
            perror("waitpid");
            exit(2);
        }
        if(!WIFEXITED(status)) {
            perror("waitpid not exited");
            exit(2);
        }
        status = WEXITSTATUS(status);
        if(status != 0)
            fprintf(stderr, "exec: %d %s\n", status, strerror(status));
    }
    return 0;
}

NeedsInvestigation

Source

rsc

❤14

Most helpful comment

I've emailed [email protected]. Will reference an archive once it appears.
If they're unpersuaded, then there's the POSIX folks at Open Group; they have a bug tracker.

RalphCorderoy on 18 Oct 2017

❤3

All 18 comments

/cc @aclements @ianlancetaylor @crawshaw

rsc on 18 Oct 2017

Change https://golang.org/cl/71570 mentions this issue: cmd/go: skip updateBuildID on binaries we will run

gopherbot on 18 Oct 2017

Change https://golang.org/cl/71571 mentions this issue: cmd/go: delete ETXTBSY hack that is no longer needed

gopherbot on 18 Oct 2017

Userspace workarounds seem flawed or less than ideal. This is a kernel
problem, like O_CLOEXEC. Perhaps lobby for a O_CLOFORK that's similar
but close on fork instead. The writer would open, write, close, fork,
exec so wouldn't make use of it, but any other thread that forks
wouldn't carry the FD with it so the writer's close would succeeding in
nailing the sole, final, reference to the "open file description", as
POSIX calls it.

RalphCorderoy on 18 Oct 2017

O_CLOFORK is a good idea. Does anybody want to suggest that to the Linux kernel maintainers? I expect that if someone can get it into Linux it will flow through to the other kernels.

I'm going to repeat a hack I described elsewhere that I believe would work for pure Go programs.

record the highest file descriptor returned by syscall.Open, syscall.Socket, syscall.Dup, etc.
add a new RWMutex in syscall: forkMutex
during syscall.Close, acquire a read lock on forkMutex
in syscall.forkAndExecInChild acquire a write lock on forkMutex, and
open a pipe in the parent (as we already do if UidMappings is set), and
in the child, loop through the descriptors up to the highest one,
closing each one that is marked close-on-exec, then close the pipe to the parent
in the parent, when the pipe is closed, release the forkMutex lock

The effect of this should be that when syscall.Close returns, we know for sure that there is no forked child that has an open copy of the descriptor.

The disadvantages are that all forks are serialized, and that all forks waste time closing descriptors that will shortly be closed anyhow. Also, of course, forks temporarily block closes, but that is unlikely to be significant.

ianlancetaylor on 18 Oct 2017

O_CLOFORK is a good idea. Does anybody want to suggest that to the Linux kernel maintainers?

I'm happy to have a go, but I'm a nobody on that list. I was assuming folks here might have the ear of a Google kernel developer or two in that area that would vet the idea and suggest it to the list if worthy. :-)

during syscall.Close, acquire a read lock on forkMutex

And syscall.Dup2 and Dup3 as they may cause newfd to close.

Do syscall.Open _et al_ also synchronise with forkMutex somehow? I'm wondering if they can be creating more FDs, either above or below the highwater mark, whilst forkAndExecInChild is looping, closing close-on-exec ones.

RalphCorderoy on 18 Oct 2017

Is there a place to file a feature request against the Linux kernel? I know nothing about the kernel development process. I hear it uses git.

Agree about Dup2 and Dup3.

As far as I can see it doesn't matter if syscall.Open and friends create a new FD while the child is looping, because the child won't see the new descriptor anyhow.

ianlancetaylor on 18 Oct 2017

😄1

@ianlancetaylor thanks, yes, the explicit closes would solve the problem with slow execs, which would be nice. That might make this actually palatable. You also don't even need the extra pipe if you use vfork in this approach.

I agree with @RalphCorderoy that there's a race between the "maintain the max" and "fork", in that Open might create a new fd, then fork runs in a different thread before Open can update the max. But since fds are created lowest-available, it should suffice for the child to assume that max is, say, 10 larger than it is.

Also note that this need not be an RWMutex (and for that matter the current syscall.ForkMutex need not be an RWMutex either). It just needs to be an "either-or" mutex. An RWMutex allows N readers or 1 writer. The mutex we need would allow N of type A or N of type B, just never a mix. If we built that (not difficult, I don't think), then programs that never fork would not serialize any of their closes, and programs that fork a lot but don't close things would not serialize any of their forks.

O_CLOFORK would require having fcntl F_SETFL/F_GETFL support for that bit too, and it would complicate fork a little more than it already is. An alternative that would be equally fine for us would be a "close all fd's above" or "tell me the maximum fd of my process" syscall. I don't know if a new bit or a new syscall is more likely.

rsc on 18 Oct 2017

I should maybe also note that macOS fixes this problem by putting #if 0 around the ETXTBSY check in the kernel implementation of exec. That would be a third option for Linux although probably less likely than the other two.

rsc on 18 Oct 2017

I've emailed [email protected]. Will reference an archive once it appears.
If they're unpersuaded, then there's the POSIX folks at Open Group; they have a bug tracker.

RalphCorderoy on 18 Oct 2017

❤3

linux-kernel mailing-list archive of post: https://marc.info/?l=linux-kernel&m=150834137201488

RalphCorderoy on 18 Oct 2017

What's the plan here for Go 1.10?

@RalphCorderoy, looks like you never got a reply, eh?

bradfitz on 30 Nov 2017

Looks like Solaris and macOS and OpenBSD have O_CLOFORK already. Hopefully it will catch on further.

ianlancetaylor on 20 Dec 2018

I'm currently running into this (I think?) on Ubuntu, using Go 1.13.5, calling ioutil.WriteFile to write a binary, immediately followed by exec.Command. Is there a suggestion for the best way to detect this in user space? Stat the file until you don't get ETXTBUSY?

kevinburkemeter on 6 Dec 2019

👍1

A colleague pointed me to this bug in context of a wider discussion about O_CLOFORK. When each fork is expected to proceed to exec (as is the case here), it is possible to solve the problem via open file description locks in 4 extra syscalls, without requiring any cooperation between threads.

The high-level algorithm for writing a file for execution is as follows:

open an fd with O_WRONLY | O_CLOEXEC
write into fd
place open file description lock on the fd
close the fd
open a new fd with O_RDONLY | O_CLOEXEC (same path as step 1)
place open file description lock on it
close the fd

If an fd opened in step 1 leaked to another process as a result of concurrent thread issuing a fork(), we wait for it to be closed at step 6. An fd opened at step 5 may also leak, but won't cause ETXTBUSY as it is open read-only.

The diff to the program shown in the opening comment would be just:

@@ -41,6 +44,20 @@ runner(void *v)
                        exit(2);
                }
                write(fd, script, strlen(script));
+               if (flock(fd, LOCK_EX) < 0) {
+                       perror("flock");
+                       exit(2);
+               }
+               close(fd);
+               fd = open(buf, O_RDONLY|O_CLOEXEC, 0777);
+               if(fd < 0) {
+                       perror("open (readonly)");
+                       exit(2);
+               }
+               if (flock(fd, LOCK_SH) < 0) {
+                       perror("flock (readonly)");
+                       exit(2);
+               }
                close(fd);
                pid = fork();
                if(pid < 0) {

amonakov on 18 Oct 2020

@amonakov Thanks for the comment. That is an interesting suggestion.

I guess that to make this work automatically in Go we would have to detect when an executable file is opened with write access. Unfortunately this would seem to require an extra fstat system call for every file opened for write access. That is not so great. Perhaps we could restrict it to only calls that use O_CREATE as that is likely the most common case that causes problems.

But then there seems to be a race condition. The fork can happen at any time. If the fork happens after we call open but before we call flock, then it seems that the same problem can occur. In the problematic case the fork doesn't know anything about the file that we are writing. The problem is that the file is held open by the child process. Using the flock technique makes this much less likely to be a problem, but I don't think it completely eliminates the problem.

ianlancetaylor on 19 Oct 2020

... make this work automatically in Go ...

I don't think that would work: permission bits could be changed independently after close(). In any case, my solution has two assumptions, that file was opened with O_CLOEXEC, and that long-lived forks do not appear. For that reason I'd say it's not appropriate to roll it up into some standard function. It could live as a separate close-like function where the purpose and requirements could be clearly documented.

But then there seems to be a race condition. The fork can happen at any time. If the fork happens after we call open but before we call flock, then it seems that the same problem can occur.

No, forked child shares the open file description with the parent, so a later flock in the parent still affects it.

amonakov on 19 Oct 2020

@amonakov Thanks.

For what it's worth, all files opened using the Go standard library have O_CLOEXEC set. And Go doesn't support long-lived forks, as fork doesn't work well with multi-threaded programs, and all Go programs are multi-threaded. So I don't think those are issues.

That said, personally I would not want to add new API to close an executable file. That seems awkward and hard to understand. I'd much rather persuade kernels to support O_CLOFORK. Of course any particular program can use your technique.

ianlancetaylor on 19 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

proposal: cmd/vet: vet should warn when time.Time type (or types embed it) is used as map keys.

go101 · 3Comments

Proposal: supporting “symlinks” in GOPATH

myitcv · 3Comments

cmd/compile: testing/quick misbehaves on Nexus 9 linux/arm64

rsc · 3Comments

all: T.FailNow used in goroutines in standard library tests

dominikh · 3Comments

x/build/cmd/coordinator: ssh proxy should support scp

bradfitz · 3Comments