go version)?$ go version go version go1.11.3 linux/amd64
Yes, it also reproduces with the latest tip.
go env)?go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/prashant/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/prashant/go"
GOPROXY=""
GORACE=""
GOROOT="/home/prashant/.gimme/versions/go1.11.3.linux.amd64"
GOTMPDIR=""
GOTOOLDIR="/home/prashant/.gimme/versions/go1.11.3.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build250167096=/tmp/go-build -gno-record-gcc-switches"
Created a unix listener, called File() to get the underlying file descriptor, then ran Accept in a goroutine. After some while, tried to call Close on the listener, but the Close blocks indefinitely.
Repro:
https://github.com/prashantv/unix-close-race/blob/master/main.go
Close to return, and cause the Accept (in a separate goroutine) to unblock, and return an error.
In Go 1.11.3, the Close blocks indefinitely:
goroutine 1 [semacquire]:
internal/poll.runtime_Semacquire(0xc0000a4028)
/home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/runtime/sema.go:61 +0x39
internal/poll.(*FD).Close(0xc0000a4000, 0xc0000a4000, 0x0)
/home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/internal/poll/fd_unix.go:109 +0xdb
net.(*netFD).Close(0xc0000a4000, 0xc0000b5e58, 0xc00008a030)
/home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/net/fd_unix.go:184 +0x4f
net.(*UnixListener).close(0xc000080540, 0xc0000d0020, 0x5)
/home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/net/unixsock_posix.go:186 +0x65
net.(*UnixListener).Close(0xc000080540, 0x1, 0x1)
/home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/net/unixsock.go:270 +0x47
main.testCloseAfterAccept(0x511ec0, 0xc000080540)
/home/prashant/go/src/github.com/prashantv/unix-close-race/main.go:66 +0x111
main.main()
/home/prashant/go/src/github.com/prashantv/unix-close-race/main.go:43 +0x1ce
In Go 1.10.6, the Close returns, but the Accept goroutine is not unblocked,
2018/12/14 14:58:02 Got fd from unix ln5
2018/12/14 14:58:02 wait for accept
2018/12/14 14:58:02 about to accept
2018/12/14 14:58:02 close
2018/12/14 14:58:02 wait for accept to end
<blocks>
It looks like the behaviour change between 1.10 and 1.11 may be caused by https://go-review.googlesource.com/c/go/+/119955/ (fix for #24942)
I added a flag to the repro, if you pass --use-new-fd, instead of calling Accept and Close on the original unix listener (which we called File() on), it uses a new listener from the copied file descriptor. This mitigates the issue (both in Go 1.10 and Go 1.11). It seems like calling File() duplicates the file descriptor, but somehow affects the original file descriptor causing issues with Accept + Close.
cc @witriew who originally ran into this issue.
This was observed in osrg/gobgp as well (at least in previous versions). This is my working theory so far after debugging that (in Go 1.11.1).
The problem is actually with the fd := f.Fd() line in the reproduction. This call makes the FD non-blocking:
func (f *File) Fd() uintptr {
if f == nil {
return ^(uintptr(0))
}
// If we put the file descriptor into nonblocking mode,
// then set it to blocking mode before we return it,
// because historically we have always returned a descriptor
// opened in blocking mode. The File will continue to work,
// but any blocking operation will tie up a thread.
if f.nonblock {
f.pfd.SetBlocking()
}
return uintptr(f.pfd.Sysfd)
}
The pfd being modifed in SetBlocking() is the duplicated FD, but note that it eventually calls fcntl(fd, F_SETFL, flag &^= O_NONBLOCK) (thus clearing the nonblocking flag on the duplicate FD).
But, the gotcha is that fcntl operates on the underlying file description, not the file descriptor. The duplicate and the original file descriptors (different integers) both share the same file description (and the same file status flags), so the nonblocking flag is cleared on the original unixLn socket by this call as well.
Thus, the unixLn socket reads are actually blocking calls, but the original pFD on the socket (poll.FD) has its isBlocking == 0. This triggers a deadlock on Close() during Accept() when it waits for all readers (the accept call) to close on on the pfd.csema semaphore.
func (fd *FD) Close() error {
...
// Wait until the descriptor is closed. If this was the only
// reference, it is already closed. Only wait if the file has
// not been set to blocking mode, as otherwise any current I/O
// may be blocking, and that would block the Close.
// No need for an atomic read of isBlocking, increfAndClose means
// we have exclusive access to fd.
if fd.isBlocking == 0 {
runtime_Semacquire(&fd.csema)
}
return err
}
See "File status flags" in fcntl(2) for reference.
@dylanplecki Thanks, I think your analysis is exactly correct.
The File method is documented as saying "The returned os.File's file descriptor is different from the connection's. Attempting to change properties of the original using this duplicate may or may not have the desired effect." The Fd method is documented as saying "On Unix systems this will cause the SetDeadline methods to stop working." The latter is a shorthand for a complex statement that amounts to "the returned file descriptor is put into blocking mode." This changes the property of the duplicate which changes the property of the original.
I don't know what we can do about this without losing features that are more important. I'm open to suggestions.
Before that, though: what are you really trying to do? Is this something that could be done more appropriately with net.ListenConfig?
@dylanplecki That makes perfect sense, thanks for the analysis! We noticed that the Fd was the culprit, didn't realize that the SetBlocking the duplicated FD affected the original FD as well, that was the missing piece that leads to a simple workaround for our use-case -- put the file back into non-blocking mode using syscall.SetNonblock.
@ianlancetaylor We're implementing a graceful restart feature (similar to HAProxy), so we need to pass file descriptors from the old process to the new using a unix domain socket. I don't think we can use net.ListenConfig here.
I think the issue comes down to:
File() returning a new file descriptor, but still referencing the same underlying file description -- the documentation mentions a new file descriptor, and that changes may not affect the original. Maybe it should be more explicit in mentioning that they may share underlying state.Fd() on the duplicated *os.File sets the file description into blocking mode, but the original internal/poll.FD of the listener still has isBlocking == 1, since it doesn't "realize" the underlying file descriptor is no longer non-blocking. On Close(), we enter this block:https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L102
// Wait until the descriptor is closed. If this was the only
// reference, it is already closed. Only wait if the file has
// not been set to blocking mode, as otherwise any current I/O
// may be blocking, and that would block the Close.
// No need for an atomic read of isBlocking, increfAndClose means
// we have exclusive access to fd.
if fd.isBlocking == 0 {
runtime_Semacquire(&fd.csema)
}
As the comment says, this causes Close() to block, leading to the behaviour we see in the bug.
Even the original internal/poll.FD was updated to no longer be considered non-blocking as part of the File() + Fd() operations, the user may trigger changes that cause it to get out-of-sync (e.g., using syscall.SetNonblock).
I think you can use SyscallConn to get the fd without side effects.
https://golang.org/pkg/net/#UnixListener.SyscallConn
I think I just ran into this issue in CL 154664. Closeing the file returned by File() before Closeing the connection worked for me.
(See line 86 of sendfile_test.go.)
There is an issue for sendfile.
https://github.com/golang/go/issues/28330
This was observed in
osrg/gobgpas well (at least in previous versions). This is my working theory so far after debugging that (in Go 1.11.1).The problem is actually with the
fd := f.Fd()line in the reproduction. This call makes the FD non-blocking:func (f *File) Fd() uintptr { if f == nil { return ^(uintptr(0)) } // If we put the file descriptor into nonblocking mode, // then set it to blocking mode before we return it, // because historically we have always returned a descriptor // opened in blocking mode. The File will continue to work, // but any blocking operation will tie up a thread. if f.nonblock { f.pfd.SetBlocking() } return uintptr(f.pfd.Sysfd) }The
pfdbeing modifed inSetBlocking()is the duplicated FD, but note that it eventually callsfcntl(fd, F_SETFL, flag &^= O_NONBLOCK)(thus clearing the nonblocking flag on the duplicate FD).But, the gotcha is that
fcntloperates on the underlying file description, not the file descriptor. The duplicate and the original file descriptors (different integers) both share the same file description (and the same file status flags), so the nonblocking flag is cleared on the originalunixLnsocket by this call as well.Thus, the
unixLnsocket reads are actually blocking calls, but the original pFD on the socket (poll.FD) has itsisBlocking == 0. This triggers a deadlock onClose()duringAccept()when it waits for all readers (the accept call) to close on on thepfd.csemasemaphore.func (fd *FD) Close() error { ... // Wait until the descriptor is closed. If this was the only // reference, it is already closed. Only wait if the file has // not been set to blocking mode, as otherwise any current I/O // may be blocking, and that would block the Close. // No need for an atomic read of isBlocking, increfAndClose means // we have exclusive access to fd. if fd.isBlocking == 0 { runtime_Semacquire(&fd.csema) } return err }See "File status flags" in fcntl(2) for reference.
This post helps me a lot. And I also found that not only accept will block close, but alse read,write and other ops that will increase lock reference. Here is my test code:
package main
import (
"fmt"
"log"
"net"
"time"
)
func main() {
l, _ := net.Listen("tcp", ":8888")
for {
conn, _ := l.Accept()
log.Println("get conn", conn.RemoteAddr())
go func() {
go func() {
for {
time.Sleep(1 * time.Second)
conn.Close()
fmt.Println("here")
}
}()
f, _ := conn.(*net.TCPConn).File()
_ = f.Fd()
f.Close()
arr := make([]byte, 1000)
for {
n, err := conn.Read(arr)
fmt.Println(n, err)
time.Sleep(3 * time.Second)
}
}()
}
}
you will find conn.Close() will be blocked until read get some data.
Most helpful comment
This was observed in
osrg/gobgpas well (at least in previous versions). This is my working theory so far after debugging that (in Go 1.11.1).The problem is actually with the
fd := f.Fd()line in the reproduction. This call makes the FD non-blocking:The
pfdbeing modifed inSetBlocking()is the duplicated FD, but note that it eventually callsfcntl(fd, F_SETFL, flag &^= O_NONBLOCK)(thus clearing the nonblocking flag on the duplicate FD).But, the gotcha is that
fcntloperates on the underlying file description, not the file descriptor. The duplicate and the original file descriptors (different integers) both share the same file description (and the same file status flags), so the nonblocking flag is cleared on the originalunixLnsocket by this call as well.Thus, the
unixLnsocket reads are actually blocking calls, but the original pFD on the socket (poll.FD) has itsisBlocking == 0. This triggers a deadlock onClose()duringAccept()when it waits for all readers (the accept call) to close on on thepfd.csemasemaphore.See "File status flags" in fcntl(2) for reference.