Go: proposal: cmd/cgo: unsafe FFI calls

Created on 9 Nov 2020  Ā·  14Comments  Ā·  Source: golang/go

The Glasgow Haskell Compiler (GHC) differentiates between ā€œsafeā€ and ā€œunsafeā€ FFI calls. ā€œsafeā€ FFI calls are allowed to block and call back into Haskell, but have a substantial overhead. ā€œunsafeā€ FFI calls are not allowed to block, but are as fast as a C function call.

While Go FFI will always be slower due to stack switching, this seems to account for only a small amount of the overhead that others have observed. If the C function is guaranteed to be short-running, a significant speedup can be obtained by making a direct call into C code, without involving the Go scheduler. Of course, if the C function blocks, this is bad, but in many cases, it can be guaranteed not to. Calling back into Go from an unsafe FFI call is undefined behavior, but in many cases such calls are known not to occur.

Proposal

Most helpful comment

@rsc There is plenty of evidence. The Tor Project decided that CGo was a very poor choice for their incremental rewrite of the Tor daemon. Filippo Valsorda wrote Rustigo to reduce call overhead when invoking Rust cryptographic routines. Rustigo was a disgusting hack, but it was over 15 times faster than CGo, which translated into a significant improvement in benchmarks.

Yes, deadlocking if the invoked function blocks is not great. But there are cases where the invoked function is absolutely guaranteed not to block. Fast cryptographic routines are one example. Graphics APIs such as Vulkan are another, and I recall yet another involving database access. In these cases, if performance matters, the choice isn’t ā€œuse CGo or a disgusting assembler hackā€. It’s ā€œuse a disgusting assembler hack or reimplement the hot code path in a different languageā€.

All 14 comments

This will produce programs that usually work but sometimes hang for inexplicable reasons. I don't think we've come close to the theoretical limit on speeding up cgo calls without removing safety.

Can you point to some documentation for GHC unsafe calls? Thanks.

@ianlancetaylor Here is a section in the GHC User Guide about guaranteed call safety: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/ffi-chap.html#guaranteed-call-safety

The Haskell 2010 Report specifies that safe FFI calls must allow foreign calls to safely call into Haskell code. In practice, this means that the garbage collector must be able to run while these calls are in progress, moving heap-allocated Haskell values around arbitrarily.

This greatly constrains library authors since it implies that it is not safe to pass any heap object reference to a safe foreign function call. For instance, it is often desirable to pass an unpinned ByteArray#s directly to native code to avoid making an otherwise-unnecessary copy. However, this can only be done safely if the array is guaranteed not to be moved by the garbage collector in the middle of the call.

The Chapter does not require implementations to refrain from doing the same for unsafe calls, so strictly Haskell 2010-conforming programs cannot pass heap-allocated references to unsafe FFI calls either.

In previous releases, GHC would take advantage of the freedom afforded by the Chapter by performing safe foreign calls in place of unsafe calls in the bytecode interpreter. This meant that some packages which worked when compiled would fail under GHCi (e.g. #13730).

However, since version 8.4 this is no longer the case: GHC guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread.

What we'd need to decide to do this is very compelling evidence that
(1) the speed difference is significant,
(2) the speed difference cannot be reduced by other optimization work, and
(3) this happens often in situations where the difference is critical.

Does anyone have any data about these?

Leaving open for another week, but in the absence of evidence that the current cgo isn't fast enough, this is headed for a likely decline.

@rsc There is plenty of evidence. The Tor Project decided that CGo was a very poor choice for their incremental rewrite of the Tor daemon. Filippo Valsorda wrote Rustigo to reduce call overhead when invoking Rust cryptographic routines. Rustigo was a disgusting hack, but it was over 15 times faster than CGo, which translated into a significant improvement in benchmarks.

Yes, deadlocking if the invoked function blocks is not great. But there are cases where the invoked function is absolutely guaranteed not to block. Fast cryptographic routines are one example. Graphics APIs such as Vulkan are another, and I recall yet another involving database access. In these cases, if performance matters, the choice isn’t ā€œuse CGo or a disgusting assembler hackā€. It’s ā€œuse a disgusting assembler hack or reimplement the hot code path in a different languageā€.

For reference, the current CGO overhead:

goos: windows
goarch: amd64
pkg: misc/cgo/test
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor

name                             time/op
CgoCall/add-int-32               49.3ns ± 2%
CgoCall/one-pointer-32           91.8ns ± 0%
CgoCall/eight-pointers-32         343ns ± 1%
CgoCall/eight-pointers-nil-32    89.4ns ± 2%
CgoCall/eight-pointers-array-32  3.90µs ± 1% // known bug
CgoCall/eight-pointers-slice-32  2.70µs ± 0%

GODEBUG=cgocheck=0
name                             time/op
CgoCall/add-int-32               51.1ns ± 0%
CgoCall/one-pointer-32           49.8ns ± 2%
CgoCall/eight-pointers-32        66.1ns ± 0%
CgoCall/eight-pointers-nil-32    66.2ns ± 0%
CgoCall/eight-pointers-array-32  1.62µs ± 1% // known bug
CgoCall/eight-pointers-slice-32   418ns ± 2%

@DemiMarie Earlier @rsc listed three things that we would want evidence for (https://github.com/golang/go/issues/42469#issuecomment-737422241).

The performance of calls across the cgo boundary matters most when those calls themselves--not the other code on the caller side, not the other code on the callee side--are a significant part of the performance cost. That implies that the code is making a lot of calls. How often is that the case. And I'll note that I believe we can continue to make the cgo call process faster.

@egonelbre Those numbers suggest that the overhead is due to pointer checking, but this proposal doesn't address pointer checking at all.

@ianlancetaylor, yeah, kind-of. The more complicated the data-structure, the more time it'll take to check. For the benchmark I tried to find the most complex struct possible, https://github.com/golang/go/blob/master/misc/cgo/test/test.go#L120, however, I would suspect that such structs are the exception. Nevertheless, there probably is a way to speed up such structures as well. I would suspect that for most codebases the overhead will be in entersyscall/exitsyscall rather than cgocheck.

Both Filippos and Tor Projects seem to predate a few optimizations to cgo. So I'm not sure how applicable the examples are. For Filippos example the current cgo call overhead is ~50ns, which compared to 20us function cost seems negligible.

PS: while reinvestigating cgo entersyscall, I noticed that there might be ways to save at least ~4ns (https://go-review.googlesource.com/c/go/+/278793).

Tor Project needed to use callbacks from C into Go extensively, which are extremely slow.

Tor Project needed to use callbacks from C into Go extensively, which are _extremely_ slow.

Can you point to some benchmarks, that would give folks something to aim at rather than talking across each other.

@davecheney Sadly no, but I do remember reading that they ~1-2 milliseconds at one point.

Related: https://github.com/golang/go/issues/16051

If the issue is that these functions need to be preemptible, would another alternative be to give these "non-blocking" foreign functions some way to yield to the scheduler?

Callbacks from C to Go have gotten much faster over the last few releases (including the upcoming 1.16 release).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jayhuang75 picture jayhuang75  Ā·  3Comments

natefinch picture natefinch  Ā·  3Comments

bradfitz picture bradfitz  Ā·  3Comments

dominikh picture dominikh  Ā·  3Comments

michaelsafyan picture michaelsafyan  Ā·  3Comments