Go: proposal: cmd/cgo: unsafe FFI calls

Created on 9 Nov 2020 · 14Comments · Source: golang/go

The Glasgow Haskell Compiler (GHC) differentiates between “safe” and “unsafe” FFI calls. “safe” FFI calls are allowed to block and call back into Haskell, but have a substantial overhead. “unsafe” FFI calls are not allowed to block, but are as fast as a C function call.

While Go FFI will always be slower due to stack switching, this seems to account for only a small amount of the overhead that others have observed. If the C function is guaranteed to be short-running, a significant speedup can be obtained by making a direct call into C code, without involving the Go scheduler. Of course, if the C function blocks, this is bad, but in many cases, it can be guaranteed not to. Calling back into Go from an unsafe FFI call is undefined behavior, but in many cases such calls are known not to occur.

Proposal

Source

DemiMarie

👍11

Most helpful comment

@rsc There is plenty of evidence. The Tor Project decided that CGo was a very poor choice for their incremental rewrite of the Tor daemon. Filippo Valsorda wrote Rustigo to reduce call overhead when invoking Rust cryptographic routines. Rustigo was a disgusting hack, but it was over 15 times faster than CGo, which translated into a significant improvement in benchmarks.

Yes, deadlocking if the invoked function blocks is not great. But there are cases where the invoked function is absolutely guaranteed not to block. Fast cryptographic routines are one example. Graphics APIs such as Vulkan are another, and I recall yet another involving database access. In these cases, if performance matters, the choice isn’t “use CGo or a disgusting assembler hack”. It’s “use a disgusting assembler hack or reimplement the hot code path in a different language”.

DemiMarie on 16 Dec 2020

👍3

All 14 comments

This will produce programs that usually work but sometimes hang for inexplicable reasons. I don't think we've come close to the theoretical limit on speeding up cgo calls without removing safety.

Can you point to some documentation for GHC unsafe calls? Thanks.

ianlancetaylor on 9 Nov 2020

👍2

@ianlancetaylor Here is a section in the GHC User Guide about guaranteed call safety: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/ffi-chap.html#guaranteed-call-safety

The Haskell 2010 Report specifies that safe FFI calls must allow foreign calls to safely call into Haskell code. In practice, this means that the garbage collector must be able to run while these calls are in progress, moving heap-allocated Haskell values around arbitrarily.

This greatly constrains library authors since it implies that it is not safe to pass any heap object reference to a safe foreign function call. For instance, it is often desirable to pass an unpinned ByteArray#s directly to native code to avoid making an otherwise-unnecessary copy. However, this can only be done safely if the array is guaranteed not to be moved by the garbage collector in the middle of the call.

The Chapter does not require implementations to refrain from doing the same for unsafe calls, so strictly Haskell 2010-conforming programs cannot pass heap-allocated references to unsafe FFI calls either.

In previous releases, GHC would take advantage of the freedom afforded by the Chapter by performing safe foreign calls in place of unsafe calls in the bytecode interpreter. This meant that some packages which worked when compiled would fail under GHCi (e.g. #13730).

However, since version 8.4 this is no longer the case: GHC guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread.

smasher164 on 10 Nov 2020

What we'd need to decide to do this is very compelling evidence that
(1) the speed difference is significant,
(2) the speed difference cannot be reduced by other optimization work, and
(3) this happens often in situations where the difference is critical.

Does anyone have any data about these?

rsc on 2 Dec 2020

Leaving open for another week, but in the absence of evidence that the current cgo isn't fast enough, this is headed for a likely decline.

rsc on 16 Dec 2020

DemiMarie on 16 Dec 2020

👍3

For reference, the current CGO overhead:

goos: windows
goarch: amd64
pkg: misc/cgo/test
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor

name                             time/op
CgoCall/add-int-32               49.3ns ± 2%
CgoCall/one-pointer-32           91.8ns ± 0%
CgoCall/eight-pointers-32         343ns ± 1%
CgoCall/eight-pointers-nil-32    89.4ns ± 2%
CgoCall/eight-pointers-array-32  3.90µs ± 1% // known bug
CgoCall/eight-pointers-slice-32  2.70µs ± 0%

GODEBUG=cgocheck=0
name                             time/op
CgoCall/add-int-32               51.1ns ± 0%
CgoCall/one-pointer-32           49.8ns ± 2%
CgoCall/eight-pointers-32        66.1ns ± 0%
CgoCall/eight-pointers-nil-32    66.2ns ± 0%
CgoCall/eight-pointers-array-32  1.62µs ± 1% // known bug
CgoCall/eight-pointers-slice-32   418ns ± 2%

egonelbre on 16 Dec 2020

@DemiMarie Earlier @rsc listed three things that we would want evidence for (https://github.com/golang/go/issues/42469#issuecomment-737422241).

The performance of calls across the cgo boundary matters most when those calls themselves--not the other code on the caller side, not the other code on the callee side--are a significant part of the performance cost. That implies that the code is making a lot of calls. How often is that the case. And I'll note that I believe we can continue to make the cgo call process faster.

@egonelbre Those numbers suggest that the overhead is due to pointer checking, but this proposal doesn't address pointer checking at all.

ianlancetaylor on 16 Dec 2020

👍2

@ianlancetaylor, yeah, kind-of. The more complicated the data-structure, the more time it'll take to check. For the benchmark I tried to find the most complex struct possible, https://github.com/golang/go/blob/master/misc/cgo/test/test.go#L120, however, I would suspect that such structs are the exception. Nevertheless, there probably is a way to speed up such structures as well. I would suspect that for most codebases the overhead will be in entersyscall/exitsyscall rather than cgocheck.

Both Filippos and Tor Projects seem to predate a few optimizations to cgo. So I'm not sure how applicable the examples are. For Filippos example the current cgo call overhead is ~50ns, which compared to 20us function cost seems negligible.

egonelbre on 17 Dec 2020

PS: while reinvestigating cgo entersyscall, I noticed that there might be ways to save at least ~4ns (https://go-review.googlesource.com/c/go/+/278793).

egonelbre on 17 Dec 2020

👍1

Tor Project needed to use callbacks from C into Go extensively, which are extremely slow.

DemiMarie on 17 Dec 2020

Tor Project needed to use callbacks from C into Go extensively, which are _extremely_ slow.

Can you point to some benchmarks, that would give folks something to aim at rather than talking across each other.

davecheney on 17 Dec 2020

👍1

@davecheney Sadly no, but I do remember reading that they ~1-2 milliseconds at one point.

DemiMarie on 17 Dec 2020

If the issue is that these functions need to be preemptible, would another alternative be to give these "non-blocking" foreign functions some way to yield to the scheduler?

smasher164 on 17 Dec 2020

Callbacks from C to Go have gotten much faster over the last few releases (including the upcoming 1.16 release).

ianlancetaylor on 17 Dec 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings