Grpc-go: Severe performance regression when using ServeHTTP

Created on 4 Mar 2016  Â·  24Comments  Â·  Source: grpc/grpc-go

Extracting from discussion in #549. See https://github.com/cockroachdb/rpc-bench; relevant parts as of this writing:

name                 time/op
GRPCServe_1K-4         72.6µs ±10%
GRPCServe_64K-4         785µs ± 8%
GRPCServeHTTP_1K-4      178µs ± 2%
GRPCServeHTTP_64K-4    1.32ms ± 2%

name                 speed
GRPCServe_1K-4       28.3MB/s ± 9%
GRPCServe_64K-4       167MB/s ± 7%
GRPCServeHTTP_1K-4   11.5MB/s ± 2%
GRPCServeHTTP_64K-4  99.4MB/s ± 2%

name                 alloc/op
GRPCServe_1K-4         15.6kB ± 0%
GRPCServe_64K-4         711kB ± 0%
GRPCServeHTTP_1K-4     88.5kB ± 0%
GRPCServeHTTP_64K-4     801kB ± 0%

name                 allocs/op
GRPCServe_1K-4           96.0 ± 0%
GRPCServe_64K-4           162 ± 0%
GRPCServeHTTP_1K-4        166 ± 0%
GRPCServeHTTP_64K-4       671 ± 0%
P3 Help Wanted Performance

Most helpful comment

@tamird, the grpc-go team is undergoing a changing of the guard at the moment so this isn't much of a priority.

And I don't officially work on grpc-go. It's not surprising that ServeHTTP is slower considering it's shoehorned into the existing ServerTransport interface somewhat awkwardly.

I started working on a new design for all of grpc-go in https://github.com/bradfitz/grpc-go but it's not ready yet for wider discussion or testing. It was mostly a proof of concept. But it's something I intend to get back to. @philips et al at CoreOS are also interested.

All 24 comments

@bradfitz any progress on this?

No, I've been busy with Go 1.7 stuff for the May 1st freeze. Hopefully I can spend a bit of time on this soon, but then I'm out for a month starting in a couple weeks.

@bradfitz is there some way in which we could help? :)

is there some way in which we could help? :)

Sure. Look at the http2 code and find ways to improve it, while not hurting readability or maintainability.

Any pointers as to how to start? Is there any particular test you pprof to find bottlenecks? :)

Honestly I forget. That code was mostly written for correctness and not performance. There's probably a lot of low-hanging fruit in there but there might not even be enough benchmarks yet to start using pprof & friends. You'd probably have to write at least a few new ones.

@iamqizhao this is a serious issue - can we include it for GA?

@tamird Sorry. We are approaching GA very closely and so far this is treated as an experimental effort and we do not expect the real usage for this feature at the current stage. This is in Brad's plate and I am going to talk to him after GA to see how to make progress on this.

This is still a major inconvenience for us. I've updated the performance numbers at https://github.com/cockroachdb/rpc-bench; the relevant ones are:

name                         time/op
GRPCServe_1K-24                40.5µs ± 1%
GRPCServe_64K-24                365µs ± 1%
GRPCServe_Stream_1K-24         22.7µs ± 1%
GRPCServe_Stream_64k-24         298µs ± 1%
GRPCServeHTTP_1K-24             137µs ± 1%
GRPCServeHTTP_64K-24            585µs ± 1%
GRPCServeHTTP_Stream_1K-24     38.7µs ± 1%
GRPCServeHTTP_Stream_64k-24     455µs ± 1%

name                         speed
GRPCServe_1K-24              50.5MB/s ± 1%
GRPCServe_64K-24              359MB/s ± 1%
GRPCServe_Stream_1K-24       90.3MB/s ± 1%
GRPCServe_Stream_64k-24       439MB/s ± 1%
GRPCServeHTTP_1K-24          14.9MB/s ± 1%
GRPCServeHTTP_64K-24          224MB/s ± 1%
GRPCServeHTTP_Stream_1K-24   52.9MB/s ± 1%
GRPCServeHTTP_Stream_64k-24   288MB/s ± 1%

name                         alloc/op
GRPCServe_1K-24                16.3kB ± 0%
GRPCServe_64K-24                711kB ± 0%
GRPCServe_Stream_1K-24         11.6kB ± 0%
GRPCServe_Stream_64k-24         707kB ± 0%
GRPCServeHTTP_1K-24            34.2kB ± 0%
GRPCServeHTTP_64K-24            763kB ± 0%
GRPCServeHTTP_Stream_1K-24     12.4kB ± 0%
GRPCServeHTTP_Stream_64k-24     739kB ± 0%

name                         allocs/op
GRPCServe_1K-24                  97.0 ± 0%
GRPCServe_64K-24                  147 ± 0%
GRPCServe_Stream_1K-24           22.0 ± 0%
GRPCServe_Stream_64k-24          72.0 ± 0%
GRPCServeHTTP_1K-24               169 ± 0%
GRPCServeHTTP_64K-24              275 ± 0%
GRPCServeHTTP_Stream_1K-24       29.0 ± 0%
GRPCServeHTTP_Stream_64k-24       138 ± 0%

So ServeHTTP's throughput is 50%-70% lower than Serve's, and its latency is 60%-350% higher.

@tamird, the grpc-go team is undergoing a changing of the guard at the moment so this isn't much of a priority.

And I don't officially work on grpc-go. It's not surprising that ServeHTTP is slower considering it's shoehorned into the existing ServerTransport interface somewhat awkwardly.

I started working on a new design for all of grpc-go in https://github.com/bradfitz/grpc-go but it's not ready yet for wider discussion or testing. It was mostly a proof of concept. But it's something I intend to get back to. @philips et al at CoreOS are also interested.

Is there any update on this?

@menghanl would you be able to take a look?

@dfawley this is the bug I mentioned in person at CoreOS Fest. Would be fantastic if we could reach serving parity on the HandlerTransport to the one with a pure socket listener.

I agree, but unfortunately we won't have any time to spend on this for at least another quarter. It will be a pretty sizable effort. It's possible we may get some contributions for this before then, but I wouldn't want to set any expectations at this point.

at least another quarter

Now that one quarter has passed, any chance to make any progress on this?

@mwitkow, @tamird, @glerchundi, I have been poking at this periodically and learning more about it since June.

The goal of throwing away the grpc http2 server and using x/net/http2 instead would be a major undertaking, and would require adding significant new features and hooks and (likely) optimizations into x/net/http2 to reach parity with the existing grpc http2 server. The handler-based transport was/is an experiment (and is documented as such), and isn't something fully supported. The core grpc-go team does not currently have the cycles to support this and keep up with our existing commitments -- let alone to completely replace our default transport with it -- but we would be happy to review PRs that improve features or performance of the handler-based transport.

Re: @bradfitz's comment:

It's not surprising that ServeHTTP is slower considering it's shoehorned into the existing
ServerTransport interface somewhat awkwardly.

I agree this is an awkward fit: the handler-based transport is mimicking a full transport for each RPC. As we continue to dig in and clean up some of the code and interfaces between the grpc layer and the transport layer (e.g.s #1745, #1854), it could happen that the handler-based transport becomes a more natural fit, and performance _may_ be improved as a result if that is a factor in its performance differences.

Is this performance still an issue given the use of grpc.ServerHTTP?

I'm weighing on whether we should wrap gRPC's server with net/http server as I need to define / as healthcheck so that it will pass GCLB?

@tamird is this still an issue for you guys?

I don't work on CockroachDB anymore, but it should be trivial for you to replicate my results using rpc-bench.

I can post new results on Monday if you'd like. Has any work been done in this area? I haven't seen any mention.

This issue is labeled as requiring an update from the reporter, and no update has been received after 7 days. If no update is provided in the next 7 days, this issue will be automatically closed.

I hacked around and got the rpc-tests working and using the standard lib http2 (instead of the x/net/http2. Current results for GRPC Server vs GRPC ServeHTTP:

BenchmarkGRPCServe_1K-10                           55292             22533 ns/op          90.89 MB/s
BenchmarkGRPCServe_64K-10                           4573            226846 ns/op         577.80 MB/s
BenchmarkGRPCServe_Stream_1K-10                   101989             11717 ns/op         174.79 MB/s
BenchmarkGRPCServe_Stream_64k-10                    5167            221615 ns/op         591.44 MB/s
BenchmarkGRPCServeHTTP_1K-10                       19066             63799 ns/op          32.10 MB/s
BenchmarkGRPCServeHTTP_64K-10                       3297            365040 ns/op         359.06 MB/s
BenchmarkGRPCServeHTTP_Stream_1K-10                59731             19344 ns/op         105.87 MB/s
BenchmarkGRPCServeHTTP_Stream_64k-10                3696            307412 ns/op         426.37 MB/s
BenchmarkGobRPC_1K-10                              90954             12879 ns/op         159.02 MB/s
BenchmarkGobRPC_64K-10                             10990            111328 ns/op        1177.34 MB/s
BenchmarkProtoHTTP1_1K-10                           7953            139164 ns/op          14.72 MB/s
BenchmarkProtoHTTP1_64K-10                          1718            601093 ns/op         218.06 MB/s
BenchmarkProtoHTTP2_1K-10                           9948            126565 ns/op          16.18 MB/s
BenchmarkProtoHTTP2_64K-10                          2184            565510 ns/op         231.78 MB/s

The ProtoRPC benchmarks aren't running, but I'm not that fussed about those.

This implies the results are still in the sme ballpark as the previous runs (grpc 1.27.1, go 1.13). For me the overhead is worth being able to run on regular mux , single port, without cmux (which gets confusing fast). But clearly it can be improved.
My guess is it is quite a big job to get enough of an understanding of this to help out, I'm willing to give it a go. If anyone can point me at some files that would help. I'll start taking a look anyway.

@tcolgate I could never get the benchmarks to run. Admittedly, I didn't spend a ton of time on it. If you create a repo with working benchmarks, I wouldn't mind taking a crack at it as well or helping.

Uploaded here: https://github.com/tcolgate/rpc-bench
It's probably worth putting some benchmarks in grpc-go directly, if they aren't there already. I started poking around the code. @bradfitz attempted replacement is probably a good way to get to understand what's going on, though it's not rebasable, there's less code and it's simpler. It will at least be an aide to understanding.

Was this page helpful?
0 / 5 - 0 ratings