Grpc-go: Connection latency significantly affects throughput

Created on 10 Jan 2017  Â·  66Comments  Â·  Source: grpc/grpc-go

I am working on a system that uses GRPC to send 1MB blobs between clients and servers and have observed some poor throughput when connection latency is high (180ms round trip is typical between Australia and the USA).

The same throughput issues are not present when I take GRPC out of the equation. I have prepared a self-contained program that reproduces the problem on a local machine by simulating latency at the net.Listener level. It can run either using GRPC or just plain HTTP/2 POST requests. In each case the payload and latency are the same, but—as you can see from the data below—GRPC becomes several orders of magnitude slower as latency increases.

The program and related files: https://gist.github.com/adg/641d04ef335782648502cb32a03b2b07

The output of a typical run:

$ ./run.sh 
Duration    Latency Proto

6.977221ms  0s  GRPC
4.833989ms  0s  GRPC
4.714891ms  0s  GRPC
3.884165ms  0s  GRPC
5.254322ms  0s  GRPC

8.507352ms  0s  HTTP/2.0
936.436µs   0s  HTTP/2.0
453.471µs   0s  HTTP/2.0
252.786µs   0s  HTTP/2.0
265.955µs   0s  HTTP/2.0

107.32663ms 1ms GRPC
102.51629ms 1ms GRPC
100.235617ms    1ms GRPC
100.444982ms    1ms GRPC
100.881221ms    1ms GRPC

12.423725ms 1ms HTTP/2.0
3.02918ms   1ms HTTP/2.0
2.775928ms  1ms HTTP/2.0
4.161895ms  1ms HTTP/2.0
2.951534ms  1ms HTTP/2.0

195.731175ms    2ms GRPC
190.571784ms    2ms GRPC
188.810298ms    2ms GRPC
190.593822ms    2ms GRPC
190.015888ms    2ms GRPC

19.18046ms  2ms HTTP/2.0
4.663983ms  2ms HTTP/2.0
5.45113ms   2ms HTTP/2.0
5.56255ms   2ms HTTP/2.0
5.582744ms  2ms HTTP/2.0

378.653747ms    4ms GRPC
362.14625ms 4ms GRPC
357.95833ms 4ms GRPC
364.525646ms    4ms GRPC
364.954077ms    4ms GRPC

33.666184ms 4ms HTTP/2.0
8.68926ms   4ms HTTP/2.0
10.658349ms 4ms HTTP/2.0
10.741361ms 4ms HTTP/2.0
10.188444ms 4ms HTTP/2.0

719.696194ms    8ms GRPC
699.807568ms    8ms GRPC
703.794127ms    8ms GRPC
702.610461ms    8ms GRPC
710.592955ms    8ms GRPC

55.66933ms  8ms HTTP/2.0
18.449093ms 8ms HTTP/2.0
17.080567ms 8ms HTTP/2.0
20.597944ms 8ms HTTP/2.0
17.318133ms 8ms HTTP/2.0

1.415272339s    16ms    GRPC
1.350923577s    16ms    GRPC
1.355653965s    16ms    GRPC
1.338834603s    16ms    GRPC
1.358419144s    16ms    GRPC

102.133898ms    16ms    HTTP/2.0
39.144638ms 16ms    HTTP/2.0
40.82348ms  16ms    HTTP/2.0
35.133498ms 16ms    HTTP/2.0
39.516466ms 16ms    HTTP/2.0

2.630821843s    32ms    GRPC
2.46741086s 32ms    GRPC
2.507019279s    32ms    GRPC
2.476177935s    32ms    GRPC
2.49266693s 32ms    GRPC

179.271675ms    32ms    HTTP/2.0
72.575954ms 32ms    HTTP/2.0
67.23265ms  32ms    HTTP/2.0
70.651455ms 32ms    HTTP/2.0
67.579124ms 32ms    HTTP/2.0

I theorize that there is something wrong with GRPC's flow control mechanism, but that's just a guess.

P1 Performance

Most helpful comment

I don't believe this should be closed until the process of setting the window size is automatic and the process of making gRPC run fast is clearly documented.

All 66 comments

cc @iamqizhao (and @rsc @robpike @bradfitz who may be interested)

If you observe from 1ms, both grpc and http2 are increasing proportionally. The only issue with grpc is that it is started with much large value (i.e., 107ms vs 12ms). We need to figure out what is the expected value there (this counters my previous observation with http2.0...).

@apolcyn, assign this to you and please help. Thank you.

@iamqizhao note the relationship between the simulated latency and the request time.

For HTTP/2, with 0 latency a request takes ~1ms. With 1ms of latency (both up and down) you get requests around 3ms (after the first request, when the connection is established). With 2ms of latency, requests are ~5ms. For 4ms latency, requests take ~9ms. See the pattern? So the actual time spent making HTTP/2 requests, minus the introduced latency, is constant (around 1ms).

Contrast this to the GRPC behavior, where the introduced latency multiplies the overall request time.

We need to figure out what is the expected value there (this counters my previous observation with http2.0...).

Starting with zero latency, the GRPC requests take 4-6ms. What we should see for GRPC requests is a similar pattern, where 1ms of latency yields 8ms requests (6ms + 1ms + 1ms), 2ms latency = 10ms requests, and so on.

as I said, your code cached the string used by http2. For example,

12.423725ms 1ms HTTP/2.0
3.02918ms 1ms HTTP/2.0
2.775928ms 1ms HTTP/2.0
4.161895ms 1ms HTTP/2.0
2.951534ms 1ms HTTP/2.0

the 1st line is much larger than the rest. I believe this means the rest
takes advantage of caching. When I said they are both increased
proportionally, I meant the first line of the result from http2.

On Mon, Jan 9, 2017 at 6:51 PM, Andrew Gerrand notifications@github.com
wrote:

@iamqizhao https://github.com/iamqizhao note the relationship between
the simulated latency and the request time.

For HTTP/2, with 0 latency a request takes ~1ms. With 1ms of latency (both
up and down) you get requests around 3ms (after the first request, when the
connection is established). With 2ms of latency, requests are ~4ms. For 4ms
latency, requests take ~9ms. See the pattern? So the actual time spent
making HTTP/2 requests, minus the introduced latency, is constant (around
1ms).

Contrast this to the GRPC behavior, where the introduced latency
multiplies the overall request time.

We need to figure out what is the expected value there (this counters my
previous observation with http2.0...).

Starting with zero latency, the GRPC requests take 4-6ms. What we should
see for GRPC requests is a similar pattern, where 1ms of latency yields 8ms
requests (6ms + 1ms + 1ms), 2ms latency = 10ms requests, and so on.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1043#issuecomment-271472457, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJpyLMwAwSMdNAK_VHa1KDXNyjBMIwl0ks5rQvI7gaJpZM4Le51s
.

Why would latency have any affect on proto marshaling overhead?

12.423725ms 1ms HTTP/2.0
3.02918ms   1ms HTTP/2.0
2.775928ms  1ms HTTP/2.0
4.161895ms  1ms HTTP/2.0
2.951534ms  1ms HTTP/2.0

The additional 8-10ms in the first request is certainly setting up the HTTP/TLS connection, which increases proportionally with latency as it involves several round trips.

Err, I see -- I did not look into how you set up the http transport.

yes, it seems this is due to flow control issue -- if I enlarge the flow control window above 1M, this issue goes away.

Perhaps a bit more is happening. At 0s latency grpc is still at most 1/10th the speed of https. Does grpc adjust window to rtt?

I checked in a variant of adg's code that shows the packet breakup and timing.

go get -d -u rsc.io/tmp/grpcbench
cd $GOPATH/src/rsc.io/tmp/grpcbench
bash run.sh

It only shows the output for 32ms round trip time. This is what HTTP/2 POST looks like (the actual output shows packet boundaries within a given millisecond as <-93+89+1024+1024+1024, but I've added all those up for easier reading):

0.036s
    <-184
0.079s
    ->1297
0.080s
    <-93
0.117s
    ->51
0.118s
    <-12470
0.119s
    <-53573
0.151s
    ->56
    <-38
0.184s
    ->120
181.526851ms    32ms    HTTP/2.0

0.184s
    <-46
0.217s
    ->83
    <-1024
0.218s
    <-64665
0.250s
    ->43
65.878218ms 32ms    HTTP/2.0

There is some connection setup in the first one, and then a whole bunch of uninterrupted data transfer. The second one omits the setup but still has the nice fast data transfer.

Compare with GRPC, again two connections:

0.070s
    ->9
0.104s
    ->13
0.105s
    <-65702
0.138s
    ->9
0.171s
    ->13
    <-16393
0.206s
    ->13
    <-16393
0.243s
    ->13
    <-16393
0.281s
    ->13
    <-16392
0.316s
    ->13
    <-16393
0.351s
    ->13
    <-15360
0.352s
    <-1024
0.353s
    <-9
0.384s
    ->13
    <-16393
0.417s
    ->13
    <-16392
0.450s
    ->13
    <-16384
0.451s
    <-9
0.483s
    ->13
    <-2048
0.485s
    <-14345
0.516s
    ->13
    <-16393
0.549s
    ->13
    <-16392
0.582s
    ->13
    <-16393
0.615s
    ->13
    <-16393
0.648s
    ->13
    <-16393
0.681s
    ->13
0.714s
    ->13
    <-16392
0.747s
    ->13
    <-16393
0.780s
    ->13
    <-16393
0.813s
    ->13
    <-16393
0.846s
    ->13
    <-16392
0.879s
    ->13
    <-16393
0.912s
    ->13
    <-16393
0.945s
    ->13
    <-16393
0.980s
    ->13
    <-16392
1.014s
    ->13
    <-8192
1.015s
    <-8201
1.051s
    ->13
1.052s
    <-16393
1.086s
    ->13
    <-16393
1.120s
    ->13
    <-1024
1.121s
    <-15368
1.154s
    ->13
    <-16393
1.187s
    ->13
    <-16393
1.220s
    ->13
1.221s
    <-16393
1.254s
    ->13
1.287s
    ->13
    <-16392
1.320s
    ->13
1.321s
    <-16393
1.354s
    ->13
    <-16393
1.387s
    ->13
    <-16393
1.420s
    ->13
1.421s
    <-16392
1.458s
    ->13
    <-16393
1.490s
    ->13
    <-16393
1.527s
    ->13
    <-7168
1.528s
    <-9225
1.560s
    ->13
    <-13312
1.561s
    <-3080
1.598s
    ->13
    <-16393
1.635s
    ->13
    <-16393
1.671s
    ->13
    <-16393
1.704s
    ->13
1.705s
    <-16392
1.737s
    ->13
    <-16393
1.770s
    ->13
    <-16393
1.803s
    ->13
    <-16393
1.838s
    ->13
1.870s
    ->13
1.871s
    <-16392
1.904s
    ->13
    <-16393
1.938s
    ->13
    <-13312
1.939s
    <-3081
1.971s
    ->13
    <-16393
2.004s
    ->13
    <-16392
2.039s
    ->13
    <-16393
2.072s
    ->13
    <-16393
2.104s
    ->13
    <-16393
2.139s
    ->13
2.140s
    <-16392
2.173s
    ->13
    <-16393
2.205s
    ->13
2.206s
    <-16393
2.239s
    ->13
    <-16393
2.273s
    ->13
2.274s
    <-16392
2.306s
    ->13
    <-34
2.338s
    ->13
2.371s
    ->23
2.404s
    ->13
2.437s
    ->14
2.470s
    ->13
2.504s
    ->33
2.501003152s    32ms    GRPC

2.507s
    <-1024
2.508s
    <-64563
2.538s
    ->13
2.571s
    ->13
    <-16393
2.604s
    ->13
    <-15360
2.605s
    <-1033
2.638s
    ->13
    <-16393
2.671s
    ->13
    <-16392
2.704s
    ->13
    <-7168
2.705s
    <-9225
2.738s
    ->13
    <-16393
2.771s
    ->13
    <-16393
2.804s
    ->13
    <-4096
2.805s
    <-12296
2.837s
    ->13
    <-16384
2.838s
    <-9
2.871s
    ->13
    <-16393
2.904s
    ->13
2.905s
    <-16393
2.937s
    ->13
    <-16392
2.971s
    ->13
    <-16393
3.004s
    ->13
3.005s
    <-16393
3.039s
    ->13
3.040s
    <-16393
3.072s
    ->13
3.106s
    ->13
    <-16392
3.139s
    ->13
    <-16393
3.171s
    ->13
    <-16393
3.204s
    ->13
3.205s
    <-16393
3.238s
    ->13
    <-16392
3.271s
    ->13
    <-16393
3.304s
    ->13
3.305s
    <-16393
3.338s
    ->13
    <-16393
3.371s
    ->13
    <-14336
3.372s
    <-2056
3.405s
    ->13
    <-16393
3.438s
    ->13
    <-16393
3.471s
    ->13
    <-10240
3.472s
    <-6153
3.505s
    ->13
    <-16392
3.538s
    ->13
    <-16393
3.571s
    ->13
    <-1024
3.572s
    <-15369
3.605s
    ->13
    <-16393
3.638s
    ->13
3.671s
    ->13
3.672s
    <-16392
3.705s
    ->13
    <-16393
3.738s
    ->13
    <-16393
3.771s
    ->13
    <-1024
3.772s
    <-15369
3.805s
    ->13
    <-16392
3.838s
    ->13
    <-16393
3.871s
    ->13
3.872s
    <-16393
3.905s
    ->13
    <-16393
3.938s
    ->13
    <-16392
3.971s
    ->13
    <-1024
3.972s
    <-15369
4.004s
    ->13
    <-16393
4.041s
    ->13
    <-16393
4.078s
    ->13
    <-16392
4.113s
    ->13
    <-16393
4.146s
    ->13
    <-16393
4.179s
    ->13
    <-16393
4.212s
    ->13
4.245s
    ->13
    <-16392
4.278s
    ->13
    <-16393
4.311s
    ->13
    <-16393
4.344s
    ->13
    <-11264
4.345s
    <-5129
4.377s
    ->13
    <-16392
4.410s
    ->13
    <-16393
4.443s
    ->13
    <-5120
4.445s
    <-11273
4.476s
    ->13
    <-16393
4.509s
    ->13
    <-16392
4.542s
    ->13
    <-16393
4.575s
    ->13
    <-16393
4.608s
    ->13
    <-16393
4.641s
    ->13
    <-13312
4.642s
    <-2048
4.643s
    <-1032
4.674s
    ->13
    <-34
4.707s
    ->13
4.740s
    ->11
4.772s
    ->13
4.805s
    ->14
4.838s
    ->13
4.872s
    ->11
2.367555959s    32ms    GRPC

It looks like there are two problems. The first one is that GRPC appears to have added its own flow control on top of HTTP/2's flow control (on top of TCP's flow control), and it has a fixed 16 kB window. The second is that it looks like HTTP/2 POST is compressing while GRPC is not: in addition to having better (no added) flow control, the posts transfer only about 64 kB of data in total. Either that or the test is broken.

@rsc Thanks for investigation. Nice findings. In gRPC-Go, there is no HTTP/2 flow control (FYI, Brad and I have not finished the migration to the standard HTTP2 package yet and gRPC-Go still uses its own native HTTP2 implementation). And yes gRPC does not apply compression by default.

The root case that the latency of gRPC-Go is not equal to network RTT + constant processing time is because of its flow control mechanism. Currently, it takes a simplistic form -- it has a fixed 16KB window update as Russ discovered. It does not tune according to network RTT and incoming message sizes so that it favors low latency networks and sucks in long-haul networks (as shown in this issue). The current mechanism is the initial version and was consistent across languages. Recently, gRPC-C added some more intelligence to its flow control and more is under design. I summarize the effort we will put in this area as follows:
i) window size tuning upon message size. This should fix this issue;
ii) window size tuning according to network RTT (under design);
iii) public surface to allow user to tune flow control parameters and enforce a server-scoped flow control limit (under design).

Can user client code change the fixed flow control window size?

Can user client code enable compression?

Can user client code change the fixed flow control window size?

Currently they cannot. This is the 3rd effort I mentioned in the previous reply.

Can user client code enable compression?

Yes, with grpc.WithCompressor dial option.

This change enables compression for GRPC and uses a randomly-generated payload to defeat the effect of the compression:

diff --git a/grpcbench/main.go b/grpcbench/main.go
index 101677d..c03ac0f 100644
--- a/grpcbench/main.go
+++ b/grpcbench/main.go
@@ -1,6 +1,7 @@
 package main

 import (
+       "crypto/rand"
        "crypto/tls"
        "flag"
        "fmt"
@@ -41,6 +42,7 @@ func main() {
                                grpc.WithBlock(),
                                grpc.WithTimeout(3 * time.Second),
                                grpc.WithInsecure(),
+                               grpc.WithCompressor(grpc.NewGZIPCompressor()),
                        }
                        conn, err := grpc.Dial(*addr, opts...)
                        if err != nil {
@@ -59,7 +61,12 @@ func main() {

                ctx := context.Background()

-               msg := strings.Repeat(" ", *msgSize)
+               randomBytes := make([]byte, *msgSize)
+               _, err := rand.Read(randomBytes)
+               if err != nil {
+                       log.Fatal(err)
+               }
+               msg := string(randomBytes)
                for i := 0; i < *numRuns; i++ {
                        t1 := time.Now()
                        var err error
@@ -86,7 +93,7 @@ func main() {

        var server *grpc.Server
        if *useGRPC {
-               server = grpc.NewServer()
+               server = grpc.NewServer(grpc.RPCDecompressor(grpc.NewGZIPDecompressor()))
                helloworld.RegisterGreeterServer(server, greeter{})
        }
        l, err := net.Listen("tcp", *addr)

The numbers appear to be more or less the same:

2.477774562s    32ms    GRPC
2.462765459s    32ms    GRPC

176.592079ms    32ms    HTTP/2.0
69.485653ms 32ms    HTTP/2.0

I believe you really want different bytes for each message, not just an
initial random setting.

On Tue, Jan 10, 2017 at 2:41 PM, Andrew Gerrand notifications@github.com
wrote:

This change enables compression for GRPC and uses a randomly-generated
payload to defeat the effect of the compression:

diff --git a/grpcbench/main.go b/grpcbench/main.go
index 101677d..c03ac0f 100644
--- a/grpcbench/main.go
+++ b/grpcbench/main.go
@@ -1,6 +1,7 @@
package main

import (

  • "crypto/rand"
    "crypto/tls"
    "flag"
    "fmt"
    @@ -41,6 +42,7 @@ func main() {
    grpc.WithBlock(),
    grpc.WithTimeout(3 * time.Second),
    grpc.WithInsecure(),
  • grpc.WithCompressor(grpc.NewGZIPCompressor()),
    }
    conn, err := grpc.Dial(*addr, opts...)
    if err != nil {
    @@ -59,7 +61,12 @@ func main() {
            ctx := context.Background()
  • msg := strings.Repeat(" ", *msgSize)
  • randomBytes := make([]byte, *msgSize)
  • _, err := rand.Read(randomBytes)
  • if err != nil {
  • log.Fatal(err)
  • }
  • msg := string(randomBytes)
    for i := 0; i < *numRuns; i++ {
    t1 := time.Now()
    var err error
    @@ -86,7 +93,7 @@ func main() {

    var server *grpc.Server
    if *useGRPC {
    
  • server = grpc.NewServer()
  • server = grpc.NewServer(grpc.RPCDecompressor(grpc.NewGZIPDecompressor()))
    helloworld.RegisterGreeterServer(server, greeter{})
    }
    l, err := net.Listen("tcp", *addr)

The numbers appear to be more or less the same:

2.477774562s 32ms GRPC
2.462765459s 32ms GRPC

176.592079ms 32ms HTTP/2.0
69.485653ms 32ms HTTP/2.0

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1043#issuecomment-271721362, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFM2PeI2jDuAtw4tioQcWaVe5UgC2tLZks5rRAkGgaJpZM4Le51s
.

How would that change things?

If they are using a learned dictionary compessor, you would only be sending
the data once.

On Tue, Jan 10, 2017 at 2:49 PM, Andrew Gerrand notifications@github.com
wrote:

How would that change things?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1043#issuecomment-271723316, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFM2PWHWjZ7ZA-KVhq1AkaZFT42UFtpFks5rRAsJgaJpZM4Le51s
.

On investigating this further, I found an issue with the grpcbench program. It was not sending complete HTTP/2 requests, which explains the dramatic difference between GRPC and HTTP/2. GRPC still has performance issues, but it is only about 2x slower than x/net/http2 (not orders of magnitude as first reported).

I updated grpcbench to correctly exercise HTTP/2, and also threw HTTP/1.1 into the mix. See this pull request.

Interestingly, while HTTP/2 is 2x faster than GRPC, HTTP/1.1 is more than 10x faster than HTTP/2. I talked to @bradfitz about this and it is because there's a fixed initial flow control window which never changes, with TODOs to improve this. It would make sense for GRPC to adopt x/net/http2 so that when those TODOs are done, GRPC users should benefit.

Here's the data from a new run of grpcbench:

Duration    Latency Proto

9.491853ms  0s  GRPC
4.495137ms  0s  GRPC
5.391501ms  0s  GRPC
4.726007ms  0s  GRPC
5.12824ms   0s  GRPC

14.558358ms 0s  HTTP/2.0
6.382529ms  0s  HTTP/2.0
6.178314ms  0s  HTTP/2.0
5.51633ms   0s  HTTP/2.0
6.029431ms  0s  HTTP/2.0

11.587129ms 0s  HTTP/1.1
5.397509ms  0s  HTTP/1.1
4.237196ms  0s  HTTP/1.1
5.960211ms  0s  HTTP/1.1
108.997722ms    0s  HTTP/1.1

93.566806ms 1ms GRPC
99.017055ms 1ms GRPC
93.487803ms 1ms GRPC
98.34198ms  1ms GRPC
99.387167ms 1ms GRPC

59.484852ms 1ms HTTP/2.0
51.734641ms 1ms HTTP/2.0
54.935499ms 1ms HTTP/2.0
156.577757ms    1ms HTTP/2.0
51.694602ms 1ms HTTP/2.0

15.841433ms 1ms HTTP/1.1
5.704361ms  1ms HTTP/1.1
107.550483ms    1ms HTTP/1.1
5.40012ms   1ms HTTP/1.1
5.800495ms  1ms HTTP/1.1

192.232563ms    2ms GRPC
188.990258ms    2ms GRPC
180.028459ms    2ms GRPC
189.284675ms    2ms GRPC
192.21624ms 2ms GRPC

107.48685ms 2ms HTTP/2.0
92.618437ms 2ms HTTP/2.0
103.681046ms    2ms HTTP/2.0
94.787457ms 2ms HTTP/2.0
96.490496ms 2ms HTTP/2.0

25.557564ms 2ms HTTP/1.1
7.750258ms  2ms HTTP/1.1
6.587036ms  2ms HTTP/1.1
107.893995ms    2ms HTTP/1.1
6.236492ms  2ms HTTP/1.1

355.213791ms    4ms GRPC
349.819232ms    4ms GRPC
333.278482ms    4ms GRPC
338.551819ms    4ms GRPC
338.591875ms    4ms GRPC

184.903654ms    4ms HTTP/2.0
179.907175ms    4ms HTTP/2.0
180.843634ms    4ms HTTP/2.0
169.588748ms    4ms HTTP/2.0
167.390974ms    4ms HTTP/2.0

32.01951ms  4ms HTTP/1.1
12.554588ms 4ms HTTP/1.1
9.637975ms  4ms HTTP/1.1
112.462356ms    4ms HTTP/1.1
8.400449ms  4ms HTTP/1.1

669.925997ms    8ms GRPC
682.715267ms    8ms GRPC
669.499247ms    8ms GRPC
653.192967ms    8ms GRPC
686.181873ms    8ms GRPC

365.49212ms 8ms HTTP/2.0
330.625694ms    8ms HTTP/2.0
340.397932ms    8ms HTTP/2.0
329.523659ms    8ms HTTP/2.0
336.895354ms    8ms HTTP/2.0

50.715689ms 8ms HTTP/1.1
15.353749ms 8ms HTTP/1.1
17.007966ms 8ms HTTP/1.1
114.972155ms    8ms HTTP/1.1
18.499317ms 8ms HTTP/1.1

1.272841685s    16ms    GRPC
1.285137518s    16ms    GRPC
1.278880998s    16ms    GRPC
1.286303382s    16ms    GRPC
1.280546303s    16ms    GRPC

688.093256ms    16ms    HTTP/2.0
637.301763ms    16ms    HTTP/2.0
623.836108ms    16ms    HTTP/2.0
658.244228ms    16ms    HTTP/2.0
633.323672ms    16ms    HTTP/2.0

87.381827ms 16ms    HTTP/1.1
24.798529ms 16ms    HTTP/1.1
122.792087ms    16ms    HTTP/1.1
24.228707ms 16ms    HTTP/1.1
22.021362ms 16ms    HTTP/1.1

2.500501533s    32ms    GRPC
2.41936603s 32ms    GRPC
2.458976246s    32ms    GRPC
2.45729285s 32ms    GRPC
2.475112293s    32ms    GRPC

1.325008844s    32ms    HTTP/2.0
1.185842114s    32ms    HTTP/2.0
1.22962957s 32ms    HTTP/2.0
1.195622848s    32ms    HTTP/2.0
1.188314041s    32ms    HTTP/2.0

150.393105ms    32ms    HTTP/1.1
36.553453ms 32ms    HTTP/1.1
40.572305ms 32ms    HTTP/1.1
40.395134ms 32ms    HTTP/1.1
146.914462ms    32ms    HTTP/1.1

I think this method of simulated latency might be unfair.

As is, it looks like the server is basically sleeping for the full latency (e.g., 32ms) before each write to the network. AFAICS this blocking behavior might be realistic when e.g., sending the large 1M message from a fast sender to a slow receiver, or possibly sending across a faulty network, but I think it's probably unfair otherwise. It looks like the 13 byte window updates coming from the server (receiver) are taking much longer than they should, with effects of small window/update-threshold larger than they should be.

FWIW, looking into the logs posted by @rsc:

At first the client has a fresh 64K window and so it sends the first 64K of it's 1M message. It doesn't arrive at the server until 32ms later because of latency. The window update isn't sent until ~64ms later (took 32ms to send settings ack at 0.138?).

When server receives 64K at 0.138 (32ms after send), it should be ready to give 4 window updates of 16K each right away (GRPC uses 16K threshold to send update), and then the server should be again be able to again send 64K when all of these arrive (32ms later). Because each write is blocking though, the window updates are spread out with 32ms between each.

0.105s
    <-65702
0.138s
    ->9
0.171s
    ->13
    <-16393
0.206s
    ->13
    <-16393
0.243s
    ->13
    <-16393
0.281s
    ->13
    <-16392

I tried simulating latency on my localhost by using:

tc qdisc add dev lo root handle 1:0 netem delay 100msec

running the test here with 0ms latency induced from the golang test code, and got:

Duration    Latency Proto

3.41255655s 0s  GRPC
3.412687395s    0s  GRPC
3.411504574s    0s  GRPC
3.411832411s    0s  GRPC
3.411427349s    0s  GRPC

4.241488937s    0s  HTTP/2.0
3.413378604s    0s  HTTP/2.0
3.412536441s    0s  HTTP/2.0
3.413333573s    0s  HTTP/2.0
3.413284508s    0s  HTTP/2.0

1.637170729s    0s  HTTP/1.1
602.145344ms    0s  HTTP/1.1
601.863744ms    0s  HTTP/1.1
406.623978ms    0s  HTTP/1.1
406.15212ms 0s  HTTP/1.1

My main observation was that now the network writes calls can happen much more asynchronously (not blocking the app for full RTT on each write). The strict ping-pong between window updates and data transfers is gone (previously gRPC's relatively small window size and frequent window updates were seeing a really large penalty for this).

Printed out the sums of the transfers and their times again with this "tc induced 100ms latency" (200 RTT). see https://gist.github.com/apolcyn/ad0068220c74130427f0030cdc095e55

e.g., at line 220 from that log: (middle of gRPC message send)

  • note how the server is able to now send multiple window updates within a short time period, as I believe is realistic - so the client's view of window size stays higher.
1.994s


     <-27648

     ->13

     <-21530

     ->26

2.193s


2.193s


     <-2048

2.194s


2.194s


     <-14345

     ->13

     <-22528

Interesting. The 5-6x slow down still makes GRPC (and x/net/http2) unusable
for us, unfortunately.

I intend to work with @bradfitz on improving HTTP/2, because it's a
codebase I understand.

Sorry if unclear in last two comments, just to summarize:

  • Definitely issues with grpc and net/http2 vs. http1 for large message because of the http2 flow control window size. Possible solutions noted by @iamqizhao
  • net/http2 and grpc look like they're both using the same fixed window size - their performance in this situation looks about the same.
  • Differences in when receiver sends window refresh to sender should be irrelevant. Mainly I think the latency simulation should be changed to rule out things like that.

CockroachDB just stumbled upon this same issue. GRPC has severe bandwidth limitations on high latency networks. The scenario I encountered was sending ~60MB of data between two servers with a 60ms RTT time (using streaming RPC to send the data in 1MB chunks). I naively expected this to take a few seconds as nc-based tests were able to demonstrate 40MB/s. Instead, the transfer took ~60s, just as you would expect with the fixed 64KB stream window size.

iii) public surface to allow user to tune flow control parameters and enforce a server-scoped flow control limit (under design).

Is there any progress I can follow or help I can give? Perhaps a short term unsupported solution which allows configuration of initialWindowSize and initialConnWindowSize on a per-server/client basis.

Is there an issue open to follow up on the migration to the standard HTTP2 package? I couldn't find it.

@petermattis

Is there any progress I can follow or help I can give? Perhaps a short term unsupported solution which allows configuration of initialWindowSize and initialConnWindowSize on a per-server/client basis.

cc @menghanl. Maybe some temporary API options to client conns and grpc servers is an option? There's #1073 as another option

@c4milo

Is there an issue open to follow up on the migration to the standard HTTP2 package? I couldn't find it.

cc @sailat . But no I think this might be brought up in some other issues but there is no issue about that.

Maybe some temporary API options to client conns and grpc servers is an option?

Something like: https://github.com/grpc/grpc-go/compare/master...petermattis:pmattis/configurable-window-size?expand=1?

@petermattis would the temporary API be ok, yes I think https://github.com/grpc/grpc-go/compare/master...petermattis:pmattis/configurable-window-size?expand=1 makes sense.

@apolcyn I'm on vacation for the next few days. I can spiff up that branch for a PR when I get back if nobody beats me to it.

A one line diff that raises the defaultWindowSize by 2048 improves throughput by 5.5x when the actual real (not simulated) latency is 2.7msec; e.g. a 512MB total transfer chunked into 1MB messages goes from taking 32 seconds to taking just 5.8 seconds.

Needless to say, this makes a huge difference in the viability of actually using gRPC.

```~/go/src/google.golang.org/grpc$ git diff
diff --git a/transport/control.go b/transport/control.go
index 64d22f8..5ce8cf3 100644
--- a/transport/control.go
+++ b/transport/control.go
@@ -43,8 +43,8 @@ import (
)

const (

  • // The default value of flow control window size in HTTP2 spec.
  • defaultWindowSize = 65535
  • // 2048 * the default value of flow control window size in HTTP2 spec.
  • defaultWindowSize = 2048 * 65535
    // The initial window size for flow control.
    initialWindowSize = defaultWindowSize // for an RPC
    initialConnWindowSize = defaultWindowSize * 16 // for a connection
    ```

@apolcyn @petermattis Any chance of getting that DialOption API (for setting intial and conn windows sizes) merged?

@apolcyn @petermattis Any chance of getting that DialOption API (for setting intial and conn windows sizes) merged?

This slipped off my plate. I'll try to get back to it this week, though I'm more than happy for someone else to pick it up and run with it. I think the main thing to finish off is the addition of a few tests and documentation comments.

PS Changing defaultWindowSize violates the HTTP2 spec. You should really be changing initialWindowSize and initialConnWindowSize.

PS Changing defaultWindowSize violates the HTTP2 spec. You should really be changing initialWindowSize and initialConnWindowSize.

Done.

I measured the same 5.5x improved throughput with that change. Now #1153 is spec compliant.

I think the main thing to finish off is the addition of a few tests and documentation comments.

The branch Peter made https://github.com/grpc/grpc-go/compare/master...petermattis:pmattis/configurable-window-size?expand=1 is pretty straight forward. @petermattis What tests and doc comments do you think are needed?

(another CockroachDB dev here)

Just making sure everybody's aware of the seemingly ongoing work in #1126 around BDP detection.

FWIW, when we've forked grpc to increase the window size for CockroachDB, I've played with some values for the per-stream and per-conn windows sizes and with some values I got a failure in TestServerWithMisbehavingClient.
When making the two values equal, I got a timeout in a different test.

@ejona86 @zinuga can we get some traction here?
There's a PR that is attempting to fix this https://github.com/grpc/grpc-go/pull/1126 but we are also hitting this. Hand-cranking a initialWindowSize number to something magical is our workaround, but this is pretty bad for gRPC rep: https://news.ycombinator.com/item?id=14211213

Could we (at the very least) start by fixing the initial window size setting? There's a super simple two line fix PR that can be merged today, it just needs someone with merge rights to do it --

https://github.com/grpc/grpc-go/pull/1153

I've been using my fork with this change for a month in production. 5x better performance makes gRPC viable again.

We have PR under-work that shows significant performance improvement. https://github.com/grpc/grpc-go/pull/1073

Meanwhile, I'll have another PR out to provide DialOptions to be able to set window size on connection and stream.

Created a PR to make window size configurable: #1210

Merged the PR. Making severity of this bug P2.

@petermattis @andreimatei Did cockroachdb 1.0 ship with a workaround for this (#1043) problem? If so, what was your workaround and how effective is it for WAN high-latency links?

CockroachDB 1.0 shipped with a 2MB stream window and a 32MB connection window. This let us use enough bandwidth over WAN links for our needs, but I don't have any numbers.

The problems that we have now, though, is that it's too easy to saturate a connection window with slow stream that exhaust their windows, and then new "high priority" RPCs don't get any window.
We really want a dynamic connection window, but don't really know what to do. More discussion in
https://github.com/cockroachdb/cockroach/issues/14948

If your connection window is filling up because of slow or blocked readers, then you could try using a bigger ratio of connection:stream window sizes (either raise the connection limit or lower the stream limit -- or both). This would enable more streams to be in flight at any time.

Reading the description of the problem in cockroachdb/cockroach#14948, it sounds like you have a dependency between multiple readers from the same connection -- with this approach, I believe it's possible to hit this kind of deadlock under enough load no matter what strategy we use for flow control aside from an infinite connection window. I'm not sure what the motivations are for not reading data that's pending, but if you could pull it off the stream and store it in another buffer (or throw it away if you're running out of memory and re-request it later if necessary?), that would be the only way to completely avoid the problem.

@apolcyn's PR #1073 could help, but there are a couple caveats:

  • Our defaults will still have the same ratio as what you're setting (32MB:2MB vs. 1MB:64KB), so you would still need to increase the connection window size to make the deadlock less likely.
  • This only helps with large messages -- if you're sending many small messages, this will not result in an increased stream window size, and throughput will still be a problem without manually increasing the window size.

We've also been talking about dynamic window size based upon the BDP of the connection, but there are no plans yet to implement them.

EDIT: @apolcyn also had a suggestion to use multiple connections for this kind of use case. With that approach, not reading from one connection would not impact with the others.

@MakMukhi is taking over @apolcyn's change. Reassigning and increasing priority.

1248

Somewhat related to the issue above but I just noticed that gRPC sends a lot of packets. I haven't yet dug into the details to understand why, but it's visible in the trace posted above by @rsc's test. In my case I'm sending a small request and receiving a small response (~100 bytes each) but I see 18 packets go on the wire, most of which are 9-15 bytes (11 if we discount empty packets that are just acks).

[after priming the connection by sending a first RPC]
// this packet actually contains the PB of the request
01:04:32.687525 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 204: (tos 0x0, ttl 61, id 5674, offset 0, flags [DF], proto TCP (6), length 190)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [P.], seq 421:559, ack 315, win 4096, options [nop,nop,TS val 557400299 ecr 4525713], length 138
01:04:32.688483 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 81: (tos 0x0, ttl 61, id 22052, offset 0, flags [DF], proto TCP (6), length 67)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [P.], seq 559:574, ack 315, win 4096, options [nop,nop,TS val 557400299 ecr 4525713], length 15
01:04:32.690579 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 59436, offset 0, flags [DF], proto TCP (6), length 52)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [.], ack 574, win 243, options [nop,nop,TS val 4526466 ecr 557400299], length 0
01:04:32.690647 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 75: (tos 0x0, ttl 64, id 59437, offset 0, flags [DF], proto TCP (6), length 61)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [P.], seq 315:324, ack 574, win 243, options [nop,nop,TS val 4526466 ecr 557400299], length 9
01:04:32.696567 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 61, id 3587, offset 0, flags [DF], proto TCP (6), length 52)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [.], ack 324, win 4096, options [nop,nop,TS val 557400311 ecr 4526466], length 0
01:04:32.697151 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 81: (tos 0x0, ttl 61, id 30455, offset 0, flags [DF], proto TCP (6), length 67)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [P.], seq 574:589, ack 324, win 4096, options [nop,nop,TS val 557400311 ecr 4526466], length 15
01:04:32.697257 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 75: (tos 0x0, ttl 64, id 59438, offset 0, flags [DF], proto TCP (6), length 61)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [P.], seq 324:333, ack 589, win 243, options [nop,nop,TS val 4526467 ecr 557400311], length 9
01:04:32.702782 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 61, id 12564, offset 0, flags [DF], proto TCP (6), length 52)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [.], ack 333, win 4095, options [nop,nop,TS val 557400317 ecr 4526467], length 0
01:04:32.703466 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 81: (tos 0x0, ttl 61, id 719, offset 0, flags [DF], proto TCP (6), length 67)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [P.], seq 589:604, ack 333, win 4096, options [nop,nop,TS val 557400317 ecr 4526467], length 15
01:04:32.703564 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 75: (tos 0x0, ttl 64, id 59439, offset 0, flags [DF], proto TCP (6), length 61)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [P.], seq 333:342, ack 604, win 243, options [nop,nop,TS val 4526469 ecr 557400317], length 9
01:04:32.708048 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 77: (tos 0x0, ttl 64, id 59440, offset 0, flags [DF], proto TCP (6), length 63)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [P.], seq 342:353, ack 604, win 243, options [nop,nop,TS val 4526470 ecr 557400317], length 11
// this packet actually contains the PB of the response
01:04:32.708159 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 143: (tos 0x0, ttl 64, id 59441, offset 0, flags [DF], proto TCP (6), length 129)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [P.], seq 353:430, ack 604, win 243, options [nop,nop,TS val 4526470 ecr 557400317], length 77
01:04:32.709612 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 61, id 31736, offset 0, flags [DF], proto TCP (6), length 52)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [.], ack 342, win 4095, options [nop,nop,TS val 557400323 ecr 4526469], length 0
01:04:32.709915 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 81: (tos 0x0, ttl 61, id 15498, offset 0, flags [DF], proto TCP (6), length 67)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [P.], seq 604:619, ack 342, win 4096, options [nop,nop,TS val 557400323 ecr 4526469], length 15
01:04:32.710010 00:1c:73:74:35:d8 > 00:1c:73:00:00:99, ethertype IPv4 (0x0800), length 75: (tos 0x0, ttl 64, id 59442, offset 0, flags [DF], proto TCP (6), length 61)
    172.24.12.82.6042 > 10.95.4.132.52612: Flags [P.], seq 430:439, ack 619, win 243, options [nop,nop,TS val 4526470 ecr 557400323], length 9
01:04:32.714292 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 61, id 34032, offset 0, flags [DF], proto TCP (6), length 52)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [.], ack 353, win 4095, options [nop,nop,TS val 557400327 ecr 4526470], length 0
01:04:32.714810 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 61, id 58116, offset 0, flags [DF], proto TCP (6), length 52)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [.], ack 430, win 4093, options [nop,nop,TS val 557400327 ecr 4526470], length 0
01:04:32.715478 00:1c:73:3d:9a:c1 > 00:1c:73:74:35:d8, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 61, id 56796, offset 0, flags [DF], proto TCP (6), length 52)
    10.95.4.132.52612 > 172.24.12.82.6042: Flags [.], ack 439, win 4095, options [nop,nop,TS val 557400328 ecr 4526470], length 0

What are all these small 9-15 byte packets?

1248 is merged now. Let us know if this helps for your use-case.

Since this issue is about large messages on high latency networks we're closing this. However, we are continuing our work on flow control optimization.

This specific issue was tested against our benchmark:

{  
   "scenarios":[  
      {  
         "name":"go_protobuf_sync_unary_ping_pong_secure_2097152db",
         "warmup_seconds":5,
         "benchmark_seconds":30,
         "num_servers":1,
         "server_config":{  
            "async_server_threads":0,
            "security_params":{  
               "use_test_ca":true,
               "server_host_override":"foo.test.google.fr"
            },
            "payload_config":{  
               "simple_params":{  
                  "resp_size":2097152,
                  "req_size":2097152
               }
            },
            "server_type":"SYNC_SERVER"
         },
         "client_config":{  
            "client_type":"SYNC_CLIENT",
            "security_params":{  
               "use_test_ca":true,
               "server_host_override":"foo.test.google.fr"
            },
            "payload_config":{  
               "simple_params":{  
                  "resp_size":2097152,
                  "req_size":2097152
               }
            },
            "client_channels":1,
            "async_client_threads":1,
            "outstanding_rpcs_per_channel":1,
            "rpc_type":"UNARY",
            "load_params":{  
               "closed_loop":{  

               }
            },
            "histogram_params":{  
               "max_possible":60000000000.0,
               "resolution":0.01
            }
         },
         "num_clients":1
      }
   ]
}

The QPS went up from 0.1 to 0.9.

I re-ran rsc's grpcbench against the master. You can ignore the HTTP/2 results (due to the reason explained at https://github.com/grpc/grpc-go/issues/1043#issuecomment-272052075), and compare the GRPC results with the previous. I think the new results confirm the improvement.

Duration    Latency Proto

5.330994ms  0s  GRPC
4.009134ms  0s  GRPC
3.763ms         0s  GRPC
3.527573ms  0s  GRPC
3.38252ms   0s  GRPC

12.472857ms 0s  HTTP/2.0
5.688769ms  0s  HTTP/2.0
6.409501ms  0s  HTTP/2.0
108.958873ms    0s  HTTP/2.0
6.5151ms    0s  HTTP/2.0

10.245881ms 0s  HTTP/1.1
3.852141ms  0s  HTTP/1.1
3.36458ms   0s  HTTP/1.1
3.510561ms  0s  HTTP/1.1
144.650239ms    0s  HTTP/1.1

14.371479ms 1ms GRPC
12.052484ms 1ms GRPC
13.174659ms 1ms GRPC
14.036349ms 1ms GRPC
11.440052ms 1ms GRPC

64.310362ms 1ms HTTP/2.0
55.559025ms 1ms HTTP/2.0
53.085947ms 1ms HTTP/2.0
55.087295ms 1ms HTTP/2.0
55.589108ms 1ms HTTP/2.0

17.123458ms 1ms HTTP/1.1
5.117912ms  1ms HTTP/1.1
4.775117ms  1ms HTTP/1.1
5.077654ms  1ms HTTP/1.1
4.713239ms  1ms HTTP/1.1

24.798243ms 2ms GRPC
21.212411ms 2ms GRPC
24.497939ms 2ms GRPC
23.300286ms 2ms GRPC
23.151755ms 2ms GRPC

109.657809ms    2ms HTTP/2.0
95.343032ms 2ms HTTP/2.0
90.149519ms 2ms HTTP/2.0
97.416299ms 2ms HTTP/2.0
97.81356ms  2ms HTTP/2.0

22.973429ms 2ms HTTP/1.1
5.949754ms  2ms HTTP/1.1
6.118488ms  2ms HTTP/1.1
6.083907ms  2ms HTTP/1.1
6.043363ms  2ms HTTP/1.1

43.104757ms 4ms GRPC
38.983007ms 4ms GRPC
43.200068ms 4ms GRPC
40.585982ms 4ms GRPC
42.068339ms 4ms GRPC

195.434971ms    4ms HTTP/2.0
177.185796ms    4ms HTTP/2.0
174.232438ms    4ms HTTP/2.0
174.821399ms    4ms HTTP/2.0
178.38124ms 4ms HTTP/2.0

33.23372ms  4ms HTTP/1.1
8.903388ms  4ms HTTP/1.1
111.990624ms    4ms HTTP/1.1
8.636087ms  4ms HTTP/1.1
9.185476ms  4ms HTTP/1.1

77.66905ms  8ms GRPC
80.779861ms 8ms GRPC
79.943702ms 8ms GRPC
77.576871ms 8ms GRPC
75.829066ms 8ms GRPC

358.309164ms    8ms HTTP/2.0
330.676264ms    8ms HTTP/2.0
335.010852ms    8ms HTTP/2.0
326.09494ms 8ms HTTP/2.0
328.849144ms    8ms HTTP/2.0

52.560471ms 8ms HTTP/1.1
13.88925ms  8ms HTTP/1.1
114.867623ms    8ms HTTP/1.1
13.123327ms 8ms HTTP/1.1
14.003239ms 8ms HTTP/1.1

158.107222ms    16ms    GRPC
154.017713ms    16ms    GRPC
155.516585ms    16ms    GRPC
153.444526ms    16ms    GRPC
150.978996ms    16ms    GRPC

702.038344ms    16ms    HTTP/2.0
642.113313ms    16ms    HTTP/2.0
635.178695ms    16ms    HTTP/2.0
640.926487ms    16ms    HTTP/2.0
636.483788ms    16ms    HTTP/2.0

86.444409ms 16ms    HTTP/1.1
22.05304ms  16ms    HTTP/1.1
19.950935ms 16ms    HTTP/1.1
125.511693ms    16ms    HTTP/1.1
23.579457ms 16ms    HTTP/1.1

345.991851ms    32ms    GRPC
286.761732ms    32ms    GRPC
290.791148ms    32ms    GRPC
282.957969ms    32ms    GRPC
280.404998ms    32ms    GRPC

1.30835792s 32ms    HTTP/2.0
1.183078984s    32ms    HTTP/2.0
1.196037184s    32ms    HTTP/2.0
1.187915833s    32ms    HTTP/2.0
1.189489888s    32ms    HTTP/2.0

150.990974ms    32ms    HTTP/1.1
42.193952ms 32ms    HTTP/1.1
38.220967ms 32ms    HTTP/1.1
36.277676ms 32ms    HTTP/1.1
142.413035ms    32ms    HTTP/1.1

Improvement, yes, but I wouldn't call it fixed. Even at a modest 8ms latency, GRPC (>=75ms) is almost 4X slower than HTTP/1.1 (>=20ms). At 32ms GRPC (>= 280ms) is 7X slower than HTTP/1.1 (>= 36ms). Those used to be 40X and 30X, so yes improvement, but still more than users are going to want to give up.

The numbers show that we are not there yet and we plan on iterating over it. Reopening the issue to track it here.

This benchmark is flawed. Every call to the throttler's Write method incurs the network latency before it returns (see lines 128, 89 & 102 in the linked file), whereas a normal net.Conn's Write function will return as soon as the data is committed to send. When the grpc server sends the response to the client, it writes control frames (to absorb the request payload), headers, data frames, and trailers in separate calls to Write. Since each call has a fixed 32ms delay, that explains why our performance appears so poor in these benchmarks.

@MakMukhi is going to see whether he can improve the throttler to only perform the sleep in the middle of transmitting the data between the writer and reader (instead of in-line in the Write call), and if so, we will provide updated numbers with this benchmark.

In the process of brainstorming performance ideas, we have come up with another improvement for unary RPCs -- we will send a very large window update to the server for the stream immediately after the request is sent. This will enable the server to send the entire response once it is ready, without waiting for a window update after the first 64k is received by the client. This should eliminate one round-trip for large response messages.

Tried several failed attempts to fix the benchmarks but none has worked yet. At this point we're inclined towards running the client and server code on VMs in different continents and compare the numbers. I'll post those numbers by tomorrow.

Results from running the same benchmark on VMs in different continents (server in Asia and client in North America).
Latency(RTT): 152ms

GRPC:
Duration: 1.154495933s for GRPC
Duration: 462.041517ms for GRPC
Duration: 460.35183ms for GRPC
Duration: 460.48016ms for GRPC
Duration: 460.286916ms for GRPC

HTTP 1:
Duration: 1.572714757s for HTTP 1
Duration: 305.216448ms for HTTP 1
Duration: 156.844879ms for HTTP 1
Duration: 157.044419ms for HTTP 1
Duration: 157.040266ms for HTTP 1

HTTP2:
Duration: 3.251843565s for HTTP 2
Duration: 2.596755423s for HTTP 2
Duration: 2.595688722s for HTTP 2
Duration: 2.596232422s for HTTP 2
Duration: 2.594843281s for HTTP 2

I took @rsc's benchmark and copied it into a branch here. I modified the throttler to make the reader apply a delay corresponding to when Write was called, plus a latency, and I wrapped both the client and the server net.Conns. Here are my numbers with a one-way latency of 76, which matches up almost exactly with the real-world samples gathered by @MakMukhi:

459.528466ms    76ms    GRPC
459.264934ms    76ms    GRPC
460.386324ms    76ms    GRPC
458.850075ms    76ms    GRPC
460.898012ms    76ms    GRPC

2.904975973s    76ms    HTTP/2.0
2.595898404s    76ms    HTTP/2.0
2.595900343s    76ms    HTTP/2.0
2.594949489s    76ms    HTTP/2.0
2.594792549s    76ms    HTTP/2.0

471.882055ms    76ms    HTTP/1.1
156.190616ms    76ms    HTTP/1.1
157.006169ms    76ms    HTTP/1.1
156.707524ms    76ms    HTTP/1.1
156.684356ms    76ms    HTTP/1.1

So we're approximately 3x the time of HTTP1, which is about twice as good as the previous measurement showed, but still more than what I would expect, which would be 2x of HTTP1, because of flow control.

Some explanation of the numbers: the reason it's 3x instead of 2x is because of the size of the request (1MB) very closely matching our connection window. Only once 25% of the connection window worth of data is received by the server does it send a window update back to the client. The timing looks like this:

 0ms - Client initiates connection.  Server conn window: 1MB, stream window: 64KB (defaults).
     - Client sends headers and the first 64k-<delta> of request data, where
       <delta> represents a tiny header we put on the data.
     - Server conn window: 960KB, stream window: 0.

32ms - Server receives first 64KB-<delta> of data.
     - Server sends 1MB+<delta>-64KB window update for the stream to let the
       client know to send the entire message.

64ms - Client receives window update.  Server conn window: 960KB, stream window: 960KB+<delta>.
     - Client sends remaining request data -- except for <delta>, before running
       out of _connection_ window quota.
     - Server conn window: 0MB, stream window: <delta>

96ms - Server receives data.
     - Server sends 1MB connection window updates (split in 4 updates)

128ms - Client receives window updates.  Server conn window: 1MB, stream window: <delta>.
      - Client sends remaining <delta> of payload.

160ms - Server receives remainder of request payload.  Sends response.
192ms - Client receives server response; done.

If we set the initial connection window size to 2MB manually (via grpc.InitialConnWindowSize(2<<20)), that removes one hop, and we see ~132ms. If we also set the initial stream window size to 2MB (via grpc.InitialWindowSize(2<<20)), then we can perform the entire RPC in a single hop.

Following is our near-future plan to increase performance further:

  1. Further optimize Unary RPCs to expect a large response from the server: This is a fairly straight-forward change however it won't affect this benchmark.
  2. Batching: There's a bug in the transport that prevents batching from taking place. The fix for this is quite simple too. Again this won't have a major impact on this benchmark.
  3. BDP estimation and dynamic flow control: This is the next thing we're taking up now and hope to bring these benchmark numbers further down.

I don't believe this should be closed until the process of setting the window size is automatic and the process of making gRPC run fast is clearly documented.

I don't believe this should be closed until gRPC is as fast as http1 on the 76msec benchmark; out of the box, with no tuning required.

I agree. Everyone in Australia should.

On Mon, Jun 5, 2017 at 2:43 PM, Jason E. Aten, Ph.D. <
[email protected]> wrote:

I don't believe this should be closed until gRPC is as fast as http1 on
the 76msec benchmark; out of the box, with no tuning required.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1043#issuecomment-306318981, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AFM2PTm0xvaAZkmDrd-Hus9s-G4pGcT-ks5sBHaNgaJpZM4Le51s
.

  1. We are not claiming that we'll stop working on gRPC performance after this; we're actively looking into strategies to dynamically update flow control.
  2. We acknowledge appropriate documentation should be written.
  3. However, we'd rather open a separate issue for both of them. It is important to close this issue since it contains misinformation (broken benchmarks that show gRPC's performance much worse than it is.)

@MakMukhi I think there is benefit in creating a tracking bug specifically to address throughput issues on high-latency links.Issue #1280 does not mention latency at all (I know this is what "BDP estimation and flow control" refers to). In my experience, issues focused on actual user problems yield better solutions.

@adg That's a good point. I've updated #1280 to mention that. Feel free to update it further.

It is important to close this issue since it contains misinformation (broken benchmarks that show gRPC's performance much worse than it is.)

The above is not a legitimate reason to close this issue--in fact just the opposite. The most recent measurements in this thread show a 3x throughput loss that has not been addressed, and thus the title is still accurate, and thus this remains, still, an open issue. The #1280 issue is not nearly as specific, and looses a ton of useful context for those trying to understand the defects of gRPC in its current form.

I just re-tested against tip (8de2dff78c3b968a51c99ec526d934f686537437), on an actual high latency 109msec (one-way) link, between New Jersey and Surry Hills, New South Wales, Australia. Changing the initial window sizes, as in the pull request #1153, doubles the obtained throughput. So #1153 was also pre-maturely closed without being implemented, and it gives a 2x throughput boost.

@robpike, @adg, @glycerine

I'm concerned that if we optimize performance for this one workload, we may do so at the expense of other, more-realistic scenarios. Flow control exists as a way to trade-off bandwidth for application memory usage and tail latency. It would be trivial to change the defaults of grpc-go to maximize bandwidth utilization at the cost of the other two, but that is not our goal. While 64MB seems like a reasonable setting for the window size when you think about it in terms of a single connection, if you consider that servers may have thousands of open connections at once, that quickly turns into a production problem waiting to happen.

At the end of the day, grpc-go needs to be a reasonable out-of-the-box solution for most users, while providing configuration knobs for advanced users to optimize for either speed or memory footprint based upon their individual needs. Also note that any changes to the default behavior has to be made very cautiously, as they will impact all our users.

To that end, our plans for the next month or so are as follows:

  • Add a collection of benchmarks to measure our performance for different types of workloads under various network conditions (high and low bandwidth and latency). Performance metrics will include throughput, QPS, and tail latency (as well as localized metrics like allocations, syscalls, and CPU usage).

  • Provide a setting to limit grpc's memory usage per connection. The default setting for this is TBD.

  • Provide a setting to limit grpc's memory usage per stream. The default setting for this is likely to be the measured BDP of the connection to optimize for maximum performance per stream (and if not, we will provide a way to configure it that way).

  • Implement other optimizations where we find opportunities, like batching pending window updates and pre-emptively sending a large window update along with a unary request.

  • Document how grpc-go manages flow control, how all the various settings impact that, and how to optimize settings for different use cases.

  • Further in the future, provide a setting to limit grpc's memory usage per-Server and per-Client.

This bug will remain closed due to the misinformation / FUD caused by the benchmark it was predicated upon. Significant improvements have already been made for these types of scenarios and settings were added to allow users to easily trade-off memory usage to achieve better performance. I'll file a few extra issues to track things more granularly than #1280 and link them here later today. Thanks for your understanding and patience.

Thanks, @dfawley, that sounds like a great plan.

Should I assume that since this bug is closed, there are no remaining issues with latency significantly affecting throughput?

I only ask because from @dfawley s comments it sounds like there is more to do.

@dfawley did you get around to creating the "few extra issues" that you mentioned? If so can you link them please?

Sorry, I forgot to circle back and link the issues here. You should refer to our performance label for an up-to-date list of our perf issues.

Since this bug was popular, we've added auto-window-size updating based upon BDP estimation, which we believe should completely resolve this issue. Most of our performance work remaining at this point is optimizations to improve CPU utilization by reducing syscalls and avoiding allocations (thereby reducing GC), and reducing lock contention to support higher QPS on each connection.

Was this page helpful?
0 / 5 - 0 ratings