Grpc-go: Unexpected transport closing: too many pings from client

Created on 25 Feb 2018  Â·  33Comments  Â·  Source: grpc/grpc-go

What version of gRPC are you using?

583a6303969ea5075e9bd1dc4b75805dfe66989a

What version of Go are you using (go version)?

1.10

What operating system (Linux, Windows, …) and version?

Linux AMD64, Kernel 4.10

What did you do?

When I have the server configured with GZIP compression as so:

gzcp := grpc.NewGZIPCompressor()
grpcServer := grpc.NewServer(grpc.RPCCompressor(gzcp))

Then when serving thousands of concurrent requests a second, clients will occasionally be disconnected with

rpc error: code = Unavailable desc = transport is closing

I see no errors from the server, and the both the client and server are far from overloaded (<10% CPU usage etc). Not all clients are affected at once, it will just be one connection which gets this error.

While trying to debug this, I disabled GZIP compression so I could use more easily look at packet captures. I am unable to reproduce this error once the GZIP compressor is no longer in use.

This issue is mostly to ask what the best way to proceed with diagnosing the problem is, or if there are any reasons why having a compressor would change the behavior of the system (aside from CPU usage which I don't think is a problem).

P1 Bug

All 33 comments

Sorry, I can reproduce this even without the GZIP compressor. I still don't know how to debug it though. The only error I get on the client with loglevel set to info is:

pickfirstBalancer: HandleSubConnStateChange: 0xc421bc6160, TRANSIENT_FAILURE

Turn on transport level logs by setting these two env variables:
export GRPC_GO_LOG_VERBOSITY_LEVEL=2
export GRPC_GO_LOG_SEVERITY_LEVEL=info

May be the transport on the server of client side sees some error and
closes.

On Sat, Feb 24, 2018 at 4:26 PM, Michael Andersen notifications@github.com
wrote:

Sorry, I can reproduce this even without the GZIP compressor. I still
don't know how to debug it though. The only error I get on the client with
loglevel set to info is:

pickfirstBalancer: HandleSubConnStateChange: 0xc421bc6160, TRANSIENT_FAILURE

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1882#issuecomment-368271626, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR6yLJFRNEv-Nn5h7ILZfqwbUZoMFks5tYKisgaJpZM4SSF8Z
.

The client respects those variables, but the server does not, perhaps the etcd client in the server (which uses grpc for its own stuff) is overriding it? Is there a way to set the log level in the code?

EDIT: I see, etcd calls SetLoggerV2, pity that is global, but I can work around that.

I finally got a new piece of information. The server prints this shortly before the client disconnects:

ERROR: 2018/02/24 19:16:02 transport: Got too many pings from the client, closing the connection.
ERROR: 2018/02/24 19:16:02 transport: Error while handling item. Err: connection error: desc = "transport is closing"

What does "too many pings" mean?

The client prints this with the new verbosity:

INFO: 2018/02/24 19:16:02 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2018/02/24 19:16:02 pickfirstBalancer: HandleSubConnStateChange: 0xc420b7c4b0, TRANSIENT_FAILURE

Incidentally, neither the server nor the client passes any keep alive parameters, everything is default

The client is pinging the server too many times. Perhaps the keepalive
settings on the client and the server are not the same.

On Sat, Feb 24, 2018, 7:18 PM Michael Andersen notifications@github.com
wrote:

I finally got a new piece of information. The server prints this shortly
before the client disconnects:

ERROR: 2018/02/24 19:16:02 transport: Got too many pings from the client, closing the connection.
ERROR: 2018/02/24 19:16:02 transport: Error while handling item. Err: connection error: desc = "transport is closing"

What does "too many pings" mean?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1882#issuecomment-368279607, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR11QehqgSkWt1PprRu28OaEXAN7Lks5tYND3gaJpZM4SSF8Z
.

But both sides are built with the same version of grpc, neither specify any keepalive parameters and it only happens under heavy load.

If it's not keepalive then the only thing inside gRPC that sends pings is bdp estimator which will be triggered on receiving data. However, every time the server sends data frames it resets its pingStrike counter. To quickly check if it's the bdp estimator or not, try turning it off by setting WithInitialWindowSize and WIthInitialConnectionWindowSize dial options on the client.

Thanks for the suggestion. I have set both of those options (to 1MB) and will leave it running overnight. It typically takes somewhere between 2 minutes and 20 minutes under load to trigger, so a few hours should be conclusive.

It's been running without any problem for about 9 hours, so I think that confirms that the BDP estimator is at fault. I know nothing about how pings or BDP estimation works inside grpc, but surely if the pings are sent when data goes from client to the server and the pingStrike counter is only reset when data is sent the other way then if the RPC operations spike in latency (e.g mine go to hundreds of milliseconds) then you can exceed the counter simply because the server isn't yet responding to any RPC calls so isn't doing any resets?

A bdp estimator ping is sent by the client when it receives data from the server. And the server must have reset its counter for that dataFrame. In an old version the server was buggy and was resetting its counter only once for a message but you don't seem to be using that.
Thanks for narrowing it down. I'll take a look at it Monday. I don't suppose it would be easy for you to create a repro of it but if you find a way can you please upload it.

I made a simple reproducer where a program connects to itself and invokes an operation that has roughly the latency I observed with roughly the concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if there is anything else I can do to help

Thanks for trying to reproduce it. We might need to collaborate to debug it
further. I'll look at the code again tomorrow to see if I can find any
obvious mistakes.

On Sun, Feb 25, 2018, 4:17 PM Michael Andersen notifications@github.com
wrote:

I made a simple reproducer where a program connects to itself and invokes
an operation that has roughly the latency I observed with roughly the
concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if
there is anything else I can do to help

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1882#issuecomment-368359055, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR2G2PhzC3C28Mkgu9vdg9P-T_2Zgks5tYfgigaJpZM4SSF8Z
.

I didn't succeed in making a standalone reproducer but I can reliably reproduce it in-situ even on completely different hardware. More than willing to help track this down, just let me know how.

Hey, that's great that you can reliably reproduce it in your set up.
Following is what we know:

Here's where the client sends a bdp ping: https://github.com/grpc/grpc-go/blob/f0a1202acdc5c4702be05098d5ff8e9b3b444442/transport/http2_client.go#L840

As you see this is inside of the data handler which is called every time the client gets a dataFrame.
Moreover,client sends ping only if it already got an ack back from a previous ping.

On the server-side whenever a dataFrame is sent the server resets the ping counter:
https://github.com/grpc/grpc-go/blob/f0a1202acdc5c4702be05098d5ff8e9b3b444442/transport/http2_server.go#L901

  • Client sees data and sends ping.
  • Client sees more data doesn't send ping.
  • Client sees more data doesn't send ping.
  • Server sees ping, but the counter was reset.
  • Server sends back ping-ack and now sets the ping strike counter again.
  • Client sees ping-ack and records bdp.
  • The whole process is repeated.

I would start with making sure the faulty pings are always bdp pings. (Assuming you can update your local code):

  1. Make sure there's nothing else that's pinging the server: In the server's handlePing, right after this line

I'd add the following:

if f.Data != bdpPing.data {
  panic("Some other pinger?")
}
  1. Perhaps, print out time whenever server sends data and receives ping.

I'm sorry I have not been able to find out anything obvious yet. But my best guess so far is that something else is pinging the server.
Let's see what you find and we'll go from there. I'd like to stare at these print statements and logs once you have them.

Also, thanks for helping out. :)

@immesys, Random thought along the lines of @MakMukhi's reasoning: you aren't connecting through any kind of proxy, are you?

@dfawley no, there is no http2 proxy involved. I was initially running in kubernetes, but I have reproduced it on bare metal with no fancy networking at all.

EDIT: actually I lie there is probably some overlay networking still going on, but that would be at layer 3, nothing should be injecting HTTP2 pings

Hmm, sounds like there _should_ be no pings in the system except BDP pings, then. Hopefully you can try the suggestions from @MakMukhi and get back to us. Thanks!

Ok I reproduced it with the debug statements in place.
Here are the changes I made https://gist.github.com/immesys/be704af11354ba943f35b801abc9caf8
Here are the 2000 lines preceding the event. I included two files, one with all the log output and one filtered to only include log output from the specific http2Server that tripped (there are 10 concurrent connections): https://gist.github.com/immesys/aff2b454b9e0eab843e062e76bfb65c3

Let me know if I can re-run this with different print statements

@dfawley yeah in the entire log the OTHER PING DATA print statement never triggers, so I think this is just BDP pings.

Is there a reason the ping strikes are only reset when sending headers in itemHandler? Why not when sending dataFrame?

To save you reading the whole logs, here is the important sequence:

1519942464739172903 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739261537 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739266862 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739318578 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739452781 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739459106 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739535670 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739659435 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739674955 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739678469 BDPDEBUG[http2Server=0xc423778f00] >maxPingStrikes

Basically because we are writing three data frames and no header frames, we don't reset the ping strikes.

They are reset while scheduling dataFrame here:
https://github.com/grpc/grpc-go/blob/master/transport/http2_server.go#L901
But I think you're right if we moved this to just before writing dataFrame
on the wire it would work out. Can you try doing that?

On Thu, Mar 1, 2018 at 2:41 PM, Michael Andersen notifications@github.com
wrote:

To save you reading the whole logs, here is the important sequence:

1519942464739172903 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739261537 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739266862 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739318578 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739452781 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739459106 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739535670 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739659435 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739674955 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739678469 BDPDEBUG[http2Server=0xc423778f00] >maxPingStrikes

Basically because we are writing three data frames and no header frames,
we don't reset the ping strikes.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1882#issuecomment-369756425, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR9_TJ7joHxuljxqyp2kjDJH0pojtks5taHkAgaJpZM4SSF8Z
.

Should it be before writing it to the wire or just after (the header is just after) ?

I'd put it right before, but it shouldn't matter as much.

On Thu, Mar 1, 2018 at 2:52 PM, Michael Andersen notifications@github.com
wrote:

Should it be before writing it to the wire or just after (the header is
just after) ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1882#issuecomment-369759047, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnRw5oyiJb42TnfzK7L8UjQXon4iEtks5taHuSgaJpZM4SSF8Z
.

It's been running for a while with no problems. I'll run it overnight and see. Does this completely remove the race condition or just make it harder to hit?

Still didn't hit this case now after four hours. I stopped the test because I needed the machine for a different test. Is it always the case that every data frame gets an interleaved reply 1:1 ? You can't queue a few frames in the TCP window, clearing the reset strikes after each and then receive the three pings in response all together? I don't know the code, just wondering

I suspect the problem was a ping ack being scheduled between the two steps
where write method resets the counter and schedules data. This scheduling
of ping-ack sets the counter back.
If this indeed was the case then moving the reset instruction in
itemHandler will eliminate the race.

Multiple data frames get queued up in gRPC itself and I'm sure in the OS
kernel too.

On Thu, Mar 1, 2018 at 6:24 PM, Michael Andersen notifications@github.com
wrote:

Still didn't hit this case now after four hours. I stopped the test
because I needed the machine for a different test. Is it always the case
that every data frame gets an interleaved reply 1:1 ? You can't queue a few
frames in the TCP window, clearing the reset strikes after each and then
receive the three pings in response all together? I don't know the code,
just wondering

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/1882#issuecomment-369799471, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR81g0qy4vfeZaUvmQ3YinIBywbxGks5taK1mgaJpZM4SSF8Z
.

Is there any way to figure out why this was happening to @immesys so reproducibly, but none of our tests encountered it?

Well it only happens after a few minutes under very high load. I think it has to do with the order that goroutines get scheduled. We could probably make a reproducer by putting in some sleeps in the code in the right places

Similar/same issue here. We've got a one client, one 'relay' server and one server. If we create many concurrent requests (thousands as its mentioned above), we got same error on client:

rpc error: code = Unavailable desc = transport is closing

Thanks to @MakMukhi

... try turning it off by setting WithInitialWindowSize and WIthInitialConnectionWindowSize dial options on the client.

That solve our problem.

@PraserX We'd ideally like BDP estimator to be turned on. Is it possible for you to turn it back on and patch #1905 to see if that fixes your problem as well.

@MakMukhi It seems that it is working perfectly. Thank you for response.

Was this page helpful?
0 / 5 - 0 ratings