Grpc-go: backoff/timeout does not take into account actual GRPC connection established, causes denial of service.

Created on 28 Oct 2016 · 23Comments · Source: grpc/grpc-go

When I connect a GRPC client to an IP and port number which has no listeners the client properly tries to reconnect and respects the dial timeout and fallback.

However if I run an socat -v TCP-LISTEN:8002,fork SYSTEM:'echo hi' and then attempt to connect my grpc-go client to it... it'll keep reconnecting at a fast rate in what seems an infinite loop, effectively DoS'ing the host its running on since TCP connections stay around with TIME_WAIT for tcp_fin_timeout duration.

WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
[infinite]

This can cause issues when using a TCP load balancer like ELB which always has a TCP listen port open regardless of whether there are any backend instances available as the connection is immediately closed after an accept.

Even in the case where someone isn't running a load balancer, if for any reason the server it tries to connect misbehaves it will cause the above.

Note: I'm using grpc.WithTimeout and grpc.WithBlock.

P1 Duplicate Bug

Source

cdesousa

👍7

Most helpful comment

+1 this is still an issue. I have a very basic use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong.

2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.

ahmetb on 22 Feb 2017

👍10

All 23 comments

This issue is the same as in #120, but @cdesousa's here description is much better.

jellevandenhooff on 31 Oct 2016

@jellevandenhooff in #120 the main topic is how annoying the message are, however here I instead see that:

grpc-go has "smart" functionality to handle reconnection with goroutines
such functionality does not cover the case of a (unarguably misbehaving) server accepting and immediately closing connections

Since grpc-go wants to automate reconnection, I think it should cover also the case of this type of misbehaving servers to prevent self-DoS scenarios. It should be possible to cover not only the dialing stage but the reconnection stage (think about the IRC "reconnecting too fast" error message).

gm42 on 31 Oct 2016

I have dug a bit into this, it turns out that the tight-looping (DoS-ing) behaviour is to be found in the transportMonitor() goroutine, which does not take in account an EOF-serving service.

Steps to reproduce the problem:

start a gRPC example server on port 8081
connect the client
stop the server (e.g. Ctrl+C)
start a dummy EOF-server in place of it: socat -v TCP-LISTEN:8081,fork SYSTEM:'echo hi'

The client will generate an avalanche of reconnections, quickly exhausting all file descriptors.

gm42 on 31 Oct 2016

👍6

@iamqizhao, what do you think of adding logic to the transport monitor to back-off before attempting to reconnect again if the first read in the reader goroutine fails?

jellevandenhooff on 13 Dec 2016

2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.

ahmetb on 22 Feb 2017

👍10

We will be adding verbosity levels to logging soon. Error like that won't
spam logs after that.
No plans yet to support cases where the server or client misbehaves.

In this particular case, seems like the underlying transport breaks due to
something (may be the environment configuration doesn't allow for
long-living ideal connections or something). However, the aforementioned
monitor routine reconnects.

On Wed, Feb 22, 2017 at 10:04 AM, Ahmet Alp Balkan <[email protected]

wrote:

+1 this is still an issue. I have a very simple use case of gRPC and the
log output is flooded with this message every 4 minutes while it is sitting
idle. This looks confusing for someone who just started using gRPC and
therefore warrants a Google search which eventually lands people here. I am
still wondering if I’m doing something wrong.

2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/954#issuecomment-281751094, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR6xwfuoJ8xHMLzi5a5vENRZ3BRC7ks5rfHjCgaJpZM4Kjh0O
.

MakMukhi on 23 Feb 2017

seems like the underlying transport breaks due to something

I think we need to dig this one deeper. I understand the logger work is going on but this one might be pointing to an actual problem.

This is basically just me running multiple grpc services on my laptop and it starts happening after a while such as 2 minutes after services don't get any requests:

DEBU[2017-02-23T12:06:08-08:00] request completed                             elapsed=1.512785ms method=POST path="/coffee" service=web

2017/02/23 12:08:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.

ahmetb on 23 Feb 2017

@MakMukhi Do you have an ETA and/or gRPC version number for when the log spam will cease?

jba on 17 Mar 2017

👍2

I'm getting these logs when using the gcloud vision client (cloud.google.com/go/vision)

transport: http2Client.notifyError got notified that the client transport was broken EOF.

montanaflynn on 4 Apr 2017

👍3

I am also getting these logs when using Google's Cloud Speech and Cloud Translate APIs.

Is there an update on this?

dmonay on 17 May 2017

👍2

I have an easy repro for this, I just initialized one of cloud.google.com/go libraries, did not make any requests, and created an interactive prompt to read in from the user, and just waited for a couple of minutes in the prompt.

Simplest repro:

import iamadmin "cloud.google.com/go/iam/admin/apiv1"

client,err := iamadmin.NewIamClient(ctx)
if err != nil {panic(err)}

scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {fmt.Println(scanner.Text())} // <-- do not hit Return when you run the program
if err := scanner.Err(); err != nil {panic(err)}

shortly after, you'll start seeing:

2017/06/12 12:37:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/06/12 12:41:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
...

ahmetb on 12 Jun 2017

The logging API changes are in: https://github.com/grpc/grpc-go/pull/922
We will get it merged by end of next week.

After this, we will change the gRPC code to use the correct log level.

menghanl on 12 Jun 2017

@menghanl any updates on the zero-delay reconnect logic?

jellevandenhooff on 12 Jun 2017

@jellevandenhooff
If I remember correctly, the zero-delay reconnect issue only happens with a misbehaving server. It should not be an issue in a correctly configured environment.
Also, in the current gRPC code structure, it's not trivial to add backoff in this situation.
Considering the effort needed to fix and the severity of the problem, we are putting it at a low priority.

The logging issue is more annoying for users, so we will get that fixed soon.

menghanl on 12 Jun 2017

Let me know if you need a repro program. I am assuming the fix will get rid of this error message without any change in the caller code. If that's not the case, we're not fixing it right.

ahmetb on 19 Jun 2017

❤2

Hi @menghanl, I don't think #1340 fixes this issue. Though it helps with the log messages, the file descriptors in @gm42's comment still get exhausted.

jellevandenhooff on 13 Jul 2017

Sorry, reopened.

menghanl on 13 Jul 2017

2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42022 - 405 0B in 17.907µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42024 - 405 0B in 24.254µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42026 - 405 0B in 12.321µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42028 - 405 0B in 50.646µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42030 - 405 0B in 20.908µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42032 - 405 0B in 10.143µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42034 - 405 0B in 16.86µs 2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42036 - 405 0B in 14.035µs

I'm using the chi framework and I see a flood of these when I enable router.Use(middleware.Logger) in my server and a backend GRPC service it uses is offline.

wedgeV on 17 Jul 2017

We discussed this a bit today and weren't sure exactly what the right behavior is in this case. As mentioned above, backoff applies only to cases where the server does not respond, not to the case where it can be reached but then immediately fails. Related, the pick-first balancer, which restarts from the top of the list of backends upon any backend failure, would continually pick this same problem server if it is at the top. We will need to do some more investigation on this.

dfawley on 18 Jul 2017

A side effect (it seems) of these rapid retries, leaving in hundreds of lingering TCP sessions, in case of a mis-behaving remote server (which in our case was due to mis-configuration, pointing to a wrong server) is GC getting triggered every 15secs or so and taking up to 70% of a core for about 15secs. Below is a snip of perf report. Any way to expedite fix for the issue?:
Samples: 49K of event 'cycles', Event count (approx.): 21149591915
29.95% emsd emsd [.] runtime.scanobject
21.11% emsd emsd [.] runtime.memclrNoHeapPointers
10.52% emsd emsd [.] runtime.greyobject
9.67% emsd emsd [.] runtime.heapBitsForObject
2.76% emsd emsd [.] runtime.gcDrain
1.52% emsd emsd [.] runtime.scanblock
[snip]

vtubati on 16 Sep 2017

This and similar problems have come up a few times lately, so we would like to do something soon. The tentative plan is:

Add a dial option to allow the user to choose to wait until a settings frame is received from the server before considering the connection established. This is what the java client does, and it will prevent this problem when connecting to a server that is accepting connections, but isn't a gRPC endpoint or isn't healthy enough to send settings frames.
Make this option opt-out after one release, instead of opt-in.
Remove this option after one more release.

If a server is healthy enough to send a settings frame, and then disconnects, and that happens repeatedly...I'm not sure there's much we can do about that without changes to the gRPC spec.

One incompatibility with this behavior that I'm aware of is cmux -- there is a workaround, but it would require a small code change. If anyone knows of other possible problems with this approach, please let us know.

dfawley on 18 Sep 2017

Thanks for the details of the fix being thought about. Wonder if you are in a position to share when the patch is likely to be available. Thanks.

vtubati on 18 Sep 2017

Let's consolidate this with #1444.

dfawley on 19 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings