When I connect a GRPC client to an IP and port number which has no listeners the client properly tries to reconnect and respects the dial timeout and fallback.
However if I run an socat -v TCP-LISTEN:8002,fork SYSTEM:'echo hi' and then attempt to connect my grpc-go client to it... it'll keep reconnecting at a fast rate in what seems an infinite loop, effectively DoS'ing the host its running on since TCP connections stay around with TIME_WAIT for tcp_fin_timeout duration.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
[infinite]
This can cause issues when using a TCP load balancer like ELB which always has a TCP listen port open regardless of whether there are any backend instances available as the connection is immediately closed after an accept.
Even in the case where someone isn't running a load balancer, if for any reason the server it tries to connect misbehaves it will cause the above.
Note: I'm using grpc.WithTimeout and grpc.WithBlock.
This issue is the same as in #120, but @cdesousa's here description is much better.
@jellevandenhooff in #120 the main topic is how annoying the message are, however here I instead see that:
Since grpc-go wants to automate reconnection, I think it should cover also the case of this type of misbehaving servers to prevent self-DoS scenarios. It should be possible to cover not only the dialing stage but the reconnection stage (think about the IRC "reconnecting too fast" error message).
I have dug a bit into this, it turns out that the tight-looping (DoS-ing) behaviour is to be found in the transportMonitor() goroutine, which does not take in account an EOF-serving service.
Steps to reproduce the problem:
Ctrl+C)socat -v TCP-LISTEN:8081,fork SYSTEM:'echo hi'The client will generate an avalanche of reconnections, quickly exhausting all file descriptors.
@iamqizhao, what do you think of adding logic to the transport monitor to back-off before attempting to reconnect again if the first read in the reader goroutine fails?
+1 this is still an issue. I have a very basic use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong.
2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
We will be adding verbosity levels to logging soon. Error like that won't
spam logs after that.
No plans yet to support cases where the server or client misbehaves.
In this particular case, seems like the underlying transport breaks due to
something (may be the environment configuration doesn't allow for
long-living ideal connections or something). However, the aforementioned
monitor routine reconnects.
On Wed, Feb 22, 2017 at 10:04 AM, Ahmet Alp Balkan <[email protected]
wrote:
+1 this is still an issue. I have a very simple use case of gRPC and the
log output is flooded with this message every 4 minutes while it is sitting
idle. This looks confusing for someone who just started using gRPC and
therefore warrants a Google search which eventually lands people here. I am
still wondering if I’m doing something wrong.2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/grpc/grpc-go/issues/954#issuecomment-281751094, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ATtnR6xwfuoJ8xHMLzi5a5vENRZ3BRC7ks5rfHjCgaJpZM4Kjh0O
.
seems like the underlying transport breaks due to something
I think we need to dig this one deeper. I understand the logger work is going on but this one might be pointing to an actual problem.
This is basically just me running multiple grpc services on my laptop and it starts happening after a while such as 2 minutes after services don't get any requests:
DEBU[2017-02-23T12:06:08-08:00] request completed elapsed=1.512785ms method=POST path="/coffee" service=web
2017/02/23 12:08:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.
@MakMukhi Do you have an ETA and/or gRPC version number for when the log spam will cease?
I'm getting these logs when using the gcloud vision client (cloud.google.com/go/vision)
transport: http2Client.notifyError got notified that the client transport was broken EOF.
I am also getting these logs when using Google's Cloud Speech and Cloud Translate APIs.
Is there an update on this?
I have an easy repro for this, I just initialized one of cloud.google.com/go libraries, did not make any requests, and created an interactive prompt to read in from the user, and just waited for a couple of minutes in the prompt.
Simplest repro:
import iamadmin "cloud.google.com/go/iam/admin/apiv1"
client,err := iamadmin.NewIamClient(ctx)
if err != nil {panic(err)}
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {fmt.Println(scanner.Text())} // <-- do not hit Return when you run the program
if err := scanner.Err(); err != nil {panic(err)}
shortly after, you'll start seeing:
2017/06/12 12:37:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/06/12 12:41:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
...
The logging API changes are in: https://github.com/grpc/grpc-go/pull/922
We will get it merged by end of next week.
After this, we will change the gRPC code to use the correct log level.
@menghanl any updates on the zero-delay reconnect logic?
@jellevandenhooff
If I remember correctly, the zero-delay reconnect issue only happens with a misbehaving server. It should not be an issue in a correctly configured environment.
Also, in the current gRPC code structure, it's not trivial to add backoff in this situation.
Considering the effort needed to fix and the severity of the problem, we are putting it at a low priority.
The logging issue is more annoying for users, so we will get that fixed soon.
Let me know if you need a repro program. I am assuming the fix will get rid of this error message without any change in the caller code. If that's not the case, we're not fixing it right.
Hi @menghanl, I don't think #1340 fixes this issue. Though it helps with the log messages, the file descriptors in @gm42's comment still get exhausted.
Sorry, reopened.
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42022 - 405 0B in 17.907µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42024 - 405 0B in 24.254µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42026 - 405 0B in 12.321µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42028 - 405 0B in 50.646µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42030 - 405 0B in 20.908µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42032 - 405 0B in 10.143µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42034 - 405 0B in 16.86µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42036 - 405 0B in 14.035µs
I'm using the chi framework and I see a flood of these when I enable router.Use(middleware.Logger) in my server and a backend GRPC service it uses is offline.
We discussed this a bit today and weren't sure exactly what the right behavior is in this case. As mentioned above, backoff applies only to cases where the server does not respond, not to the case where it can be reached but then immediately fails. Related, the pick-first balancer, which restarts from the top of the list of backends upon any backend failure, would continually pick this same problem server if it is at the top. We will need to do some more investigation on this.
A side effect (it seems) of these rapid retries, leaving in hundreds of lingering TCP sessions, in case of a mis-behaving remote server (which in our case was due to mis-configuration, pointing to a wrong server) is GC getting triggered every 15secs or so and taking up to 70% of a core for about 15secs. Below is a snip of perf report. Any way to expedite fix for the issue?:
Samples: 49K of event 'cycles', Event count (approx.): 21149591915
29.95% emsd emsd [.] runtime.scanobject
21.11% emsd emsd [.] runtime.memclrNoHeapPointers
10.52% emsd emsd [.] runtime.greyobject
9.67% emsd emsd [.] runtime.heapBitsForObject
2.76% emsd emsd [.] runtime.gcDrain
1.52% emsd emsd [.] runtime.scanblock
[snip]
This and similar problems have come up a few times lately, so we would like to do something soon. The tentative plan is:
Add a dial option to allow the user to choose to wait until a settings frame is received from the server before considering the connection established. This is what the java client does, and it will prevent this problem when connecting to a server that is accepting connections, but isn't a gRPC endpoint or isn't healthy enough to send settings frames.
Make this option opt-out after one release, instead of opt-in.
Remove this option after one more release.
If a server is healthy enough to send a settings frame, and then disconnects, and that happens repeatedly...I'm not sure there's much we can do about that without changes to the gRPC spec.
One incompatibility with this behavior that I'm aware of is cmux -- there is a workaround, but it would require a small code change. If anyone knows of other possible problems with this approach, please let us know.
Thanks for the details of the fix being thought about. Wonder if you are in a position to share when the patch is likely to be available. Thanks.
Let's consolidate this with #1444.
Most helpful comment
+1 this is still an issue. I have a very basic use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong.