Hello! I am working with a server side streaming gRPC application, and in accordance to their best practices, involving long-lived rpc's, I have given the server a max connection age. Once this age is reached, the server will sever ties with the client, and the client side's onError will be called with the status code UNAVAILABLE. gRPC defines the status code as:
* The service is currently unavailable. This is a most likely a
* transient condition and may be corrected by retrying with
* a backoff. Note that it is not always safe to retry
* non-idempotent operations.
Once the client receives this status code, it will reconnect to the server and continue. I am able to get this to work with linkerd disabled. However, when linkerd is enabled, I get a different status code. Rather than getting UNAVAILABLE, I get the following code: INTERNAL. (INTERNAL ERROR a.k.a. not good)
Here is a simple reproduction I threw together:
https://github.com/byblakeorriver/SimpleGrpcLinkerd
Here are the logs from the linkerd-proxy. This is the whole logs. About 80ish seconds in is where I get the INTERNAL instead of UNAVAILABLE.
https://gist.github.com/byblakeorriver/e07d45b7d2dc705bfe96fe6df97b994d
linkerd check outputI don't have permissions right now to do this. If I ask nicely I can probably get it.
@byblakeorriver Huge thanks for the repro! This should give us enough to track this down.
@byblakeorriver just to confirm, in this scenario you are only injecting the client, right ?
@zaharidichev IIRC, the problem exists in a full-meshed configuration as well, but this was a narrower repro
@olix0r @byblakeorriver So after a fair bit of spelunking here is what I think... What is happening here is that when the server reaches the 1m Graceful Shutdown period it simply sends a GOAWAY frame to the client. This simply indicates that no new streams are allowed. Forward 30 seconds from that point the connection is severed, accoring to the maxConnectionAgeGrace threshold.
The rust H2 implementation at that point just closes the codec, and errors with a Cause(Io) error, which ultimatelly translates to an Internal error. In contrast Netty seems to be keeping track of the fact that there was a GoAway frame received, with a NoError on it. So when the connection is severed, this is the reported status. The H2 codec does not do that and after reading the GoAWAY part of the Http2 RFC, it is a bit unclear whether this actually correct.
Not sure whether we actually need to _fix_ anything here. One thing we could do is that we can map this error to UNAVAILABLE, but I do not think this is desirable. After all this is ungraceful termination and event conceptually this does not directly translate to UNAVAILABLE. Wdyt?
On another note... if you get rid of the maxConnectionAgeGrace, you will continue receiving frames for that stream until the end of times, because the connection is not terminated and GOAWAY allows in-progress streams to finish.
@zaharidichev Thanks for digging into this. A few questions:
What is happening here is that when the server reaches the 1m Graceful Shutdown period it simply sends a GOAWAY frame to the client. This simply indicates that no new streams are allowed. Forward 30 seconds from that point the connection is severed, accoring to the maxConnectionAgeGrace threshold.
Would you mind walking us through the series of frames sent here, especially between the proxy and server?
Specifically, I would expect a stream reset (with the unavailable status), but i'm surprised that we see a full connection teardown via GOAWAY. The serverside sends the GOAWAY? Why? Does it send this without the proxy involved?
If we replaced the client with a Tonic client, what behavior would we get (without the proxy in the mix)?
@olix0r I did a bit more digging and I am not entirely convinced that all of this is correct. But to elaborate on the sequence of frames.
So after the initial handshake and stream intialization we have the following:
HalfClosedLocal def start(): Unit = {
server = NettyServerBuilder
.forPort(80)
.addService(CounterService.newInstance())
.maxConnectionAge(1L, TimeUnit.MINUTES) <------- This causes a GOAWAY to be sent
.maxConnectionAgeGrace(30L, TimeUnit.SECONDS) <----- This causes the connection teardown 30 secs after the GOAWAY was sent
.build()
.start()
I will take a look at what happens when using tonic to rule out the proxy being the problem.
OK. I have a much better understanding of this situation after discussing this further...
The problem arises from use of maxConnectionAge on the serverside. The feature is, aiui, intended to force clients to balance load, assuming that clients balance by connection (and not by stream).
This is somewhat at odds with linkerd's load balancer, which natively balances by streams and not connections. Ideally, the server would limit each _stream_'s lifetime and send a trailers message with a grpc-status. This would work across implementations and is "correct" from the protocol point of view.
When the server sends a GOAWAY, that informs linkerd to stop sending _new_ streams over this connection, but it does not tell us to stop using existing streams. We expect each stream to be terminated by the server, and keep reading until the server eventually closes the connection, which propagates as an IO error, which we translate into an unknown HTTP/2 error.
We can consider introducing gRPC-specific error handling into the proxy to emit such errors as grpc-status: UNAVAILABLE, but I'm not 100% sure this is the correct thing to do.
@byblakeorriver Can you explain more about your choice to use maxConnectionAge? Is this a standard (documented?) practice that all gRPC servers are expected to do? Or a clever workaround for a balancing problem?
In my view, the server is misbehaving by not terminating streams individually; but if this is a standard practice we can consider ways to paper over the misbehavior in Linkerd.
@olix0r I got the idea to use maxConnectionAge from Eric Anderson's talk at KubeCon.
https://www.youtube.com/watch?v=Naonb2XD_2Q (around 5:36). The talk was about common issues that arise when having long lived rpcs. Since my particular rpc could live forever, I went ahead and added the max connection age setting to force the break up between the client and server. When this occurs, I would expect to receive a status: UNAVAILABLE. That status, defined as follows by gRpc:
* The service is currently unavailable. This is a most likely a
* transient condition and may be corrected by retrying with
* a backoff. Note that it is not always safe to retry
* non-idempotent operations.
, would let the client know that it is okay to try reconnecting.
The issue with not using maxConnectionAge is that if my rpc never goes away, the server can become overburdened because there is (afaik) no way to know how many active streams are actively being served by the server. Since there is no way to determine this, my clients could just keep being directed to that overloaded server. So my solution to the problem was to periodically drop the rpc, the way that Eric suggested. This would give me the ability to shorten the life of the rpc.
I guess I could technically send notice to the client by way of a separate protocol message and gracefully close the connection then reconnect. I would have the maxConnectionAgeGrace to manage the time it takes to do so, but still then I would have the situation where if it was not able to gracefully close. Then the server would still send the status: UNKNOWN and I would be in the same boat.
@byblakeorriver Thanks, that's helpful context. That definitely is the easiest way to balance load, given the tools that Netty offers out of the box. However, it's a somewhat blunt approach, akin to just restarting the pod every so often (but without the obvious startup and scheduling overhead); and it's especially tricky for middle-boxes like Linkerd's proxy to be totally transparent in this situation:
If we have a situation like
[Client -> Linkerd Proxy] -> Server
# or, more generally
[Client -> Proxy] -> [Proxy -> Server]
The client's proxy maintains an HTTP/2 connection with the Client and with each pod in the Server's deployment. As requests are sent by the client, the proxy load balances each request to the "least loaded" Server (as measured by a weighted moving average of latency, combined with the number of open requests).
Because the proxy operates at the request-level, that connection from the Client isn't associated with any one Server. So when a Server terminates its connection due to the maxConnectionAge, the Client's proxy sees this as a response stream failure because the proxy was in the middle of streaming a response and it was interrupted by a network error and not an end-of-stream message. So, we forward that response-error (and not a closed connection) to the Client.
So, when Linkerd is in the mix, maxConnectionAge will generally perform worse for your application... even if we can better-hide this error from your application, you'll needlessly be disrupting connections that do not need to be disrupted. However, you may still want to have your server enforce a maximum _stream_ age. Ideally, this would be something that you could implement as a gRPC middleware that gracefully terminates a stream after a given time period.
All that said, we may be able to hide this failure from your application by special-casing connection errors for gRPC streams so that we synthesize a grpc-status: UNAVAILABLE that your application knows how to handle. @zaharidichev is doing some research to understand how other grpc implementations handle connection errors to ensure this is generally safe/proper for Linkerd to replicate.
@olix0r @byblakeorriver All of this makes sense. I did a bit more digging and I can configm that go-grpc handles these connection errors by ending an UNAVAILABLE status. This comes as a bit of a surprise to me but we can certainly special case this into the proxy. The relevant piece of Go code is right here.
I can confirm that when using go-grpc for the client and not having anything meshed, I get the unavailble status:
rpc error: code = Unavailable desc = transport is closing
Awesome! Thank you guys for helping out with this. @olix0r I will definitely consider using my own stream life termination technique in place of relying on the status code. Also, @zaharidichev if you have any particular questions for the folks at gRPC, they are very active on their Gitter. I usually get a response within a couple of hours.
@byblakeorriver If you are keen you cna give this branch a go https://github.com/linkerd/linkerd2-proxy/pull/493
@zaharidichev I probably won't be able to try today, but as soon as I can I will. Thanks again!
@byblakeorriver This will be released with today's edge release (edge-20.4.5). You can either upgrade your cluster from this version of the CLI (i.e. via :; curl -sL https://run.linkerd.io/install-edge | sh) or you can test this on a single workload by setting an annotation like config.linkerd.io/proxy-version: edge-20.4.5.
Note: this release is not yet live, but should be within the next few hours.
Awesome news! Thank you guys so much for the help!
@olix0r - We're still seeing this on the latest edge (our own service name is edge, but edge-7 here is our app). Note the GRPC error it got back talking to other services.....
edge-7fbdfb68c8-7gwld edge edge-7fbdfb68c8-7gwld{"error":"rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field","level":"error","msg":"service_call_failed","operation":"get_event","process":{"host":"edge-7fbdfb68c8-7gwld","name":"edge","runtime":4985.396622537,"version":"1"},"time":"2020-09-08T08:21:07Z"}
linkerd-proxy [ 4985.29219347s] WARN ThreadId(06) outbound:accept{peer.addr=10.202.167.190:49128}:source{target.addr=172.20.23.57:50051}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
edge-7fbdfb68c8-7gwld edge {"error":"rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field","level":"error","msg":"service_call_failed","operation":"ingest_delta","process":{"host":"edge-7fbdfb68c8-7gwld","name":"edge","runtime":4985.528984901,"version":"1"},"time":"2020-09-08T08:21:07Z"}
edge-7fbdfb68c8-7gwld linkerd-proxy [ 4985.161562870s] WARN ThreadId(06) outbound:accept{peer.addr=10.202.167.190:34128}:source{target.addr=172.20.82.1:50051}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
edge-7fbdfb68c8-7gwld edge {"error":"rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field","level":"error","msg":"service_call_failed","operation":"get_event","process":{"host":"edge-7fbdfb68c8-7gwld","name":"edge","runtime":4985.775839788,"version":"1"},"time":"2020-09-08T08:21:08Z"}
linkerd version:
Client version: edge-20.9.1
Server version: edge-20.9.1
linkerd check
kubernetes-api
--------------
โ can initialize the client
โ can query the Kubernetes API
kubernetes-version
------------------
โ is running the minimum Kubernetes API version
โ is running the minimum kubectl version
linkerd-existence
-----------------
โ 'linkerd-config' config map exists
โ heartbeat ServiceAccount exist
โ control plane replica sets are ready
โ no unschedulable pods
โ controller pod is running
โ can initialize the client
โ can query the control plane API
linkerd-config
--------------
โ control plane Namespace exists
โ control plane ClusterRoles exist
โ control plane ClusterRoleBindings exist
โ control plane ServiceAccounts exist
โ control plane CustomResourceDefinitions exist
โ control plane MutatingWebhookConfigurations exist
โ control plane ValidatingWebhookConfigurations exist
โ control plane PodSecurityPolicies exist
linkerd-identity
----------------
โ certificate config is valid
โ trust anchors are using supported crypto algorithm
โ trust anchors are within their validity period
โ trust anchors are valid for at least 60 days
โ issuer cert is using supported crypto algorithm
โ issuer cert is within its validity period
โ issuer cert is valid for at least 60 days
โ issuer cert is issued by the trust anchor
linkerd-identity-data-plane
---------------------------
โ data plane proxies certificate match CA
linkerd-api
-----------
โ control plane pods are ready
โ control plane self-check
โ [kubernetes] control plane can talk to Kubernetes
@steve-gray Thanks. Are you using response streams and maxConnectionAge? If not, would you mind opening a new issue so we can make sure we're focused on the right set of conditions?
We're seeing it on unary and streaming GRPC's. I'll move the above over to the other issue referenced above. #4227