Linkerd2: Received http2 header with status: 502

Created on 11 Jun 2019 · 22Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

Our application shows a couple of errors that seem to be related to linkerd-proxy( without linkerd injection these errors don't happen):
```err: {
"code": 4,
"metadata": {
"_internal_repr": {}
},
"details": "Deadline Exceeded"
}

followed by

err: {
"code": 1,
"metadata": {
"_internal_repr": {
":status": [
"502"
],
"content-length": [
"0"
],
"date": [
"Tue, 11 Jun 2019 13:01:30 GMT"
]
}
},
"details": "Received http2 header with status: 502"
}

On `linkerd-proxy` pods we can see the following errors:

DBUG [ 388.710739s] proxy={client=in dst=10.24.26.128:9090 proto=Http2} linkerd2_proxy::proxy::http::h2 http2 conn error: http2 error: protocol error: unspecific protocol error detected
ERR! [ 388.710816s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.23.122:35662} linkerd2_proxy::app::errors unexpected error: http2 error: protocol error: unspecific protocol error detected

Based on the logs it seems like `service A` sends a request to `service B` via `linkerd-proxy`, `service B` processes the request successfully (both `service B` logs and `linkerd-proxy` of service B show no errors and data is written further downstream in the database). However this is where the issue is reported - `linkerd-proxy` of service A instead of reporting the request successfully it shows and the `linkerd2_proxy::app::errors unexpected error: http2 error: protocol error: unspecific protocol error detected` error and the `service A` logs show the above application errors.
### How can it be reproduced? 
Not 100% sure - we were seeing this issue when we started using linkerd but after we applied the fix described in this ticket https://github.com/linkerd/linkerd2/issues/2813#issuecomment-496641996 we stopped seeing this issue until today.
I've attempted upgrading linkerd with `--disable-h2-upgrade` and I can still see the issue.
### Logs, error output, etc

(If the output is long, please create a [gist](https://gist.github.com/) and
paste the link here.)

#### `linkerd check` output

```kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ control plane namespace exists
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version:v1.14.1
Cluster Environment: (GKE, AKS, kops, ...) GKE v1.12.8-gke.6
Host OS:
Linkerd version: tested with edge-19.6.1, stable-2.3.2, and --proxy-version=fix-2863-0 (which contained the memory leak fix)

Possible solution

Not a solution but internally we think that linkerd is adding the status field wrapped in : and grpc doesn't like that.

Additional context

There isn't a pattern or much consistency with this - every time we run the job the error is returned at the different point in the flow.

This seems to be a very similar issue to what is described in https://github.com/linkerd/linkerd2/issues/2801

areproxy needrepro prioritP0

Source

calinah

Most helpful comment

@siggy @seanmonstar @olix0r I am super pleased to confirm that the changes you guys have rolled out have fixed the 502 issues. I've solved our internal issue and have managed to test multiple times with edge-19.7.1 and once with edge-19.7.2. Thank you for your assistance 🙏

calinah on 9 Jul 2019

🎉4

All 22 comments

@calinah Thanks for the excellent bug report. A few questions:

Is your application itself using HTTP/2, i.e. for gRPC?
Are the logs you shared are from _service A_ (the client side) or _service B_ (the server side)? That they include proxy={client=in makes me believe we are seeing the server side? Are there any logs on the client side?

olix0r on 11 Jun 2019

@olix0r The application itself uses HTTP/2 and the application logs I've shared are from service A client.
As for the linkerd-proxy log I've shared it's a bit more complex. I've simplified the flow but the way it works is main service--->service A --->service B ---> database

DBUG [   388.710739s] proxy={client=in dst=10.24.26.128:9090 proto=Http2} linkerd2_proxy::proxy::http::h2 http2 conn error: http2 error: protocol error: unspecific protocol error detected
ERR! [   388.710816s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.23.122:35662} linkerd2_proxy::app::errors unexpected error: http2 error: protocol error: unspecific protocol error detected

this log is shown in linkerd-proxy of service A(10.24.26.128) but where you see server it actually means main service (10.24.23.122)

calinah on 11 Jun 2019

@calinah Thanks for the context. I'm curious if we can get any insight into what's happening on the _outbound_ side of service A, if anything. (The logs you shared are from the _inbound_ side). Perhaps tap can help us: linkerd tap -n $namespace deploy/$deploy_a... I think we may do a poor job of surfacing errors at the moment, but it would be helpful to notice whether you see _any_ corresponding outbound requests from service A and, if so, whether they succeed or fail...

olix0r on 11 Jun 2019

@olix0r https://gist.github.com/calinah/b4803e93ecfe202783fdcae88f1c673d - service B coming up

calinah on 11 Jun 2019

and https://gist.github.com/calinah/79dec781806f236b058d44ad76a90ca0

calinah on 11 Jun 2019

for readability of those logs - service A- 10.24.26.128, main-service - 10.24.23.122 , service B 10.24.16.81

calinah on 11 Jun 2019

hey @olix0r any luck or updates with this? Anything I can test/do to help?

calinah on 14 Jun 2019

@calinah thanks, we've been looking at a few similar reports and trying to narrow them down. Can you set the config.linkerd.io/proxy-log-level: info,h2=trace,linkerd2_proxy=trace annotation on the _service a_ pod spec? This will output _a lot_ of logs, but if we can isolate a single protocol error in these logs, we can hopefully determine whether these reports are related or not.

olix0r on 14 Jun 2019

@calinah We have another h2 fix which (1) fixes a case where the proxy would erroneously emit 502s and (2) adds better diagnostics to logs when the h2 server emits errors. These changes will be in the upcoming edge-19.6.3.

If you'd like to test this change in the meantime, I've pushed a proxy image that you can test against by setting the pod spec annotations (or setting the proper linkerd inject flags):

config.linkerd.io/proxy-version: ver-protocol-error-fix-0
config.linkerd.io/proxy-log-level: warn,linkerd2_proxy=info,h2=debug

I'd be curious if you see any new behavior with this change.

olix0r on 18 Jun 2019

Hi @calinah. Just following up, any luck testing/reproducing this with the latest edge-19.6.3?

siggy on 24 Jun 2019

Hi @siggy if the latest edge-19.6.3 contains the same changes as config.linkerd.io/proxy-version: ver-protocol-error-fix-0 and/or config.linkerd.io/proxy-version: ver-protocol-error-fix-1 no luck with either. I've documented my results here https://github.com/linkerd/linkerd2/issues/2801#issuecomment-504118860 . Let me know if it's still worth trying edge-19.6.3 and I'll give it a go later today.

calinah on 25 Jun 2019

@calinah Thanks for the update. I'm assuming edge-19.6.3 did not contain any fixes beyond ver-protocol-error-fix-* (@olix0r please confirm). I'd hold off on further testing until we come back with additional proxy fixes.

siggy on 25 Jun 2019

@siggy @calinah we have a follow-up fix in review/testing... hopefully for this week's edge.

olix0r on 25 Jun 2019

@olix0r just a quick update on our side. I've noticed that you guys released edge-19.6.4 so I've installed it on our staging cluster and ran a test and I can confirm we're still seeing the same error messages (not sure if the changes were supposed to fix this issue as well)

calinah on 1 Jul 2019

Sorry, the http2 fixes didn't make it into last week's edge (HTTP2 is complicated ;_;).

seanmonstar on 1 Jul 2019

@calinah We just released edge-19.7.1 with what we believe is the fix. Can you let us know if this addressed the issue?

siggy on 3 Jul 2019

@siggy Unfortunately it did not :( . The logs are slightly different now - no longer getting a 502 on the client and the proxy is no longer showing an unspecified protocol error. Here's what I can see on the linkerd-proxy of the service:

DBUG [ 11464.242602s] admin={server=tap listen=0.0.0.0:4190 remote=10.24.22.34:52942} h2::codec::framed_write send; frame=Data { stream_id: StreamId(1) }
DBUG [ 11466.641653s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.14.67:48272} h2::codec::framed_read received; frame=Reset { stream_id: StreamId(17), error_code: CANCEL }
DBUG [ 11466.641980s] proxy={client=in dst=10.24.14.63:9090 proto=Http2} h2::codec::framed_write send; frame=Reset { stream_id: StreamId(17), error_code: CANCEL }
DBUG [ 11466.665680s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.14.67:48272} h2::codec::framed_read received; frame=Reset { stream_id: StreamId(19), error_code: CANCEL }
DBUG [ 11466.665882s] proxy={client=in dst=10.24.14.63:9090 proto=Http2} h2::codec::framed_write send; frame=Reset { stream_id: StreamId(19), error_code: CANCEL }
DBUG [ 11466.697931s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.14.67:48272} h2::codec::framed_read received; frame=Headers { stream_id: StreamId(25), flags: (0x4: END_HEADERS) }
DBUG [ 11466.697987s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.14.67:48272} h2::codec::framed_read received; frame=Data { stream_id: StreamId(25), flags: (0x1: END_STREAM) }
.
.
.
DBUG [ 11561.651715s] admin={server=tap listen=0.0.0.0:4190 remote=10.24.22.34:52942} h2::codec::framed_write send; frame=Data { stream_id: StreamId(1) }
DBUG [ 11561.651781s] admin={server=tap listen=0.0.0.0:4190 remote=10.24.22.34:52942} h2::codec::framed_write send; frame=Data { stream_id: StreamId(1) }
DBUG [ 11580.002991s] proxy={client=in dst=10.24.14.63:9090 proto=Http2} h2::codec::framed_write send; frame=GoAway { last_stream_id: StreamId(0), error_code: NO_ERROR }
DBUG [ 11580.003038s] proxy={client=in dst=10.24.14.63:9090 proto=Http2} h2::proto::connection Connection::poll; connection error=NO_ERROR
DBUG [ 11580.003216s] admin={client=control dst=linkerd-destination.linkerd.svc.cluster.local:8086 remote=10.27.245.209:8086} h2::codec::framed_write send; frame=Reset { stream_id: StreamId(25), error_code: CANCEL }
DBUG [ 11584.525965s] proxy={server=in listen=0.0.0.0:4143 remote=10.24.14.67:48272} h2::codec::framed_read received; frame=GoAway { last_stream_id: StreamId(0), error_code: NO_ERROR }
DBUG [ 11817.283192s] proxy={server=out listen=127.0.0.1:4140 remote=10.24.14.63:53262} h2::codec::framed_read received; frame=Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 2] }

and also if I run linkerd tap deploy $service-name I can see a DeadlineExceeded error:

end id=1:20 proxy=out src=10.24.14.63:55724 dst=10.24.24.5:8529 tls=not_provided_by_service_discovery duration=44µs response-length=102B
req id=1:21 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true :method=POST :authority=$service-name:9090 :path=/$methodName
req id=1:22 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true :method=POST :authority=$service-name:9090 :path=/$methodName
req id=1:23 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true :method=POST :authority=$service-name:9090 :path=/$methodNamee
req id=1:24 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true :method=POST :authority=$service-name:9090 :path=/$methodName
rsp id=1:1 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true :status=200 latency=13261µs
end id=1:1 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true grpc-status=DeadlineExceeded duration=27µs response-length=0B
rsp id=1:2 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true :status=200 latency=8009µs
end id=1:2 proxy=in  src=10.24.14.67:39414 dst=10.24.14.63:9090 tls=true grpc-status=DeadlineExceeded duration=5µs response-length=0B
req id=1:25 proxy=out src=10.24.14.63:54942 dst=10.24.24.5:8529 tls=not_provided_by_service_discovery :method=GET :authority=$db-hostname:8529 :path=/$MethodName

On the application side all I get is either:

err: {
      "message": "Deadline Exceeded",
      "code": 4
    }

err: {
      "code": 4,
      "metadata": {
        "_internal_repr": {}
      },
      "details": "Deadline Exceeded"
    }

I do think that the changes helped with the bug but there's still something off. I also tried to create a lot less data than in all other tests to rule out a connection timeout of some sort.
Let me know if I can provide more details or test anything else.

calinah on 4 Jul 2019

So, according to the logs, it does look like the original issue has been fixed. There's no more 502s introduced by the proxy, and there are no more connections being aburptly closed with a protocol error.

The DeadlineExceeded codes are from gRPC libraries (not in the proxy). It's typical of s gRPC client to wait the configured deadline, and if no answer, it will send the HTTP2 CANCEL code, and return DeadlineExceeded. That explains why the proxy sees a bunch of cancels.

The question remaining is whether these deadlines are exceeded in normal operation, or if only when linkerd is injected. If the latter, it's likely a different issue than this reported one.

seanmonstar on 4 Jul 2019

@calinah Thanks again for all the detail, we just released edge-19.7.2 with additional fixes. Can you let us know if this addressed the issue?

siggy on 8 Jul 2019

thank you @siggy I will check out version edge-19.7.2 tomorrow but I've been testing with 19.7.1 and without all day today and as @seanmonstar suggested and it seems like the new errors we're seeing might be down a separate downstream supplier due to new changes they've implemented on their side in the last couple of days - so I've been working to pin point the exact issue and will update shortly when i know more

calinah on 8 Jul 2019