Title: gRPC 1.20.1 upgrade causes gRPC connections to fail
Description:
I use Envoy's gRPC-Web filter to integrate against a few gRPC applications built in Python. These applications use a secured connection.
Since they upgraded to gRPC v1.20.1 (from v1.19.x), Envoy has been unable to get proper responses. The version upgrade is the only change and other applications have no problem communicating with the updated services.
The error I am getting is gRPC status 14 with this message:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
I tried running this against the current stable version as well as a (yesterday's) dev build and got the same results.
I have also been unable to reproduce this issue. I created a small gRPC application using Python and gRPC 1.20.1 and I could use gRPC to communicate with it. My little test service was unsecured but otherwise pretty similar, so I suspect perhaps a TLS issue.
The only other relevant thing I can think of is that all of these services are behind (TCP) load balancers.
Config:
My config is more or less the same as in the example. I fixed the indentation of the bottom part and (of course) updated addresses to match my services.
Logs:
These are the logs of a failing request (minus the request itself):
[2019-05-09 19:20:34.438][19][debug][client] [source/common/http/codec_client.cc:26] [C3] connecting
[2019-05-09 19:20:34.438][19][debug][connection] [source/common/network/connection_impl.cc:695] [C3] connecting to <THE IP ADDRESS>:443
[2019-05-09 19:20:34.438][19][debug][connection] [source/common/network/connection_impl.cc:704] [C3] connection in progress
[2019-05-09 19:20:34.438][19][debug][http2] [source/common/http/http2/codec_impl.cc:725] [C3] setting stream-level initial window size to 268435456
[2019-05-09 19:20:34.438][19][debug][http2] [source/common/http/http2/codec_impl.cc:747] [C3] updating connection-level initial window size to 268435456
[2019-05-09 19:20:34.438][19][debug][pool] [source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-05-09 19:20:34.438][19][debug][http] [source/common/http/conn_manager_impl.cc:1062] [C0][S2159400524055061755] request end stream
[2019-05-09 19:20:34.454][19][debug][connection] [source/common/network/connection_impl.cc:543] [C3] connected
[2019-05-09 19:20:34.454][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.470][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.471][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.472][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.472][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C3] handshake error: 2
[2019-05-09 19:20:34.506][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:131] [C3] handshake complete
[2019-05-09 19:20:34.506][19][debug][client] [source/common/http/codec_client.cc:64] [C3] connected
[2019-05-09 19:20:34.506][19][debug][pool] [source/common/http/http2/conn_pool.cc:96] [C3] creating stream
[2019-05-09 19:20:34.506][19][debug][router] [source/common/router/router.cc:1211] [C0][S2159400524055061755] pool ready
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C3]
[2019-05-09 19:20:34.507][19][debug][connection] [source/common/network/connection_impl.cc:511] [C3] remote close
[2019-05-09 19:20:34.507][19][debug][connection] [source/common/network/connection_impl.cc:192] [C3] closing socket: 0
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:244] [C3] SSL shutdown: rc=0
[2019-05-09 19:20:34.507][19][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C3]
[2019-05-09 19:20:34.507][19][debug][client] [source/common/http/codec_client.cc:82] [C3] disconnect. resetting 1 pending requests
[2019-05-09 19:20:34.507][19][debug][client] [source/common/http/codec_client.cc:105] [C3] request reset
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:237] [C3] destroying stream: 0 remaining
[2019-05-09 19:20:34.507][19][debug][router] [source/common/router/router.cc:672] [C0][S2159400524055061755] upstream reset: reset reason connection termination
[2019-05-09 19:20:34.507][19][debug][http] [source/common/http/conn_manager_impl.cc:1137] [C0][S2159400524055061755] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2019-05-09 19:20:34.507][19][debug][http] [source/common/http/conn_manager_impl.cc:1329] [C0][S2159400524055061755] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc-web-text+proto'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection termination'
'access-control-allow-origin', 'http://localhost:8081'
'access-control-expose-headers', 'grpc-status,grpc-message'
'date', 'Thu, 09 May 2019 19:20:34 GMT'
'server', 'envoy'
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:150] [C3] client disconnected
[2019-05-09 19:20:34.507][19][debug][pool] [source/common/http/http2/conn_pool.cc:175] [C3] destroying primary client
[2019-05-09 19:20:39.091][1][debug][main] [source/server/server.cc:147] flushing stats
I tried a bunch of things on the Envoy config side, but nothing made any difference.
Could anyone point me at what might be causing the issue or what might help me get to the bottom of this?
@lizan any thoughts?
@lizan @mattklein123 I am experiencing a similar issue. I can provide a reproducible example if needed.
@gyang274 a reproducible example will be very helpful!
@dio I actually fixed the problem by using
hosts: [{ socket_address: { address: 0.0.0.0, port_value: 50051 }}]
instead of
hosts: [{ socket_address: { address: host.docker.internal, port_value: 50051 }}]
in enovy.yaml clusters settings, and start the docker with --network=host. This leads to success communication, although get warning that this option is depreciated.
Btw, this is on ubuntu 16.04 with grpc 1.20.3, google-protobuf 3.8.0-rc1.
@gyang274 The non-deprecated way of specifying hosts is shown in the following example - https://github.com/envoyproxy/envoy/blob/master/examples/grpc-bridge/config/s2s-grpc-envoy.yaml#L35-L43
load_assignment:
cluster_name: local_service_grpc
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: host.docker.internal
port_value: 50051
My issue is definitely not the same one as @gyang274 's.
While I run Envoy in Docker, the service I am contacting runs separately, in a K8s cluster.
The host and port are correct (verified by curling the host at the IP in my logs at the configured port) and they are accessible from the container.
I updated my config to be similar to what @moderation described and I am still seeing the same issue.
As stated in my original report, the only change (quite reproducibly; a rollback fixed it, then another update broke things again) is the update of that service to the new gRPC version.
I also verified that this error looks different than when connecting to a host that does not exist and an http/2 host that does not speak gRPC, in order to eliminate basic issues.
We may be able to do some more experimentation soon (I'd personally like to see if this is a TLS issue), but it would be great to get some pointers on where to start.
@TBoshoven gRPC 1.21.0 has just been released - would be interesting to see if you still have the issue using this - https://github.com/grpc/grpc/releases/tag/v1.21.0
I'll try to get it tested within a week.
Unfortunately, after updating the service to gRPC 1.21.1, things still fail in the same way.
Tested using the latest Envoy dev Docker container.
I can add some more detail to what @TBoshoven said - I've spent much of the day working with combinations here, and with the same server code, if I swap grpc 1.19.0 for 1.20.0, it breaks in the way he describes.
I did some packet capturing with Wireshark, and in the 1.20 case, the 19th packet is a [FIN, ACK] packet rather than application data, and then everything further is different. I'm not sure the captures are really of much use though.
I should note that this only is a problem with TLS enabled - without, 1.20 also works just fine.
@alertedsnake @TBoshoven TLS enabled where? between client and envoy, or between envoy and upstream gRPC server, or both?
Can you post, or email me if you don't want to disclose, a full config with a trace (not debug) log?
@lizan sorry, I should have mentioned that. TLS enabled between envoy and the upstream gRPC service. It uses authentication, so it only listens with TLS. If we have the gRPC service listen without TLS, everything works fine.
I'm using @TBoshoven 's envoy config, and I don't know anything about envoy here at all, so I'm not sure how to provide you the logs you're looking for.
The logs I provided in my initial report are still valid.
Here's my cleaned-up config with a bunch of stuff that I poked at in a comment:
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
codec_type: auto
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match:
prefix: "/prefix"
route:
cluster: the_service
max_grpc_timeout: 0s
cors:
allow_origin:
- "http://localhost:8081"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout,authorization
max_age: "1728000"
expose_headers: grpc-status,grpc-message
http_filters:
- name: envoy.grpc_web
- name: envoy.cors
- name: envoy.router
clusters:
# Stuff I have poked at
# - name: the_service
# connect_timeout: 5s
# type: logical_dns
# http2_protocol_options:
# hpack_table_size: 0
# lb_policy: round_robin
# load_assignment:
# cluster_name: local_service_grpc
# endpoints:
# - lb_endpoints:
# - endpoint:
# address:
# socket_address:
# address: hostname
# port_value: 443
# tls_context:
# allow_renegotiation: true
# sni: hostname
# common_tls_context:
# validation_context:
# trusted_ca:
# filename: /etc/ssl/cert.pem
- name: the_service
connect_timeout: 5s
type: logical_dns
http2_protocol_options: {}
lb_policy: round_robin
hosts:
- socket_address:
address: hostname
port_value: 443
tls_context: {}
The uncommented part of this worked with gRPC 1.19 and not 1.20.
@lizan This is still an issue. Is there anything else we can provide to help identify and resolve the problem?
EDIT: Just read that you asked for a trace log. Let me see what I can do.
This is very useful. I was not aware of the trace verbosity level.
Here's what I believe is the relevant part of the log. If needed, I can give you the full one as well.
envoy_1 | [2019-06-17 08:55:51.125][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:157] [C1] handshake complete
envoy_1 | [2019-06-17 08:55:51.125][12][debug][client] [source/common/http/codec_client.cc:64] [C1] connected
envoy_1 | [2019-06-17 08:55:51.125][12][debug][pool] [source/common/http/http2/conn_pool.cc:96] [C1] creating stream
envoy_1 | [2019-06-17 08:55:51.125][12][debug][router] [source/common/router/router.cc:1400] [C0][S11007887549020777307] pool ready
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=24
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 24 bytes, end_stream false
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=21
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 21 bytes, end_stream false
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=4
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=13
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 13 bytes, end_stream false
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=8
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:558] [C1] send data: bytes=332
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 332 bytes, end_stream false
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=1
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:392] [C1] writing 22 bytes, end_stream false
envoy_1 | [2019-06-17 08:55:51.125][12][trace][http2] [source/common/http/http2/codec_impl.cc:511] [C1] sent frame type=0
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/common/network/connection_impl.cc:541] [C1] write ready
envoy_1 | [2019-06-17 08:55:51.125][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:234] [C1] ssl write returns: 412
envoy_1 | [2019-06-17 08:55:51.126][12][trace][connection] [source/common/network/connection_impl.cc:494] [C1] read ready
envoy_1 | [2019-06-17 08:55:51.126][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:73] [C1] ssl read returns: 0
envoy_1 | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C1]
envoy_1 | [2019-06-17 08:55:51.126][12][trace][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:148] [C1] ssl read 0 bytes into 0 slices
envoy_1 | [2019-06-17 08:55:51.126][12][debug][connection] [source/common/network/connection_impl.cc:518] [C1] remote close
envoy_1 | [2019-06-17 08:55:51.126][12][debug][connection] [source/common/network/connection_impl.cc:188] [C1] closing socket: 0
envoy_1 | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:270] [C1] SSL shutdown: rc=0
envoy_1 | [2019-06-17 08:55:51.126][12][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:201] [C1]
envoy_1 | [2019-06-17 08:55:51.126][12][debug][client] [source/common/http/codec_client.cc:82] [C1] disconnect. resetting 1 pending requests
envoy_1 | [2019-06-17 08:55:51.126][12][debug][client] [source/common/http/codec_client.cc:105] [C1] request reset
envoy_1 | [2019-06-17 08:55:51.126][12][trace][main] [source/common/event/dispatcher_impl.cc:144] item added to deferred deletion list (size=1)
envoy_1 | [2019-06-17 08:55:51.126][12][debug][pool] [source/common/http/http2/conn_pool.cc:237] [C1] destroying stream: 0 remaining
The line that struck me as odd is [connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:234] [C1] ssl write returns: 412
According to this, 412 means "SSL protocol or certificate type is not supported.". Not sure if that number in the log entry refers to this error or the number of bytes, but it seemed like it might. (Actually, on multiple runs, I get slight variations on this number, so I suspect it's just the number of bytes and not an error.)
About the cert: curl accepts the certificate just fine (TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 with a P-256 curve which is part of the default accepted values according to the docs).
I ended up switching to grpcwebproxy. It successfully connects to the backend services.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
I get the point of the stale bot, but this is a real issue that has not been solved, and so it shouldn't be closed.
I had a similar connect issue and alpn_protocols: h2 was the fix for me. See https://github.com/envoyproxy/envoy/issues/7925.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.
I have similar issue, alpn_protocols: h2 didn't help either.
Most helpful comment
I get the point of the stale bot, but this is a real issue that has not been solved, and so it shouldn't be closed.