When using Connect with Envoy, Envoy uses HTTPS to connect to the Consul agent for the first time only when CONSUL_HTTP_SSL is true, even if CONSUL_HTTP_ADDR is set to use HTTPS. After it has connected for the first time (with CONSUL_HTTP_SSL enabled), it is able to connect again even without setting CONSUL_HTTP_SSL.
CONSUL_HTTP_ADDR=https://127.0.0.1:8501 (without setting CONSUL_HTTP_SSL)Envoy fails to connect to the Consul agent:
Agent log:
agent: grpc: Server.Serve failed to complete security handshake from "127.0.0.1:50030": tls: first record does not look like a TLS handshake
Envoy log:
router decoding headers:
':method', 'POST'
':path', '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
':authority', 'local_agent'
':scheme', 'http'
'te', 'trailers'
'content-type', 'application/grpc'
'x-consul-token', ''
'x-envoy-internal', 'true'
'x-forwarded-for', '192.168.121.186'
...
Unknown error code 104 details Connection reset by peer
When CONSUL_HTTP_SSL is set to true, the ':scheme', 'http' line becomes ':scheme', 'https'.
Here is a full Ansible playbook that reproduces this problem:
https://gist.github.com/akhayyat/4a6a5718425ac4addfef3fa9bb932c65
Client info
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 2
build:
prerelease =
revision = 2cf0a3c8
version = 1.7.1
consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 192.168.121.170:8300
server = true
raft:
applied_index = 61
commit_index = 61
fsm_pending = 0
last_contact = 21.84253ms
last_log_index = 61
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:61915f30-6707-76d2-8ab2-d8f5ab82e1be Address:192.168.121.186:8300} {Suffrage:Voter ID:3919f158-7d07-ddfd-530e-8b048e5aff4f Address:192.168.121.170:8300} {Suffrage:Voter ID:41f37e2a-bda7-f15a-b714-2cfbd8193f72 Address:192.168.121.134:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 2
runtime:
arch = amd64
cpu_count = 1
goroutines = 104
max_procs = 1
os = linux
version = go1.13.7
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 6
members = 3
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 5
members = 3
query_queue = 0
query_time = 1
Server info
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 2
build:
prerelease =
revision = 2cf0a3c8
version = 1.7.1
consul:
acl = disabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 192.168.121.170:8300
server = true
raft:
applied_index = 73
commit_index = 73
fsm_pending = 0
last_contact = 38.422636ms
last_log_index = 74
last_log_term = 2
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:61915f30-6707-76d2-8ab2-d8f5ab82e1be Address:192.168.121.186:8300} {Suffrage:Voter ID:3919f158-7d07-ddfd-530e-8b048e5aff4f Address:192.168.121.170:8300} {Suffrage:Voter ID:41f37e2a-bda7-f15a-b714-2cfbd8193f72 Address:192.168.121.134:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 2
runtime:
arch = amd64
cpu_count = 1
goroutines = 104
max_procs = 1
os = linux
version = go1.13.7
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 2
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 6
members = 3
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 5
members = 3
query_queue = 0
query_time = 1
Debian 10, amd64.
Hi @akhayyat
This looks to be an actual bug, and this is not working as expected. Tagging for the team to look through :)
Thank you for finding this!
I had a quick look at the code, and here is what I found: https://github.com/hashicorp/consul/blob/32daa2b27c3e07446021de484e23fa9c8e1465d0/command/connect/envoy/envoy.go#L414-L426
c.grpcAddr will default to the value of this environment variable: https://github.com/hashicorp/consul/blob/32daa2b27c3e07446021de484e23fa9c8e1465d0/api/api.go#L69-L73
So the scheme is being set to https if CONSUL_GRPC_ADDR has an https address, but not CONSUL_HTTP_ADDR.
This might be the correct behaviour. I believe CONSUL_GRPC_ADDR is the address used by envoy to connect. When CONSUL_GRPC_ADDR is not set, it looks like we default c.grpcAddr to localhost:PORT: https://github.com/hashicorp/consul/blob/32daa2b27c3e07446021de484e23fa9c8e1465d0/command/connect/envoy/envoy.go#L347-L359.
Maybe in this case we need to be defaulting it to an address with a scheme, so that the subsequent checks detect https correctly. I'm not yet sure how we would find the scheme at this place, but I will have another look.
Thanks for looking into this issue.
What I considered buggy is that this behavior is inconsistent with the documentation of the CONSUL_HTTP_ADDR environment variable, which states:
If the https:// scheme is used, CONSUL_HTTP_SSL is implied to be true.
In my case, with CONSUL_HTTP_ADDR set to use the https:// scheme, the behavior was different when CONSUL_HTTP_SSL was set or not, so it was not exactly implied.
You are correct, there is definitely a bug here.
I enabled TLS in my Nomad/Consul clusters and all of them started having this issue on the Nomad Agents.
failed to complete security handshake from "127.0.0.1
The only way I was able to resolve the issue was by physically rebooting the vm.
I also had to ensure the following is defined in the nomad config
tls:
http: false
rpc: true
rpc_upgrade_mode: true
Most helpful comment
Thanks for looking into this issue.
What I considered buggy is that this behavior is inconsistent with the documentation of the
CONSUL_HTTP_ADDRenvironment variable, which states:In my case, with
CONSUL_HTTP_ADDRset to use the https:// scheme, the behavior was different whenCONSUL_HTTP_SSLwas set or not, so it was not exactly implied.