Consul: Connect with Envoy: gRPC over HTTPS in a new cluster only works when CONSUL_HTTP_SSL is true

Created on 20 Mar 2020  路  5Comments  路  Source: hashicorp/consul

Overview of the Issue

When using Connect with Envoy, Envoy uses HTTPS to connect to the Consul agent for the first time only when CONSUL_HTTP_SSL is true, even if CONSUL_HTTP_ADDR is set to use HTTPS. After it has connected for the first time (with CONSUL_HTTP_SSL enabled), it is able to connect again even without setting CONSUL_HTTP_SSL.

Reproduction Steps

  1. Create a cluster with HTTP (8500), HTTPS (8501), and GRPC (8502) ports enabled, and with Connect enabled
  2. Define a service with a sidecar proxy configured to use CONSUL_HTTP_ADDR=https://127.0.0.1:8501 (without setting CONSUL_HTTP_SSL)
  3. Envoy fails to connect to the Consul agent:

    Agent log:

    agent: grpc: Server.Serve failed to complete security handshake from "127.0.0.1:50030": tls: first record does not look like a TLS handshake
    

    Envoy log:

    router decoding headers:
    ':method', 'POST'
    ':path', '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
    ':authority', 'local_agent'
    ':scheme', 'http'
    'te', 'trailers'
    'content-type', 'application/grpc'
    'x-consul-token', ''
    'x-envoy-internal', 'true'
    'x-forwarded-for', '192.168.121.186'
    ...
    Unknown error code 104 details Connection reset by peer
    

When CONSUL_HTTP_SSL is set to true, the ':scheme', 'http' line becomes ':scheme', 'https'.

Here is a full Ansible playbook that reproduces this problem:
https://gist.github.com/akhayyat/4a6a5718425ac4addfef3fa9bb932c65

Consul info for both Client and Server


Client info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 2
        services = 2
build:  
        prerelease =
        revision = 2cf0a3c8
        version = 1.7.1
consul: 
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 192.168.121.170:8300
        server = true
raft:   
        applied_index = 61
        commit_index = 61
        fsm_pending = 0
        last_contact = 21.84253ms
        last_log_index = 61
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:61915f30-6707-76d2-8ab2-d8f5ab82e1be Address:192.168.121.186:8300} {Suffrage:Voter ID:3919f158-7d07-ddfd-530e-8b048e5aff4f Address:192.168.121.170:8300} {Suffrage:Voter ID:41f37e2a-bda7-f15a-b714-2cfbd8193f72 Address:192.168.121.134:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 104
        max_procs = 1
        os = linux  
        version = go1.13.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 2
        failed = 0  
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 3 
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0  
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5
        members = 3 
        query_queue = 0
        query_time = 1


Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 2  
        services = 2
build:
        prerelease =
        revision = 2cf0a3c8
        version = 1.7.1
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 192.168.121.170:8300
        server = true
raft:
        applied_index = 73
        commit_index = 73
        fsm_pending = 0
        last_contact = 38.422636ms
        last_log_index = 74
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:61915f30-6707-76d2-8ab2-d8f5ab82e1be Address:192.168.121.186:8300} {Suffrage:Voter ID:3919f158-7d07-ddfd-530e-8b048e5aff4f Address:192.168.121.170:8300} {Suffrage:Voter ID:41f37e2a-bda7-f15a-b714-2cfbd8193f72 Address:192.168.121.134:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 104
        max_procs = 1
        os = linux
        version = go1.13.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 6
        members = 3 
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0  
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5
        members = 3 
        query_queue = 0
        query_time = 1

Operating system and Environment details

Debian 10, amd64.

typbug

Most helpful comment

Thanks for looking into this issue.

What I considered buggy is that this behavior is inconsistent with the documentation of the CONSUL_HTTP_ADDR environment variable, which states:

If the https:// scheme is used, CONSUL_HTTP_SSL is implied to be true.

In my case, with CONSUL_HTTP_ADDR set to use the https:// scheme, the behavior was different when CONSUL_HTTP_SSL was set or not, so it was not exactly implied.

All 5 comments

Hi @akhayyat
This looks to be an actual bug, and this is not working as expected. Tagging for the team to look through :)

Thank you for finding this!

I had a quick look at the code, and here is what I found: https://github.com/hashicorp/consul/blob/32daa2b27c3e07446021de484e23fa9c8e1465d0/command/connect/envoy/envoy.go#L414-L426

c.grpcAddr will default to the value of this environment variable: https://github.com/hashicorp/consul/blob/32daa2b27c3e07446021de484e23fa9c8e1465d0/api/api.go#L69-L73

So the scheme is being set to https if CONSUL_GRPC_ADDR has an https address, but not CONSUL_HTTP_ADDR.

This might be the correct behaviour. I believe CONSUL_GRPC_ADDR is the address used by envoy to connect. When CONSUL_GRPC_ADDR is not set, it looks like we default c.grpcAddr to localhost:PORT: https://github.com/hashicorp/consul/blob/32daa2b27c3e07446021de484e23fa9c8e1465d0/command/connect/envoy/envoy.go#L347-L359.

Maybe in this case we need to be defaulting it to an address with a scheme, so that the subsequent checks detect https correctly. I'm not yet sure how we would find the scheme at this place, but I will have another look.

Thanks for looking into this issue.

What I considered buggy is that this behavior is inconsistent with the documentation of the CONSUL_HTTP_ADDR environment variable, which states:

If the https:// scheme is used, CONSUL_HTTP_SSL is implied to be true.

In my case, with CONSUL_HTTP_ADDR set to use the https:// scheme, the behavior was different when CONSUL_HTTP_SSL was set or not, so it was not exactly implied.

You are correct, there is definitely a bug here.

I enabled TLS in my Nomad/Consul clusters and all of them started having this issue on the Nomad Agents.

failed to complete security handshake from "127.0.0.1

The only way I was able to resolve the issue was by physically rebooting the vm.

I also had to ensure the following is defined in the nomad config


  tls:
    http: false
    rpc: true
    rpc_upgrade_mode: true

Was this page helpful?
0 / 5 - 0 ratings