Consul: [ERR] consul: failed to establish leadership: unknown CA provider ""

Created on 14 Nov 2018 · 21Comments · Source: hashicorp/consul

Overview of the Issue

I'm trying to use consul connect and now my cluster is partially broken.

Reproduction Steps

I'm not sure how exactly to reproduce. I've been poking at this cluster too much to have clear steps. I'll try more tomorrow.

I put this in /etc/consul.d/config.hcl:

connect {
  enabled = true
}

When I upgraded consul from 1.2.3 to 1.3.0 and added that config, something went wrong and leader election failed.

I manually recovered by poking peers.json and now the cluster has a leader:

Node      ID                                    Address             State     Voter  RaftProtocol
consul    375e6536-a4d0-5770-3b7d-98dbe4a65686  192.168.0.100:8300  follower  true   3
consul-a  65000f53-2857-4daf-19b8-61e5bb4492c0  192.168.0.101:8300  follower  true   3
consul-b  ef31e65f-4536-14c0-4c8d-fdb1f6725922  192.168.0.102:8300  leader    true   3

However, something is still wrong with the cluster. The UI is showing stale data (https://github.com/hashicorp/consul/issues/4923, maybe?) and vault is stuck in standby.

$ vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           1
Threshold              1
Version                0.11.3
Cluster Name           vault-cluster-641f43ea
Cluster ID             e9570865-65db-1ea6-738b-c759d68fbfdd
HA Enabled             true
HA Cluster             n/a
HA Mode                standby
Active Node Address    <none>

$ vault secrets enable pki
Error enabling: Error making API request.

URL: POST http://127.0.0.1:8200/v1/sys/mounts/pki
Code: 500. Errors:

* local node not active but active cluster node not found

I would like to use vault for consul connect, but was just trying to get it working with the simplest setup first. How do I migrate to vault from this broken connect provider?

From my reading of the docs, the ca stuff was all supposed to be automatic so I don't know how to do it manually. I think I need to run consul connect ca set-config -config-file ca.json but don't know what to put for ca.json (also, why no hcl support here?).

Consul info for both Client and Server

Client info

agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 1
build:
    prerelease = 
    revision = e8757838
    version = 1.3.0
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 8
    goroutines = 52
    max_procs = 8
    os = linux
    version = go1.11.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 168
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1708
    members = 5
    query_queue = 0
    query_time = 1

Server info

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = e8757838
    version = 1.3.0
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 192.168.0.102:8300
    server = true
raft:
    applied_index = 6311623
    commit_index = 6311623
    fsm_pending = 0
    last_contact = 0
    last_log_index = 6311623
    last_log_term = 897
    last_snapshot_index = 6311248
    last_snapshot_term = 700
    latest_configuration = [{Suffrage:Voter ID:375e6536-a4d0-5770-3b7d-98dbe4a65686 Address:192.168.0.100:8300} {Suffrage:Voter ID:65000f53-2857-4daf-19b8-61e5bb4492c0 Address:192.168.0.101:8300} {Suffrage:Voter ID:ef31e65f-4536-14c0-4c8d-fdb1f6725922 Address:192.168.0.102:8300}]
    latest_configuration_index = 1
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 897
runtime:
    arch = amd64
    cpu_count = 8
    goroutines = 94
    max_procs = 8
    os = linux
    version = go1.11.1
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 168
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 2
    member_time = 1708
    members = 7
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 228
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

I'm running everything inside docker containers on a single host.

Log Fragments

consul_b_1  | bootstrap_expect > 0: expecting 3 servers
consul_b_1  |     2018/11/14 04:04:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:05:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:06:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:07:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:08:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:09:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:10:31 [ERR] consul: failed to establish leadership: unknown CA provider ""
consul_b_1  |     2018/11/14 04:11:31 [ERR] consul: failed to establish leadership: unknown CA provider ""

needs-investigation themconnect typbug

Source

WyseNynja

All 21 comments

Hey @WyseNynja, thanks for the bug report. That config you gave for connect (enabled = true) should be enough to get things working in both 1.2.3 and 1.3.0, and the empty provider type should be an impossible state to get into since the provider defaults to "consul" in both of those versions.

I tried a few things to reproduce the state you show where it's failing to establish leadership but couldn't get the same result:

Upgrading a lone server from 1.2.3 with Connect disabled to 1.3.0 with just the connect = enabled config.
A rolling upgrade of a set of 3 servers from 1.2.3 with Connect disabled to 1.3.0 with the above config.
Partially upgrading a set of 3 servers so that 2 were running 1.3.0 and the other was on 1.2.3, then getting one of the 1.3.0 servers to become leader and bootstrap the CA before restarting it to cause the 1.2.3 server to become leader again (in case some state was backwards incompatible here).

Any steps you have to help reproduce this are appreciated. It may take a fresh cluster to do so if you've got things back to a working state though, as it sounds like the invalid CA config that was preventing the leader election has been fixed (or wasn't an issue on another server).

kyhavlov on 14 Nov 2018

@kyhavlov I'm experiencing the same issue (without Vault). After upgrading the cluster to 1.3.0 (and then 1.3.1) and enabling connect I receive the same error:

2018/11/27 17:40:05 [ERR] consul: failed to establish leadership: unknown CA provider ""

After adding some debug statements to initializeCAConfig() I can see that the CA config returned from the FSM state is non-nil but empty, and so the empty config is returned.

2018/11/27 17:40:05 [DEBUG] consul: (agy) initializeCAConfig state.CAConfig: &structs.CAConfiguration{ClusterID:"", Provider:"", Config:map[string]interface {}(nil), RaftIndex:structs.RaftIndex{CreateIndex:0x0, ModifyIndex:0x0}}
2018/11/27 17:40:05 [DEBUG] consul: (agy) initializeCAConfig modIndex: 0x0

I attempted to set the ca_provider to consul as well, but this doesn't seem to make a difference.

$ consul connect ca get-config
{
    "Provider": "",
    "Config": null,
    "CreateIndex": 0,
    "ModifyIndex": 0
}

Dumping the leader's agent config shows that connect is enabled and depending whether I set the provider it's either "" or "consul".

$ curl -s localhost:8500/v1/agent/self | jq .DebugConfig.ConnectEnabled
true
$ curl -s localhost:8500/v1/agent/self | jq .DebugConfig.ConnectCAProvider
""

I cannot reproduce this on test clusters with the same scenario. The problematic cluster is moderate in size with ~1500 nodes.

agy on 27 Nov 2018

I have rotated all of the consul servers in this cluster with new machines and the issue remains.

agy on 27 Nov 2018

Attempting to update the CA config also does not work:

$ cat connect_config.json
{
  "Provider": "consul",
  "Config": {
    "LeafCertTTL": "72h",
    "RotationPeriod": "2160h",
    "PrivateKey": "",
    "RootCert": ""
  }
}

$ curl -v -X PUT -d @connect_config.json http://127.0.0.1:8500/v1/connect/ca/configuration
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8500 (#0)
> PUT /v1/connect/ca/configuration HTTP/1.1
> User-Agent: curl/7.35.0
> Host: 127.0.0.1:8500
> Accept: */*
> Content-Length: 135
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 135 out of 135 bytes
< HTTP/1.1 500 Internal Server Error
< Vary: Accept-Encoding
< Date: Tue, 27 Nov 2018 22:44:41 GMT
< Content-Length: 57
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 127.0.0.1 left intact
rpc error making call: internal error: CA provider is nil

agy on 27 Nov 2018

While the server the failed to establish leadership state, the leadership is not given up and an election doesn't occur. Reads and writes seem to work, but reads that are forced as persistent fail.

$ curl -s -X PUT -d "foo" http://localhost:8500/v1/kv/bar; echo
true

$ curl -s -X GET http://localhost:8500/v1/kv/bar | jq .
[
  {
    "LockIndex": 0,
    "Key": "bar",
    "Flags": 0,
    "Value": "Zm9v",
    "CreateIndex": 284603937,
    "ModifyIndex": 284603961
  }
]



md5-0508fc5cb51ff674dba4be5f02f4aed0



$ curl -s -X GET -d "foo" http://localhost:8500/v1/kv/bar?consistent=; echo
rpc error making call: Not ready to serve consistent reads

Killing the leader when in this state does not trigger an election. The follower nodes do report that there is no leader (as expected).

I'm unsure what the expected behaviour should be when in this state.

agy on 30 Nov 2018

Regarding this comment: https://github.com/hashicorp/consul/issues/4954#issuecomment-442194100

And the error message:

2018/11/27 17:40:05 [ERR] consul: failed to establish leadership: unknown CA provider ""

It may be worth investigating if https://github.com/hashicorp/consul/issues/5016 is showing a symptom of a similar bug. Adding a new server in this case and restoring from a snapshot manually in the case of #5016 could be hitting the same condition. That case includes a repro.

pearkes on 30 Nov 2018

The Docker image referenced in https://github.com/hashicorp/consul/issues/5016 doesn't seem to include the snapshot (unless I'm missing something obvious).

I can get a test cluster in the same state as my broken one by importing the raft db. But I cannot reproduce any other way.

I should also note that at no point did I manually restore a snapshot.

agy on 30 Nov 2018

Re: the snapshot - that's correct; there's no snapshot built into the image currently - we had been voluming in the snapshot from a local directory into the container and then running the restore from there.

That issue we are seeing there is also from a 1.2.1 to 1.2.4 upgrade; snapshot restores were working ok with 1.2.1 but are producing that error with 1.2.4. It's a little different then this case which is why I wanted to open a separate issue for it (although they do sound related.)

PurrBiscuit on 30 Nov 2018

Thanks @agy and @PurrBiscuit for the information in both cases...we're continuing to look into this.

pearkes on 30 Nov 2018

👍1

@agy Can you clarify what version you're upgrading from to 1.3.0? Our current hunch is that you could be seeing a symptom of https://github.com/hashicorp/consul/pull/4535 which was fixed in 1.2.3. If you were ever on a previous version utilizing Connect CA configuration that wrote to the state store you'd be seeing this as the state store would have been corrupted.

If this is case we're considering adding something like a -force option to connect ca set-config that would allow you to override the CA configuration without going through the rotation mechanism that would fail with invalid configuration (which it seems you have based on what we've seen) in your state store.

Alternatively (the better option we think) we could add automatic handling of this corrupted CA configuration which would then allow you to bypass this issue by treating this as a nil configuration.

If we added something like that could you potentially jump to 1.4.1-dev (master)?

pearkes on 3 Dec 2018

@pearkes 0.8.3 -> 1.3.0 -> 1.3.1.

Note: I had tested this upgrade path on a newly provisioned test cluster. And I have run this upgrade for all the testing that I've done earlier.

Since I'm able to reproduce this issue on a test cluster by importing the raft store for the current broken cluster, I can test whatever fixes that you propose. I agree that your alternate, "better" solution is preferable.

Upgrading to 1.4.x is problematic because I have not had the opportunity to test the new ACL system.

agy on 3 Dec 2018

Unfortunately, since I rotated all the members of the broken cluster I cannot verify if I had enabled connect when the cluster was 1.3.0 or only once it was 1.3.1.

agy on 3 Dec 2018

@agy Our concern and assumption was that it corrupted the state in 1.2.0 - 1.2.2. If you never ran those versions (regardless of where you're coming from now) that is relatively confusing but doesn't necessarily make the fix different.

pearkes on 3 Dec 2018

@agy can you also clarify from this comment:

I can get a test cluster in the same state as my broken one by importing the raft db

What operation did you do here? consul snapshot save/restore or did you copy the actual raft DB (something in the data directory, if so which file(s))?

pearkes on 3 Dec 2018

@pearkes I have done both.

The snapshot save/restore fails on restore with:

Error restoring snapshot: Unexpected response code: 500 (unknown CA provider "")

I have also copied over a tarball of the Consul data_dir from the broken cluster to the test cluster. I wouldn't normally, ever do this, but it was the only way I am able to get a test cluster into the same state as the broken one.

What I do is:

Have three test nodes with 1.3.1 installed.
Duplicate the configuration from the broken cluster (replacing node names, addresses, etc where appropriate).
Remove node-id.
Start all three of the nodes.
Stop one of the nodes.
Create peers.json with the newly created node-ids and IPs.
Start the stopped node.

As previously mentioned, this is far from ideal and is only to allow me to attempt some workarounds/fixes.

File listing:

$ tar tf consul.tar
consul/
consul/proxy/
consul/proxy/snapshot.json
consul/serf/
consul/serf/remote.snapshot
consul/serf/local.snapshot
consul/serf/local.keyring
consul/serf/remote.keyring
consul/raft/
consul/raft/snapshots/
consul/raft/snapshots/12554-284583919-1543427019761/
consul/raft/snapshots/12554-284583919-1543427019761/state.bin
consul/raft/snapshots/12554-284583919-1543427019761/meta.json
consul/raft/snapshots/12554-284600350-1543430224806/
consul/raft/snapshots/12554-284600350-1543430224806/state.bin
consul/raft/snapshots/12554-284600350-1543430224806/meta.json
consul/raft/raft.db
consul/raft/peers.info
consul/node-id
consul/checkpoint-signature

agy on 3 Dec 2018

@WyseNynja you mentioned that you had a similar issue when upgrading 1.2.3 to 1.3.0. Did you have connect enabled pre-1.2.3 ?

agy on 3 Dec 2018

I don’t believe so. I’m running 1.4 now without issue. I am able to rebuild this cluster easily and it didn’t happen when I started fresh.

WyseNynja on 4 Dec 2018

I opened #5061 which should fix this - any older versions from 1.2.3 forward will be able to cherry pick this fix using these steps: https://github.com/hashicorp/consul/issues/5016#issuecomment-444728950

kyhavlov on 6 Dec 2018

👍1

I have got this issue when enable connect in 1.4.0, too.

2018/12/06 15:46:41 [ERR] consul: failed to establish leadership: unknown CA provider ""
2018/12/06 15:47:22 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45468
2018/12/06 15:47:31 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45506
2018/12/06 15:47:41 [ERR] http: Request GET /v1/kv/vault/core/lock?consistent=, error: Not ready to serve consistent reads from=127.0.0.1:45442

thanapolr on 6 Dec 2018

The #5061 solved my problem.

thanapolr on 7 Dec 2018

I was able to reproduce this error with the following server configuration

{
  "addresses": {
    "dns": "0.0.0.0",
    "http": "127.0.0.1",
    "https": "0.0.0.0",
    "grpc": "0.0.0.0"
  },
  "bootstrap_expect": 5,
  "ca_file": "/opt/consul/certs/ca_cert.pem",
  "cert_file": "/opt/consul/certs/local_cert.pem",
  "data_dir": "/opt/consul/data",
  "discard_check_output": null,
  "discovery_max_stale": null,
  "enable_script_checks": false,
  "enable_local_script_checks": false,
  "encrypt": "72Tle7Mf5E72Zpq/cLz9+g==",
  "encrypt_verify_incoming": true,
  "encrypt_verify_outgoing": true,
  "key_file": "/opt/consul/certs/local_key.pem",
  "log_level": "DEBUG",
  "log_file": "/opt/consul/log/consul.log",
  "log_rotate_bytes": 1048576,
  "pid_file": "/opt/consul/pid/consul.pid",
  "ports": {},
  "retry_join": [
    "10.126.0.178",
    "10.126.0.146",
    "10.126.0.150",
    "10.126.0.144",
    "10.126.0.145"
  ],
  "server": true,
  "start_join": [
    "10.126.0.178",
    "10.126.0.146",
    "10.126.0.150",
    "10.126.0.144",
    "10.126.0.145"
  ],
  "verify_incoming": true,
  "verify_incoming_https": true,
  "verify_incoming_rpc": true,
  "verify_outgoing": true,
  "connect": {
    "ca_config": {
      "private_key": "/opt/consul/certs/ca_key.pem",
      "root_cert": "/opt/consul/certs/ca_cert.pem",
      "csr_max_per_second": 100,
      "csr_max_concurrent": 4,
      "leaf_cert_ttl": "4h"
    },
    "ca_provider": "consul",
    "enabled": true
  }
}

This is on consul 1.5.1

My agents report

[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 5s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 20s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 45s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 1m20s
[Err] consul.watch:  Watch (type: connect_leaf) errored: Unexpected response code: 500 (rpc error making call: internal error: CA provider is nil), retry in 2m5s

I can remove this configuration section, but when I do I start getting registration errors that RPC resources are exhausted try again. So I assumed I _should_ increasing the number of certificate requests per second (and in parallel) as I'm kicking off some ~175 sidecars across the whole cluster on deployment.