We have a consul setup over 4 DCs.
2 DCs have a 3 server cluster, 2 DCs have a 1 server "cluster".
One 3 server cluster (management) is connected to all the other DCs, but the other DCs (staging, production, ci) can't connect one to each other.
One of the management consul server was replaced. It left, but when the new server replaced it, the initial server was still part of the peer list. I had to use consul operator to remove it manually, but it was not sufficient to make the election happen.
After restarting all the servers of the cluster, the election happened.
We noticed that DNS resolution using Consul was failing in our staging and ci DC after a few days.
Restarting the server seem to solve the issue.
After looking at the logs, it seems that the server is trying to connect to server which has been removed.
But this server doesn't appear when running consul members -wan.
Steps to reproduce this issue, eg:
Client info
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 4
build:
prerelease =
revision = 9a494b5f
version = 1.0.6
consul:
bootstrap = true
known_datacenters = 4
leader = true
leader_addr = 10.13.11.146:8300
server = true
raft:
applied_index = 796429
commit_index = 796429
fsm_pending = 0
last_contact = 0
last_log_index = 796429
last_log_term = 8
last_snapshot_index = 794634
last_snapshot_term = 8
latest_configuration = [{Suffrage:Voter ID:b1530d34-c999-ebe3-fcf5-38971630b07b Address:10.13.11.146:8300}]
latest_configuration_index = 1
num_peers = 0
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 8
runtime:
arch = amd64
cpu_count = 1
goroutines = 101
max_procs = 1
os = linux
version = go1.9.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 1
event_time = 15
failed = 0
health_score = 0
intent_queue = 1
left = 0
member_time = 17
members = 1
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 381
members = 8
query_queue = 0
Server info
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 4
build:
prerelease =
revision = 9a494b5f
version = 1.0.6
consul:
bootstrap = false
known_datacenters = 4
leader = false
leader_addr = 10.10.13.178:8300
server = true
raft:
applied_index = 56167849
commit_index = 56167849
fsm_pending = 0
last_contact = 16.7967ms
last_log_index = 56167849
last_log_term = 1008
last_snapshot_index = 56165684
last_snapshot_term = 1008
latest_configuration = [{Suffrage:Voter ID:1f872edb-b088-6708-04a3-fc396ef2d360 Address:10.10.13.178:8300} {Suffrage:Voter ID:9ceaa926-af42-0b01-7085-f3ef11b6f6f4 Address:10.10.14.47:8300} {Suffrage:Voter ID:40e76095-2c8d-0b7e-37d9-15d13a3ac9f5 Address:10.10.11.65:8300}]
latest_configuration_index = 55612905
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 1008
runtime:
arch = amd64
cpu_count = 1
goroutines = 131
max_procs = 1
os = linux
version = go1.9.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 318
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1363
members = 20
query_queue = 0
query_time = 46
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 381
members = 8
query_queue = 0
query_time = 1
Ubuntu 16.04
Consul 1.0.6
Aug 26 22:52:56 consul001-0de9 consul[24178]: 2018/08/26 22:52:56 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 22:52:56 consul001-0de9 consul[24178]: 2018/08/26 22:52:56 [ERR] consul.acl: Failed to get policy from ACL datacenter: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
$ consul members -wan
Node Address Status Type Build Protocol DC Segment
consul001-01cd.prd-jp1-aws 10.11.13.245:8302 alive server 1.0.6 2 prd-jp1-aws <all>
consul001-0311.stg-jp1-aws 10.12.11.62:8302 alive server 1.0.6 2 stg-jp1-aws <all>
consul001-0336.mgt-jp1-aws 10.10.11.65:8302 alive server 1.0.6 2 mgt-jp1-aws <all>
consul001-091f.prd-jp1-aws 10.11.14.174:8302 alive server 1.0.6 2 prd-jp1-aws <all>
consul001-09c1.mgt-jp1-aws 10.10.14.47:8302 alive server 1.0.6 2 mgt-jp1-aws <all>
consul001-0a82.mgt-jp1-aws 10.10.13.178:8302 alive server 1.0.6 2 mgt-jp1-aws <all>
consul001-0ad0.prd-jp1-aws 10.11.11.188:8302 alive server 1.0.6 2 prd-jp1-aws <all>
consul001-0de9.ci-jp1-aws 10.13.11.146:8302 alive server 1.0.6 2 ci-jp1-aws <all>
Try performing 'consul -force-leave XYZ' on the server that is complaining about reaching the missing server, where XYZ is the node name of the missing server.
The server doesn't show up in consul members at all, would this have an effect?
In the management DC, I already used consul force-leave and consul operator raft remove-peer to make sure the server was no longer here.
But the other DCs were still trying to access it.
I ran the consul force-leave on this server, but I believe it has no effect as they are in different DCs and the issue still persists.
In other DCs, I've restarted the consul servers, and they don't have any errors anymore.
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] consul.acl: Failed to get policy from ACL datacenter: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]: 2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
$ consul members
Node Address Status Type Build Protocol DC Segment
consul001-0de9 10.13.11.146:8301 alive server 1.0.6 2 ci-jp1-aws <all>
$ consul members -wan
Node Address Status Type Build Protocol DC Segment
consul001-01cd.prd-jp1-aws 10.11.13.245:8302 alive server 1.0.6 2 prd-jp1-aws <all>
consul001-0311.stg-jp1-aws 10.12.11.62:8302 alive server 1.0.6 2 stg-jp1-aws <all>
consul001-0336.mgt-jp1-aws 10.10.11.65:8302 alive server 1.0.6 2 mgt-jp1-aws <all>
consul001-091f.prd-jp1-aws 10.11.14.174:8302 alive server 1.0.6 2 prd-jp1-aws <all>
consul001-09c1.mgt-jp1-aws 10.10.14.47:8302 alive server 1.0.6 2 mgt-jp1-aws <all>
consul001-0a82.mgt-jp1-aws 10.10.13.178:8302 alive server 1.0.6 2 mgt-jp1-aws <all>
consul001-0ad0.prd-jp1-aws 10.11.11.188:8302 alive server 1.0.6 2 prd-jp1-aws <all>
consul001-0de9.ci-jp1-aws 10.13.11.146:8302 alive server 1.0.6 2 ci-jp1-aws <all>
Do you see that this left server persists for a long period of time, or does it get reaped eventually? (>30m). It is possible we need to detect the failed reads and do something proactively but it _should_ get reaped pretty quickly as-is.
The left server disappeared from all consul members commands quickly, but after 36h the consul server from the second cluster was still trying to reach it when doing DNS resolution.
I had to restart manually the server trying to reach the left server.
We experienced the same issue with consul 0.8.1 - a node in datacenter dc_b failed and had to be forcibly removed from the cluster. A new node was launched to replace it. By all obvious measures the replacement appeared successful: 'consul members -wan' in datacenters dc_a and dc_b listed the new node and not the old one.
However, we continued to see cross-datacenter queries fail intermittently, with these messages in the consul server logs in datacenter dc_a:
2018/09/27 18:10:37 [ERR] consul: RPC failed to server xx.yy.252.3:8300 in DC "dc_b": rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/27 18:11:43 [WARN] consul.rpc: RPC request for DC "dc_b", no path found
Thanks to @MiLk for discovering a workaround - we restarted each server in dc_a and cross-dc queries went back to working consistently.
In our closed issue https://github.com/hashicorp/consul/issues/4794 we see exactly the same, only full cluster restart solve problem
This issue is still happening with Consul 1.4.0
In a different region, I have 2 consul datacenters.
I have a mgt-cn1-aws dc with holds the ACL configuration, and a prd-cn1-aws dc.
Both have 3 consul servers running on 1.4.0.
One of the EC2 instance in mgt-cn1-aws was replaced, and DNS resolution and API calls started to fail randomly in prd-cn1-aws.
After looking at the logs, I have a lot of consul: RPC failed to server 10.20.11.88:8300 in DC "mgt-cn1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection indicating that it's trying to reach the now removed server.
$ consul members -wan
Node Address Status Type Build Protocol DC Segment
consul001-00ad.prd-cn1-aws 10.21.11.216:8302 alive server 1.4.0 2 prd-cn1-aws <all>
consul001-00d9.mgt-cn1-aws 10.20.13.72:8302 alive server 1.4.0 2 mgt-cn1-aws <all>
consul001-0485.mgt-cn1-aws 10.20.11.184:8302 alive server 1.4.0 2 mgt-cn1-aws <all>
consul001-078d.prd-cn1-aws 10.21.13.251:8302 alive server 1.4.0 2 prd-cn1-aws <all>
consul001-0af1.mgt-cn1-aws 10.20.12.216:8302 alive server 1.4.0 2 mgt-cn1-aws <all>
consul001-0feb.prd-cn1-aws 10.21.12.207:8302 alive server 1.4.0 2 prd-cn1-aws <all>
consul001-00ad:~$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul001-078d 0bcdf6c8-8e58-773a-a85b-95ed3807f26e 10.21.13.251:8300 follower true 3
consul001-0feb 446b825f-9643-8245-c513-82d0450adc1f 10.21.12.207:8300 follower true 3
consul001-00ad f3c48ad0-6bd9-c0b7-68c9-299016782eca 10.21.11.216:8300 leader true 3
consul001-0af1:~$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
consul001-00d9 34252ac9-fa03-f805-4551-bed502f6a20e 10.20.13.72:8300 follower true 3
consul001-0af1 145a50ab-4c03-6763-66af-aeeb281d8237 10.20.12.216:8300 leader true 3
consul001-0485 ee4c0301-a2e2-4e8e-92dc-d9e220cba788 10.20.11.184:8300 follower true 3
What needs to be done to propagate correctly the list of consul servers from one dc to another?
@MiLk
What needs to be done to propagate correctly the list of consul servers from one dc to another?
The only way we've found so far is to restart all Consul servers
Most helpful comment
I ran the
consul force-leaveon this server, but I believe it has no effect as they are in different DCs and the issue still persists.In other DCs, I've restarted the consul servers, and they don't have any errors anymore.