Consul: Consul server keeps trying to reach a left server from another DC

Created on 27 Aug 2018  路  9Comments  路  Source: hashicorp/consul

Overview of the Issue

We have a consul setup over 4 DCs.
2 DCs have a 3 server cluster, 2 DCs have a 1 server "cluster".
One 3 server cluster (management) is connected to all the other DCs, but the other DCs (staging, production, ci) can't connect one to each other.

One of the management consul server was replaced. It left, but when the new server replaced it, the initial server was still part of the peer list. I had to use consul operator to remove it manually, but it was not sufficient to make the election happen.
After restarting all the servers of the cluster, the election happened.

We noticed that DNS resolution using Consul was failing in our staging and ci DC after a few days.
Restarting the server seem to solve the issue.
After looking at the logs, it seems that the server is trying to connect to server which has been removed.
But this server doesn't appear when running consul members -wan.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create 2 cluster with 3 and 1 server nodes
  2. Replace one of the 3 server of the first cluster, by bringing down one of them, and then starting a replacement.
  3. See that the other server from the second cluster is still trying to connect to the left server.

Consul info for both Client and Server


Client info

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 1
    services = 4
build:
    prerelease =
    revision = 9a494b5f
    version = 1.0.6
consul:
    bootstrap = true
    known_datacenters = 4
    leader = true
    leader_addr = 10.13.11.146:8300
    server = true
raft:
    applied_index = 796429
    commit_index = 796429
    fsm_pending = 0
    last_contact = 0
    last_log_index = 796429
    last_log_term = 8
    last_snapshot_index = 794634
    last_snapshot_term = 8
    latest_configuration = [{Suffrage:Voter ID:b1530d34-c999-ebe3-fcf5-38971630b07b Address:10.13.11.146:8300}]
    latest_configuration_index = 1
    num_peers = 0
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 8
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 101
    max_procs = 1
    os = linux
    version = go1.9.3
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 1
    event_time = 15
    failed = 0
    health_score = 0
    intent_queue = 1
    left = 0
    member_time = 17
    members = 1
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 381
    members = 8
    query_queue = 0


Server info

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 1
    services = 4
build:
    prerelease =
    revision = 9a494b5f
    version = 1.0.6
consul:
    bootstrap = false
    known_datacenters = 4
    leader = false
    leader_addr = 10.10.13.178:8300
    server = true
raft:
    applied_index = 56167849
    commit_index = 56167849
    fsm_pending = 0
    last_contact = 16.7967ms
    last_log_index = 56167849
    last_log_term = 1008
    last_snapshot_index = 56165684
    last_snapshot_term = 1008
    latest_configuration = [{Suffrage:Voter ID:1f872edb-b088-6708-04a3-fc396ef2d360 Address:10.10.13.178:8300} {Suffrage:Voter ID:9ceaa926-af42-0b01-7085-f3ef11b6f6f4 Address:10.10.14.47:8300} {Suffrage:Voter ID:40e76095-2c8d-0b7e-37d9-15d13a3ac9f5 Address:10.10.11.65:8300}]
    latest_configuration_index = 55612905
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 1008
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 131
    max_procs = 1
    os = linux
    version = go1.9.3
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 318
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1363
    members = 20
    query_queue = 0
    query_time = 46
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 381
    members = 8
    query_queue = 0
    query_time = 1

Operating system and Environment details

Ubuntu 16.04
Consul 1.0.6

Log Fragments

Aug 26 22:52:56 consul001-0de9 consul[24178]:     2018/08/26 22:52:56 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 22:52:56 consul001-0de9 consul[24178]:     2018/08/26 22:52:56 [ERR] consul.acl: Failed to get policy from ACL datacenter: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
$ consul members -wan
Node                         Address            Status  Type    Build  Protocol  DC            Segment
consul001-01cd.prd-jp1-aws   10.11.13.245:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-0311.stg-jp1-aws   10.12.11.62:8302   alive   server  1.0.6  2         stg-jp1-aws   <all>
consul001-0336.mgt-jp1-aws   10.10.11.65:8302   alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-091f.prd-jp1-aws   10.11.14.174:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-09c1.mgt-jp1-aws   10.10.14.47:8302   alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-0a82.mgt-jp1-aws   10.10.13.178:8302  alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-0ad0.prd-jp1-aws   10.11.11.188:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-0de9.ci-jp1-aws   10.13.11.146:8302  alive   server  1.0.6  2          ci-jp1-aws  <all>
typquestion

Most helpful comment

I ran the consul force-leave on this server, but I believe it has no effect as they are in different DCs and the issue still persists.
In other DCs, I've restarted the consul servers, and they don't have any errors anymore.

Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul.acl: Failed to get policy from ACL datacenter: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
$ consul members
Node            Address            Status  Type    Build  Protocol  DC            Segment
consul001-0de9  10.13.11.146:8301  alive   server  1.0.6  2         ci-jp1-aws    <all>
$ consul members -wan
Node                         Address            Status  Type    Build  Protocol  DC            Segment
consul001-01cd.prd-jp1-aws   10.11.13.245:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-0311.stg-jp1-aws   10.12.11.62:8302   alive   server  1.0.6  2         stg-jp1-aws   <all>
consul001-0336.mgt-jp1-aws   10.10.11.65:8302   alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-091f.prd-jp1-aws   10.11.14.174:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-09c1.mgt-jp1-aws   10.10.14.47:8302   alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-0a82.mgt-jp1-aws   10.10.13.178:8302  alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-0ad0.prd-jp1-aws   10.11.11.188:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-0de9.ci-jp1-aws    10.13.11.146:8302  alive   server  1.0.6  2         ci-jp1-aws    <all>

All 9 comments

Try performing 'consul -force-leave XYZ' on the server that is complaining about reaching the missing server, where XYZ is the node name of the missing server.

The server doesn't show up in consul members at all, would this have an effect?

In the management DC, I already used consul force-leave and consul operator raft remove-peer to make sure the server was no longer here.
But the other DCs were still trying to access it.

I ran the consul force-leave on this server, but I believe it has no effect as they are in different DCs and the issue still persists.
In other DCs, I've restarted the consul servers, and they don't have any errors anymore.

Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul.acl: Failed to get policy from ACL datacenter: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] consul: RPC failed to server 10.10.11.125:8300 in DC "mgt-jp1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 27 18:45:11 consul001-0de9 consul[24178]:     2018/08/27 09:45:11 [ERR] dns: rpc error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
$ consul members
Node            Address            Status  Type    Build  Protocol  DC            Segment
consul001-0de9  10.13.11.146:8301  alive   server  1.0.6  2         ci-jp1-aws    <all>
$ consul members -wan
Node                         Address            Status  Type    Build  Protocol  DC            Segment
consul001-01cd.prd-jp1-aws   10.11.13.245:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-0311.stg-jp1-aws   10.12.11.62:8302   alive   server  1.0.6  2         stg-jp1-aws   <all>
consul001-0336.mgt-jp1-aws   10.10.11.65:8302   alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-091f.prd-jp1-aws   10.11.14.174:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-09c1.mgt-jp1-aws   10.10.14.47:8302   alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-0a82.mgt-jp1-aws   10.10.13.178:8302  alive   server  1.0.6  2         mgt-jp1-aws   <all>
consul001-0ad0.prd-jp1-aws   10.11.11.188:8302  alive   server  1.0.6  2         prd-jp1-aws   <all>
consul001-0de9.ci-jp1-aws    10.13.11.146:8302  alive   server  1.0.6  2         ci-jp1-aws    <all>

Do you see that this left server persists for a long period of time, or does it get reaped eventually? (>30m). It is possible we need to detect the failed reads and do something proactively but it _should_ get reaped pretty quickly as-is.

The left server disappeared from all consul members commands quickly, but after 36h the consul server from the second cluster was still trying to reach it when doing DNS resolution.

I had to restart manually the server trying to reach the left server.

We experienced the same issue with consul 0.8.1 - a node in datacenter dc_b failed and had to be forcibly removed from the cluster. A new node was launched to replace it. By all obvious measures the replacement appeared successful: 'consul members -wan' in datacenters dc_a and dc_b listed the new node and not the old one.

However, we continued to see cross-datacenter queries fail intermittently, with these messages in the consul server logs in datacenter dc_a:

2018/09/27 18:10:37 [ERR] consul: RPC failed to server xx.yy.252.3:8300 in DC "dc_b": rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/27 18:11:43 [WARN] consul.rpc: RPC request for DC "dc_b", no path found

Thanks to @MiLk for discovering a workaround - we restarted each server in dc_a and cross-dc queries went back to working consistently.

In our closed issue https://github.com/hashicorp/consul/issues/4794 we see exactly the same, only full cluster restart solve problem

This issue is still happening with Consul 1.4.0

In a different region, I have 2 consul datacenters.
I have a mgt-cn1-aws dc with holds the ACL configuration, and a prd-cn1-aws dc.
Both have 3 consul servers running on 1.4.0.

One of the EC2 instance in mgt-cn1-aws was replaced, and DNS resolution and API calls started to fail randomly in prd-cn1-aws.
After looking at the logs, I have a lot of consul: RPC failed to server 10.20.11.88:8300 in DC "mgt-cn1-aws": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection indicating that it's trying to reach the now removed server.

$ consul members -wan
Node                        Address            Status  Type    Build  Protocol  DC           Segment
consul001-00ad.prd-cn1-aws  10.21.11.216:8302  alive   server  1.4.0  2         prd-cn1-aws  <all>
consul001-00d9.mgt-cn1-aws  10.20.13.72:8302   alive   server  1.4.0  2         mgt-cn1-aws  <all>
consul001-0485.mgt-cn1-aws  10.20.11.184:8302  alive   server  1.4.0  2         mgt-cn1-aws  <all>
consul001-078d.prd-cn1-aws  10.21.13.251:8302  alive   server  1.4.0  2         prd-cn1-aws  <all>
consul001-0af1.mgt-cn1-aws  10.20.12.216:8302  alive   server  1.4.0  2         mgt-cn1-aws  <all>
consul001-0feb.prd-cn1-aws  10.21.12.207:8302  alive   server  1.4.0  2         prd-cn1-aws  <all>
consul001-00ad:~$ consul operator raft list-peers
Node            ID                                    Address            State     Voter  RaftProtocol
consul001-078d  0bcdf6c8-8e58-773a-a85b-95ed3807f26e  10.21.13.251:8300  follower  true   3
consul001-0feb  446b825f-9643-8245-c513-82d0450adc1f  10.21.12.207:8300  follower  true   3
consul001-00ad  f3c48ad0-6bd9-c0b7-68c9-299016782eca  10.21.11.216:8300  leader    true   3
consul001-0af1:~$ consul operator raft list-peers
Node            ID                                    Address            State     Voter  RaftProtocol
consul001-00d9  34252ac9-fa03-f805-4551-bed502f6a20e  10.20.13.72:8300   follower  true   3
consul001-0af1  145a50ab-4c03-6763-66af-aeeb281d8237  10.20.12.216:8300  leader    true   3
consul001-0485  ee4c0301-a2e2-4e8e-92dc-d9e220cba788  10.20.11.184:8300  follower  true   3

What needs to be done to propagate correctly the list of consul servers from one dc to another?

@MiLk

What needs to be done to propagate correctly the list of consul servers from one dc to another?

The only way we've found so far is to restart all Consul servers

Was this page helpful?
0 / 5 - 0 ratings