Consul: Consul cluster fails to recover in outage

Created on 25 Mar 2019 · 9Comments · Source: hashicorp/consul

Overview of the Issue

The consul cluster has enough nodes to recover, but doesn't.

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 3 server nodes
Shutdown 2 nodes
Create 2 new nodes

1.4.4

I did the aforementioned scenario many times, and sometimes, the remaining survival is elected leader. But mostly, the cluster is broken.

I assume the nodes will eventually return and the cluster will be returned to be healthy, but when the nodes return the last survival, remembers his dead peers, and the election goes in a loop
consul operator raft list-peers -stale on the living nodes:

consul-bb8f:
Node         ID                                    Address           State     Voter  RaftProtocol
consul-hd50  87eabf82-2272-31f3-822c-163f5f8b5cfd  xx.xx.0.22:8300  leader    true   3
consul-bb8f  c80c5e87-e3e1-3c62-a2d5-9ecabe1627c4  xx.xx.0.21:8300  follower  false  3
consul-bbl6  be674976-1b0a-3d09-bdc1-087070fb15b4  xx.xx.0.18:8300  follower  false  3

consul-hd50
Node         ID                                    Address           State     Voter  RaftProtocol
consul-hd50  87eabf82-2272-31f3-822c-163f5f8b5cfd  xx.xx.0.22:8300  leader    true   3
consul-bb8f  c80c5e87-e3e1-3c62-a2d5-9ecabe1627c4  xx.xx.0.21:8300  follower  false  3
consul-bbl6  be674976-1b0a-3d09-bdc1-087070fb15b4  xx.0.18:8300  follower  false  3

consul-bbl6
Node         ID                                    Address           State     Voter  RaftProtocol
<dead> consul-28lf  a823b90d-c8d7-374d-a137-1978953a9908  xx.xx.0.7:8300   follower  true   3
consul-bbl6  be674976-1b0a-3d09-bdc1-087070fb15b4  xx.xx.0.18:8300  follower  true   3

Constant electiong logs

consul-bb8f:
2019/03/25 08:37:03 [ERR] agent: failed to sync remote state: rpc error making call: No cluster leader
2019/03/25 08:37:06 [ERR] raft-net: Failed to flush response: write tcp xx.xx.0.21:8300->xx.xx0.22:55442: write: broken pipe
2019/03/25 08:37:06 [INFO] consul: New leader elected: consul-hd50
2019/03/25 08:37:14 [ERR] http: Request PUT /v1/kv/data, error: rpc error making call: No cluster leader from=127.0.0.1:52808
INFO startup-script: Error! Failed writing data: Unexpected response code: 500 (rpc error making call: No cluster leader)
2019/03/25 08:37:14 [INFO] consul: New leader elected: consul-hd50

consul-hd50
2019/03/25 08:37:14 [INFO] consul: New leader elected: consul-hd50
2019/03/25 08:37:14 [ERR] raft: peer {Nonvoter be674976-1b0a-3d09-bdc1-087070fb15b4 xx.xx.0.18:8300} has newer term, stopping replication
2019/03/25 08:37:14 [INFO] raft: Node at xx.xx.0.22:8300 [Follower] entering Follower state (Leader: "")
2019/03/25 08:37:14 [ERR] consul: failed to wait for barrier: node is not the leader
2019/03/25 08:37:14 [INFO] consul: cluster leadership lost
2019/03/25 08:37:19 [WARN] consul.coordinate: Batch update failed: node is not the leader
2019/03/25 08:37:21 [ERR] agent: failed to sync remote state: No cluster leader
2019/03/25 08:37:23 [WARN] raft: Heartbeat timeout from "" reached, starting election
2019/03/25 08:37:23 [INFO] raft: Node at xx.xx.0.22:8300 [Candidate] entering Candidate state in term 134
2019/03/25 08:37:23 [INFO] raft: Election won. Tally: 1
2019/03/25 08:37:23 [INFO] raft: Node at xx.xx.0.22:8300 [Leader] entering Leader state
2019/03/25 08:37:23 [INFO] raft: Added peer c80c5e87-e3e1-3c62-a2d5-9ecabe1627c4, starting replication
2019/03/25 08:37:23 [INFO] raft: Added peer be674976-1b0a-3d09-bdc1-087070fb15b4, starting replication
2019/03/25 08:37:23 [INFO] consul: cluster leadership acquired
2019/03/25 08:37:23 [INFO] consul: New leader elected: consul-hd50

consul-bbl6
2019/03/25 08:37:55 [INFO] consul: New leader elected: consul-hd50
2019/03/25 08:37:56 [ERR] http: Request GET /v1/kv/stack, error: No cluster leader from=127.0.0.1:49678
2019/03/25 08:37:57 [ERR] agent: failed to sync remote state: No cluster leader
2019/03/25 08:37:59 [WARN] raft: Election timeout reached, restarting election
2019/03/25 08:37:59 [INFO] raft: Node at xx.xx.0.18:8300 [Candidate] entering Candidate state in term 350
2019/03/25 08:37:59 [WARN] raft: Unable to get address for server id a823b90d-c8d7-374d-a137-1978953a9908, using fallback address xx.xx.0.7:8300: Could not find address for server id a823b90d-c8d7-374d-a137-1978953a9908
2019/03/25 08:38:00 [ERR] raft: Failed to make RequestVote RPC to {Voter a823b90d-c8d7-374d-a137-1978953a9908 xx.xx.0.7:8300}: dial tcp <nil>->xx.xx.0.7:8300: i/o timeout
2019/03/25 08:38:01 [INFO] consul: New leader elected: consul-hd50

The issue seems to be around the fact that the last survival Request Vote from the dead peer. When I replace the last-survival the cluster is healthy.

I guess this could be recovered manually when such outage occurs, but I'm looking for an automatic solution. How can I detect such a situation?

In some cases the cluster is brought back to be healthy, I'm not sure if It always happens after a period of time, or just randomly. Perhaps after a rebalance event. Trying to reproduce.

typquestion

Source

aclowkey

Most helpful comment

I too am seeing this issue in a similar setup as the OP. This was after an upgrade from Consul v1.2.2 in which I was waiting to see a resolution for #4741.

When there is a hard outage of one server node it appears to stay in the list of peers shown from the consul operator cli. The remaining two nodes re-elect a leader, but the failed node is never reaped from the list of peers. It seems that there may be or could be a configuration setting for the agents to handle this kind of condition?

If either of the two remaining server nodes is restarted (SIGINT) then it can never appear in the peer list again and is never allowed to take part in a voting process that leave the cluster in a down state. The only recover at this point would be to take everything down and reload everything. If this process is done AFTER a manual interaction of consul operator raft remove-peers is taken to remove the hard down node, then all expected behaviour resumes without issue.

From some of the log output it looks like maybe there has been a change with how SIGINT or SIGTERM is handled? Not sure this is completely relevant to the OP question.

How can we get failed server nodes to be immediately reaped from the peer set?

jgornowich on 27 Jun 2019

🚀1 👍1

All 9 comments

The underlying consensus algorithm protecting the data in Consul is raft. The thesis of that algorithm is that you must maintain a majority of your voting server instances at all times. If you dip below quorum (by unexpectedly and permanently losing 2 of 3 nodes for example) the algorithm is designed to specifically NOT recover to avoid data corruption from a possible split-brain scenario.

There is a guide that covers this sort of scenario, with some ways to intervene administratively (with caveats): https://learn.hashicorp.com/consul/day-2-operations/advanced-operations/outage#failure-of-multiple-servers-in-a-multi-server-cluster

rboyer on 25 Mar 2019

But other nodes DO replace them.
Also, I had instances where I lost 2 nodes, and the remainder was elected leader in a single node cluster.
Some instances where 2 nodes are back when this was recovered, and some got stuck in the loop.

There seem to be inconsistencies.

aclowkey on 26 Mar 2019

I too am seeing this issue in a similar setup as the OP. This was after an upgrade from Consul v1.2.2 in which I was waiting to see a resolution for #4741.

From some of the log output it looks like maybe there has been a change with how SIGINT or SIGTERM is handled? Not sure this is completely relevant to the OP question.

How can we get failed server nodes to be immediately reaped from the peer set?

jgornowich on 27 Jun 2019

🚀1 👍1

I am facing this issue too. When a node dies, Consul keep trying to connect to this node. When a node is gracefully shutdown, everything works as expected. Maybe a configuration to reap node when node socket refuses connection?

Leandropintogit on 6 Jul 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

stale[bot] on 21 Oct 2019

This is definitely still an issue, but doesn't happen as often. Not entirely sure what triggers it either, but it does seem to require restarting all nodes, or at least restarting/redeploying the services. After that's done, it takes a hot minute for it to fix itself.

Dids on 22 Oct 2019

As @rboyer mentioned that is the expected behaviour that is why if you care about things like ACL roles/policy/tokens and data in your K/V and other consul features that stores some information like Consul connect, you should take snapshots periodically (i.e with a cron). With restarting all the servers or redeploying them you simply losing all the data.

peimanja on 20 Nov 2019

Feel free to check out the community forum as well!
Thank you!

stale[bot] on 19 Jan 2020

This is an issue in many scenarios, where the system is down for some extended period of time during an outage and want to join back the once the outage is fixed. In this case, we know that the node was down or there was network failure and would need a way to programmatically join the nodes. Else it would need a customer support case where somebody needs to login to system and fix using the manual steps mentioned by @rboyer. So the ask is whether we can have an API to resurrect the cluster