Consul: Consul is stuck at "No cluster leader" when the other server leaves the cluster

Created on 27 Jul 2016  路  4Comments  路  Source: hashicorp/consul

Hi,

Sorry if this is a classic---it looks to me like a classic---but I can't seem to find answers, I've checked on the IRC, and all.

Not that it should mater, but I'm running consul on docker. I have a remote server with ip 192.168.0.1 and my local computer, with ip 192.168.0.6.

Both are running consul 0.6.4.

So on my computer I create a server with consul agent -server -bootstrap-expect 1 -data-dir /consul/data -bind 192.168.0.6. It starts allright, electing itself as leader, as expected. I can insert a new key in the kv store and query it.

Then on the remote server, I start a second server with consul agent -server -data-dir /consul/data -bind 192.168.0.1 -retry-join 192.168.0.6. It starts alright and joins the cluster. I can query the key I previously added from both hosts. All if going fine, as expected.

From my understanding (but maybe this is where I am wrong, but can't seem to find satisfying answers) what is happening here is that the servers are expecting at least 1 server (with bootstrap-expect 1) but can scale up if other servers are joining the cluster. As far as I see it: the more servers join the cluster, the better: it means more redundancy, and even more high availability, right?

But then, when I kill one of the consul, for instance the remote, I've got these messages on the logs of the remaining server (on my local computer):

consul_1  |     2016/07/27 08:19:05 [INFO] consul: cluster leadership lost
consul_1  |     2016/07/27 08:19:05 [INFO] memberlist: Suspect gouda has failed, no acks received
consul_1  |     2016/07/27 08:19:06 [WARN] raft: Heartbeat timeout reached, starting election
consul_1  |     2016/07/27 08:19:06 [INFO] raft: Node at 192.168.0.6:8300 [Candidate] entering Candidate state
consul_1  |     2016/07/27 08:19:07 [ERR] raft: Failed to make RequestVote RPC to 192.168.0.1:8300: dial tcp 192.168.0.1:8300: getsockopt: connection refused
consul_1  |     2016/07/27 08:19:07 [ERR] raft: Failed to heartbeat to 192.168.0.1:8300: dial tcp 192.168.0.1:8300: getsockopt: connection refused
consul_1  |     2016/07/27 08:19:07 [INFO] memberlist: Suspect gouda has failed, no acks received
consul_1  |     2016/07/27 08:19:08 [WARN] raft: Election timeout reached, restarting election
consul_1  |     2016/07/27 08:19:08 [INFO] raft: Node at 192.168.0.6:8300 [Candidate] entering Candidate state
consul_1  |     2016/07/27 08:19:08 [ERR] raft: Failed to make RequestVote RPC to 192.168.0.1:8300: dial tcp 192.168.0.1:8300: getsockopt: connection refused
consul_1  |     2016/07/27 08:19:09 [INFO] memberlist: Suspect gouda has failed, no acks received
consul_1  |     2016/07/27 08:19:10 [WARN] raft: Election timeout reached, restarting election
consul_1  |     2016/07/27 08:19:10 [INFO] raft: Node at 192.168.0.6:8300 [Candidate] entering Candidate state

(For information 'gouda' is the name of the remote server). So it detects that the remote server has failed, and if I run consul members on my local computer, I see

Node       Address           Status  Type    Build  Protocol  DC
gouda      192.168.0.1:8301  failed  server  0.6.4  2         dc1
nschoe-PC  192.168.0.6:8301  alive   server  0.6.4  2         dc1

So ok, the remote server is in failed state. Problem is now I can't query the cluster from my local computer, it returns 'No cluster leader'. And the logs keep failing with this in an endless loop of:

[INFO] raft: Node at 192.168.0.6:8300 [Candidate] entering Candidate state
consul_1  |     2016/07/27 08:19:08 [ERR] raft: Failed to make RequestVote RPC to 192.168.0.1:8300: dial tcp 192.168.0.1:8300: getsockopt: connection refused
consul_1  |     2016/07/27 08:19:09 [INFO] memberlist: Suspect gouda has failed, no acks received
consul_1  |     2016/07/27 08:19:10 [WARN] raft: Election timeout reached, restarting election

So what is going on: 192.168.0.6 (local computer with consul still running) enter Candidate state, all right. Then it fails to contact remote 192.168.0.1 for a vote (normal: it is down), it does suspect it failed, but then election timeout fails. Why won't the sole remaining node elect itself as a leader?

It was started with --bootstrap-expect 1 and successfully elected itself as the leader when I first started it, but then when a second server joined and failed, it is not able to elect itself back as a leader.

Am I missing something or is there a weird behavior going on?

Most helpful comment

Hum, okay maybe I've just understood something I previously missed: it's about quorum. I knew that the quorum needed is (N/2)+1 but I thought the N was the bootstrap-expect number, maybe it's not. Maybe it's the number of nodes in the cluster. So it means it changes each time a new server joins or leaves the cluster, right?

So at first, when I had only 1 server, the quorum is 1/2 + 1 = 1 so it can elect itself alone. Then when a new server successfully joined the cluster, the quorum changed to 2/2 + 1 = 2 so when it then left, the elect can only be carried if there are at lest 2 servers left, right?

Am I correct---in which case I perfectly understand the "problem" I'm facing---or am I wrong again?

All 4 comments

Hum, okay maybe I've just understood something I previously missed: it's about quorum. I knew that the quorum needed is (N/2)+1 but I thought the N was the bootstrap-expect number, maybe it's not. Maybe it's the number of nodes in the cluster. So it means it changes each time a new server joins or leaves the cluster, right?

So at first, when I had only 1 server, the quorum is 1/2 + 1 = 1 so it can elect itself alone. Then when a new server successfully joined the cluster, the quorum changed to 2/2 + 1 = 2 so when it then left, the elect can only be carried if there are at lest 2 servers left, right?

Am I correct---in which case I perfectly understand the "problem" I'm facing---or am I wrong again?

Okay I'm now sure that this is what happens, as per here under last section "Deployment Table".

I'm okay to close the issue now (I'm just waiting for a confirmation from a more experience user, but as far as I'm concerned, the issue can be closed, sorry for the troubles?)

Hi @nschoe that's correct - the quorum requirements change as servers are added and removed.

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least _30_ days.

If you are still experiencing problems, or still have questions, feel free to open a new one :+1:.

Was this page helpful?
0 / 5 - 0 ratings