Consul: Infinite re-election loop on leader node rejoin

Created on 28 Mar 2017 · 2Comments · Source: hashicorp/consul

`consul version` for Server

Consul v0.7.0
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

`consul info` for Server

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 1
    services = 1
build:
    prerelease = 
    revision = 'a189091
    version = 0.7.0
consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = 10.2.2.1:8300
    server = true
raft:
    applied_index = 334
    commit_index = 334
    fsm_pending = 0
    last_contact = 90.3972ms
    last_log_index = 334
    last_log_term = 23
    last_snapshot_index = 0
    last_snapshot_term = 0
    latest_configuration = [{Suffrage:Voter ID:10.2.2.1:8300 Address:10.2.2.1:8300} {Suffrage:Voter ID:10.2.3.1:8300 Address:10.2.3.1:8300} {Suffrage:Voter ID:10.2.1.1:8300 Address:10.2.1.1:8300}]
    latest_configuration_index = 331
    num_peers = 2
    protocol_version = 1
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 23
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 66
    max_procs = 4
    os = linux
    version = go1.6.3
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 5
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 1
    member_time = 6
    members = 4
    query_queue = 0
    query_time = 1
serf_wan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Operating system and Environment details

Linux node1 4.4.19-29.55.amzn1.x86_64 #1 SMP Mon Aug 29 23:29:40 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Each Consul server is running within a Docker container with the following parameters:

docker run \
  --memory=128m \
  --cpu-shares=128 \
  --net=custom-net-1 \
  -e GOMAXPROCS=4 \
  --publish=8500:8500 \
  --publish=10.2.2.1:8400:8400 \
  --publish=10.2.2.1:8300-8302:8300-8302 \
  --publish=10.2.2.1:8300-8302:8300-8302/udp \
  custom-consul-image:0.7.0 \
  # Consul server parameters ...

Note, Docker container does not mount /var/consul/data so leave from the cluster purges the raft database and all data.

Server configuration of the Consul. The server configuration file from /etc/consul.d/config.json and health-check configuration from /etc/consul.d/node.json (command have been modified) respectively:

{
  "data_dir": "/var/consul/data",
  "ui_dir": "/var/consul/ui",
  "disable_update_check": true
}

{
  "checks": [
    {
      "id": "node-health",
      "name": "Node health checks",
      "docker_container_id": "agent",
      "shell": "/bin/sh",
      "script": "/usr/bin/healthcheck",
      "interval":"30s",
      "timeout":"15s"
    }
  ]
}

Description of the Issue (and unexpected/desired result)

After the leave of Consul leader from the cluster and subsequent rejoin the server with the same IP replicated logs are keep getting rejected. It is expected that replicated logs are committed to the joined server of the cluster.

Reproduction steps

1. Create a three-node cluster with an identical configuration on each node:

consul agent \
  -config-dir=/etc/consul.d \
  -client=0.0.0.0 \
  -node="${ipsubnet}/24" \
  -log-level=debug \
  -retry-interval 5s \
  -advertise "${ipaddress}" \
  -dc dc1-3-node \
  -retry-join 10.2.1.1 -retry-join 10.2.2.1 -retry-join 10.2.3.1 \
  -server \
  -bootstrap-expect 3

Where:

Variable ${ipsubnet} takes one of: 10.2.1.0/24, 10.2.2.0/24, 10.2.3.0/24.
Variable ${ipaddress} takes one of: 10.2.1.1, 10.2.2.1, 10.2.3.1 respectively.

2. Wait until the cluster finishes bootstrapping stage (in my case node with IP `10.2.2.1` takes the leadership of the cluster and starts replication routines):

2017/03/28 10:34:21 [INFO] raft: Node at 10.2.1.1:8300 [Leader] entering Leader state
2017/03/28 10:34:21 [INFO] raft: Added peer 10.2.2.1:8300, starting replication
2017/03/28 10:34:21 [INFO] raft: Added peer 10.2.3.1:8300, starting replication
2017/03/28 10:34:21 [INFO] consul: cluster leadership acquired
2017/03/28 10:34:21 [INFO] consul: New leader elected: 10.2.1.0/24,172.28.5.71,c2bd8b49b4ae

3. Gracefully delete leader node from the Consul cluster:

[[email protected]]$ consul leave

Logs from the former Consul leader:

2017/03/28 10:39:37 [INFO] raft: Removed ourself, transitioning to follower
2017/03/28 10:39:37 [INFO] raft: Node at 10.2.1.1:8300 [Follower] entering Follower state (Leader: "")
2017/03/28 10:39:37 [INFO] consul: cluster leadership lost
...
2017/03/28 10:39:37 [INFO] agent: shutdown complete
2017/03/28 10:39:37 [DEBUG] http: Shutting down http server (0.0.0.0:8500)

4. Wait until a new leader is reelected:

2017/03/28 10:39:48 [INFO] raft: Election won. Tally: 2
2017/03/28 10:39:48 [INFO] raft: Node at 10.2.3.1:8300 [Leader] entering Leader state
2017/03/28 10:39:48 [INFO] raft: Added peer 10.2.2.1:8300, starting replication
2017/03/28 10:39:48 [INFO] consul: cluster leadership acquired
2017/03/28 10:39:48 [INFO] consul: New leader elected: 10.2.3.0/24,172.28.5.211,808195027ade
2017/03/28 10:39:48 [INFO] raft: pipelining replication to peer {Voter 10.2.2.1:8300 10.2.2.1:8300}

5. Join detached member to the Consul cluster back (with the same IP address and `--bootstrap-expect` set to `3`):

2017/03/28 11:06:06 [INFO] serf: EventMemberJoin: 10.2.2.0/24,172.28.5.16,19883068ab3c 10.2.2.1
2017/03/28 11:06:06 [INFO] serf: EventMemberJoin: 10.2.3.0/24,172.28.5.211,808195027ade 10.2.3.1
2017/03/28 11:06:06 [INFO] consul: Adding LAN server 10.2.2.0/24,172.28.5.16,19883068ab3c (Addr: tcp/10.2.2.1:8300) (DC: dc1-3-santa_clara)
2017/03/28 11:06:06 [DEBUG] memberlist: Initiating push/pull sync with: 10.2.3.1:8301
2017/03/28 11:06:06 [INFO] agent: (LAN) joined: 3 Err: <nil>
2017/03/28 11:06:06 [INFO] agent: Join completed. Synced with 3 initial agents
2017/03/28 11:06:06 [INFO] consul: Existing Raft peers reported by 10.2.2.0/24,172.28.5.16,19883068ab3c, disabling bootstrap mode

6. Notice that due to the heart-beat timeout expiration caused by accumulation of uncommitted logs on rejoined member leader re-election never ends:

$ grep -i 'log not found' 10-2-1-1-consul-after.log
2017/03/28 11:06:06 [WARN] raft: Failed to get previous log: 247 log not found (last: 0)
2017/03/28 11:06:48 [WARN] raft: Failed to get previous log: 261 log not found (last: 260)
2017/03/28 11:07:48 [WARN] raft: Failed to get previous log: 269 log not found (last: 268)
2017/03/28 11:08:48 [WARN] raft: Failed to get previous log: 278 log not found (last: 277)
...
2017/03/28 11:14:07 [WARN] raft: Failed to get previous log: 331 log not found (last: 330)
2017/03/28 11:15:07 [WARN] raft: Failed to get previous log: 342 log not found (last: 341)
2017/03/28 11:16:07 [WARN] raft: Failed to get previous log: 353 log not found (last: 352)
...
2017/03/28 11:28:44 [WARN] raft: Failed to get previous log: 472 log not found (last: 465)
2017/03/28 11:28:44 [WARN] raft: Failed to get previous log: 476 log not found (last: 475)

$ grep -i heartbeat 10-2-1-1-consul-after.log
2017/03/28 11:11:59 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
2017/03/28 11:13:08 [WARN] raft: Heartbeat timeout from "10.2.2.1:8300" reached, starting election
2017/03/28 11:19:15 [WARN] raft: Heartbeat timeout from "10.2.2.1:8300" reached, starting election
2017/03/28 11:21:27 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
2017/03/28 11:23:32 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
2017/03/28 11:27:41 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election

According to my limited understanding, only event of the logs commitment update the leader heart-beat timer on the followers node. An expired timer causes the leader re-election over and over.

Log Fragments

Logs from the server before the leave from the cluster 10-2-1-1-consul-before.log.
Logs from the server after the rejoin to the cluster: 10-2-1-1-consul-after.log.
Logs from another server node: 10-2-3-1-consul.log.

Note, I validated this scenario on 0.6.4 and 0.7.1 versions as well and always able to reproduce.

theminternal-cleanup typbug

Source

ybubnov

Most helpful comment

Is there any solution to this? Can't get the server that was reset to join back to the cluster.

Mike-Dunton on 28 Sep 2017

👍4

All 2 comments

Hi @ybubnov the log warnings are actually OK (the leader will send older logs or a snapshot). I think what's happening here as that the server that's coming back is replaying the logs and removing itself, getting stuck in a loop. If you don't do the leave it should work ok just to restart the server, but we will have to think about the right way to fix this.

slackpad on 2 May 2017

Is there any solution to this? Can't get the server that was reset to join back to the cluster.

Mike-Dunton on 28 Sep 2017

👍4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add Consul vs Eureka comparison to website

philsttr · 3Comments

bind public ip

eshujiushiwo · 3Comments

Consul is stuck at "No cluster leader" when the other server leaves the cluster

nschoe · 4Comments

import command failing

pritam97 · 3Comments

Node marked as failed, but service checks are still green.

darron · 4Comments