consul version for ServerConsul v0.7.0
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
consul info for ServerServer:
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 1
build:
prerelease =
revision = 'a189091
version = 0.7.0
consul:
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 10.2.2.1:8300
server = true
raft:
applied_index = 334
commit_index = 334
fsm_pending = 0
last_contact = 90.3972ms
last_log_index = 334
last_log_term = 23
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:10.2.2.1:8300 Address:10.2.2.1:8300} {Suffrage:Voter ID:10.2.3.1:8300 Address:10.2.3.1:8300} {Suffrage:Voter ID:10.2.1.1:8300 Address:10.2.1.1:8300}]
latest_configuration_index = 331
num_peers = 2
protocol_version = 1
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 23
runtime:
arch = amd64
cpu_count = 1
goroutines = 66
max_procs = 4
os = linux
version = go1.6.3
serf_lan:
encrypted = false
event_queue = 0
event_time = 5
failed = 0
health_score = 0
intent_queue = 0
left = 1
member_time = 6
members = 4
query_queue = 0
query_time = 1
serf_wan:
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
docker run \
--memory=128m \
--cpu-shares=128 \
--net=custom-net-1 \
-e GOMAXPROCS=4 \
--publish=8500:8500 \
--publish=10.2.2.1:8400:8400 \
--publish=10.2.2.1:8300-8302:8300-8302 \
--publish=10.2.2.1:8300-8302:8300-8302/udp \
custom-consul-image:0.7.0 \
# Consul server parameters ...
Note, Docker container does not mount /var/consul/data so leave from the cluster purges the raft database and all data.
/etc/consul.d/config.json and health-check configuration from /etc/consul.d/node.json (command have been modified) respectively:{
"data_dir": "/var/consul/data",
"ui_dir": "/var/consul/ui",
"disable_update_check": true
}
{
"checks": [
{
"id": "node-health",
"name": "Node health checks",
"docker_container_id": "agent",
"shell": "/bin/sh",
"script": "/usr/bin/healthcheck",
"interval":"30s",
"timeout":"15s"
}
]
}
After the leave of Consul leader from the cluster and subsequent rejoin the server with the same IP replicated logs are keep getting rejected. It is expected that replicated logs are committed to the joined server of the cluster.
consul agent \
-config-dir=/etc/consul.d \
-client=0.0.0.0 \
-node="${ipsubnet}/24" \
-log-level=debug \
-retry-interval 5s \
-advertise "${ipaddress}" \
-dc dc1-3-node \
-retry-join 10.2.1.1 -retry-join 10.2.2.1 -retry-join 10.2.3.1 \
-server \
-bootstrap-expect 3
Where:
${ipsubnet} takes one of: 10.2.1.0/24, 10.2.2.0/24, 10.2.3.0/24.${ipaddress} takes one of: 10.2.1.1, 10.2.2.1, 10.2.3.1 respectively.10.2.2.1 takes the leadership of the cluster and starts replication routines):2017/03/28 10:34:21 [INFO] raft: Node at 10.2.1.1:8300 [Leader] entering Leader state
2017/03/28 10:34:21 [INFO] raft: Added peer 10.2.2.1:8300, starting replication
2017/03/28 10:34:21 [INFO] raft: Added peer 10.2.3.1:8300, starting replication
2017/03/28 10:34:21 [INFO] consul: cluster leadership acquired
2017/03/28 10:34:21 [INFO] consul: New leader elected: 10.2.1.0/24,172.28.5.71,c2bd8b49b4ae
[[email protected]]$ consul leave
Logs from the former Consul leader:
2017/03/28 10:39:37 [INFO] raft: Removed ourself, transitioning to follower
2017/03/28 10:39:37 [INFO] raft: Node at 10.2.1.1:8300 [Follower] entering Follower state (Leader: "")
2017/03/28 10:39:37 [INFO] consul: cluster leadership lost
...
2017/03/28 10:39:37 [INFO] agent: shutdown complete
2017/03/28 10:39:37 [DEBUG] http: Shutting down http server (0.0.0.0:8500)
2017/03/28 10:39:48 [INFO] raft: Election won. Tally: 2
2017/03/28 10:39:48 [INFO] raft: Node at 10.2.3.1:8300 [Leader] entering Leader state
2017/03/28 10:39:48 [INFO] raft: Added peer 10.2.2.1:8300, starting replication
2017/03/28 10:39:48 [INFO] consul: cluster leadership acquired
2017/03/28 10:39:48 [INFO] consul: New leader elected: 10.2.3.0/24,172.28.5.211,808195027ade
2017/03/28 10:39:48 [INFO] raft: pipelining replication to peer {Voter 10.2.2.1:8300 10.2.2.1:8300}
--bootstrap-expect set to 3):2017/03/28 11:06:06 [INFO] serf: EventMemberJoin: 10.2.2.0/24,172.28.5.16,19883068ab3c 10.2.2.1
2017/03/28 11:06:06 [INFO] serf: EventMemberJoin: 10.2.3.0/24,172.28.5.211,808195027ade 10.2.3.1
2017/03/28 11:06:06 [INFO] consul: Adding LAN server 10.2.2.0/24,172.28.5.16,19883068ab3c (Addr: tcp/10.2.2.1:8300) (DC: dc1-3-santa_clara)
2017/03/28 11:06:06 [DEBUG] memberlist: Initiating push/pull sync with: 10.2.3.1:8301
2017/03/28 11:06:06 [INFO] agent: (LAN) joined: 3 Err: <nil>
2017/03/28 11:06:06 [INFO] agent: Join completed. Synced with 3 initial agents
2017/03/28 11:06:06 [INFO] consul: Existing Raft peers reported by 10.2.2.0/24,172.28.5.16,19883068ab3c, disabling bootstrap mode
$ grep -i 'log not found' 10-2-1-1-consul-after.log
2017/03/28 11:06:06 [WARN] raft: Failed to get previous log: 247 log not found (last: 0)
2017/03/28 11:06:48 [WARN] raft: Failed to get previous log: 261 log not found (last: 260)
2017/03/28 11:07:48 [WARN] raft: Failed to get previous log: 269 log not found (last: 268)
2017/03/28 11:08:48 [WARN] raft: Failed to get previous log: 278 log not found (last: 277)
...
2017/03/28 11:14:07 [WARN] raft: Failed to get previous log: 331 log not found (last: 330)
2017/03/28 11:15:07 [WARN] raft: Failed to get previous log: 342 log not found (last: 341)
2017/03/28 11:16:07 [WARN] raft: Failed to get previous log: 353 log not found (last: 352)
...
2017/03/28 11:28:44 [WARN] raft: Failed to get previous log: 472 log not found (last: 465)
2017/03/28 11:28:44 [WARN] raft: Failed to get previous log: 476 log not found (last: 475)
$ grep -i heartbeat 10-2-1-1-consul-after.log
2017/03/28 11:11:59 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
2017/03/28 11:13:08 [WARN] raft: Heartbeat timeout from "10.2.2.1:8300" reached, starting election
2017/03/28 11:19:15 [WARN] raft: Heartbeat timeout from "10.2.2.1:8300" reached, starting election
2017/03/28 11:21:27 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
2017/03/28 11:23:32 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
2017/03/28 11:27:41 [WARN] raft: Heartbeat timeout from "10.2.3.1:8300" reached, starting election
According to my limited understanding, only event of the logs commitment update the leader heart-beat timer on the followers node. An expired timer causes the leader re-election over and over.
Logs from the server before the leave from the cluster 10-2-1-1-consul-before.log.
Logs from the server after the rejoin to the cluster: 10-2-1-1-consul-after.log.
Logs from another server node: 10-2-3-1-consul.log.
Note, I validated this scenario on 0.6.4 and 0.7.1 versions as well and always able to reproduce.
Hi @ybubnov the log warnings are actually OK (the leader will send older logs or a snapshot). I think what's happening here as that the server that's coming back is replaying the logs and removing itself, getting stuck in a loop. If you don't do the leave it should work ok just to restart the server, but we will have to think about the right way to fix this.
Is there any solution to this? Can't get the server that was reset to join back to the cluster.
Most helpful comment
Is there any solution to this? Can't get the server that was reset to join back to the cluster.