We register external services in Consul, and are using Consul ESM to perform health checks. And we're using prepared queries along with the DNS interface to failover to another datacenter if a service becomes unhealthy in the local datacenter. The problem we're observing is when Consul ESM marks a service as unhealthy, sometimes only the leader is reflecting the update, so stale queries to follower servers are incorrectly returning the unhealthy endpoint. It's not a matter of being slow to update; the followers just never update.
We believe we have traced this down to inconsistent CreateIndex and ModifyIndex values across server nodes for the health checks:
โ ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/foo?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1573507,1573507]
99com82: [1491640,1491640]
99com122: [1491639,2428888]
โ ~
Consistent queries are forwarded to the leader and return the expected results:
โ ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/foo' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1491639,2428888]
99com82: [1491639,2428888]
99com122: [1491639,2428888]
โ ~
While at the same time the servers all agree about the current index value:
โ ~ pdsh -w 99com82,99com103,99com122 "curl -i -s 'http://localhost:8500/v1/health/checks/foo?stale' | grep X-Consul-Index"
99com103: X-Consul-Index: 2443513
99com122: X-Consul-Index: 2443513
99com82: X-Consul-Index: 2443513
โ ~
We don't understand how some services found themselves in this state, but they never recover from it. At the same time, newly registered services do not demonstrate this behavior. And deregistering/re-registering an impacted service fixes that service.
We don't know how to reproduce this scenario, but we detected the problem in 3 separate clusters. Given a 3 node cluster, we can find inconsistencies using the Prometheus query (sum(consul_catalog_service_node_healthy) by (cluster,service_id) > 0) != 3.
This example is from a 1.7.4 cluster, but we're observing the same behavior from a 1.8.0 cluster as well.
Server info
cbroglie@99com82:~$ consul info
agent:
check_monitors = 0
check_ttls = 1
checks = 1
services = 1
build:
prerelease =
revision = d149d7e9
version = 1.7.4+1
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = [2a06:98c0:1400:100::c629:990d]:8300
server = true
raft:
applied_index = 2443786
commit_index = 2443786
fsm_pending = 0
last_contact = 53.901869ms
last_log_index = 2443786
last_log_term = 74
last_snapshot_index = 2436056
last_snapshot_term = 74
latest_configuration = [{Suffrage:Voter ID:432355ab-4fa6-502d-8837-3c1bf3719ff2 Address:[2a06:98c0:1400:100::c629:990d]:8300} {Suffrage:Voter ID:89b25347-d6b8-53de-b7d1-b96e7e0b4183 Address:[2a06:98c0:1400:100::c629:98fa]:8300} {Suffrage:Voter ID:1b23682c-c1b0-5067-bfd8-4aab2efe1d5d Address:[2a06:98c0:1400:100::c629:98e5]:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 74
runtime:
arch = amd64
cpu_count = 96
goroutines = 1112
max_procs = 96
os = linux
version = go1.13.12
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 16
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1654
members = 311
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 116
members = 8
query_queue = 0
query_time = 1
cbroglie@99com82:~$
cbroglie@99com103:~$ consul info
agent:
check_monitors = 0
check_ttls = 1
checks = 1
services = 1
build:
prerelease =
revision = d149d7e9
version = 1.7.4+1
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = [2a06:98c0:1400:100::c629:990d]:8300
server = true
raft:
applied_index = 2444037
commit_index = 2444037
fsm_pending = 0
last_contact = 5.948419ms
last_log_index = 2444037
last_log_term = 74
last_snapshot_index = 2436056
last_snapshot_term = 74
latest_configuration = [{Suffrage:Voter ID:432355ab-4fa6-502d-8837-3c1bf3719ff2 Address:[2a06:98c0:1400:100::c629:990d]:8300} {Suffrage:Voter ID:89b25347-d6b8-53de-b7d1-b96e7e0b4183 Address:[2a06:98c0:1400:100::c629:98fa]:8300} {Suffrage:Voter ID:1b23682c-c1b0-5067-bfd8-4aab2efe1d5d Address:[2a06:98c0:1400:100::c629:98e5]:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 74
runtime:
arch = amd64
cpu_count = 96
goroutines = 1102
max_procs = 96
os = linux
version = go1.13.12
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 16
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1654
members = 311
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 116
members = 8
query_queue = 0
query_time = 1
cbroglie@99com103:~$
md5-a459feffcf254d30602f869342118ccc
cbroglie@99com122:~$ consul info
agent:
check_monitors = 0
check_ttls = 1
checks = 1
services = 1
build:
prerelease =
revision = d149d7e9
version = 1.7.4+1
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = true
leader_addr = [2a06:98c0:1400:100::c629:990d]:8300
server = true
raft:
applied_index = 2444054
commit_index = 2444054
fsm_pending = 0
last_contact = 0
last_log_index = 2444054
last_log_term = 74
last_snapshot_index = 2436054
last_snapshot_term = 74
latest_configuration = [{Suffrage:Voter ID:432355ab-4fa6-502d-8837-3c1bf3719ff2 Address:[2a06:98c0:1400:100::c629:990d]:8300} {Suffrage:Voter ID:89b25347-d6b8-53de-b7d1-b96e7e0b4183 Address:[2a06:98c0:1400:100::c629:98fa]:8300} {Suffrage:Voter ID:1b23682c-c1b0-5067-bfd8-4aab2efe1d5d Address:[2a06:98c0:1400:100::c629:98e5]:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 74
runtime:
arch = amd64
cpu_count = 96
goroutines = 1235
max_procs = 96
os = linux
version = go1.13.12
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 16
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1654
members = 311
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 116
members = 8
query_queue = 0
query_time = 1
cbroglie@99com122:~$
md5-967e731998fff3550051cb5b7ec7ddbc
Jul 23 09:40:14 99com81 consul-esm[60990]: 2020/07/23 09:40:14 [WARN] agent: Check 'node/foo/http-check' is now critical Jul 23 09:40:14 99com81 consul-esm[60990]: 2020/07/23 09:40:14 [TRACE] Registered check status to the catalog with ID http-check
md5-f487ca9f3bfa982b30c07335f8a2042d
2020-07-23T09:40:14.117Z [DEBUG] agent.tlsutil: IncomingRPCConfig: version=2
2020-07-23T09:40:14.117Z [DEBUG] agent.tlsutil: IncomingRPCConfig: version=2
2020-07-23T09:40:14.133Z [DEBUG] agent.http: Request finished: method=GET url=/v1/health/node/foo?dc=bar from=127.0.0.1:65410 latency=1.061054ms
2020-07-23T09:40:14.140Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/txn?dc=bar from=127.0.0.1:65410 latency=6.606603ms
After which only the leader reflects the update, but we don't see any errors in the logs from followers.
@cbroglie Probably a duplicate of https://github.com/hashicorp/consul/issues/6181 => while we looked carefully, never found this underlying issue (the good news is that does not occur a lot since we had the issue only once on 10+ clusters that were never shut down for years)
We had this issue once on an old cluster, never went back. To fix it, we proceeded this way: https://github.com/hashicorp/consul/issues/7939#issuecomment-632071578
Thanks @pierresouchay, but I do not think this is a dup of #6181 as we don't see anything similar in the logs. And we're observing the issue in all 3 of our clusters.
Adding some more details we uncovered: all of the services in this broken state have the exact same CreateIndex values in each server, while each server has a different value. The original example was:
โ ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/foo?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1573507,1573507]
99com82: [1491640,1491640]
99com122: [1491639,2428888]
โ ~
And we see those same 3 CreateIndex values (1573507, 1491640, 1491639) for every broken service, e.g.:
โ ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/abc?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com82: [1491640,1491640]
99com122: [1491639,1568724]
99com103: [1573507,1573507]
โ ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/xyz?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1573507,1573507]
99com82: [1491640,1491640]
99com122: [1491639,2360138]
โ ~
We've managed to reproduce the issue in dev https://github.com/szechuen/consul-8377
Waiting for cluster to be ready...
Creating health check...
true
Querying health check -- this should be consistent across servers...
server1:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
server2:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
server3:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
Manually triggering snapshots with different indexes...
server3
server2
server1
Leaving and restarting servers...
server3
server2
server1
Checking Raft stats after restarts...
server1:
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"38.1564ms","last_log_index":"39","last_log_term":"3","last_snapshot_index":"16","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Follower","term":"3"}
server2:
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"11.8772ms","last_log_index":"39","last_log_term":"3","last_snapshot_index":"15","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Follower","term":"3"}
server3:
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"0","last_log_index":"39","last_log_term":"3","last_snapshot_index":"14","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Leader","term":"3"}
Querying health check -- this should now be INconsistent across servers...
server1:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":16,"ModifyIndex":16}]
server2:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":15,"ModifyIndex":15}]
server3:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
Performing CAS check update...
{"Results":[{"Check":{"Node":"node1","CheckID":"check1","Name":"check1","Status":"critical","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":null,"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":40}}],"Errors":null}
Querying health check -- note that status is only updated on leader...
server1:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":16,"ModifyIndex":16}]
server2:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":15,"ModifyIndex":15}]
server3:
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"critical","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":40}]
(Added Raft stats after restarts)
I'm working on a fix for this now - the reproduction case above was super helpful in finding the issue (checks/nodes/services being restored with the wrong index when restoring from a snapshot). I'll link the PR here when the fix is up.
Awesome, thanks for the update
Most helpful comment
We've managed to reproduce the issue in dev https://github.com/szechuen/consul-8377
(Added Raft stats after restarts)