Consul: Inconsistent Consul Catalog

Created on 24 Jul 2020 · 6Comments · Source: hashicorp/consul

Overview of the Issue

We register external services in Consul, and are using Consul ESM to perform health checks. And we're using prepared queries along with the DNS interface to failover to another datacenter if a service becomes unhealthy in the local datacenter. The problem we're observing is when Consul ESM marks a service as unhealthy, sometimes only the leader is reflecting the update, so stale queries to follower servers are incorrectly returning the unhealthy endpoint. It's not a matter of being slow to update; the followers just never update.

We believe we have traced this down to inconsistent CreateIndex and ModifyIndex values across server nodes for the health checks:

➜  ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/foo?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1573507,1573507]
99com82: [1491640,1491640]
99com122: [1491639,2428888]
➜  ~

Consistent queries are forwarded to the leader and return the expected results:

➜  ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/foo' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1491639,2428888]
99com82: [1491639,2428888]
99com122: [1491639,2428888]
➜  ~

While at the same time the servers all agree about the current index value:

➜  ~ pdsh -w 99com82,99com103,99com122 "curl -i -s 'http://localhost:8500/v1/health/checks/foo?stale' | grep X-Consul-Index"
99com103: X-Consul-Index: 2443513
99com122: X-Consul-Index: 2443513
99com82: X-Consul-Index: 2443513
➜  ~

We don't understand how some services found themselves in this state, but they never recover from it. At the same time, newly registered services do not demonstrate this behavior. And deregistering/re-registering an impacted service fixes that service.

Reproduction Steps

We don't know how to reproduce this scenario, but we detected the problem in 3 separate clusters. Given a 3 node cluster, we can find inconsistencies using the Prometheus query (sum(consul_catalog_service_node_healthy) by (cluster,service_id) > 0) != 3.

Consul info

This example is from a 1.7.4 cluster, but we're observing the same behavior from a 1.8.0 cluster as well.

Server info

cbroglie@99com82:~$ consul info
agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 1
build:
    prerelease =
    revision = d149d7e9
    version = 1.7.4+1
consul:
    acl = enabled
    bootstrap = false
    known_datacenters = 2
    leader = false
    leader_addr = [2a06:98c0:1400:100::c629:990d]:8300
    server = true
raft:
    applied_index = 2443786
    commit_index = 2443786
    fsm_pending = 0
    last_contact = 53.901869ms
    last_log_index = 2443786
    last_log_term = 74
    last_snapshot_index = 2436056
    last_snapshot_term = 74
    latest_configuration = [{Suffrage:Voter ID:432355ab-4fa6-502d-8837-3c1bf3719ff2 Address:[2a06:98c0:1400:100::c629:990d]:8300} {Suffrage:Voter ID:89b25347-d6b8-53de-b7d1-b96e7e0b4183 Address:[2a06:98c0:1400:100::c629:98fa]:8300} {Suffrage:Voter ID:1b23682c-c1b0-5067-bfd8-4aab2efe1d5d Address:[2a06:98c0:1400:100::c629:98e5]:8300}]
    latest_configuration_index = 0
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 74
runtime:
    arch = amd64
    cpu_count = 96
    goroutines = 1112
    max_procs = 96
    os = linux
    version = go1.13.12
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 16
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1654
    members = 311
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 116
    members = 8
    query_queue = 0
    query_time = 1
cbroglie@99com82:~$

cbroglie@99com103:~$ consul info
agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 1
build:
    prerelease =
    revision = d149d7e9
    version = 1.7.4+1
consul:
    acl = enabled
    bootstrap = false
    known_datacenters = 2
    leader = false
    leader_addr = [2a06:98c0:1400:100::c629:990d]:8300
    server = true
raft:
    applied_index = 2444037
    commit_index = 2444037
    fsm_pending = 0
    last_contact = 5.948419ms
    last_log_index = 2444037
    last_log_term = 74
    last_snapshot_index = 2436056
    last_snapshot_term = 74
    latest_configuration = [{Suffrage:Voter ID:432355ab-4fa6-502d-8837-3c1bf3719ff2 Address:[2a06:98c0:1400:100::c629:990d]:8300} {Suffrage:Voter ID:89b25347-d6b8-53de-b7d1-b96e7e0b4183 Address:[2a06:98c0:1400:100::c629:98fa]:8300} {Suffrage:Voter ID:1b23682c-c1b0-5067-bfd8-4aab2efe1d5d Address:[2a06:98c0:1400:100::c629:98e5]:8300}]
    latest_configuration_index = 0
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 74
runtime:
    arch = amd64
    cpu_count = 96
    goroutines = 1102
    max_procs = 96
    os = linux
    version = go1.13.12
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 16
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1654
    members = 311
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 116
    members = 8
    query_queue = 0
    query_time = 1
cbroglie@99com103:~$



md5-a459feffcf254d30602f869342118ccc



cbroglie@99com122:~$ consul info
agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 1
build:
    prerelease =
    revision = d149d7e9
    version = 1.7.4+1
consul:
    acl = enabled
    bootstrap = false
    known_datacenters = 2
    leader = true
    leader_addr = [2a06:98c0:1400:100::c629:990d]:8300
    server = true
raft:
    applied_index = 2444054
    commit_index = 2444054
    fsm_pending = 0
    last_contact = 0
    last_log_index = 2444054
    last_log_term = 74
    last_snapshot_index = 2436054
    last_snapshot_term = 74
    latest_configuration = [{Suffrage:Voter ID:432355ab-4fa6-502d-8837-3c1bf3719ff2 Address:[2a06:98c0:1400:100::c629:990d]:8300} {Suffrage:Voter ID:89b25347-d6b8-53de-b7d1-b96e7e0b4183 Address:[2a06:98c0:1400:100::c629:98fa]:8300} {Suffrage:Voter ID:1b23682c-c1b0-5067-bfd8-4aab2efe1d5d Address:[2a06:98c0:1400:100::c629:98e5]:8300}]
    latest_configuration_index = 0
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 74
runtime:
    arch = amd64
    cpu_count = 96
    goroutines = 1235
    max_procs = 96
    os = linux
    version = go1.13.12
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 16
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1654
    members = 311
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 116
    members = 8
    query_queue = 0
    query_time = 1
cbroglie@99com122:~$



md5-967e731998fff3550051cb5b7ec7ddbc



Jul 23 09:40:14 99com81 consul-esm[60990]: 2020/07/23 09:40:14 [WARN] agent: Check 'node/foo/http-check' is now critical Jul 23 09:40:14 99com81 consul-esm[60990]: 2020/07/23 09:40:14 [TRACE] Registered check status to the catalog with ID http-check



md5-f487ca9f3bfa982b30c07335f8a2042d



2020-07-23T09:40:14.117Z [DEBUG] agent.tlsutil: IncomingRPCConfig: version=2
2020-07-23T09:40:14.117Z [DEBUG] agent.tlsutil: IncomingRPCConfig: version=2
2020-07-23T09:40:14.133Z [DEBUG] agent.http: Request finished: method=GET url=/v1/health/node/foo?dc=bar from=127.0.0.1:65410 latency=1.061054ms
2020-07-23T09:40:14.140Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/txn?dc=bar from=127.0.0.1:65410 latency=6.606603ms

After which only the leader reflects the update, but we don't see any errors in the logs from followers.

needs-investigation theminternals

Source

cbroglie

👍1

Most helpful comment

We've managed to reproduce the issue in dev https://github.com/szechuen/consul-8377

Waiting for cluster to be ready...

Creating health check...
true

Querying health check -- this should be consistent across servers...
server1: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
server2: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
server3: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]

Manually triggering snapshots with different indexes...
server3
server2
server1

Leaving and restarting servers...
server3
server2
server1

Checking Raft stats after restarts...
server1: 
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"38.1564ms","last_log_index":"39","last_log_term":"3","last_snapshot_index":"16","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Follower","term":"3"}
server2: 
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"11.8772ms","last_log_index":"39","last_log_term":"3","last_snapshot_index":"15","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Follower","term":"3"}
server3: 
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"0","last_log_index":"39","last_log_term":"3","last_snapshot_index":"14","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Leader","term":"3"}

Querying health check -- this should now be INconsistent across servers...
server1: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":16,"ModifyIndex":16}]
server2: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":15,"ModifyIndex":15}]
server3: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]

Performing CAS check update...
{"Results":[{"Check":{"Node":"node1","CheckID":"check1","Name":"check1","Status":"critical","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":null,"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":40}}],"Errors":null}

Querying health check -- note that status is only updated on leader...
server1: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":16,"ModifyIndex":16}]
server2: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":15,"ModifyIndex":15}]
server3: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"critical","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":40}]

(Added Raft stats after restarts)

szechuen on 28 Jul 2020

👍9 🚀1

All 6 comments

@cbroglie Probably a duplicate of https://github.com/hashicorp/consul/issues/6181 => while we looked carefully, never found this underlying issue (the good news is that does not occur a lot since we had the issue only once on 10+ clusters that were never shut down for years)

We had this issue once on an old cluster, never went back. To fix it, we proceeded this way: https://github.com/hashicorp/consul/issues/7939#issuecomment-632071578

pierresouchay on 24 Jul 2020

Thanks @pierresouchay, but I do not think this is a dup of #6181 as we don't see anything similar in the logs. And we're observing the issue in all 3 of our clusters.

cbroglie on 24 Jul 2020

Adding some more details we uncovered: all of the services in this broken state have the exact same CreateIndex values in each server, while each server has a different value. The original example was:

➜  ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/foo?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1573507,1573507]
99com82: [1491640,1491640]
99com122: [1491639,2428888]
➜  ~

And we see those same 3 CreateIndex values (1573507, 1491640, 1491639) for every broken service, e.g.:

➜  ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/abc?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com82: [1491640,1491640]
99com122: [1491639,1568724]
99com103: [1573507,1573507]
➜  ~ pdsh -w 99com82,99com103,99com122 "curl -s 'http://localhost:8500/v1/health/checks/xyz?stale' | jq -c '.[] | [.CreateIndex,.ModifyIndex]'"
99com103: [1573507,1573507]
99com82: [1491640,1491640]
99com122: [1491639,2360138]
➜  ~

cbroglie on 24 Jul 2020

We've managed to reproduce the issue in dev https://github.com/szechuen/consul-8377

Waiting for cluster to be ready...

Creating health check...
true

Querying health check -- this should be consistent across servers...
server1: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
server2: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]
server3: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]

Manually triggering snapshots with different indexes...
server3
server2
server1

Leaving and restarting servers...
server3
server2
server1

Checking Raft stats after restarts...
server1: 
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"38.1564ms","last_log_index":"39","last_log_term":"3","last_snapshot_index":"16","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Follower","term":"3"}
server2: 
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"11.8772ms","last_log_index":"39","last_log_term":"3","last_snapshot_index":"15","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Follower","term":"3"}
server3: 
{"applied_index":"39","commit_index":"39","fsm_pending":"0","last_contact":"0","last_log_index":"39","last_log_term":"3","last_snapshot_index":"14","last_snapshot_term":"2","latest_configuration":"[{Suffrage:Voter ID:9fdd52d7-930e-930a-6fd2-cb857558468b Address:127.0.0.1:8203} {Suffrage:Nonvoter ID:cc2d4d2f-7af7-c7c9-c140-aac43846a294 Address:127.0.0.1:8202} {Suffrage:Nonvoter ID:c7378a5f-b11f-d2e3-3831-00191cae5a9e Address:127.0.0.1:8201}]","latest_configuration_index":"0","num_peers":"0","protocol_version":"3","protocol_version_max":"3","protocol_version_min":"0","snapshot_version_max":"1","snapshot_version_min":"0","state":"Leader","term":"3"}

Querying health check -- this should now be INconsistent across servers...
server1: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":16,"ModifyIndex":16}]
server2: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":15,"ModifyIndex":15}]
server3: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":14}]

Performing CAS check update...
{"Results":[{"Check":{"Node":"node1","CheckID":"check1","Name":"check1","Status":"critical","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":null,"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":40}}],"Errors":null}

Querying health check -- note that status is only updated on leader...
server1: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":16,"ModifyIndex":16}]
server2: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"passing","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":15,"ModifyIndex":15}]
server3: 
[{"Node":"node1","CheckID":"check1","Name":"check1","Status":"critical","Notes":"","Output":"","ServiceID":"service1","ServiceName":"service1","ServiceTags":[],"Type":"","Definition":{},"CreateIndex":14,"ModifyIndex":40}]

(Added Raft stats after restarts)

szechuen on 28 Jul 2020

👍9 🚀1

I'm working on a fix for this now - the reproduction case above was super helpful in finding the issue (checks/nodes/services being restored with the wrong index when restoring from a snapshot). I'll link the PR here when the fix is up.