Consul: Healthchecks resynced every ~2mins, causes consul index to grow

Created on 11 Jul 2017 · 24Comments · Source: hashicorp/consul

Hey,

This is very much related to #1244
I'm using blocking HTTP queries and I encountered a strange behavior where consul index is incremented even when healthchecks didn't change their status (or their output)
So I looked at the value of X-Consul-Index, searched for an offending healthcheck that had that exact same number (via /health/checks/any) and presumed that I'll find that the check state was changed.
But all I could find in the log is
https://gist.github.com/shays10/d0bdb43e0ae3664af1686aa139f18965

All checks are synced again, and this happens every 90-120 seconds. I'm guessing that it happens when antiEntropy kicks in and it finds that the service is synced but the checks are not.
I managed to find a correlation between the timestamp of the sync (according to the log) and the time where the ModifyIndex was bumped in all the checks.
Any idea on how to find the root cause? I'm having lots of checks on a lot of nodes and this causes the index to go wild.

I tried to make sure as much as I can that there isn't a flaky health check that fails randomly or returns a different output.

Client: 0.8.3
Server: 0.8.3

Client:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 11
    services = 11
build:
    prerelease = 
    revision = ea2a82b
    version = 0.8.3
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 8
    goroutines = 60
    max_procs = 2
    os = linux
    version = go1.8.1
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 243
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1496
    members = 32
    query_queue = 0
    query_time = 1

Server:

agent:
    check_monitors = 3
    check_ttls = 0
    checks = 3
    services = 1
build:
    prerelease = 
    revision = ea2a82b
    version = 0.8.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 172.31.9.104:8300
    server = true
raft:
    applied_index = 2553095
    commit_index = 2553095
    fsm_pending = 0
    last_contact = 0
    last_log_index = 2553095
    last_log_term = 435
    last_snapshot_index = 2551334
    last_snapshot_term = 435
    latest_configuration = [{Suffrage:Voter ID:172.31.13.71:8300 Address:172.31.13.71:8300} {Suffrage:Voter ID:172.31.11.117:8300 Address:172.31.11.117:8300} {Suffrage:Voter ID:172.31.9.104:8300 Address:172.31.9.104:8300}]
    latest_configuration_index = 1180420
    num_peers = 2
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 435
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 247
    max_procs = 2
    os = linux
    version = go1.8.1
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 243
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1496
    members = 32
    query_queue = 0
    query_time = 1
serf_wan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 92
    members = 3
    query_queue = 0
    query_time = 1

themhealth-checks typbug

Source

shays10

👍1

All 24 comments

A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5

shays10 on 17 Jul 2017

@shays10 thanks for the update. This kind of thing is usually related to some changing output, but the downgrade fixing it doesn't really fit in with that explanation. Do you have ACLs enabled?

slackpad on 17 Jul 2017

@slackpad Thanks for your reply! No, ACLs are currently disabled in our cluster

shays10 on 18 Jul 2017

Hi @shays10

A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5

Did you downgrade the clients, servers, or both?

slackpad on 2 Sep 2017

Hey again! Both.

shays10 on 2 Sep 2017

I've just faced the same issue with latest consul.
It seems that index is incremented on every http check.
Are http response headers considered as part of output?

client

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 2
build:
        prerelease =
        revision = 75ca2ca
        version = 0.9.2
consul:
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 44
        max_procs = 1
        os = linux
        version = go1.8.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 315
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 638
        members = 33
        query_queue = 0
        query_time = 35

server

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 75ca2ca
        version = 0.9.2
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 172.31.24.147:8300
        server = true
raft:
        applied_index = 15877761
        commit_index = 15877761
        fsm_pending = 0
        last_contact = 22.483µs
        last_log_index = 15877761
        last_log_term = 784
        last_snapshot_index = 15873877
        last_snapshot_term = 784
        latest_configuration = [{Suffrage:Voter ID:11325065-d2e6-788a-6a23-7d7c0af1a7d9 Address:172.31.22.9:8300} {Suffrage:Voter ID:78ca8bae-0f85-70de-329e-bf947b058a54 Address:172.31.24.147:8300} {Suffrage:Voter ID:f1193568-725f-7048-05a4-12bbd1a1d1e8 Address:172.31.25.174:8300}]
        latest_configuration_index = 15617842
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 784
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 225
        max_procs = 1
        os = linux
        version = go1.8.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 315
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 638
        members = 33
        query_queue = 0
        query_time = 35
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

ponimas on 4 Sep 2017

👍1

No. https://github.com/hashicorp/consul/blob/master/agent/check.go#L400-L415

Have you got a timestamp in the body?

magiconair on 4 Sep 2017

Have you got a timestamp in the body?

Nope.
Sometimes it's empty body and sometimes it's simple json (without any arrays).
Same response every time from the same service.

ponimas on 4 Sep 2017

@ponimas Can you provide the check definition and a full check request/response for testing?

magiconair on 4 Sep 2017

@magiconair

{
          "http": "http://172.31.17.136:3000/",
          "interval": "60s",
          "timeout": "1s"
}

as far as I understand request is a simple GET request
and response is

curl -i http://172.31.17.136:3000/

HTTP/1.1 200 OK
Date: Tue, 05 Sep 2017 05:08:32 GMT
Content-Type: application/json; charset=utf-8
x-request-id: 15045881129851FD6B543
Content-Length: 60
Server: Jetty(9.2.z-SNAPSHOT)

{"status":"ok","message":"V2. Follow me, I know a shortcut"}

curl -i /v1/health/checks/susanin

HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 05 Sep 2017 05:17:39 GMT
Transfer-Encoding: chunked
X-Consul-Index: 15897185
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
[
...
{
        "CheckID": "service:registrator.susanin-prod-cc8b.vpc:susanin:3000",
        "CreateIndex": 15616870,
        "ModifyIndex": 15897062,
        "Name": "Service 'susanin' check",
        "Node": "susanin-prod-cc8b",
        "Notes": "",
        "Output": "HTTP GET http://172.31.17.136:3000/: 200 OK Output: {\"status\":\"ok\",\"message\":\"V2. Follow me, I know a shortcut\"}",
        "ServiceID": "registrator.susanin-prod-cc8b.vpc:susanin:3000",
        "ServiceName": "susanin",
        "ServiceTags": [
            "urlprefix-/internal/susanin strip=/internal/susanin"
        ],
        "Status": "passing"
    },
...
]

ponimas on 5 Sep 2017

@magiconair is there any news on this issue?

ponimas on 11 Sep 2017

consul 0.9.2 - same behaviour, is it possible that header Date may trigger index to grow?

PS C:\tools\curl> .\curl.exe -k --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
Test consul site
PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:27 GMT

PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:37 GMT

123BLiN on 14 Sep 2017

I'm currently a bit sick and also preparing for HashiConf next week. Should be able to catch my breath after that.

magiconair on 14 Sep 2017

@magiconair is that gonna be fixed in 1.0?

ponimas on 10 Oct 2017

@ponimas If I can find the root cause then probably. However, usually this is triggered by the Output field being different. I'll try to have a look this week but it is a short week for me.

magiconair on 10 Oct 2017

@ponimas does that x-request-id: 15045881129851FD6B543 field change on every GET? That would cause some churn if so.

slackpad on 10 Oct 2017

@slackpad yup it changes. But it's header.
AFAIU https://github.com/hashicorp/consul/issues/3259#issuecomment-327017362 - it shouldn't affect anything

ponimas on 11 Oct 2017

@ponimas ah ok - I thought you were actually running that curl command as the check.

slackpad on 11 Oct 2017

@slackpad I think you should have a look to @yfouquet patch.
In our cluster, the X-Consul-Index is increasing continuously while state is not varying that much. We have lots of checks on services with tags and we have very large amount of sync at cluster level, probably for this reason

pierresouchay on 22 Nov 2017

btw our indexes are still growing insanely fast w/out health check output. However consul watchers and consul-template do not trigger on them and in general they trigger only on changes - there should be some undocumented magic here, as I always thought that trigger event for such blocking queries is x-consul-index change.

123BLiN on 22 Nov 2017

Might be related as well: https://github.com/hashicorp/consul/issues/3712

pierresouchay on 25 Nov 2017

Closing this as fixed by https://github.com/hashicorp/consul/pull/3642 which looks to be the root cause of this.

slackpad on 14 Dec 2017

We've upgraded to consul 1.0.2 but we're seeing this issue with the nomad health check.

Almost every 2 minutes it does a synced check of the Nomad Client HTTP Check
Running nomad 0.7.1

Jan 29 00:28:45 box consul[110606]: 2018/01/29 00:28:45 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:30:17 box consul[110606]: 2018/01/29 00:30:17 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:32:28 box consul[110606]: 2018/01/29 00:32:28 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:34:40 box consul[110606]: 2018/01/29 00:34:40 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:37:02 box consul[110606]: 2018/01/29 00:37:02 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:39:13 box consul[110606]: 2018/01/29 00:39:13 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"