Hey,
This is very much related to #1244
I'm using blocking HTTP queries and I encountered a strange behavior where consul index is incremented even when healthchecks didn't change their status (or their output)
So I looked at the value of X-Consul-Index, searched for an offending healthcheck that had that exact same number (via /health/checks/any) and presumed that I'll find that the check state was changed.
But all I could find in the log is
https://gist.github.com/shays10/d0bdb43e0ae3664af1686aa139f18965
All checks are synced again, and this happens every 90-120 seconds. I'm guessing that it happens when antiEntropy kicks in and it finds that the service is synced but the checks are not.
I managed to find a correlation between the timestamp of the sync (according to the log) and the time where the ModifyIndex was bumped in all the checks.
Any idea on how to find the root cause? I'm having lots of checks on a lot of nodes and this causes the index to go wild.
I tried to make sure as much as I can that there isn't a flaky health check that fails randomly or returns a different output.
Client: 0.8.3
Server: 0.8.3
Client:
agent:
check_monitors = 0
check_ttls = 0
checks = 11
services = 11
build:
prerelease =
revision = ea2a82b
version = 0.8.3
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 8
goroutines = 60
max_procs = 2
os = linux
version = go1.8.1
serf_lan:
encrypted = false
event_queue = 0
event_time = 243
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1496
members = 32
query_queue = 0
query_time = 1
Server:
agent:
check_monitors = 3
check_ttls = 0
checks = 3
services = 1
build:
prerelease =
revision = ea2a82b
version = 0.8.3
consul:
bootstrap = false
known_datacenters = 1
leader = true
leader_addr = 172.31.9.104:8300
server = true
raft:
applied_index = 2553095
commit_index = 2553095
fsm_pending = 0
last_contact = 0
last_log_index = 2553095
last_log_term = 435
last_snapshot_index = 2551334
last_snapshot_term = 435
latest_configuration = [{Suffrage:Voter ID:172.31.13.71:8300 Address:172.31.13.71:8300} {Suffrage:Voter ID:172.31.11.117:8300 Address:172.31.11.117:8300} {Suffrage:Voter ID:172.31.9.104:8300 Address:172.31.9.104:8300}]
latest_configuration_index = 1180420
num_peers = 2
protocol_version = 2
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 435
runtime:
arch = amd64
cpu_count = 2
goroutines = 247
max_procs = 2
os = linux
version = go1.8.1
serf_lan:
encrypted = false
event_queue = 0
event_time = 243
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1496
members = 32
query_queue = 0
query_time = 1
serf_wan:
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 92
members = 3
query_queue = 0
query_time = 1
A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5
@shays10 thanks for the update. This kind of thing is usually related to some changing output, but the downgrade fixing it doesn't really fit in with that explanation. Do you have ACLs enabled?
@slackpad Thanks for your reply! No, ACLs are currently disabled in our cluster
Hi @shays10
A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5
Did you downgrade the clients, servers, or both?
Hey again! Both.
I've just faced the same issue with latest consul.
It seems that index is incremented on every http check.
Are http response headers considered as part of output?
client
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 2
build:
prerelease =
revision = 75ca2ca
version = 0.9.2
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 1
goroutines = 44
max_procs = 1
os = linux
version = go1.8.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 315
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 638
members = 33
query_queue = 0
query_time = 35
server
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 75ca2ca
version = 0.9.2
consul:
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 172.31.24.147:8300
server = true
raft:
applied_index = 15877761
commit_index = 15877761
fsm_pending = 0
last_contact = 22.483碌s
last_log_index = 15877761
last_log_term = 784
last_snapshot_index = 15873877
last_snapshot_term = 784
latest_configuration = [{Suffrage:Voter ID:11325065-d2e6-788a-6a23-7d7c0af1a7d9 Address:172.31.22.9:8300} {Suffrage:Voter ID:78ca8bae-0f85-70de-329e-bf947b058a54 Address:172.31.24.147:8300} {Suffrage:Voter ID:f1193568-725f-7048-05a4-12bbd1a1d1e8 Address:172.31.25.174:8300}]
latest_configuration_index = 15617842
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 784
runtime:
arch = amd64
cpu_count = 1
goroutines = 225
max_procs = 1
os = linux
version = go1.8.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 315
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 638
members = 33
query_queue = 0
query_time = 35
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
No. https://github.com/hashicorp/consul/blob/master/agent/check.go#L400-L415
Have you got a timestamp in the body?
Have you got a timestamp in the body?
Nope.
Sometimes it's empty body and sometimes it's simple json (without any arrays).
Same response every time from the same service.
@ponimas Can you provide the check definition and a full check request/response for testing?
@magiconair
{
"http": "http://172.31.17.136:3000/",
"interval": "60s",
"timeout": "1s"
}
as far as I understand request is a simple GET request
and response is
curl -i http://172.31.17.136:3000/
HTTP/1.1 200 OK
Date: Tue, 05 Sep 2017 05:08:32 GMT
Content-Type: application/json; charset=utf-8
x-request-id: 15045881129851FD6B543
Content-Length: 60
Server: Jetty(9.2.z-SNAPSHOT)
{"status":"ok","message":"V2. Follow me, I know a shortcut"}
curl -i /v1/health/checks/susanin
HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 05 Sep 2017 05:17:39 GMT
Transfer-Encoding: chunked
X-Consul-Index: 15897185
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
[
...
{
"CheckID": "service:registrator.susanin-prod-cc8b.vpc:susanin:3000",
"CreateIndex": 15616870,
"ModifyIndex": 15897062,
"Name": "Service 'susanin' check",
"Node": "susanin-prod-cc8b",
"Notes": "",
"Output": "HTTP GET http://172.31.17.136:3000/: 200 OK Output: {\"status\":\"ok\",\"message\":\"V2. Follow me, I know a shortcut\"}",
"ServiceID": "registrator.susanin-prod-cc8b.vpc:susanin:3000",
"ServiceName": "susanin",
"ServiceTags": [
"urlprefix-/internal/susanin strip=/internal/susanin"
],
"Status": "passing"
},
...
]
@magiconair is there any news on this issue?
consul 0.9.2 - same behaviour, is it possible that header Date may trigger index to grow?
PS C:\tools\curl> .\curl.exe -k --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
Test consul site
PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:27 GMT
PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:37 GMT
I'm currently a bit sick and also preparing for HashiConf next week. Should be able to catch my breath after that.
@magiconair is that gonna be fixed in 1.0?
@ponimas If I can find the root cause then probably. However, usually this is triggered by the Output field being different. I'll try to have a look this week but it is a short week for me.
@ponimas does that x-request-id: 15045881129851FD6B543 field change on every GET? That would cause some churn if so.
@slackpad yup it changes. But it's header.
AFAIU https://github.com/hashicorp/consul/issues/3259#issuecomment-327017362 - it shouldn't affect anything
@ponimas ah ok - I thought you were actually running that curl command as the check.
@slackpad I think you should have a look to @yfouquet patch.
In our cluster, the X-Consul-Index is increasing continuously while state is not varying that much. We have lots of checks on services with tags and we have very large amount of sync at cluster level, probably for this reason
btw our indexes are still growing insanely fast w/out health check output. However consul watchers and consul-template do not trigger on them and in general they trigger only on changes - there should be some undocumented magic here, as I always thought that trigger event for such blocking queries is x-consul-index change.
Might be related as well: https://github.com/hashicorp/consul/issues/3712
Closing this as fixed by https://github.com/hashicorp/consul/pull/3642 which looks to be the root cause of this.
We've upgraded to consul 1.0.2 but we're seeing this issue with the nomad health check.
Almost every 2 minutes it does a synced check of the Nomad Client HTTP Check
Running nomad 0.7.1
Jan 29 00:28:45 box consul[110606]: 2018/01/29 00:28:45 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:30:17 box consul[110606]: 2018/01/29 00:30:17 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:32:28 box consul[110606]: 2018/01/29 00:32:28 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:34:40 box consul[110606]: 2018/01/29 00:34:40 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:37:02 box consul[110606]: 2018/01/29 00:37:02 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:39:13 box consul[110606]: 2018/01/29 00:39:13 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
After a lot of debugging I found the issue (for my issue), and submitted a PR
cc: @slackpad