Consul: Node marked as failed, but service checks are still green.

Created on 26 Jan 2015 · 4Comments · Source: hashicorp/consul

0.4.1. 400'ish nodes.

AWS had a hardware failure a couple of days ago and the node went sideways. It's marked as failed:

darron@host:~$ consul members | grep i-fbb98c17
i-fbb98c17  10.187.61.162:8301   failed  client  0.4.1  2

But it's still showing up in the service catalog because the health checks haven't failed:

4oo3w-14-40

The "bunk" service is defined as:

{
  "service": {
    "name": "bunk",
    "port": 80,
    "tags": [
      "bunk",
      "az:us-east-1b"
    ],
    "check": {
      "interval": "2s",
      "script": "/bin/nc -z localhost 80"
    }
  }
}

But that hasn't run for more than 24 hours - there's almost 24 hours of this in the logs:

Jan 26 14:43:35 i-05e8deef consul:  serf: attempting reconnect to i-fbb98c17 10.187.61.162:8301
Jan 26 14:43:43 i-ad4cca43 consul:  serf: attempting reconnect to i-fbb98c17 10.187.61.162:8301
Jan 26 14:43:56 i-ad840a41 consul:  serf: attempting reconnect to i-fbb98c17 10.187.61.162:8301
Jan 26 14:44:03 i-002837e1 consul:  serf: attempting reconnect to i-fbb98c17 10.187.61.162:8301
Jan 26 14:44:15 i-e7b50d0b consul:  serf: attempting reconnect to i-fbb98c17 10.187.61.162:8301

Is it expected that a service check that can't have been running for more than 24 hours on a dead node is still green?

Source

darron

Most helpful comment

Hey guys, I'm passing for the same issue, but in my case I don't have heath checks enabled, shouldn't the failed node services be removed when the node's agent itself isn't alive?

lucaswxp on 13 Aug 2016

👍3

All 4 comments

Hey @darron, this actually seems correct. If the node has gone offline, it is likely not running the checks any more, and definitely not submitting them to the catalog. Since the check is a script + interval type, the only thing that ends up marking the check as failed is when a bad return code is observed when the agent runs the script, and the result submitted to the catalog. I can see how this could be a little confusing when looking at the UI.

As a side note, the node in the output you've provided actually should be excluded from any queries for the bunk service with passing filters enabled. Reason being is that serfHealth is failing, and is attached directly to the node. If any node-level health checks go into a failed state, then all services from the node should be excluded from the typical DNS or API /v1/health/service/bunk?passing interfaces.

ryanuber on 26 Jan 2015

OK - it is showing up critical using that API:

darron@host:~$ curl -s "http://127.0.0.1:8500/v1/health/service/bunk" | jq -c '.[]'
{"Node":{"Node":"i-fbb98c17","Address":"10.187.61.162"},"Service":{"ID":"bunk","Service":"bunk","Tags":["bunk","az:us-east-1b"],"Port":80},"Checks":[{"Node":"i-fbb98c17","CheckID":"service:bunk","Name":"Service 'bunk' check","Status":"passing","Notes":"","Output":"","ServiceID":"bunk","ServiceName":"bunk"},{"Node":"i-fbb98c17","CheckID":"serfHealth","Name":"Serf Health Status","Status":"critical","Notes":"","Output":"Agent not live or unreachable","ServiceID":"","ServiceName":""}]}

And doesn't show up when you add ?passing - thanks for the tip.

I hadn't expected that this url would still show failing items:

curl -s "http://127.0.0.1:8500/v1/catalog/service/bunk" | jq -c '.[]'

But it does make sense now that I understand it better.

Having it still appear green in the UI threw me a bit. When we have removed other nodes, they have all left the cluster properly, but this one stuck around because of how it degraded.

Thanks for the explanation!

darron on 26 Jan 2015

Hey guys, I'm passing for the same issue, but in my case I don't have heath checks enabled, shouldn't the failed node services be removed when the node's agent itself isn't alive?

lucaswxp on 13 Aug 2016

👍3