Consul: Consul left with many failed nodes (former agents) over time in cloud environment

Created on 16 Dec 2014 · 29Comments · Source: hashicorp/consul

I'm experiencing a situation where over time, as boxes with agents are re-built and not always gracefully deregistered, I'm left in a situation where I have dozens of failed nodes with 0 services. If you issue deregister via the UI or the API the nodes still eventually come back. I have found that force-leave works but I have to issue these manually per failed node. Having many failed nodes makes things dirty and I'd like to figure out how to keep this clean as the failed state is not offering any value.

typbug

Source

owaaa

👍6

Most helpful comment

@slackpad using the health API is not always convenient (the payloads are larger, we are not interested in node/check data, only the service entries), we prefer the /v1/catalog/service/:service API. The problem is that this returns service entries on failed nodes. Is there a way to filter this response to only return services running on known active (not failed) nodes? What is the rationale behind returning service entries for nodes known to be in a failed state?

MitchFierro on 1 May 2018

👍5

All 29 comments

@owaaa The nodes should automatically reap out after 72hours (not yet configurable, but soon). Otherwise, the best route is to issue a graceful leave before destroying the nodes (consul leave), so that they can be reaped immediately. They are kept around for that long since without a graceful leave, Consul cannot distinguish between a temporary failure, agent crash, network partition, etc.

armon on 17 Dec 2014

👍2

I had this same issue, I also tried issuing a consul force-leave but the nodes were still lingering.

c4milo on 6 May 2015

@c4milo Were the nodes dead when you did a "force-leave"? Do they show up in "consul members" at all?

armon on 7 May 2015

They were dead yes, they showed up as failed members.

c4milo on 7 May 2015

@c4milo Hmm force-leave should push them into the "left" state. Nodes are not reaped until they are in failed for 72h or in the left state. Did "force-leave" not cause them to go to "left"?

armon on 7 May 2015

Nop

c4milo on 7 May 2015

Another interesting fact, the wan pool didn't have the failed nodes.

c4milo on 7 May 2015

@armon I just ran into this again, here is a video: https://asciinema.org/a/bd1apr97vc45f4syahet3dxig. Should I open a separate issue for this?

c4milo on 12 May 2015

@c4milo Everything looks fine in that video, the nodes are in the left state. "force-leave" just moves a node from the "failed" -> "left" state. They are not removed from the members list for 24 or 72h.

armon on 12 May 2015

I see. I'm going to need more sleep. Thanks @armon.

c4milo on 12 May 2015

+1 for making the reap time configurable

gtmtech on 11 Jun 2015

👍1

+1 for making reap time configurable

bryanwb on 11 Nov 2015

👍1

Another +1 for making the reap time configurable

atrbgithub on 20 Nov 2015

-1 if it makes the cluster unstable or it is prone to people having more issues than usual.

c4milo on 20 Nov 2015

+1 for making reap time configurable

goacid on 23 Nov 2015

+1 for making reap time configurable

s1l0uk on 19 Jan 2016

+1 for the reap time configurable pls!

developerinlondon on 22 Jan 2016

+1 for the reap time configurable

erkules on 18 Feb 2016

+1 for configurable reap time

naydencho on 22 Feb 2016

👍1

:+1:

dim on 21 Mar 2016

:+1:

Poogles on 21 Mar 2016

:+1:

nadirollo on 7 Apr 2016

+1 for the reap time configurable

ThomasGilbert on 8 Apr 2016

While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem.

Are people mostly worried about clutter in consul members, or is there some other problem people are hoping will be fixed by making this configurable?

slackpad on 11 Apr 2016

I think at the very least if we make it configurable we should set the minimum relatively high, ~8 hours or more.

slackpad on 11 Apr 2016

@slackpad For me, the problem that this causes is that /v1/catalog/service/service endpoint, still shows services from failed nodes, for me it makes more sense, once it's failed, the services get removed.

lucaswxp on 15 Aug 2016

@lucaswxp usually clients use the https://www.consul.io/docs/agent/http/health.html#health_service endpoint to find healthy instances (there's a ?passing parameter that will filter to only healthy ones).

slackpad on 15 Aug 2016

👍1

MitchFierro on 1 May 2018

👍5

Agreed with @MitchFierro If the /v1/catalog/service/service endpoint could also take a passing=true parameter, this would be preferable in the interest of a smaller response payload.