Consul: Consul left with many failed nodes (former agents) over time in cloud environment

Created on 16 Dec 2014  路  29Comments  路  Source: hashicorp/consul

I'm experiencing a situation where over time, as boxes with agents are re-built and not always gracefully deregistered, I'm left in a situation where I have dozens of failed nodes with 0 services. If you issue deregister via the UI or the API the nodes still eventually come back. I have found that force-leave works but I have to issue these manually per failed node. Having many failed nodes makes things dirty and I'd like to figure out how to keep this clean as the failed state is not offering any value.

typbug

Most helpful comment

@slackpad using the health API is not always convenient (the payloads are larger, we are not interested in node/check data, only the service entries), we prefer the /v1/catalog/service/:service API. The problem is that this returns service entries on failed nodes. Is there a way to filter this response to only return services running on known active (not failed) nodes? What is the rationale behind returning service entries for nodes known to be in a failed state?

All 29 comments

@owaaa The nodes should automatically reap out after 72hours (not yet configurable, but soon). Otherwise, the best route is to issue a graceful leave before destroying the nodes (consul leave), so that they can be reaped immediately. They are kept around for that long since without a graceful leave, Consul cannot distinguish between a temporary failure, agent crash, network partition, etc.

I had this same issue, I also tried issuing a consul force-leave but the nodes were still lingering.

@c4milo Were the nodes dead when you did a "force-leave"? Do they show up in "consul members" at all?

They were dead yes, they showed up as failed members.

@c4milo Hmm force-leave should push them into the "left" state. Nodes are not reaped until they are in failed for 72h or in the left state. Did "force-leave" not cause them to go to "left"?

Nop

Another interesting fact, the wan pool didn't have the failed nodes.

@armon I just ran into this again, here is a video: https://asciinema.org/a/bd1apr97vc45f4syahet3dxig. Should I open a separate issue for this?

@c4milo Everything looks fine in that video, the nodes are in the left state. "force-leave" just moves a node from the "failed" -> "left" state. They are not removed from the members list for 24 or 72h.

I see. I'm going to need more sleep. Thanks @armon.

+1 for making the reap time configurable

+1 for making reap time configurable

Another +1 for making the reap time configurable

-1 if it makes the cluster unstable or it is prone to people having more issues than usual.

+1 for making reap time configurable

+1 for making reap time configurable

+1 for the reap time configurable pls!

+1 for the reap time configurable

+1 for configurable reap time

:+1:

:+1:

:+1:

+1 for the reap time configurable

While implementing #1935 which makes this configurable, and reviewing it with @sean- we realized that lowering this too much can be fairly dangerous for the case of Consul servers. If there's a partition that isolates a server and this is set low, the server could get kicked prematurely and would need to be re-joined or restarted in order to work again, whereas with the current default setting you'd have to have 72 hours elapse before that's a problem.

Are people mostly worried about clutter in consul members, or is there some other problem people are hoping will be fixed by making this configurable?

I think at the very least if we make it configurable we should set the minimum relatively high, ~8 hours or more.

@slackpad For me, the problem that this causes is that /v1/catalog/service/service endpoint, still shows services from failed nodes, for me it makes more sense, once it's failed, the services get removed.

@lucaswxp usually clients use the https://www.consul.io/docs/agent/http/health.html#health_service endpoint to find healthy instances (there's a ?passing parameter that will filter to only healthy ones).

@slackpad using the health API is not always convenient (the payloads are larger, we are not interested in node/check data, only the service entries), we prefer the /v1/catalog/service/:service API. The problem is that this returns service entries on failed nodes. Is there a way to filter this response to only return services running on known active (not failed) nodes? What is the rationale behind returning service entries for nodes known to be in a failed state?

Agreed with @MitchFierro If the /v1/catalog/service/service endpoint could also take a passing=true parameter, this would be preferable in the interest of a smaller response payload.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hooksie1 picture hooksie1  路  3Comments

runswithd6s picture runswithd6s  路  3Comments

satheeshCharles picture satheeshCharles  路  3Comments

wargamez picture wargamez  路  4Comments

nicholasjackson picture nicholasjackson  路  3Comments