Consul: Configure time to cleanup failed consul clients when working with AWS Spot instances

Created on 28 Apr 2017 · 20Comments · Source: hashicorp/consul

`consul version` for both Client and Server

Client: >=0.7
Server: >=0.7

Operating system and Environment details

Amazon Linux/EL 6

Description of the Issue (and unexpected/desired result)

since the default reap time of failed nodes is 72hs, when working with consul on AWS spots instances or an really active autoscaling group, you might end up with lots of "failed" nodes in your cluster. This are just clients nodes, not servers, so the new autopilot feature won't work on this.
this is a really common problem, since the terminate of instances is quite fast and most of the time the leave_on_terminate doesn't work as expected.
It would be nice to be able to configure this reap time to be able to change it to something like 1hour or so for this type of use cases.

Reproduction steps

just create an autoscaling group of spot instances, that tend to scale/die frequently, the default reap time of the failed nodes is 72hs, so after a day or so you should be seeing a lot of consul clients in failed state either in the UI or through consul members

needs-discussion themoperator-usability typenhancement

Source

sebamontini

👍30

Most helpful comment

That's right. The agents would advertise their reap time as part of their
serf tags most likely and the servers would use that to decide to kick them
out after they have failed. Servers could stay at 72 hours and things like
spot instances for batch jobs could set themselves to 5 minutes. You'd
configure it on each agent based on what you need for that agent.

On May 1, 2017 9:00 PM, "cornfeedhobo" notifications@github.com wrote:

@slackpad https://github.com/slackpad hmmm, so servers would keep track
of a per-agent reap time? Not sure I follow ...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/consul/issues/2982#issuecomment-298492047,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApG5YHm-CVbT_afgoJUWoxBdzSrYutIks5r1qpigaJpZM4NL6Z5
.

slackpad on 2 May 2017

👍7

All 20 comments

nfmelendez on 28 Apr 2017

Would it be possible to configure reap times for servers and ~clients~ agents separately?

cornfeedhobo on 2 May 2017

@cornfeedhobo I think in this design you set it on the agent, so you can have a mix of reap times depending on what's running on there / what the role is.

slackpad on 2 May 2017

@slackpad hmmm, so servers would keep track of a per-agent reap time? Not sure I follow ...

cornfeedhobo on 2 May 2017

On May 1, 2017 9:00 PM, "cornfeedhobo" notifications@github.com wrote:

@slackpad https://github.com/slackpad hmmm, so servers would keep track
of a per-agent reap time? Not sure I follow ...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/consul/issues/2982#issuecomment-298492047,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApG5YHm-CVbT_afgoJUWoxBdzSrYutIks5r1qpigaJpZM4NL6Z5
.

slackpad on 2 May 2017

👍7

@slackpad Yesssss. I want that. Thanks for the explanation!

cornfeedhobo on 2 May 2017

👍1

gitrc on 4 Oct 2017

vigneshsenapathy on 4 Oct 2017

scdrill on 4 Oct 2017

😄3

guidoiaquinti on 9 Feb 2018

ghost on 28 Feb 2018

giladsh1 on 20 Aug 2018

danielelisi on 4 Sep 2018

jgornowich on 11 Oct 2018

swandive0209 on 19 Oct 2018

aashitvyas on 17 Dec 2018

@slackpad just to confirm this for my understanding. As of now consul doesn't have any config property for client node to reap out at all . All the properties and options I have seen so far is for server nodes only , like Consul Auto Pilot and reconnect_timeout , reconnect_timeout_wan. Thank you !

aashitvyas on 18 Dec 2018

👍1

@slackpad just to confirm this for my understanding. As of now consul doesn't have any config property for client node to reap out at all . All the properties and options I have seen so far is for server nodes only , like Consul Auto Pilot and reconnect_timeout , reconnect_timeout_wan. Thank you !

Can someone confirm this?

casper-gh on 8 Nov 2019

consul force-leave -prune {NODE}

Add -prune when do force-leave

hixichen on 29 May 2020

@casper-gh what you wanted to confirm is roughly correct.

reconnect_timeout is configurable on both clients and servers
reconnect_timeout_wan is only useful on servers as these are the only agents in a DC connected to the WAN.
Both of these are set on an agent and control when that individual agent will quit tracking a failed/left member.
When a node gets pruned/reaped after a reconnect timeout expires, no message is broadcast throughout the cluster to the remaining members.
The reconnect timeouts should be synchronized across all agents in a DC or else you can experience scenarios where some subset of agents have pruned/reaped the other agent from its members list while another subset still are tracking it.

What Consul is missing is the ability for an individual node to advertise a customized reconnect/reap/pruning interval that should only be applied to removal of its own node. That statement is confusing and an example would be better.

So if you have 3 servers and 2 clients running on pretty stable infrastructure and several more clients running on spot instances in a cloud provider. Those spot instances might get killed off abruptly and new ones come back to take their place. In this scenario we would want the spot instances to be more aggressively pruned/reaped so that we don't end up tracking them for longer periods of time.

Currently if you have a reconnect_timeout set to 8 hours (the minimum Consul will allow with a default of 72h) then 8 hours after one of the spot instances is killed the 3 stable servers and 2 stable clients will stop tracking the removed node. During that 8 hours, the node will still be registered in the catalog and any associated services and health checks will remain at their last known state. The one thing that will change is that the serf health check for the node will be marked as critical. Many of Consul's APIs (DNS and the HTTP API) have the ability to automatically filter out unhealthy service instances in order to prevent abrupt halting of an agent from causing bad behavior with applications relying on Consul for service discovery and traffic routing.

The problem in this scenario is that if an agent is being run in a more ephemeral manner then the 8h minimum on the reconnect timeout could leave those registrations hanging around a lot longer than desired. One naive solution could be to just get rid of the minimum restriction. That however would have bad consequences in this scenario if for example the reconnect_timeout was set to 60s and one of the servers happened to die or get partitioned from the networks perspective. The server would be removed from the members listing causing it to be removed from Raft which could in turn change various calculations within autopilot/raft regarding quorum checks. Basically a 60s timeout for servers could produce very bad consequences.

So the only potential solution I can think of to better support ephemeral agents would be to have each agent advertise a timeout via gossip/serf tags. In the scenario the spot instances could advertise a timeout of 60s and that way 60s after the spot instance is removed the other agents would prune/reap the spot instance from their members listings.

Hopefully that helps to clarify things a little.