consul version for both Client and ServerClient: >=0.7
Server: >=0.7
Amazon Linux/EL 6
since the default reap time of failed nodes is 72hs, when working with consul on AWS spots instances or an really active autoscaling group, you might end up with lots of "failed" nodes in your cluster. This are just clients nodes, not servers, so the new autopilot feature won't work on this.
this is a really common problem, since the terminate of instances is quite fast and most of the time the leave_on_terminate doesn't work as expected.
It would be nice to be able to configure this reap time to be able to change it to something like 1hour or so for this type of use cases.
just create an autoscaling group of spot instances, that tend to scale/die frequently, the default reap time of the failed nodes is 72hs, so after a day or so you should be seeing a lot of consul clients in failed state either in the UI or through consul members
+1
Would it be possible to configure reap times for servers and ~clients~ agents separately?
@cornfeedhobo I think in this design you set it on the agent, so you can have a mix of reap times depending on what's running on there / what the role is.
@slackpad hmmm, so servers would keep track of a per-agent reap time? Not sure I follow ...
That's right. The agents would advertise their reap time as part of their
serf tags most likely and the servers would use that to decide to kick them
out after they have failed. Servers could stay at 72 hours and things like
spot instances for batch jobs could set themselves to 5 minutes. You'd
configure it on each agent based on what you need for that agent.
On May 1, 2017 9:00 PM, "cornfeedhobo" notifications@github.com wrote:
@slackpad https://github.com/slackpad hmmm, so servers would keep track
of a per-agent reap time? Not sure I follow ...—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/hashicorp/consul/issues/2982#issuecomment-298492047,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApG5YHm-CVbT_afgoJUWoxBdzSrYutIks5r1qpigaJpZM4NL6Z5
.
@slackpad Yesssss. I want that. Thanks for the explanation!
+1
+1

+1
+1
+1
+1
+1
+1
+1
@slackpad just to confirm this for my understanding. As of now consul doesn't have any config property for client node to reap out at all . All the properties and options I have seen so far is for server nodes only , like Consul Auto Pilot and reconnect_timeout , reconnect_timeout_wan. Thank you !
@slackpad just to confirm this for my understanding. As of now consul doesn't have any config property for client node to reap out at all . All the properties and options I have seen so far is for server nodes only , like Consul Auto Pilot and
reconnect_timeout,reconnect_timeout_wan. Thank you !
Can someone confirm this?
consul force-leave -prune {NODE}
Add -prune when do force-leave
@casper-gh what you wanted to confirm is roughly correct.
reconnect_timeout is configurable on both clients and serversreconnect_timeout_wan is only useful on servers as these are the only agents in a DC connected to the WAN.What Consul is missing is the ability for an individual node to advertise a customized reconnect/reap/pruning interval that should only be applied to removal of its own node. That statement is confusing and an example would be better.
So if you have 3 servers and 2 clients running on pretty stable infrastructure and several more clients running on spot instances in a cloud provider. Those spot instances might get killed off abruptly and new ones come back to take their place. In this scenario we would want the spot instances to be more aggressively pruned/reaped so that we don't end up tracking them for longer periods of time.
Currently if you have a reconnect_timeout set to 8 hours (the minimum Consul will allow with a default of 72h) then 8 hours after one of the spot instances is killed the 3 stable servers and 2 stable clients will stop tracking the removed node. During that 8 hours, the node will still be registered in the catalog and any associated services and health checks will remain at their last known state. The one thing that will change is that the serf health check for the node will be marked as critical. Many of Consul's APIs (DNS and the HTTP API) have the ability to automatically filter out unhealthy service instances in order to prevent abrupt halting of an agent from causing bad behavior with applications relying on Consul for service discovery and traffic routing.
The problem in this scenario is that if an agent is being run in a more ephemeral manner then the 8h minimum on the reconnect timeout could leave those registrations hanging around a lot longer than desired. One naive solution could be to just get rid of the minimum restriction. That however would have bad consequences in this scenario if for example the reconnect_timeout was set to 60s and one of the servers happened to die or get partitioned from the networks perspective. The server would be removed from the members listing causing it to be removed from Raft which could in turn change various calculations within autopilot/raft regarding quorum checks. Basically a 60s timeout for servers could produce very bad consequences.
So the only potential solution I can think of to better support ephemeral agents would be to have each agent advertise a timeout via gossip/serf tags. In the scenario the spot instances could advertise a timeout of 60s and that way 60s after the spot instance is removed the other agents would prune/reap the spot instance from their members listings.
Hopefully that helps to clarify things a little.
Most helpful comment
That's right. The agents would advertise their reap time as part of their
serf tags most likely and the servers would use that to decide to kick them
out after they have failed. Servers could stay at 72 hours and things like
spot instances for batch jobs could set themselves to 5 minutes. You'd
configure it on each agent based on what you need for that agent.
On May 1, 2017 9:00 PM, "cornfeedhobo" notifications@github.com wrote: