Assuming that nomad is configured with a check block while registering with consul so that Consul does a health check of the containers.
In this scenario, if consul reports that one of the containers are not healthy, will nomad restart/reschedule those containers?
@vrenjith Not yet, but this is on our roadmap.
Any updates on where in the roadmap this is?
Any news?
Hey, no update on this quite yet. We are refactoring the way we do Consul registrations in 0.5.X. This will make it easier to add new features like this.
Hi,
any news?
as 0.6.0 is coming I guess
Related to #164.
it would be great to add some option to restart some of containers but leave another one in failing state. and add timeout for restart, for example during some long term operations when container should be not available to accept client connection but should be active and killed only after deadline.
Here is a workaround.
I have an autoscaled spot fleet in AWS with nomad agents and this feature was essential for me.
Our main application is written in Java and Java Machine do not fails by itself in "some cases". It just cannot respond to the health_check on http port.
What I've done:
template { data = <<EOH last_restart: {{ key_or_default (printf "apps/backend-rtb/backend-rtb-task/%s" (env "attr.unique.hostname")) "no_signal" }} EOH destination = "local/nomad_task_status" change_mode = "restart" }Consul check for the main app
service { tags = ["backend-rtb"] port = "backend" check { type = "http", port = "backend" path = "/system/ping" interval = "2s" timeout = "1s" } }
With python and python-consul that was quite simple. Any custom restart logic is possible here.
+1
Any news on the road map for this feature?
From @samart:
It would nice if a failing check restarted a task or re-scheduled it elsewhere.
Consul provides a service discovery healthcheck, but when a service is unresponsive we'd like to restart it. Our mesos clusters do this with marathon keep-alive healthchecks and it works well to keep applications responsive.
We should be able to specify at least:
gracePeriod to wait before starting to healthcheck, after the task has started
number of check failures before a restart/re-schedule
interval between checks
timeout on the check attempt
Including the current nomad restart options would be nice as well. Max restart attempts, mode, etc.
@dadgar Is there a way to manually resolve/restart an unhealthy allocation?
Because I currently have one marked unhealthy, while it is perfectly responsive (as also consul shows). When I run plan to roll out an update however, it only wants to update one of 'm, and ignores the other: Task Group: "web" (1 create/destroy update, 1 ignore).
How can I get nomad to re-evaluate the allocation status?
@tino Currently there is no way to restart a particular allocation. Further I think the plan is just showing that because you likely have count = 2 and max_parallel = 1. It will do 1 at a time but will replace all of them.
This function is critical! Also it would be great to see restart limits for whole cluster, to prevent situation when service is overloaded and can't handle all requests but massive restart might cause more problems and you need to restart services one by one.
Most helpful comment
This function is critical! Also it would be great to see restart limits for whole cluster, to prevent situation when service is overloaded and can't handle all requests but massive restart might cause more problems and you need to restart services one by one.