Nomad v0.4.0
When rolling update are in place, nomad apply delay only for stopping existing evaluations and doesn't check if containers that are replacing stopped ones actually started, which can lead to situations like this:
ID Priority Triggered By Status Placement Failures
cc22d922-637e-6502-0a27-70e7d45f0e59 50 rolling-update complete false
1728faa0-e0de-c3d0-75ca-87e98312c525 50 rolling-update complete false
d89f4563-6573-3083-8872-f28205ea3209 50 job-register complete false
Allocations
ID Eval ID Node ID Task Group Desired Status
06b7b662-7765-40de-47d5-913a9d218090 cc22d922-637e-6502-0a27-70e7d45f0e59 93c61971-69c4-30df-99af-21540ea6a909 admin run pending
374da636-08a5-81eb-b8ad-9527b6c2e514 cc22d922-637e-6502-0a27-70e7d45f0e59 93c61971-69c4-30df-99af-21540ea6a909 admin run pending
97a4c2ea-703d-7700-0e76-8112639b4cd3 1728faa0-e0de-c3d0-75ca-87e98312c525 45372a28-46f6-c0a4-01a1-1f5af00e056c admin run pending
702a570c-7ccd-69fb-1d85-559c084d0781 1728faa0-e0de-c3d0-75ca-87e98312c525 93c61971-69c4-30df-99af-21540ea6a909 admin run pending
92787c01-07f5-31db-e49a-3422ff249dc7 d89f4563-6573-3083-8872-f28205ea3209 45372a28-46f6-c0a4-01a1-1f5af00e056c admin run pending
c0c70990-4b69-adc4-4901-60f904e0b056 d89f4563-6573-3083-8872-f28205ea3209 45372a28-46f6-c0a4-01a1-1f5af00e056c admin run pending
5ea9aa76-a363-37a6-c7b7-9b9794919057 9f14f3f2-5719-2d06-e46c-b7875b7bb4ec 93c61971-69c4-30df-99af-21540ea6a909 admin stop complete
161b1b7b-22c0-6b57-85c9-e6021f2fdef3 9f14f3f2-5719-2d06-e46c-b7875b7bb4ec 45372a28-46f6-c0a4-01a1-1f5af00e056c admin stop complete
0ac7c080-4877-a42e-41ea-ae2f9d85bd82 bd2f1c88-422c-e4b1-fc85-bd0279ac034e 45372a28-46f6-c0a4-01a1-1f5af00e056c admin stop complete
b2f60278-fd21-f8ab-4716-3a9975cb9f5e bd2f1c88-422c-e4b1-fc85-bd0279ac034e 93c61971-69c4-30df-99af-21540ea6a909 admin stop complete
5103e096-9eeb-706b-4b08-d1f73fafe927 758af4a6-1e44-2abf-b830-7d4ddee06237 45372a28-46f6-c0a4-01a1-1f5af00e056c admin stop complete
e98f0f97-6d96-e569-74d5-8fc687eb8a83 758af4a6-1e44-2abf-b830-7d4ddee06237 93c61971-69c4-30df-99af-21540ea6a909 admin stop complete
In this case there was short stagger time, which lead to service unavailability as nomad stopped all containers, while new ones didn't yet started (here they are pending because nodes are fetching images from remote registry).
While having longer stagger time will resolve the issue, this seems to be a potential cause of downtime as nomad will eventually stop all containers regardless of whether new ones started in place.
Wonder if it would make sense to have an extra flag that will wait for replacing containers to start and abort evaluation in case if containers didn't start within stagger interval. (It is better to have old image running on slightly smaller amount of instances after failed evaluation than end up with no instances running at all)
We will most likely solve this by waiting for the service to be healthy in Consul before moving on
I think that waiting for healthy is the most correct solution to this, as there is a combination of both docker registry pull time and service startup time that need to be waited on before the next service is restarted in order to avoid downtime. That being said, I feel like it would not hurt to also implement the more raw wait for service to transition to running in the mean time, and in case people are not running consul.
Agree that consul service status would be the best solution, and i also think that there should be "raw wait" at least for tasks that do not expose services.
I' see a few issues with relying on Consul only for health checks, In a sense there are two phases in health checking a task, an initial check, to determine that the task has started on the node and after that continuous health checks to verify that the task is still running healthy.
If the initial check passes, then Nomad should continue the rolling update, a failed initial check should probably stop the update?
The initial health check should be done by Nomad, it might be a really simple tcp, http or script check, the continuous checks could be done by Consul.
+1 for making sure the task passes it's health check before continuing on.
We ended up writing a push script that would deploy our task in a new job, wait for a successful consul health check, and then update the main job for an upgrade. This wrapper feels unnecessary if the rolling upgrade process worked a little different. With out it we have the risk of rolling out an unhealthy task to 100% of our cluster which isn't one we're going to take.
Hey all,
Just wanted to update that this is slated for 0.6.0
Any estimates on when to expect v0.6.0? Or at least 0.5.0?
v0.6.0 is a bit far out to estimate but 0.5.0 will be out very soon!
@nathanpalmer would you like to share your script? Sounds like nice workaround of missing wait for successful health check.
@venth The whole script is a bit too involved and tied to our setup to get posted here (it spans several files and utilities.) However the basics it looks like this.
1) Our jobs are setup using an prefix and a group. Setup in a blue/green deployment. For example api-green and api-blue.
2) The job.hcl file is scripted using consul-template and environment variables to determine which group we're deploying at a given time (using SERVICE_NAME and SERVICE_GROUP)
{{$name := (env "SERVICE_NAME")}}{{$group := (env "SERVICE_GROUP")}}job "{{$name}}-{{$group}}" {
group "default" {
count = {{key (print "service/api/jobs/" $name "-" $group "/count")}}
meta {
revision = "{{key (print "service/api/jobs/" $name "-" $group "/revision")}}"
}
task "api" {
...
}
}
}
3) Since that job file is setup so we can deploy the app to any named group we deploy first to a migration group call api-migrate.
4) Wait for the allocation to complete
We query /v1/allocation/ and look for essentially this
allocation["ClientStatus"] == "failed" ||
(allocation["DesiredStatus"] == "run" && allocation["ClientStatus"] == "running") ||
(allocation["DesiredStatus"] == "stop" && allocation["ClientStatus"] == "complete")
5) Wait for the health check to pass
We query /v1/health/service/#{service} looking for a status of passing
6) We tear down the api-migrate job and if the health was successful we update the main group we're deploying (api-green for example)
@nathanpalmer Thanks for explanation and knowledge sharing ;)
Hey this has been addressed by deployments in 0.6.0
Most helpful comment
We will most likely solve this by waiting for the service to be healthy in Consul before moving on