Hi everyone!
We are using nomad + fabiolb/traefik as reverse proxy/load balancer in a microservices architecture.
We are trying to achive zero downtime deployment with rolling upgrades.
How can we detect unhealthy targets in synchronization with the reverse proxy to ensure that the applications performs the graceful shutdown without errors?
Simple use case.
We submit to nomad an update of a certain docker container.
Nomad through docker sends the SIGTERM signal to start the graceful shutdown of the application.
The application updates the status to unhealthy waiting for the established connections to be terminated.
How can we ensure that the load balancer / reverse proxy identifies the unhealthy state before the application ends gracefully?
We understand that it may happen that the application ends correctly , faster than the load balancer identifies the unhealthy state, causing new requests to travel to that application.
We look for an infrastructure solution without messing up applications with this logic.
Regards.
@dadgar @nicholasjackson @anubhavmishra @marcosnils
Hey @arrodriguez, Thanks for your question. So, it looks like you are using Fabio, and I am assuming that you are also using Consul along side. If that is the case, you can use the service stanza in your job file to define the service and the associated health checks. We support script1, http and tcp health checks. I would recommend just creating a simple endpoint and use that to check the health of the service. Fabio will respect
Also, making sure that your application gracefully shuts down when it get a SIGTERM from Nomad would help.
You can also use the kill_timeout and respond to SIGINT or any other signal you want to start rejecting connections and use the kill_timeout to allow the application to exit gracefully as it finishes handling existing connections.
I hope this helps. Sorry for a delayed response. Let me know if you have any other question.
Thank you for your response. And yes, we are using consul.
Actually we are doing everything that you suggest.
The specific question is: once nomad starts a rolling upgrade, does it de-registers the consul health-checks for each job instance before actually sending the SIGTERM to the application?. If this is not the case, then zero downtime deployments would't be possible because Fabio would still send new requests to the application.
The ideal workflow would be as follows:
1 - Rolling upgrade is requred
2 - Select candidate job instance to update
3 - De-register current job instance health checks (this will make any consul-based LB to stop sending new requests)
4 - Pull new docker image (this can be done before actually stopping the job service)
5 - Send SIGTERM/SIGKILL to running job instance
6 - Deploy / start the new version of the app
7 - Enable health-checks again.
We haven't found / looked at the code about what Nomad actually does, but this is our initial thought of how it should work in order to accomplish ZDD.
Hey @arrodriguez,
That is how it works. Nomad will:
Hope that helps!
Thx a lot @dadgar!.
Is that documented somewhere?. Just wondering if we missed some part of the documentation
@marcosnils Very delayed but here you go: https://github.com/hashicorp/nomad/pull/7083
Most helpful comment
@marcosnils Very delayed but here you go: https://github.com/hashicorp/nomad/pull/7083