Origin: [Proposal] Improved graceful shutdown (zero downtime)

Created on 9 Mar 2018 · 13Comments · Source: openshift/origin

Problem / Motivation

We operate more than 3500 containers on a large OpenShift cluster. A lot of of applications have the same problem with the current termination process. To achieve zero downtime in rolling updates, pod restarts and evacuation of nodes due to maintenance an application has to do the following things:

Pod has to be killed due to some of the above mentioned events
Kubernetes/OpenShift sends a SIGTERM signal
Application has to catch the SIGTERM signal
Application has to set its readyness probe to false to stop getting new traffic
Application has to wait until no new connections are sent via service
In OpenShift the application also has to wait until the HA-Proxy is reloaded to not get any traffic from there. Also the HA-Proxy has it's own health check which is not in sync with the readyness state of Kubernetes. To signal HA-Proxy to not send any more traffic the HTTP listener port has to be closed.
The application has to finish it's active requests (within the terminationGracePeriodSeconds) and then terminate itself

@SBB we implemented this behaviour for Spring Boot 1 & 2 with this extension library:
https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown

But this solution only works for java apps. All other languages/webservers have to implement the same thing again and again. We talked to a lot of people/companies that use OpenShift/Kubernetes and all of them struggle with this issue. Thus, I would like to propose a solution where the container platform handles the termination a bit differently.

Proposal

Introduce a pod new life cycle state, something like "TerminationPreparation", in this state

Stop the readyness-check and set the container to "NotReady", stop sending SDN traffic to that container
In OpenShift remove the container from the HA-Proxy config, wait until all HA-Proxies are reloaded
Introduce a new setting where an application can define how long it needs to finish existing requests. This could also be something like the readyness/liveness checks (e.g. terminationPreparationGracePeriodSeconds).
If this time is up, or the application signals that is is done processing requests, send the SIGTERM
Applications still can handle that signal if they have to do things like cleanup, but it is no longer mandatory for "zero downtime"

This would massively improve the availability of our applications during any form of container termination. Developers would no longer need to take care of that manually.

Source

ReToCode

👍4

Most helpful comment

The current behaviour is quite weird and unexpected:
"Dear pod, please die. Also please handle these requests"
The service should remove the pod from the LB before signalling TERM.

eoftedal on 25 Apr 2018

👍6

All 13 comments

As discussed in advance, FYI: @sreber84, @eberlec, @saturnism, @smarterclayton, @knobunc.

ReToCode on 9 Mar 2018

Just so I understand, using preStop isn't an option? I.e.:

add a preStop hook wait_for_done.sh that:
1. waits 10s
2. optionally sends a signal to the process to tell it to stop accepting new requests
3. waits until the process is done before exiting

won't solve your issue?

I agree that the wait 10s in preStop is ugly, but your wait script can control how long before termination you have to exit, even if your process doesn't work. I.e. if you have a 60s timeout on requests, you should be able to set termination grace period to 120s, set preStop to wait 60s + 10s for buffer, then exit. Then you'll get 50s for graceful shutdown before SIGKILL gets sent.

smarterclayton on 10 Mar 2018

This should work yes, but this way all the apps/devs would have to add that themselves.

At least the first part (stop sending new traffic to an app before sending the SIGTERM) seems like a good idea on platform level. In a "classic environment" one would remove an app from a load balancer before even thinking about stopping it. I think it would help a lot of people/apps if this would be changed globally. I heard that quite a lot of people struggle with the same problem.

I agree on the second part (to finish active requets before quitting). This is the apps responsability and can also be done in the way you described.

ReToCode on 10 Mar 2018

👍4

@openshift/sig-pod @openshift/sig-networking

jwforres on 20 Mar 2018

This would involve upstream as we are talking about core kube components here. I agree this is an issue and I do hear about it in the community.

In fact, we talked about this proposal in the sig-node meeting which handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit

I do think that an additional pod state would be difficult to get accepted upstream.

It seems that this could be accomplished if the Pod was removed from the Endpoints when the deletionTimestamp was set, indicating the Pod is terminating. Then the pod could set whatever terminationGracePeriodSeconds required as a timeout for draining connections. If the drain completes early, the process and simple exit.

I'm unsure about if Pods are removed from Endpoint once the deletionTimestamp is set or when the Pod is actually deleted.

Maybe I'm not grasping the nuance.

sjenning on 21 Mar 2018

A deleted pod is considered not ready. There's a very old issue for this
that is similar

https://github.com/kubernetes/kubernetes/issues/13364
https://github.com/kubernetes/kubernetes/issues/20473

On Tue, Mar 20, 2018 at 11:45 PM, Seth Jennings notifications@github.com
wrote:

This would involve upstream as we are talking about core kube components
here. I agree this is an issue and I do hear about it in the community.

In fact, we talked about this proposal in the sig-node meeting which
handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-
jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit

I do think that an additional pod state would be difficult to get accepted
upstream.

It seems that this could be accomplished if the Pod was removed from the
Endpoints when the deletionTimestamp was set, indicating the Pod is
terminating. Then the pod could set whatever terminationGracePeriodSeconds
required as a timeout for draining connections. If the drain completes
early, the process and simple exit.

I'm unsure about if Pods are removed from Endpoint once the
deletionTimestamp is set or when the Pod is actually deleted.

Maybe I'm not grasping the nuance.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/18914#issuecomment-374826976,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p-C1mAeKwLBf8LHiTQ9BdIzSd6wkks5tgczLgaJpZM4SkJAl
.

smarterclayton on 21 Mar 2018

Agree this is something that needs to get some real attention.

On Wed, Mar 21, 2018 at 10:19 AM, Clayton Coleman ccoleman@redhat.com
wrote:

A deleted pod is considered not ready. There's a very old issue for
this that is similar

https://github.com/kubernetes/kubernetes/issues/13364
https://github.com/kubernetes/kubernetes/issues/20473

On Tue, Mar 20, 2018 at 11:45 PM, Seth Jennings notifications@github.com
wrote:

This would involve upstream as we are talking about core kube components
here. I agree this is an issue and I do hear about it in the community.

In fact, we talked about this proposal in the sig-node meeting which
handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmG
jvQ5qRldYOFV0lGqxf8/edit

I do think that an additional pod state would be difficult to get
accepted upstream.

It seems that this could be accomplished if the Pod was removed from the
Endpoints when the deletionTimestamp was set, indicating the Pod is
terminating. Then the pod could set whatever
terminationGracePeriodSeconds required as a timeout for draining
connections. If the drain completes early, the process and simple exit.

I'm unsure about if Pods are removed from Endpoint once the
deletionTimestamp is set or when the Pod is actually deleted.

Maybe I'm not grasping the nuance.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/18914#issuecomment-374826976,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p-C1mAeKwLBf8LHiTQ9BdIzSd6wkks5tgczLgaJpZM4SkJAl
.

smarterclayton on 21 Mar 2018

Thanks for the feedback. As far as I am concerned, a new pod state is not mandatory. Anything that helps to improve the situation is welcome :) I agree that this hat so be fixed in kubernetes first and then openshift just needs to add the ha-proxy part.

ReToCode on 27 Mar 2018

The current state seems to be contradictory to the Openshift documentation. On https://docs.openshift.com/container-platform/3.7/dev_guide/deployments/advanced_deployment_strategies.html we can read:

On shutdown, OpenShift Container Platform will send a TERM signal to the processes in the container. Application code, on receiving SIGTERM, should stop accepting new connections. This will ensure that load balancers route traffic to other active instances. The application code should then wait until all open connections are closed (or gracefully terminate individual connections at the next opportunity) before exiting.

However Openshift will continue to send requests to the pod for some seconds after sending it SIGTERM, and these requests will fail if we stop accepting connections.

eoftedal on 25 Apr 2018

The current behaviour is quite weird and unexpected:
"Dear pod, please die. Also please handle these requests"
The service should remove the pod from the LB before signalling TERM.

eoftedal on 25 Apr 2018

👍6

The problem with the recommended process "application code [...] should stop accepting new Connections" in java environments this is part of the application server.
@ReToCode created a workaround for Tomcat (Embedded/SpringBoot), as Tomcat doesn't support graceful shutdown at all.
Undertow supports it, but reacts with a HTTP Return Code 503 during shutdown process.
I adopted that logic for SpringBoot with Tomcat.
But this only works for routes (loadbalancer) without "SSL-Passthrough". It doesn't work for services.

I fully agree this should be handled by OpenShift (Kubernetes). Before pods are evicted they should be removed from services and routes (load balancers).

A couple of other project teams in my customer's company (operating a large OpenShift cluster) are struggling with the same problem.

ramato-procon on 26 Apr 2018

@jmencak this seems familiar :)

vikaschoudhary16 on 27 Apr 2018

The problem is that there is no tight coupling between the pieces of the system. The router only learns that the backing pods are gone when the endpoint updates. BUT the router can't immediately reload (reloads are rate limited, and even when it can immediately reload, a reload can take a few seconds to a minute depending on the number of routes and the speed of the box).

With haproxy 1.8 we can make some dynamic changes to the running router, so we don't need to do a reload for a lot of changes so the responsiveness will be greatly improved.

But for now, you need to make sure that you have some delay between when termination is started and when the pod exits. You can either add a SIGTERM handler to the process, or have a PreStop hook registered that sleeps for a little while.

https://bugzilla.redhat.com/show_bug.cgi?id=1573207#c5