We operate more than 3500 containers on a large OpenShift cluster. A lot of of applications have the same problem with the current termination process. To achieve zero downtime in rolling updates, pod restarts and evacuation of nodes due to maintenance an application has to do the following things:
@SBB we implemented this behaviour for Spring Boot 1 & 2 with this extension library:
https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown
But this solution only works for java apps. All other languages/webservers have to implement the same thing again and again. We talked to a lot of people/companies that use OpenShift/Kubernetes and all of them struggle with this issue. Thus, I would like to propose a solution where the container platform handles the termination a bit differently.
Introduce a pod new life cycle state, something like "TerminationPreparation", in this state
This would massively improve the availability of our applications during any form of container termination. Developers would no longer need to take care of that manually.
As discussed in advance, FYI: @sreber84, @eberlec, @saturnism, @smarterclayton, @knobunc.
Just so I understand, using preStop isn't an option? I.e.:
wait_for_done.sh that:won't solve your issue?
I agree that the wait 10s in preStop is ugly, but your wait script can control how long before termination you have to exit, even if your process doesn't work. I.e. if you have a 60s timeout on requests, you should be able to set termination grace period to 120s, set preStop to wait 60s + 10s for buffer, then exit. Then you'll get 50s for graceful shutdown before SIGKILL gets sent.
This should work yes, but this way all the apps/devs would have to add that themselves.
At least the first part (stop sending new traffic to an app before sending the SIGTERM) seems like a good idea on platform level. In a "classic environment" one would remove an app from a load balancer before even thinking about stopping it. I think it would help a lot of people/apps if this would be changed globally. I heard that quite a lot of people struggle with the same problem.
I agree on the second part (to finish active requets before quitting). This is the apps responsability and can also be done in the way you described.
@openshift/sig-pod @openshift/sig-networking
This would involve upstream as we are talking about core kube components here. I agree this is an issue and I do hear about it in the community.
In fact, we talked about this proposal in the sig-node meeting which handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmGjvQ5qRldYOFV0lGqxf8/edit
I do think that an additional pod state would be difficult to get accepted upstream.
It seems that this could be accomplished if the Pod was removed from the Endpoints when the deletionTimestamp was set, indicating the Pod is terminating. Then the pod could set whatever terminationGracePeriodSeconds required as a timeout for draining connections. If the drain completes early, the process and simple exit.
I'm unsure about if Pods are removed from Endpoint once the deletionTimestamp is set or when the Pod is actually deleted.
Maybe I'm not grasping the nuance.
A deleted pod is considered not ready. There's a very old issue for this
that is similar
https://github.com/kubernetes/kubernetes/issues/13364
https://github.com/kubernetes/kubernetes/issues/20473
On Tue, Mar 20, 2018 at 11:45 PM, Seth Jennings notifications@github.com
wrote:
This would involve upstream as we are talking about core kube components
here. I agree this is an issue and I do hear about it in the community.In fact, we talked about this proposal in the sig-node meeting which
handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-
jul7LKTmGjvQ5qRldYOFV0lGqxf8/editI do think that an additional pod state would be difficult to get accepted
upstream.It seems that this could be accomplished if the Pod was removed from the
Endpoints when the deletionTimestamp was set, indicating the Pod is
terminating. Then the pod could set whatever terminationGracePeriodSeconds
required as a timeout for draining connections. If the drain completes
early, the process and simple exit.I'm unsure about if Pods are removed from Endpoint once the
deletionTimestamp is set or when the Pod is actually deleted.Maybe I'm not grasping the nuance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/18914#issuecomment-374826976,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p-C1mAeKwLBf8LHiTQ9BdIzSd6wkks5tgczLgaJpZM4SkJAl
.
Agree this is something that needs to get some real attention.
On Wed, Mar 21, 2018 at 10:19 AM, Clayton Coleman ccoleman@redhat.com
wrote:
A deleted pod is considered not ready. There's a very old issue for
this that is similarhttps://github.com/kubernetes/kubernetes/issues/13364
https://github.com/kubernetes/kubernetes/issues/20473On Tue, Mar 20, 2018 at 11:45 PM, Seth Jennings notifications@github.com
wrote:This would involve upstream as we are talking about core kube components
here. I agree this is an issue and I do hear about it in the community.In fact, we talked about this proposal in the sig-node meeting which
handles this problem on the pod bring-up side vs the tear-down side.
https://docs.google.com/document/d/1VFZbc_IqPf_Msd-jul7LKTmG
jvQ5qRldYOFV0lGqxf8/editI do think that an additional pod state would be difficult to get
accepted upstream.It seems that this could be accomplished if the Pod was removed from the
Endpoints when the deletionTimestamp was set, indicating the Pod is
terminating. Then the pod could set whatever
terminationGracePeriodSeconds required as a timeout for draining
connections. If the drain completes early, the process and simple exit.I'm unsure about if Pods are removed from Endpoint once the
deletionTimestamp is set or when the Pod is actually deleted.Maybe I'm not grasping the nuance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/origin/issues/18914#issuecomment-374826976,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p-C1mAeKwLBf8LHiTQ9BdIzSd6wkks5tgczLgaJpZM4SkJAl
.
Thanks for the feedback. As far as I am concerned, a new pod state is not mandatory. Anything that helps to improve the situation is welcome :) I agree that this hat so be fixed in kubernetes first and then openshift just needs to add the ha-proxy part.
The current state seems to be contradictory to the Openshift documentation. On https://docs.openshift.com/container-platform/3.7/dev_guide/deployments/advanced_deployment_strategies.html we can read:
On shutdown, OpenShift Container Platform will send a TERM signal to the processes in the container. Application code, on receiving SIGTERM, should stop accepting new connections. This will ensure that load balancers route traffic to other active instances. The application code should then wait until all open connections are closed (or gracefully terminate individual connections at the next opportunity) before exiting.
However Openshift will continue to send requests to the pod for some seconds after sending it SIGTERM, and these requests will fail if we stop accepting connections.
The current behaviour is quite weird and unexpected:
"Dear pod, please die. Also please handle these requests"
The service should remove the pod from the LB before signalling TERM.
The problem with the recommended process "application code [...] should stop accepting new Connections" in java environments this is part of the application server.
@ReToCode created a workaround for Tomcat (Embedded/SpringBoot), as Tomcat doesn't support graceful shutdown at all.
Undertow supports it, but reacts with a HTTP Return Code 503 during shutdown process.
I adopted that logic for SpringBoot with Tomcat.
But this only works for routes (loadbalancer) without "SSL-Passthrough". It doesn't work for services.
I fully agree this should be handled by OpenShift (Kubernetes). Before pods are evicted they should be removed from services and routes (load balancers).
A couple of other project teams in my customer's company (operating a large OpenShift cluster) are struggling with the same problem.
@jmencak this seems familiar :)
The problem is that there is no tight coupling between the pieces of the system. The router only learns that the backing pods are gone when the endpoint updates. BUT the router can't immediately reload (reloads are rate limited, and even when it can immediately reload, a reload can take a few seconds to a minute depending on the number of routes and the speed of the box).
With haproxy 1.8 we can make some dynamic changes to the running router, so we don't need to do a reload for a lot of changes so the responsiveness will be greatly improved.
But for now, you need to make sure that you have some delay between when termination is started and when the pod exits. You can either add a SIGTERM handler to the process, or have a PreStop hook registered that sleeps for a little while.
Most helpful comment
The current behaviour is quite weird and unexpected:
"Dear pod, please die. Also please handle these requests"
The service should remove the pod from the LB before signalling TERM.