Disclaimer:
When a pod have a preStop hook it stops slower than its Linkerd sidecar. This eliminates the idea of the preStop hook since the pod cannot communicate with other services when Linkerd exited.
kubectl rollout restart is called and Kubernetes starts a new pod B and tries to terminate the previous pod A.preStop hook and waits for N second before exit to serve the existing requests and those new requests that still come from ALB for unknown reason.Also the problem makes impossible to use "Slow start duration" setting in the ALB.
# a new pod started and Kubernetes marks the old pod as Terminating
# Linkerd immediately says in its logs:
linkerd-proxy INFO [ 919.021686s] linkerd2_proxy::signal received SIGTERM, starting shutdown
# After that the application cannot connect to monitoring and to other services anymore:
E, [2019-11-21T14:09:25.634678 #2719] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/gems/ddtrace-0.22.0/lib/ddtrace/transport.rb:215:in `log_error_once') Failed to open TCP connection to 10.5.5.8:8126 (Connection refused - connect(2) for "10.5.5.8" port 8126)
# And at the and after N seconds the app exits as it was asked. But it does not make sense.
# ALB returns many 5xx errors to clients
If there was a way how to inject a custom preStop hook into Linkerd container with a simple sleep command inside (the container probably does not have the command?..) it might solve the problem. In this case I would ask Linkerd to wait for enough long period and Kubernetes would eventually ask it to quit according [to the documentation](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods:
- one of the Pod鈥檚 containers has defined a preStop hook, it is invoked inside of the container. If the preStop hook is still running after the grace period expires, step 2 is then invoked with a small (2 second) extended grace period.
- The container is sent the TERM signal. Note that not all containers in the Pod will receive the TERM signal at the same time and may each require a preStop hook if the order in which they shut down matters.
I cannot find how to actually solve the problem. As a workaround, some tunning around liveness/readiness probes and health check thresholds can be performed. And the "Deregistration delay" can be set to something close to zero to fail faster. But 5xx errors will still be there.
It can be a flag for linkerd inject --manual, for example --wait-before-exit 30 that will prevent linkerd-proxy from exiting for N seconds.
Hmmm... also I found this:
I guess this one of the similar ticket for Istio: https://github.com/istio/istio/issues/7136
Unfortunately the pre-stop solution wouldn't fix this issue. The proxy is receiving the SIGTERM from k8s and doing the correct thing. We'll need a configuration option to make the proxy ignore SIGTERM and continue operating until SIGKILL comes after the wait period.
Adding an annotation that does that for specific workloads having a tough time with this seems reasonable. Probably a good first proxy PR even! I've changed the title to keep folks from getting confused, but feel free to change it to something else (or close and open a new issue) if I've missed some of the details.
Actually I made it work today experimentally with the following sets of configurations that I injected manually:
# application container
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 140
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
readinessProbe:
failureThreshold: 5
initialDelaySeconds: 120
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- sleep 130 && while pkill -QUIT -f "myapp"; do sleep 1; done
# linkerd-proxy container
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- sleep 133 # a little bit bigger period of waiting
terminationGracePeriodSeconds: 160 # for entire pod
---
# ALB annotations:
...
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=10 # deregistration_delay must be less or equal to preStop delay if Linkerd supports it or it is injected manually
# alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=1 # must be ZERO until Linkerd support preStop hooks
alb.ingress.kubernetes.io/healthy-threshold-count: '2' # must be small and fast, K8s has already performed readiness probe
alb.ingress.kubernetes.io/unhealthy-threshold-count: '5' # must be bigger than K8s livenessProbe:(threshold * period) but not too big. K8s removes pod and ALB Ingress Controller notifies the target group that the pod's instance must be deregistered. If the Ingress Controller stuck/died, ALB must do it on its own.
alb.ingress.kubernetes.io/healthcheck-interval-seconds: '5' # must be small and fast but take into account K8s livenessProbe settings
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '3' # similar to internal K8s setting
And the logs in this setup:
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/ttest format=*/*
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/ttest format=*/*
# ALB deregistered the instance
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/health/ user_agent=kube-probe/1.12
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/health/ user_agent=kube-probe/1.12
...
# SLEEP is over and it is finally exiting
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.566261 #2717] INFO -- : reaped #<Process::Status: pid 2719 exit 0> worker=0
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.566722 #2717] INFO -- : reaped #<Process::Status: pid 2722 exit 0> worker=1
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.567112 #2717] INFO -- : reaped #<Process::Status: pid 2725 exit 0> worker=2
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.567206 #2717] INFO -- : master complete
app-web-web-749d6b87d5-7nr2p app-web-web I died and this is fine. Now Linkerd it's your turn to die! :-)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [ 974.284298s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [ 976.502988s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [ 979.284328s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [ 981.502903s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
# the L5d's sleep period is over
app-web-web-749d6b87d5-7nr2p linkerd-proxy INFO [ 983.473174s] linkerd2_proxy::signal received SIGTERM, starting shutdown
Woah, cool! There's something here that I'm not understanding, I'll have to take a look at it some more =)
Configure linkerd-proxy to ignore SIGTERM on a per-workload basis
The title you have set shows one of the possible solutions for the issue. A good solution, but not the only one. There can be others:
preStop, like /prepare_for_termination or just sleep/30 where 30 can be changed to any value (preStop supports http calls, not only exec).preStop that checks a file on a file system and exists only if the file created as suggested in the examplesleep like I did and it works.lifecycle.type: Sidecar into the proxy's container definition (this won't work for old clusters for years).I would prefer the variant that I suggested initially in the title because it is the simplest option and does not require code changes within the proxy but only within the proxy injector and, maybe, in the Helm chart.
To generalize, the title can look like: "Exit Linkerd proxy after the main container exited (graceful shutdown)"
@kivagant-ba I think you're right, I'm still just surprised that it works. I need to take a look and understand what I'm missing in the story.
@grampelberg
During pod termination, the following occurs:
TerminatingpreStop is executed per containerIf we ignore linkerd for a moment, what usually happens is that the container receives a sigterm while it's ip is still in the endpoint (remember, the ep is updated asynchronously); there's a possibility that the container has completed in-flight requests and exited before the ep is updated and because the ip is still in the ep, requests will be sent to an unresponsive ip resulting in a timeout or 502
If we introduce a preStop sleep with the following example:
# application container
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- sleep 20
# linkerd-proxy container
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- sleep 40
terminationGracePeriodSeconds: 160 # for entire pod
The pod is still set to Terminating and the ip is still removed from the ep, but the container does not receive a sigterm for x period of time.
What I would like to see, is the ability to configure a preStop for the linkerd proxy, so that I can achieve the following:
TerminatingpreStop executed in my app and linkerd proxy@grampelberg , I'd like to try to help with implementing this when you decide which way to go. It is not urgent for us yet but I believe it is will be a desired feature for users.
@KIVagant that's awesome! I like your suggestion the most as it doesn't require proxy changes, just need to make sure that works. --wait-before-exit seems like a perfectly fine command line flag and annotation for the injector. If we limit this to just modifying the proxy container itself, seems like it might even be a small patch!
Ok, sounds like a good next challenge to me to commit something here. When I have time, I'll make a fix asap.
That's awesome! Thank you!
Is there a possibility this will be rolled out as a patch to 2.6?
We would need this to move linkerd into production as we have quite high RPS and don't want to cause issues for users if no need
@tobad357 , from my side I can say that tomorrow I'll address the PR review comments. About the release date I cannot add anything but we will also move to prod only with the solution on board.
@tobad357 it wouldn't be a patch release of 2.6, but you can pick up the next edge after @kivagant-ba lands his PR for it.
I'm not sure how this is related to your high RPS concern though?
With high RPS the amount of 5xx errors is huge without this feature.
@KIVagant but only for workloads that need to drain, right?
It should be like that but my tests showed another picture to me. Mostly because in case if AWS CNI is used for Kubernetes, the load balancer sends traffic directly to the pod IP regardless the internal Kubernetes Service abstraction. Which means that even after the pod went to Terminating state, a lot of new requests still able to reach the pod (its IP is "white" in the VPC). Plus ALB has some interesting lag and it can continue sending new requests into a target even after it got the command to remove it.
(Ignore my GitHub-account-schizophrenia)
@KIVagant IIRC, this seems like a general problem with the ALB, with or without Linkerd. See https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/814 and https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1065. The second one is interesting, where it was mentioned that the pre-stop hook didn't help.
Yes, I saw them.
such as adding a preStop hook to sleep for some seconds
"for _some_ seconds" probably is not enough. In my tests before applying the configuration, I had to add ~2 minutes delay before pod finally died and synthetic tests showed zero 500 errors. So it's a complex issue that includes ALB Ingress Controller, ALB itself and Linkerd tuning. In addition to the set of options I also enabled a small draining period for ALB.
That clears things up for me! My TLDR is that this shouldn't block for the majority of deployments, but if you're on AWS and using ALB ingress it is a huge win.
KIVagant but only for workloads that need to drain, right?
Our app needs a small pause before SIGTERM is sent as it doesn't shutdown nicely, hence why we want to drain and then terminate. We are trying to minimize any rejected or closed connections
@grampelberg Just for my understanding. When linkerd gets SIGTERM does it kill active connections or first shutdown the listener and then wait a bit for active connections to close?
@tobad357 it shuts down the listener and then waits for the connections to close.
@grampelberg but it will still reject new connections correct if the pod is still an endpoint for the service? So those connections will get an error unless they retry?
While checking on retries, do you know if when linkerd does a retry it tries to route to a different pod or it will retry to the same pod?
but it will still reject new connections correct if the pod is still an endpoint for the service?
By definition, it will no longer be an endpoint of the service.
While checking on retries, do you know if when linkerd does a retry it tries to route to a different pod or it will retry to the same pod?
Different pod.
But won't there be a short while where Linkerd has received SIGTERM and then shutdowns the listener but the Service hasn't been updated yet to remove the Endpoint?
Wont k8s send new connection attempts to the pod while that is happening?
@tobad357 no, k8s removes from endpoints and then sends the SIGTERM. There is the possibility that your iptables is overloaded and the propagation doesn't happen immediately however.
@grampelberg my understanding from reading and testing is the ep is updated simultaneously with the sigterm being sent.
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
TIL!
One thing the proxy could do is monitor the pod's endpoint and only stop listening once the pod ip has been removed? Theoretically that should allow the proxy to continue accepting incoming connections until the load balancer has caught up, then it can start draining existing connections like it currently does?
Although care would need to be taken so it doesn't disrupt readiness probes
The correct long term solution to this problem continues to be sidecar containers.
Most helpful comment
The correct long term solution to this problem continues to be sidecar containers.