Linkerd2: Configure linkerd-proxy to ignore SIGTERM on a per-workload basis

Created on 21 Nov 2019  路  31Comments  路  Source: linkerd/linkerd2

Feature Request

Disclaimer:

What problem are you trying to solve?

When a pod have a preStop hook it stops slower than its Linkerd sidecar. This eliminates the idea of the preStop hook since the pod cannot communicate with other services when Linkerd exited.

Use case example

  1. AWS ALB has a target group that sends traffic to a pod A.
  2. kubectl rollout restart is called and Kubernetes starts a new pod B and tries to terminate the previous pod A.
  3. ALB continues sending some traffic even after the instance status changed from Healthy to Draining. There's not much we can do. Also there's a special option "Deregistration delay" which was designed to keep existing connection for a while to make sure all requests were processed.
  4. An application container inside pod A has a preStop hook and waits for N second before exit to serve the existing requests and those new requests that still come from ALB for unknown reason.
  5. Linkerd exits immediately and drops all connections.

Also the problem makes impossible to use "Slow start duration" setting in the ALB.

Logs example

# a new pod started and Kubernetes marks the old pod as Terminating
# Linkerd immediately says in its logs: 
linkerd-proxy INFO [   919.021686s] linkerd2_proxy::signal received SIGTERM, starting shutdown

# After that the application cannot connect to monitoring and to other services anymore:
E, [2019-11-21T14:09:25.634678 #2719] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/gems/ddtrace-0.22.0/lib/ddtrace/transport.rb:215:in `log_error_once') Failed to open TCP connection to 10.5.5.8:8126 (Connection refused - connect(2) for "10.5.5.8" port 8126)

# And at the and after N seconds the app exits as it was asked. But it does not make sense.
# ALB returns many 5xx errors to clients

How should the problem be solved?

If there was a way how to inject a custom preStop hook into Linkerd container with a simple sleep command inside (the container probably does not have the command?..) it might solve the problem. In this case I would ask Linkerd to wait for enough long period and Kubernetes would eventually ask it to quit according [to the documentation](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods:

  1. one of the Pod鈥檚 containers has defined a preStop hook, it is invoked inside of the container. If the preStop hook is still running after the grace period expires, step 2 is then invoked with a small (2 second) extended grace period.
  2. The container is sent the TERM signal. Note that not all containers in the Pod will receive the TERM signal at the same time and may each require a preStop hook if the order in which they shut down matters.

Any alternatives you've considered?

I cannot find how to actually solve the problem. As a workaround, some tunning around liveness/readiness probes and health check thresholds can be performed. And the "Deregistration delay" can be set to something close to zero to fail faster. But 5xx errors will still be there.

How would users interact with this feature?

It can be a flag for linkerd inject --manual, for example --wait-before-exit 30 that will prevent linkerd-proxy from exiting for N seconds.

arecli areinject areproxy help wanted

Most helpful comment

The correct long term solution to this problem continues to be sidecar containers.

All 31 comments

Hmmm... also I found this:

I guess this one of the similar ticket for Istio: https://github.com/istio/istio/issues/7136

Unfortunately the pre-stop solution wouldn't fix this issue. The proxy is receiving the SIGTERM from k8s and doing the correct thing. We'll need a configuration option to make the proxy ignore SIGTERM and continue operating until SIGKILL comes after the wait period.

Adding an annotation that does that for specific workloads having a tough time with this seems reasonable. Probably a good first proxy PR even! I've changed the title to keep folks from getting confused, but feel free to change it to something else (or close and open a new issue) if I've missed some of the details.

Actually I made it work today experimentally with the following sets of configurations that I injected manually:

        # application container
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 140
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        readinessProbe:
          failureThreshold: 5
          initialDelaySeconds: 120
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 3
        lifecycle:
          preStop:
            exec:
              command:
                - /bin/bash
                - -c
                - sleep 130 && while pkill -QUIT -f "myapp"; do sleep 1; done
        # linkerd-proxy container 
        lifecycle:
          preStop:
            exec:
              command:
                - /bin/bash
                - -c
                - sleep 133 # a little bit bigger period of waiting
    terminationGracePeriodSeconds: 160 # for entire pod

---
# ALB annotations:
...
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=10 # deregistration_delay must be less or equal to preStop delay if Linkerd supports it or it is injected manually
    # alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=1 # must be ZERO until Linkerd support preStop hooks
    alb.ingress.kubernetes.io/healthy-threshold-count: '2' # must be small and fast, K8s has already performed readiness probe
    alb.ingress.kubernetes.io/unhealthy-threshold-count: '5' # must be bigger than K8s livenessProbe:(threshold * period) but not too big. K8s removes pod and ALB Ingress Controller notifies the target group that the pod's instance must be deregistered. If the Ingress Controller stuck/died, ALB must do it on its own.
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: '5' # must be small and fast but take into account K8s livenessProbe settings
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '3' # similar to internal K8s setting

And the logs in this setup:

app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/ttest format=*/*
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/ttest format=*/*
# ALB deregistered the instance
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/health/ user_agent=kube-probe/1.12
app-web-web-749d6b87d5-7nr2p app-web-web method=GET path=/health/ user_agent=kube-probe/1.12
...
# SLEEP is over and it is finally exiting
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.566261 #2717]  INFO -- : reaped #<Process::Status: pid 2719 exit 0> worker=0
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.566722 #2717]  INFO -- : reaped #<Process::Status: pid 2722 exit 0> worker=1
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.567112 #2717]  INFO -- : reaped #<Process::Status: pid 2725 exit 0> worker=2
app-web-web-749d6b87d5-7nr2p app-web-web I, [2019-11-22T14:21:57.567206 #2717]  INFO -- : master complete
app-web-web-749d6b87d5-7nr2p app-web-web I died and this is fine. Now Linkerd it's your turn to die! :-)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [   974.284298s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [   976.502988s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [   979.284328s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
app-web-web-749d6b87d5-7nr2p linkerd-proxy ERR! [   981.502903s] linkerd2_proxy::app::errors unexpected error: error trying to connect: Connection refused (os error 111) (address: 127.0.0.1:8080)
# the L5d's sleep period is over
app-web-web-749d6b87d5-7nr2p linkerd-proxy INFO [   983.473174s] linkerd2_proxy::signal received SIGTERM, starting shutdown

Woah, cool! There's something here that I'm not understanding, I'll have to take a look at it some more =)

Configure linkerd-proxy to ignore SIGTERM on a per-workload basis

The title you have set shows one of the possible solutions for the issue. A good solution, but not the only one. There can be others:

  1. an http API endpoint in proxy that can be called in preStop, like /prepare_for_termination or just sleep/30 where 30 can be changed to any value (preStop supports http calls, not only exec).
  2. a solution without preStop that checks a file on a file system and exists only if the file created as suggested in the example
  3. just an ability to inject preStop into Linkerd with a random script where user can call sleep like I did and it works.
  4. a variant when Linkerd knows how to check if his "main" pod is still alive and exists only after the application exited
  5. an option based on the enhancement that adds lifecycle.type: Sidecar into the proxy's container definition (this won't work for old clusters for years).

I would prefer the variant that I suggested initially in the title because it is the simplest option and does not require code changes within the proxy but only within the proxy injector and, maybe, in the Helm chart.

To generalize, the title can look like: "Exit Linkerd proxy after the main container exited (graceful shutdown)"

@kivagant-ba I think you're right, I'm still just surprised that it works. I need to take a look and understand what I'm missing in the story.

@grampelberg

During pod termination, the following occurs:

  • Pod status changed to Terminating

    • Asynchronously, the pod ip is removed from the endpoint (may take a few seconds)

  • preStop is executed per container
  • sigterm sent to each container

If we ignore linkerd for a moment, what usually happens is that the container receives a sigterm while it's ip is still in the endpoint (remember, the ep is updated asynchronously); there's a possibility that the container has completed in-flight requests and exited before the ep is updated and because the ip is still in the ep, requests will be sent to an unresponsive ip resulting in a timeout or 502

If we introduce a preStop sleep with the following example:

        # application container
        lifecycle:
          preStop:
            exec:
              command:
                - /bin/bash
                - -c
                - sleep 20

        # linkerd-proxy container 
        lifecycle:
          preStop:
            exec:
              command:
                - /bin/bash
                - -c
                - sleep 40
    terminationGracePeriodSeconds: 160 # for entire pod

The pod is still set to Terminating and the ip is still removed from the ep, but the container does not receive a sigterm for x period of time.

What I would like to see, is the ability to configure a preStop for the linkerd proxy, so that I can achieve the following:

  • Pod status changes to Terminating

    • pod IP removed from ep (async)

  • preStop executed in my app and linkerd proxy
  • my app receives a sigterm after 20 seconds, which should be enough time for the ip to be removed from the ep
  • my app begins connection drain. The linkerd proxy is still up so my app is able to send a response to the client
  • my app terminates
  • linkerd proxy receives sigterm after 40 seconds
  • linkerd proxy terminates

@grampelberg , I'd like to try to help with implementing this when you decide which way to go. It is not urgent for us yet but I believe it is will be a desired feature for users.

@KIVagant that's awesome! I like your suggestion the most as it doesn't require proxy changes, just need to make sure that works. --wait-before-exit seems like a perfectly fine command line flag and annotation for the injector. If we limit this to just modifying the proxy container itself, seems like it might even be a small patch!

Ok, sounds like a good next challenge to me to commit something here. When I have time, I'll make a fix asap.

That's awesome! Thank you!

Is there a possibility this will be rolled out as a patch to 2.6?
We would need this to move linkerd into production as we have quite high RPS and don't want to cause issues for users if no need

@tobad357 , from my side I can say that tomorrow I'll address the PR review comments. About the release date I cannot add anything but we will also move to prod only with the solution on board.

@tobad357 it wouldn't be a patch release of 2.6, but you can pick up the next edge after @kivagant-ba lands his PR for it.

I'm not sure how this is related to your high RPS concern though?

With high RPS the amount of 5xx errors is huge without this feature.

@KIVagant but only for workloads that need to drain, right?

It should be like that but my tests showed another picture to me. Mostly because in case if AWS CNI is used for Kubernetes, the load balancer sends traffic directly to the pod IP regardless the internal Kubernetes Service abstraction. Which means that even after the pod went to Terminating state, a lot of new requests still able to reach the pod (its IP is "white" in the VPC). Plus ALB has some interesting lag and it can continue sending new requests into a target even after it got the command to remove it.

(Ignore my GitHub-account-schizophrenia)

@KIVagant IIRC, this seems like a general problem with the ALB, with or without Linkerd. See https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/814 and https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1065. The second one is interesting, where it was mentioned that the pre-stop hook didn't help.

Yes, I saw them.

such as adding a preStop hook to sleep for some seconds

"for _some_ seconds" probably is not enough. In my tests before applying the configuration, I had to add ~2 minutes delay before pod finally died and synthetic tests showed zero 500 errors. So it's a complex issue that includes ALB Ingress Controller, ALB itself and Linkerd tuning. In addition to the set of options I also enabled a small draining period for ALB.

That clears things up for me! My TLDR is that this shouldn't block for the majority of deployments, but if you're on AWS and using ALB ingress it is a huge win.

KIVagant but only for workloads that need to drain, right?

Our app needs a small pause before SIGTERM is sent as it doesn't shutdown nicely, hence why we want to drain and then terminate. We are trying to minimize any rejected or closed connections

@grampelberg Just for my understanding. When linkerd gets SIGTERM does it kill active connections or first shutdown the listener and then wait a bit for active connections to close?

@tobad357 it shuts down the listener and then waits for the connections to close.

@grampelberg but it will still reject new connections correct if the pod is still an endpoint for the service? So those connections will get an error unless they retry?

While checking on retries, do you know if when linkerd does a retry it tries to route to a different pod or it will retry to the same pod?

but it will still reject new connections correct if the pod is still an endpoint for the service?

By definition, it will no longer be an endpoint of the service.

While checking on retries, do you know if when linkerd does a retry it tries to route to a different pod or it will retry to the same pod?

Different pod.

But won't there be a short while where Linkerd has received SIGTERM and then shutdowns the listener but the Service hasn't been updated yet to remove the Endpoint?
Wont k8s send new connection attempts to the pod while that is happening?

@tobad357 no, k8s removes from endpoints and then sends the SIGTERM. There is the possibility that your iptables is overloaded and the propagation doesn't happen immediately however.

@grampelberg my understanding from reading and testing is the ep is updated simultaneously with the sigterm being sent.

https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

TIL!

One thing the proxy could do is monitor the pod's endpoint and only stop listening once the pod ip has been removed? Theoretically that should allow the proxy to continue accepting incoming connections until the load balancer has caught up, then it can start draining existing connections like it currently does?

Although care would need to be taken so it doesn't disrupt readiness probes

The correct long term solution to this problem continues to be sidecar containers.

Was this page helpful?
0 / 5 - 0 ratings