Linkerd2: Do not add proxy into prometheus pod during install

Created on 3 Mar 2018 · 7Comments · Source: linkerd/linkerd2

This issue is related to #478 and #485. As of 0.3.1, the conduit pods with a proxy sidecar undergoes restarts because containers in controller, prometheus and web pods are unable establish connections between each other during proxy startup. The best course of action to handle this failure is for apps within containers to retry on network failure. Retries can be easily implemented in the tap, destination and telemetry services since we have access to the code. However, We can't add retry logic to the kubectl proxy since we can't modify the source code for that application.

One solution is to stop injecting the proxy into the prometheus pod. This may require a change in the InjectYAML in inject.go. We would have to establish that a specific deployment, replicaset, etc has some criteria that notifies the install script that it should not inject the proxy into a deployment resource.

help wanted prioritP0

Source

dadjeibaah

❤1

Most helpful comment

I don't think it's problematic to continue to inject the prometheus pod. The restart that happens in that pod is from the kubectcl container, not the prometheus container. We're running a kubectl proxy alongside prometheus so that prometheus can securely scrape the kubernetes api. The kubectl proxy process makes a network call on startup, which fails if the conduit proxy process has not yet started, causing the kubectl container to exit. Here's a description from the kubectl describe command of one of these restarts:

  kubectl:
    ...
    State:          Running
      Started:      Tue, 27 Feb 2018 17:12:33 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:31 -0800
      Finished:     Tue, 27 Feb 2018 17:12:32 -0800
    Ready:          True
    Restart Count:  1

You can see that it happens almost immediately, and as far as I can tell the restart has no bearing the ability of prometheus to successfully collect metrics. Prometheus does the right thing and retries if the scrape target is unavailable. So aside from the unsightly "1 restart" in the output of the kubectl command, I don't think conduit functionality is affected. Maybe the outcome of #485 could help improve this further, but I think it's ok for now.

klingerf on 3 Mar 2018

👍2

All 7 comments

As of 0.3.0, the conduit pods with a proxy sidecar undergoes restarts because containers in controller

Note: 0.3.1, not 0.3.0. 0.3.0 is fine in this respect.

I think it is fine to stop injecting the proxy into the prometheus pod for 0.3.1.

In the future (0.4.0; maybe later is OK) we need to find a solution that lets us inject the proxy into Prometheus. We should probably contribute such a fix upstream to whoever maintains the Prometheus docker image, so that when somebody tries to inject conduit into their own instances of Prometheus, it works. I guess that should be a separate follow-up issue.

briansmith on 3 Mar 2018

We should probably contribute such a fix upstream to whoever maintains the Prometheus docker image

I agree. We could try to shoot for 0.4.0 but my guess is that it might be later.

As a possible solution, we could use a label in a the install YAML file that indicates whether a PodSpec should contain a proxy sidecar. We could this label to the prometheus section in our install template and have the inject code read the label and ignore injecting the sidecar into prometheus.

dadjeibaah on 3 Mar 2018

We could this label to the prometheus section in our install template and have the inject code read the label and ignore injecting the sidecar into prometheus.

I think that something like that would be a great feature to add to conduit inject but it probably requires design work that will require more than 1 day to sort out. I think we should just make a minimal change to the inject.go code to skip the pod "prometheus" in the conduit-namespace, siimlar to how to change the controller URL when the pod is "controller" in the conduit-namespace.

briansmith on 3 Mar 2018

  kubectl:
    ...
    State:          Running
      Started:      Tue, 27 Feb 2018 17:12:33 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:31 -0800
      Finished:     Tue, 27 Feb 2018 17:12:32 -0800
    Ready:          True
    Restart Count:  1

klingerf on 3 Mar 2018

👍2

If I understand correctly, #509 will resolve this! 🎉 🌮

olix0r on 4 Mar 2018

Awesome! once the PR get merge I can test on GKE, unless it has already been tested.

dadjeibaah on 4 Mar 2018

@klingerf wrote:

I don't think it's problematic to continue to inject the prometheus pod. The restart that happens in that pod is from the kubectcl container, not the prometheus container. We're running a kubectl proxy alongside prometheus so that prometheus can securely scrape the kubernetes api. [...] Prometheus does the right thing and retries if the scrape target is unavailable. So aside from the unsightly "1 restart" in the output of the kubectl command, I don't think conduit functionality is affected. Maybe the outcome of #485 could help improve this further, but I think it's ok for now.

i'm closing this based on @klingerf's analysis above.

Also, PR #509 removes the kubectl container completely. it turns out it wasn't even being used anyway! So even the aesthetic problem with the logs should be fixed.

briansmith on 5 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

linkerd dashboard does not work on Windows Subsystem for Linux

skalinets · 3Comments

Typo in check error message

adleong · 4Comments

Add support for rate limiting

steve-fraser · 4Comments

Proxy's logs have timestamps in an unreadable format different than what's typically used by Kubernetes

briansmith · 4Comments

Document service discovery and load balancing

briansmith · 4Comments