Linkerd2: Do not add proxy into prometheus pod during install

Created on 3 Mar 2018  路  7Comments  路  Source: linkerd/linkerd2

This issue is related to #478 and #485. As of 0.3.1, the conduit pods with a proxy sidecar undergoes restarts because containers in controller, prometheus and web pods are unable establish connections between each other during proxy startup. The best course of action to handle this failure is for apps within containers to retry on network failure. Retries can be easily implemented in the tap, destination and telemetry services since we have access to the code. However, We can't add retry logic to the kubectl proxy since we can't modify the source code for that application.

One solution is to stop injecting the proxy into the prometheus pod. This may require a change in the InjectYAML in inject.go. We would have to establish that a specific deployment, replicaset, etc has some criteria that notifies the install script that it should not inject the proxy into a deployment resource.

help wanted prioritP0

Most helpful comment

I don't think it's problematic to continue to inject the prometheus pod. The restart that happens in that pod is from the kubectcl container, not the prometheus container. We're running a kubectl proxy alongside prometheus so that prometheus can securely scrape the kubernetes api. The kubectl proxy process makes a network call on startup, which fails if the conduit proxy process has not yet started, causing the kubectl container to exit. Here's a description from the kubectl describe command of one of these restarts:

  kubectl:
    ...
    State:          Running
      Started:      Tue, 27 Feb 2018 17:12:33 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:31 -0800
      Finished:     Tue, 27 Feb 2018 17:12:32 -0800
    Ready:          True
    Restart Count:  1

You can see that it happens almost immediately, and as far as I can tell the restart has no bearing the ability of prometheus to successfully collect metrics. Prometheus does the right thing and retries if the scrape target is unavailable. So aside from the unsightly "1 restart" in the output of the kubectl command, I don't think conduit functionality is affected. Maybe the outcome of #485 could help improve this further, but I think it's ok for now.

All 7 comments

As of 0.3.0, the conduit pods with a proxy sidecar undergoes restarts because containers in controller

Note: 0.3.1, not 0.3.0. 0.3.0 is fine in this respect.

I think it is fine to stop injecting the proxy into the prometheus pod for 0.3.1.

In the future (0.4.0; maybe later is OK) we need to find a solution that lets us inject the proxy into Prometheus. We should probably contribute such a fix upstream to whoever maintains the Prometheus docker image, so that when somebody tries to inject conduit into their own instances of Prometheus, it works. I guess that should be a separate follow-up issue.

We should probably contribute such a fix upstream to whoever maintains the Prometheus docker image

I agree. We could try to shoot for 0.4.0 but my guess is that it might be later.

As a possible solution, we could use a label in a the install YAML file that indicates whether a PodSpec should contain a proxy sidecar. We could this label to the prometheus section in our install template and have the inject code read the label and ignore injecting the sidecar into prometheus.

We could this label to the prometheus section in our install template and have the inject code read the label and ignore injecting the sidecar into prometheus.

I think that something like that would be a great feature to add to conduit inject but it probably requires design work that will require more than 1 day to sort out. I think we should just make a minimal change to the inject.go code to skip the pod "prometheus" in the conduit-namespace, siimlar to how to change the controller URL when the pod is "controller" in the conduit-namespace.

I don't think it's problematic to continue to inject the prometheus pod. The restart that happens in that pod is from the kubectcl container, not the prometheus container. We're running a kubectl proxy alongside prometheus so that prometheus can securely scrape the kubernetes api. The kubectl proxy process makes a network call on startup, which fails if the conduit proxy process has not yet started, causing the kubectl container to exit. Here's a description from the kubectl describe command of one of these restarts:

  kubectl:
    ...
    State:          Running
      Started:      Tue, 27 Feb 2018 17:12:33 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:31 -0800
      Finished:     Tue, 27 Feb 2018 17:12:32 -0800
    Ready:          True
    Restart Count:  1

You can see that it happens almost immediately, and as far as I can tell the restart has no bearing the ability of prometheus to successfully collect metrics. Prometheus does the right thing and retries if the scrape target is unavailable. So aside from the unsightly "1 restart" in the output of the kubectl command, I don't think conduit functionality is affected. Maybe the outcome of #485 could help improve this further, but I think it's ok for now.

If I understand correctly, #509 will resolve this! 馃帀 馃尞

Awesome! once the PR get merge I can test on GKE, unless it has already been tested.

@klingerf wrote:

I don't think it's problematic to continue to inject the prometheus pod. The restart that happens in that pod is from the kubectcl container, not the prometheus container. We're running a kubectl proxy alongside prometheus so that prometheus can securely scrape the kubernetes api. [...] Prometheus does the right thing and retries if the scrape target is unavailable. So aside from the unsightly "1 restart" in the output of the kubectl command, I don't think conduit functionality is affected. Maybe the outcome of #485 could help improve this further, but I think it's ok for now.

i'm closing this based on @klingerf's analysis above.

Also, PR #509 removes the kubectl container completely. it turns out it wasn't even being used anyway! So even the aesthetic problem with the logs should be fixed.

Was this page helpful?
0 / 5 - 0 ratings