Linkerd2: Controller does not start cleanly on GKE

Created on 28 Feb 2018  路  37Comments  路  Source: linkerd/linkerd2

This appears to have changed since the v0.3.0 release.

When I install Conduit on GKE, the pods in the conduit namespace restart multiple times before stabilizing and entering the Running state. I'd expect them to not restart at all. For example:

$ kubectl -n conduit get po
NAME                          READY     STATUS    RESTARTS   AGE
controller-5b5c6c4846-6nxb2   6/6       Running   3          2m
prometheus-598fc79646-zl2dw   3/3       Running   0          2m
web-85799d759c-vz2bv          2/2       Running   0          2m

It's hard to track down which of the containers is causing the pod to restart, but I see this in the proxy-api container's logs:

$ kubectl -n conduit logs controller-5b5c6c4846-6nxb2 proxy-api
time="2018-02-28T00:36:17Z" level=info msg="running conduit version git-9ffe8b79"
time="2018-02-28T00:36:17Z" level=info msg="serving scrapable metrics on :9996"
time="2018-02-28T00:36:17Z" level=info msg="starting gRPC server on :8086"
time="2018-02-28T00:36:27Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
time="2018-02-28T00:36:28Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
time="2018-02-28T00:36:28Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
...
time="2018-02-28T00:36:57Z" level=error msg="Report: rpc error: code = Unknown desc = ResponseCtx is required"

I also see this in the conduit-proxy container's logs:

$ kubectl -n conduit logs controller-5b5c6c4846-6nxb2 conduit-proxy
INFO conduit_proxy using controller at HostAndPort { host: DnsName("localhost"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("localhost") to 127.0.0.1
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unknown })

If I had to guess, I think these errors are likely a result of the go processes trying to route traffic before the proxy has initialized, and this behavior changed in #365.

Here's the version I'm testing against:

$ ./bin/conduit version
Client version: git-9ffe8b79
Server version: git-9ffe8b79
arecontroller prioritP0

Most helpful comment

It would be interesting to see what the inner messages reveal with @hawkw changes in logging. I can take a look at this when I get the chance.

All 37 comments

I've confirmed that the controller starts cleanly if I roll back to 7365c56, but adding f66ec64 produces this issue.

As I understand it, the controller doesn't actually initiate any traffic on its own whereas the proxy does spontaneously generate traffic (telemetry reports). Depending on startup order the proxy could start sending reports before the telemetry service is started which could explain these errors in the proxy-api service.

My question is: why does this result in pods getting restarted? I would hope that these errors would just result in those telemetry reports being lost.

Does the controller eventually get into a working state? How long does it take the controller (e.g. in seconds) to get into a working state and is that significantly longer than previously?

If the controller takes significantly longer to get into a working state, or if it doesn't get into a working state, then that is a significant problem. However, if the issue is just that the logs are messy then that's annoying but I'm not sure that's a serious issue. I would rather expect that to some extent, based on the thinking in https://medium.com/@kelseyhightower/12-fractured-apps-1080c73d481c.

From anecdotal testing, it takes roughly a minute's worth of restarts for the controller to get into a working state.

From anecdotal testing, it takes roughly a minute's worth of restarts for the controller to get into a working state.

=> P0 then.

/cc @deebo91

It would be interesting to see what the inner messages reveal with @hawkw changes in logging. I can take a look at this when I get the chance.

@deebo91 I hope the improved log messages are helpful --- if you still need help deciphering the proxy logs, let me know.

More info: I've also seen the prometheus pod restart on occasion:

$ kubectl -n kltest4 get po
NAME                          READY     STATUS    RESTARTS   AGE
controller-759585d77b-kwhxm   6/6       Running   2          35m
prometheus-6d7c65c84f-x8ncg   3/3       Running   1          35m
web-bbc49cc79-w5svt           2/2       Running   0          35m

Looking at the describe pod logs, it looks like the kubectl container is the one that exits and causes the pod to restart:

$ kubectl -n kltest4 describe po prometheus-6d7c65c84f-x8ncg 
...
  kubectl:
    ...
    State:          Running
      Started:      Tue, 27 Feb 2018 17:12:33 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:31 -0800
      Finished:     Tue, 27 Feb 2018 17:12:32 -0800
    Ready:          True
    Restart Count:  1

Looking at the kubectl container's log, I see:

$ kubectl -n kltest4 logs prometheus-6d7c65c84f-x8ncg kubectl --previous
The connection to the server 10.51.240.1:443 was refused - did you specify the right host or port?

This makes me think that there's a race with starting the kubectl and conduit-proxy containers in that pod. If kubectl starts after the ip tables have been adjusted but before conduit-proxy is available, it will exit and restart.

I think I see a very similar thing happening with the telemetry service:

$ kubectl -n alex logs -p controller-3424828038-9f8s6 telemetry
time="2018-02-28T01:37:28Z" level=info msg="running conduit version git-1e611c21"
time="2018-02-28T01:37:58Z" level=fatal msg="[pods watcher] timed out"

I think this is the telemetry service attempting to start a watch on the Kubernetes API before the proxy is fully initialized or something along those lines...

@adleong Ah, yep, same thing in the deployments I've been testing. This controller had 2 restarts:

controller-759585d77b-kwhxm   6/6       Running   2          41m

The tap service and the telemetry service both restarted after 30 seconds of trying to reach the kubernetes API:

$ kubectl -n kltest4 describe po controller-759585d77b-kwhxm
...
  tap:
    ...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:36 -0800
      Finished:     Tue, 27 Feb 2018 17:13:06 -0800
    ...
  telemetry:
    ...
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 27 Feb 2018 17:12:37 -0800
      Finished:     Tue, 27 Feb 2018 17:13:07 -0800

That 30-second timeout is applied here: https://github.com/runconduit/conduit/blob/master/controller/k8s/watcher.go#L59

We can try lengthening that timeout to give a chance for the proxy containers to start. 30 seconds already seems pretty generous though....

The proxy shouldn't be taking 30 seconds to start, at least not regularly and repeatably. If it does that's a bug.

I think the issue is that the controller attempts to establish the connection prior to the proxy starting, and then that connection hangs until the controller gives up and exits. If the controller waited a few seconds before establishing the connection, however, I bet it would work.

The proxy can't expect that every (any) proxied service will wait before it starts networking, so it would be ideal to find a solution that doesn't require modifying the controller Go code.

@briansmith mentioned that my changes in #483, increasing the proxy's default private connect timeout to 10 seconds, may help with this issue. @deebo91, if you have a chance to try out that branch, please let me know your results!

@hawkw I will test your branch out and keep you posted!

It seems like we could possibly fix this by improving the retry logic in the controller. However, as @briansmith points out, there's a more fundamental problem where container startup order is not deterministic and applications with the Conduit proxy sidecar may attempt to make network calls before the proxy container is started.

I am skeptical that the proxy's connect timeout has anything to do with this, since we wrap all services with Reconnect (or should be).

It seems like we could possibly fix this by improving the retry logic in the controller.

I suggest:

  • We improve the retry logic for outbound connections in the controller like Alex suggests, to make the controller a "well-behaving application."
  • We stop injecting the proxy into the Prometheus pod, and then file a new issue about making the Prometheus pod a "well-behaving application" to allow re-enabling the proxy in it.
  • We file an issue to figure out to do when an application isn't "well-behaving," where "well-behaving" basically means "follows the guidelines in https://medium.com/@kelseyhightower/12-fractured-apps-1080c73d481c".

In particular, I suspect that fully supporting with applications that aren't "well-behaving" is going to require a pretty big change to Conduit, and almost definitely such a change isn't appropriate for 0.3.1 and may not even be possible for 0.4.0.

I am skeptical that the proxy's connect timeout has anything to do with this, since we wrap all services with Reconnect (or should be).

The timeout is extraordinarily low (20ms, IIRC) and in a reconnect loop. Earlier in the issue it was reported that the proxy seemed to be taking >30 seconds to start up, and I thought (think) it could be in a tight reconnect loop here.

@briansmith I see! as discussed in #483, i'm in favor of removing the connect timeout entirely if we believe it's causing problems

What is the current plan to address this issue (not generally, but in the specific case of the conduit controller)?

The plan was to make each controller component that establishes a watch with k8s retry its connections on failure. So that would be destination, tap and telemetry

@dadjeibaah Does anything need to be done for the prometheus pod?

I believe we have to do that as well for the prometheus since it does have to talk to the controller for metrics

@klingerf and I looked at go code that might be contributing to this connect issue. We observed that in pods.go

func (p *PodIndex) Run() error {
    go p.reflector.ListAndWatch(p.stopCh)
    return newWatcher(p.reflector, podResource).run()
}

go p.reflector.ListAndWatch(p.stopCh) line returns an error that we do not handle in the separate go routine. We then continue run the newWatcher(...) which does the actual 30 second wait and then terminates with a timeout error. This Run code is used in endpoints.go, replicasets.go etc. Pretty much everything that interacts with the k8s watch API.

The proposed change @klingerf recommended (correct me if I have it wrong) is that we wrap the ListAndWatch method in a go func that captures an error. If an error occurs, we retry after a brief timeout up. Once we reach a number of retries we can safely stop the server from running. This should hopefully stop the k8s API dependent containers from restarting which eventually forces the pods to restart.

@dadjeibaah Thanks for the writeup! I think you've accurately captured the issue, which is that the go routine exits immediately with an error when it can't talk to kubeapi, and then the watcher code waits 30 seconds for the reflector to initialize, which of course can't happen since the go routine is no longer running in the background. To fix we should retry on error inside the go routine. Something like:

func (p *PodIndex) Run() error {
    go func() {
        for {
            err := p.reflector.ListAndWatch(p.stopCh)
            if err != nil {
                log.Error("Run error: %s", err)
            }
            time.Sleep(1 * time.Second)
        }
    }()
    return newWatcher(p.reflector, podResource).run()
}

And since we use newWatcher in a few places, all of which will require this retry pattern, it might make sense to generalize that as a function in watcher.go, but not required.

@klingerf I took a stab at generalizing retry logic for the watcher. This is what I came up with:

type watchInitializer func(stopCh <-chan struct{}) error

type watcher struct {
    resource     resourceToWatch
    resourceType string
    timeout      time.Duration
    watchInit    watchInitializer
    stopCh       <-chan struct{}
}

func (w *watcher) run() error {
        ...
    if w.watchInit != nil {
        go func() {
            err := w.watchInit(w.stopCh)
            if err != nil {
                log.Errorf("Error establishing watch in [%s watcher]", w.resourceType)
            }
            time.Sleep(1 * time.Second)
        }()
    }
      ...

This code is in the watcher.go file. Basically, instead of running the ListAndWatch logic in their respective resource specific go code i.e. endpoints.go,replicasets.go etc, we can pass the ListAndWatch function into watcher.go and perform the retry logic in there. We can set it up so that watchInit fields holds a reference to ListAndWatch. I think this is what you referring to when you mentioned passing functions into functions but I may be wrong.

@dadjeibaah Looks great! That's exactly what I meant.

@klingerf, I would like to get your thoughts on how to approach adding retry logic in the endpoints.go package. The Run method in endpoints.go has a different function signature than the run methods of pods.go etc. The difference is that in endpoints.go the line i.informer.Run(stopCh) does not return an error as opposed to the Run methods in pods.go.

It seems as if that i.informer.Run(stopCh) executes a panic as opposed to returning an error. One approach I took to capture the panic this:

watch := func(stopCh <-chan struct{}) (err error) {
        defer func() {
            if x := recover(); x != nil {
                err = fmt.Errorf("failed endpoint [%s] watcher", endpointResource)
            }
        }()
        i.informer.Run(stopCh)
        return
    }

Basically I am wrapping i.informer.Run(stopCh) in another function that satisfies the new function signature that returns an error. I include thedefer func() to capture the panic signal then return the error. However that seems to not do the track.

@dadjeibaah Ah, just saw this, and I commented with a very similar suggestion on the PR. You're right that Run unfortunately panics instead of returning an error. Your approach of wrapping that in a func that does return an error that fits method signature is what I recommended too.

I just started to test conduit and got the same error as mentioned:
1520338511.612761 ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Closed))

However, I found it out a bit different. My setup, I have the controller running. I use helm charts to do my whole deployment. I add conduit to the deployment.yaml for the pods I want to be included in the service mesh.
I get the issue when I do helm install, but it does not happen every time, likely a timing issue. I can only see the error when issuing the helm install, but I do not (or at least not yet) see the error when scaling my pods, or restart the pods that have the issue.

Also, it seems like the pods (even with the error) are added to the service mesh, because conduit are load balancing over them, even the pod with the fault. However the message never reach the application container. I can see all pods on the conduit dashboard.

Conduit with debug:
1520338507.632551 DBUG conduit_proxy::outbound building outbound Http1 client to LocalSvc(FullyQualifiedAuthority(pod.namespace.svc:8080))
1520338507.634008 DBUG conduit_proxy::telemetry::sensor::transport client connection open
1520338511.612761 ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Closed))

I know the plan is to include helm support in the next release, but still do you think my problem can be related to the timing issue mentioned above?

@capathida interesting! I don't think we should assume that your issue is the same as this one. Would you mind opening a new issue describing how you encountered this so that we can try to reproduce the behavior? Thanks!

I filed #522 to track an additional issue with controller containers not starting cleanly. I think we should ship a separate fix for that issue.

Created issue: https://github.com/runconduit/conduit/issues/529 as agreed for the service error 500 on the sidecar

Was this page helpful?
0 / 5 - 0 ratings