Autoscaler: cluster-autoscaler: crashes when k8s API is updated

Created on 25 Nov 2019  路  20Comments  路  Source: kubernetes/autoscaler

We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here's the last state of the pod:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 19 Nov 2019 02:03:57 +0100
      Finished:     Tue, 19 Nov 2019 02:04:27 +0100

There's nothing really interesting in the logs at this time, just this:

I1119 01:03:57.820185       1 main.go:333] Cluster Autoscaler 1.13.1
F1119 01:04:27.821536       1 main.go:355] Failed to get nodes from apiserver: Get https://172.20.0.1:443/api/v1/nodes: dial tcp 172.20.0.1:443: i/o timeout

The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?

lifecyclrotten

Most helpful comment

em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it's possible that master node is not removed from itself and there's dead backend. My teammate is working on more smooth upgrade improvement.

All 20 comments

Version: k8s.gcr.io/cluster-autoscaler:v1.13.1

Hi @max-rocket-internet ,

It is intentional exit in case when we are not yet running (during initialization) if we cannot reach API server.
https://github.com/kubernetes/autoscaler/blob/3413247e67d746acd7ff0ea945c0a5bb51a40e16/cluster-autoscaler/main.go#L390
Why do you see it as a problem? The kubelet (if CA is deployed as static pod) or deployment controller (otherwise) will be restarting CA on regular basis anyway.

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO 馃檪

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO

I agree it would be cleaner. I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.
And it would not work without access to API server anyway.

Actually crashing on fatal error in main() like we do (e.g. on lost leader election token) is common for other k8s controllers.
E.g. here: https://github.com/kubernetes/kubernetes/blob/46a29a0cc30c0e601febd93a5851fcce615c2964/cmd/cloud-controller-manager/app/controllermanager.go#L118
I assume it does not manifest as crash to you, because controller-manager is running on master and is restarted together with API server on upgrade.
Are you running CA on master or on standard cluster nodes?

Also are you running single k8s master? Regional setup with multiple masters would also help as you CA would not loose connectivity to control plane.

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

I meant restart after crash :)

I agree it would be cleaner.

Cool 馃槂

I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.

Agreed. We just have a low tolerance for misbehaving containers.

And it would not work without access to API server anyway.

Yes but it could perhaps retry in a loop for a while before exiting with error?

because controller-manager is running on master and is restarted together with API server on upgrade.
Also are you running single k8s master? Regional setup with multiple masters

AWS EKS. It's a service. No masters we can see.

Are you running CA on master or on standard cluster nodes?

On the standard cluster nodes

Actually crashing on fatal error in main() like we do is common for other k8s controllers.

OK but we have in our cluster many other apps that are using the k8s API that do not crash at this time 馃檪 e.g. ingress-controllers, datadog, kube-proxy, external-dns, node-problem-detector, aws-vpc-cni, prometheus, k8s-event-logger etc etc

Yes but it could perhaps retry in a loop for a while before exiting with error?

Makes sense. Happy to accept a PR :)

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

would love to see this fixed as well, this behaviour triggers our alert system during rolling updates of our cluster.

this behaviour triggers our alert system during rolling updates of our cluster.

That's exactly our problem also.

/remove-lifecycle rotten

/remove-lifecycle stale

em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it's possible that master node is not removed from itself and there's dead backend. My teammate is working on more smooth upgrade improvement.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Would like to see this fixed. Also seeing issues with EKS.

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

Was this page helpful?
0 / 5 - 0 ratings