Autoscaler: cluster-autoscaler: crashes when k8s API is updated

Created on 25 Nov 2019 · 20Comments · Source: kubernetes/autoscaler

We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here's the last state of the pod:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 19 Nov 2019 02:03:57 +0100
      Finished:     Tue, 19 Nov 2019 02:04:27 +0100

There's nothing really interesting in the logs at this time, just this:

I1119 01:03:57.820185       1 main.go:333] Cluster Autoscaler 1.13.1
F1119 01:04:27.821536       1 main.go:355] Failed to get nodes from apiserver: Get https://172.20.0.1:443/api/v1/nodes: dial tcp 172.20.0.1:443: i/o timeout

The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?

lifecyclrotten

Source

max-rocket-internet

Most helpful comment

em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it's possible that master node is not removed from itself and there's dead backend. My teammate is working on more smooth upgrade improvement.

Jeffwan on 5 Aug 2020

👍3

All 20 comments

Version: k8s.gcr.io/cluster-autoscaler:v1.13.1

max-rocket-internet on 25 Nov 2019

Hi @max-rocket-internet ,

It is intentional exit in case when we are not yet running (during initialization) if we cannot reach API server.
https://github.com/kubernetes/autoscaler/blob/3413247e67d746acd7ff0ea945c0a5bb51a40e16/cluster-autoscaler/main.go#L390
Why do you see it as a problem? The kubelet (if CA is deployed as static pod) or deployment controller (otherwise) will be restarting CA on regular basis anyway.

losipiuk on 25 Nov 2019

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO 🙂

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

max-rocket-internet on 25 Nov 2019

Why do you see it as a problem?

It's definitely a problem. It's an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO

I agree it would be cleaner. I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.
And it would not work without access to API server anyway.

Actually crashing on fatal error in main() like we do (e.g. on lost leader election token) is common for other k8s controllers.
E.g. here: https://github.com/kubernetes/kubernetes/blob/46a29a0cc30c0e601febd93a5851fcce615c2964/cmd/cloud-controller-manager/app/controllermanager.go#L118
I assume it does not manifest as crash to you, because controller-manager is running on master and is restarted together with API server on upgrade.
Are you running CA on master or on standard cluster nodes?

Also are you running single k8s master? Regional setup with multiple masters would also help as you CA would not loose connectivity to control plane.

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don't see any restarts of the pod outside of crashes and updates?

I meant restart after crash :)

losipiuk on 25 Nov 2019

I agree it would be cleaner.

Cool 😃

I was just pointing that crash is not end of the world as the CA pod will be restarted after crash anyway.

Agreed. We just have a low tolerance for misbehaving containers.

And it would not work without access to API server anyway.

Yes but it could perhaps retry in a loop for a while before exiting with error?

because controller-manager is running on master and is restarted together with API server on upgrade.
Also are you running single k8s master? Regional setup with multiple masters

AWS EKS. It's a service. No masters we can see.

Are you running CA on master or on standard cluster nodes?

On the standard cluster nodes

Actually crashing on fatal error in main() like we do is common for other k8s controllers.

OK but we have in our cluster many other apps that are using the k8s API that do not crash at this time 🙂 e.g. ingress-controllers, datadog, kube-proxy, external-dns, node-problem-detector, aws-vpc-cni, prometheus, k8s-event-logger etc etc

max-rocket-internet on 25 Nov 2019

Yes but it could perhaps retry in a loop for a while before exiting with error?

Makes sense. Happy to accept a PR :)

losipiuk on 28 Nov 2019

❤1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 26 Feb 2020

/remove-lifecycle stale

max-rocket-internet on 27 Feb 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 27 May 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 26 Jun 2020

would love to see this fixed as well, this behaviour triggers our alert system during rolling updates of our cluster.

ltagliamonte-dd on 1 Jul 2020

this behaviour triggers our alert system during rolling updates of our cluster.

That's exactly our problem also.

/remove-lifecycle rotten

max-rocket-internet on 6 Jul 2020

/remove-lifecycle stale

max-rocket-internet on 6 Jul 2020

Jeffwan on 5 Aug 2020

👍3

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Nov 2020

Would like to see this fixed. Also seeing issues with EKS.

gillbee on 4 Nov 2020

/remove-lifecycle stale

max-rocket-internet on 5 Nov 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 3 Feb 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot on 5 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings