Autoscaler: Workload controllers may time out possibly due to long timeout on VPA webhook

Created on 21 Jan 2020 · 6Comments · Source: kubernetes/autoscaler

I'm investigating an issue where workload controllers like the replicaset controller and daemonset controller time out when attempting to create pods. The issue seems to be accompanied by log messages from kube-apiserver indicating that requests to the VPA mutating webhook are timing out. (I haven't yet determined why the VPA webhook times out in the first place.) Deleting the VPA webhook and then either waiting for the controller manager to try creating the pods again or killing the leading controller manager resolves the issue.

I wonder if the 30 second timeout on the VPA webhook is too long for the controller manager? If the workload controller request has a timeout <= 30s, it stands to reason that the controller request could time out before the apiserver ignores the VPA timeout and continues to create the Pod.

kinfeature vertical-pod-autoscaler

Source

dharmab

👍9

Most helpful comment

@bskiba My team has hit this issue three times this weekend. Our workaround is to restart vpa-admission-controller when it happens, and we still haven't determined exactly _why_ it times out, but reducing the timeout would certainly help as a mitigation.

I'm also curious why the webhook is hardcoded in vpa-admission-controller and not a separate object in the manifest YAML. We had a similar issue with Open Policy Agent, but were able to mitigate it by changing one field in the YAML instead of going upstream.

dharmab on 8 Mar 2020

👍3

All 6 comments

Sorry for the long response time, I missed this issue :(

Is it currently possible to change the timeout for admission controllers? I agree 30 seconds is too long, but last time I looked it was not configurable.

bskiba on 3 Mar 2020

I believe the AC webhook is created by vpa-admission-controller after launch- it doesn't seem to be in the example manifest.

dharmab on 3 Mar 2020

Yes, that's true, the vpa-admission-controller self-registers after startup. What I mean is do you know if the Admissionregistration Kubernetes API allows for specifying a custom timeout when registering a webhook (30 seconds is the default).

bskiba on 4 Mar 2020

Yes, you can (and should) specify a shorter timeout: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#configure-admission-webhooks-on-the-fly

Note: Default timeout for a webhook call is 10 seconds for webhooks registered created using admissionregistration.k8s.io/v1, and 30 seconds for webhooks created using admissionregistration.k8s.io/v1beta1. Starting in kubernetes 1.14 you can set the timeout and it is encouraged to use a small timeout for webhooks. If the webhook call times out, the request is handled according to the webhook’s failure policy.

dharmab on 6 Mar 2020

dharmab on 8 Mar 2020

👍3

This can be fixed by https://github.com/kubernetes/autoscaler/pull/2949, which allows operators to create and manage the webhook manually instead of using the hardcoded hook.

dharmab on 26 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings