I'm investigating an issue where workload controllers like the replicaset controller and daemonset controller time out when attempting to create pods. The issue seems to be accompanied by log messages from kube-apiserver indicating that requests to the VPA mutating webhook are timing out. (I haven't yet determined why the VPA webhook times out in the first place.) Deleting the VPA webhook and then either waiting for the controller manager to try creating the pods again or killing the leading controller manager resolves the issue.
I wonder if the 30 second timeout on the VPA webhook is too long for the controller manager? If the workload controller request has a timeout <= 30s, it stands to reason that the controller request could time out before the apiserver ignores the VPA timeout and continues to create the Pod.
Sorry for the long response time, I missed this issue :(
Is it currently possible to change the timeout for admission controllers? I agree 30 seconds is too long, but last time I looked it was not configurable.
I believe the AC webhook is created by vpa-admission-controller after launch- it doesn't seem to be in the example manifest.
Yes, that's true, the vpa-admission-controller self-registers after startup. What I mean is do you know if the Admissionregistration Kubernetes API allows for specifying a custom timeout when registering a webhook (30 seconds is the default).
Yes, you can (and should) specify a shorter timeout: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#configure-admission-webhooks-on-the-fly
Note: Default timeout for a webhook call is 10 seconds for webhooks registered created using admissionregistration.k8s.io/v1, and 30 seconds for webhooks created using admissionregistration.k8s.io/v1beta1. Starting in kubernetes 1.14 you can set the timeout and it is encouraged to use a small timeout for webhooks. If the webhook call times out, the request is handled according to the webhook鈥檚 failure policy.
@bskiba My team has hit this issue three times this weekend. Our workaround is to restart vpa-admission-controller when it happens, and we still haven't determined exactly _why_ it times out, but reducing the timeout would certainly help as a mitigation.
I'm also curious why the webhook is hardcoded in vpa-admission-controller and not a separate object in the manifest YAML. We had a similar issue with Open Policy Agent, but were able to mitigate it by changing one field in the YAML instead of going upstream.
This can be fixed by https://github.com/kubernetes/autoscaler/pull/2949, which allows operators to create and manage the webhook manually instead of using the hardcoded hook.
Most helpful comment
@bskiba My team has hit this issue three times this weekend. Our workaround is to restart vpa-admission-controller when it happens, and we still haven't determined exactly _why_ it times out, but reducing the timeout would certainly help as a mitigation.
I'm also curious why the webhook is hardcoded in vpa-admission-controller and not a separate object in the manifest YAML. We had a similar issue with Open Policy Agent, but were able to mitigate it by changing one field in the YAML instead of going upstream.