Autoscaler: vertical-pod-autoscaler recommender stopped working for some targets

Created on 3 Jul 2019 · 8Comments · Source: kubernetes/autoscaler

vpa-recommender stopped working for some targetRefs between vertical-pod-autoscaler-0.5.0 and vertical-pod-autoscaler-0.5.1.

Specifically, none of the prometheus-operator created StatefulSets get any recommendation.

TargetRef:

targetRef:
  apiVersion: apps/v1
  kind: StatefulSet
  name: prometheus-kube-system

gets following vpa-recommender v0.5.1 messages on the .status.conditions:

- lastTransitionTime: "2019-07-03T11:30:29Z"
  message: 'Error checking if target is a top level controller: Unhandled targetRef
    monitoring.coreos.com/v1 / Prometheus / kube-system, last error the server could
    not find the requested resource (get prometheuses kube-system)'
  status: "True"
  type: ConfigUnsupported
- lastTransitionTime: "2019-07-03T11:30:35Z"
  message: No pods match this VPA object
  reason: NoPodsMatched
  status: "False"
  type: RecommendationProvided

(it worked as expected in v0.5.0).

As an experiment, I tried specifying a targetRef against top-level controller's CRD

targetRef:
  apiVersion: monitoring.coreos.com/v1
  kind: Prometheus
  name: kube-system

which made vpa-recommender produce this error:

I0703 12:03:42.380087       1 request.go:485] Throttling request took 197.788691ms, request: GET:https://100.64.0.1:443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/kube-system/scale
E0703 12:03:42.380999       1 cluster_feeder.go:463] Cannot get target selector from VPA's targetRef. Reason: Unhandled targetRef monitoring.coreos.com/v1 / Prometheus / kube-system, last error the server could not find the requested resource (get prometheuses kube-system)

As far as I can tell targetRef validation logic added in #1910 does not play well with 3rd-party controllers managing 1st-party controllers objects via ownerReference.

kinbug vertical-pod-autoscaler

Source

realdimas

👍4

Most helpful comment

Any chance this will be resolved soon ?

We have our own custom CRD (metacontroller-based) maintaining many Statefulset per namespace and we have the same error.

fproulx-dfuse on 31 Oct 2019

👍3

All 8 comments

Thanks for reporting!
I have some follow-up questions.

Is the top-level controller (monitoring.coreos.com/v1/Prometheus) creating multiple Stateful Sets? If yes, does it make sense to scale them together? Are they similar enough so that it makes sense to give them one common recommendation?

Does the monitoring.coreos.com/v1/Prometheus resource implement the scale subresource?

bskiba on 3 Jul 2019

Is the top-level controller (monitoring.coreos.com/v1/Prometheus) creating multiple Stateful Sets?

Prometheus-operator currently backs each Prometheus CRD object with a corresponding individual StatefulSet.

If yes, does it make sense to scale them together? Are they similar enough so that it makes sense to give them one common recommendation?

There is no sense to blend VPA recommendations between different Prometheus (StatefulSets) objects: they are likely to have vastly different CPU & memory usage profiles.

Does the monitoring.coreos.com/v1/Prometheus resource implement the scale subresource?

It does not.

realdimas on 3 Jul 2019

Thanks!

So there is 1:1 relation between Prometheus Custom Resource and a StatefulSet.

Looks like there is a bug in the way we validate the controller chain then. AFAIU we should go up to the topmost controller that is a well known controller and/or implements scale subresource. This means that we should identify the StatefulSet as the right controller to scale on. However we first try to read the topmost controller (the Prometheus object) and the VPA doesn't have permissions to get it.

To verify that this is actually what's happening, can you modify the VPA rbac in a way that allows vpa-recommender to get the Prometheus resources and then try to get recommendations with the VPA object pointing at a StatefulSet (the setup that you have working on 0.5.0)?

bskiba on 3 Jul 2019

I did allow all verbs on Prometheus and prometheus/scale to get to the two error scenarios described in the first post.
vpa-recommender was erroring out in different ways without necessary RBAC permissions.

The proposal sounds good to me, if I may add one nit to it:

if unable to get parent controller, log RBAC-related errors into debug log level. This way info-level log messages would be less noisy.

Thank you for picking this issue up so promptly!

realdimas on 3 Jul 2019

not sure i follow, but probably because i am a noob in this area

Why does the targetRef pointing to a StatefulSet does not work ?
Why do we need to go up the chain to find a valid controller ?

for VPA , why does the scale subresource matter ?

krmayankk on 28 Aug 2019

👍3

There is rarely harm in asking questions and these are very valid ones. Answers below:

targetRef pointing to a StatefulSet in a regular situation works, it doesn't work here since there is an owner controller and we have a bug going up the controller chain.
VPA goes up the chain to avoid a situation, where you have a VPA defined on a Deployment and on a ReplicaSet that backs this same Deployment - those two will race each other to scale your pods.
VPA uses scale subresource to fetch the pods controlled by a given resource, this information is used to:
a) determine which metrics to use for calculating recommendations for a certain VPA object
b) know which pods are candidates for eviction by updater due to stale resource requests
Ideally we would have a separate subresource (without the need to actually support scale operation), but we opted for using what was already there (scale subresource is used heavily by Horizontal Pod Autoscaler).

bskiba on 20 Sep 2019

👍2

Any chance this will be resolved soon ?

We have our own custom CRD (metacontroller-based) maintaining many Statefulset per namespace and we have the same error.

fproulx-dfuse on 31 Oct 2019

👍3

We also use metacontroller with a custom CRD that controls both the deployment and the VPA resource and it doesn't look to be working because of this issue