vpa-recommender stopped working for some targetRefs between vertical-pod-autoscaler-0.5.0 and vertical-pod-autoscaler-0.5.1.
Specifically, none of the prometheus-operator created StatefulSets get any recommendation.
TargetRef:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: prometheus-kube-system
gets following vpa-recommender v0.5.1 messages on the .status.conditions:
- lastTransitionTime: "2019-07-03T11:30:29Z"
message: 'Error checking if target is a top level controller: Unhandled targetRef
monitoring.coreos.com/v1 / Prometheus / kube-system, last error the server could
not find the requested resource (get prometheuses kube-system)'
status: "True"
type: ConfigUnsupported
- lastTransitionTime: "2019-07-03T11:30:35Z"
message: No pods match this VPA object
reason: NoPodsMatched
status: "False"
type: RecommendationProvided
(it worked as expected in v0.5.0).
As an experiment, I tried specifying a targetRef against top-level controller's CRD
targetRef:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
name: kube-system
which made vpa-recommender produce this error:
I0703 12:03:42.380087 1 request.go:485] Throttling request took 197.788691ms, request: GET:https://100.64.0.1:443/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/kube-system/scale
E0703 12:03:42.380999 1 cluster_feeder.go:463] Cannot get target selector from VPA's targetRef. Reason: Unhandled targetRef monitoring.coreos.com/v1 / Prometheus / kube-system, last error the server could not find the requested resource (get prometheuses kube-system)
As far as I can tell targetRef validation logic added in #1910 does not play well with 3rd-party controllers managing 1st-party controllers objects via ownerReference.
Thanks for reporting!
I have some follow-up questions.
Is the top-level controller (monitoring.coreos.com/v1/Prometheus) creating multiple Stateful Sets?
Prometheus-operator currently backs each Prometheus CRD object with a corresponding individual StatefulSet.
If yes, does it make sense to scale them together? Are they similar enough so that it makes sense to give them one common recommendation?
There is no sense to blend VPA recommendations between different Prometheus (StatefulSets) objects: they are likely to have vastly different CPU & memory usage profiles.
Does the monitoring.coreos.com/v1/Prometheus resource implement the scale subresource?
It does not.
Thanks!
So there is 1:1 relation between Prometheus Custom Resource and a StatefulSet.
Looks like there is a bug in the way we validate the controller chain then. AFAIU we should go up to the topmost controller that is a well known controller and/or implements scale subresource. This means that we should identify the StatefulSet as the right controller to scale on. However we first try to read the topmost controller (the Prometheus object) and the VPA doesn't have permissions to get it.
To verify that this is actually what's happening, can you modify the VPA rbac in a way that allows vpa-recommender to get the Prometheus resources and then try to get recommendations with the VPA object pointing at a StatefulSet (the setup that you have working on 0.5.0)?
I did allow all verbs on Prometheus and prometheus/scale to get to the two error scenarios described in the first post.
vpa-recommender was erroring out in different ways without necessary RBAC permissions.
The proposal sounds good to me, if I may add one nit to it:
debug log level. This way info-level log messages would be less noisy.Thank you for picking this issue up so promptly!
not sure i follow, but probably because i am a noob in this area
for VPA , why does the scale subresource matter ?
There is rarely harm in asking questions and these are very valid ones. Answers below:
Any chance this will be resolved soon ?
We have our own custom CRD (metacontroller-based) maintaining many Statefulset per namespace and we have the same error.
We also use metacontroller with a custom CRD that controls both the deployment and the VPA resource and it doesn't look to be working because of this issue
Most helpful comment
Any chance this will be resolved soon ?
We have our own custom CRD (metacontroller-based) maintaining many Statefulset per namespace and we have the same error.