Autoscaler: [BUG] VPA Recommender InitFromHistoryProvider not working and logs filled with "Error adding metric sample" warnings

Created on 30 Jul 2020 · 11Comments · Source: kubernetes/autoscaler

I set up all the necessary components of VPA (vpa-release-0.8) and used Prometheus as the history provider. Everything works fine except that when the Recommender first starts up and starts to initialize history samples from Prometheus, it will throw warnings for ALL containers like this (omits some unnecessary logs)

I0729 22:57:49.688362       1 cluster_feeder.go:206] Initializing VPA from history provider
...
I0729 23:01:54.173479       1 cluster_feeder.go:212] Adding pod {syslog disk-metrics-1596000000-h8np6} with labels map[controller_uid:547478fe-cb3d-4bd0-9fc2-b4c5f250fec7 hiya_billto_owner:coretech hiya_operations_contact:akunszt job_name:disk-metrics-1596000000]
I0729 23:01:54.173518       1 cluster_feeder.go:218] Adding 1 samples for container {{syslog disk-metrics-1596000000-h8np6} disk-metrics}
W0729 23:01:54.173530       1 cluster_feeder.go:224] Error adding metric sample for container {{syslog disk-metrics-1596000000-h8np6} disk-metrics}: KeyError: {{syslog disk-metrics-1596000000-h8np6} disk-metrics}

I0729 23:01:54.173551       1 cluster_feeder.go:212] Adding pod {platform mirrormaker-operator-785896587-w5xzq} with labels map[app:mirrormaker-operator failure_domain_beta_kubernetes_io_region:us-west-2 failure_domain_beta_kubernetes_io_zone:us-west-2c hiya_billto_owner:coretech hiya_operations_contact:coretech pod_template_hash:785896587]
I0729 23:01:54.173597       1 cluster_feeder.go:218] Adding 46080 samples for container {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}
W0729 23:01:54.173631       1 cluster_feeder.go:224] Error adding metric sample for container {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}: KeyError: {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}
W0729 23:01:54.173645       1 cluster_feeder.go:224] Error adding metric sample for container {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}: KeyError: {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}
W0729 23:01:54.173665       1 cluster_feeder.go:224] Error adding metric sample for container {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}: KeyError: {{platform mirrormaker-operator-785896587-w5xzq} mirrormaker-operator}
... (My notes: Error repeats 46080 times in total)
...

These warning messages are repeated for every pod/container, but it will not show up for rest of the sampling cycles after the Prometheus initialization part is done.

From the logs it clearly shows that the Recommender is able to find all the Pods, Labels, Containers, and even data samples (otherwise it wouldn't show things like Adding 46080 samples in the log ), which means my Prometheus configuration is correct.

After further digging into the code (both release-0.8 and master), I find that this behavior appears to be a bug in the code.

More specifically (I will use the master branch for illustration), in the https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go#L204 function, it will loop through the ClusterHistory, and add usage samples to the clusterFeeder.clusterState.

However, the problem is at this part https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go#L212, where pods are added to the clusterFeeder.clusterState for the first time. Within that function, Pods are added to the clusterState if not exists, but in the newPod() function, pod is just initialized to

func newPod(id PodID) *PodState {
    return &PodState{
        ID:         id,
        Containers: make(map[string]*ContainerState),
    }
}

, which means pod.Containers is an EMPTY map at this point.

Then further down the road in the InitFromHistoryProvider() function, it is trying to add history samples to the clusterFeeder.clusterState,

            for _, sample := range sampleList {
                if err := feeder.clusterState.AddSample(
                    &model.ContainerUsageSampleWithKey{
                        ContainerUsageSample: sample,
                        Container:            containerID,
                    }); err != nil {
                    klog.Warningf("Error adding metric sample for container %v: %v", containerID, err)
                }
            }

, but within the AddSample function, it will only add samples if both pods and container already exist in the clusterFeeder.clusterState. But as mentioned above, at this time Containers is just an empty map, so it will throw an error from that function, failing to add ANY history data samples at all. This correlates to the warning logs I got in the Recommender.

So am I wrong at all of these reasonings? If not, this basically means currently the Prometheus history provider is not working at all as no history data sample will be added. But I am also skeptical because why I didn't see anyone raise this question before.

Can someone please check and follow up? @bskiba

help wanted lifecyclrotten vertical-pod-atuoscaleprometheus vertical-pod-autoscaler

Source

lli-hiya

👍1

Most helpful comment

I think this might mean the metrics get discarded. I think the issue is the prometheus history is fetched before pods are listed for the first time.

bskiba on 20 Aug 2020

👍4

All 11 comments

:+1: I've seen these spammy logs too, typically these get logged at the startup, but everythings works as per normal. Just wait a bit and kubectl describe vpa X and see the recommendations.

povilasv on 30 Jul 2020

@povilasv Exactly, the spammy logs were generated when loading Prometheus history data (only once), so that's why it only appeared at the start up time.

lli-hiya on 30 Jul 2020

@povilasv Are you sure this is only spammy logs or metrics that get discarded?

I have the same issue with the following config:

        - --prometheus-cadvisor-job-name=kubelet
        - --pod-namespace-label=namespace
        - --pod-name-label=pod
        - --pod-label-prefix=""
        - --container-name-label=container
        - --container-namespace-label=namespace
        - --container-pod-name-label=pod
        - --history-length=4h
        - --memory-saver

estahn on 19 Aug 2020

I think this might mean the metrics get discarded. I think the issue is the prometheus history is fetched before pods are listed for the first time.

bskiba on 20 Aug 2020

👍4

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 18 Nov 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 18 Dec 2020

I'm getting the same issue, EKS 1.18 - I've checked the queries that VPA runs against Prometheus and they should all return data. Also the VPAs are not being populated with recommendations and the status is "false"

chris-vest on 23 Dec 2020

If the VPAs are not populated, that is a different issue, after vpa-recommender is working or some time you should see recommendations for running pods, even if fetching history failed.

If you observe no recommendations after 15 minutes, please file a new issue with the details.

bskiba on 12 Jan 2021

I'm getting similar issues, seems like the problem is that prometheus is pulling in metrics too broadly. It would be nice if you could add additional label selectors to narrow down the search space.

Something like: --prometheus-extra-labels=foo=bar

tom-hadlaw-hs on 12 Jan 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

fejta-bot on 11 Feb 2021

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.