Datadog-agent: [cluster-agent] metrics don't appear in external.metrics.k8s.io

Created on 30 Oct 2018  路  15Comments  路  Source: DataDog/datadog-agent

Describe what happened:
I have setup the cluster agent using the helm chart stable/datadog. When I query the external metrics end point I get empty list of resources.

$ kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]}

And the HPA stucks at the <unknown> value.

$ kubectl get hpa 
NAME          REFERENCE                TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
statsd-demo   Deployment/statsd-demo   <unknown>/10 (avg)   1         10        1          50m

Output of status:

root@ddog-cluster-agent-84486db86-qbwrw:/# datadog-cluster-agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================

  Status date: 2018-10-30 18:28:19.562472 UTC
  Pid: 1
  Check Runners: 4
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2018-10-30 18:28:19.562472 UTC

  Hostnames
  =========
    ec2-hostname: ip-1xx-1xx-2xx-190.us-west-2.compute.internal
    hostname: i-0c1580d88cbec55c0
    instance-id: i-0c1580d88cbec55c0
    socket-fqdn: ddog-cluster-agent-84486db86-qbwrw
    socket-hostname: ddog-cluster-agent-84486db86-qbwrw
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Leader Election
  ===============
    Leader Election Status:  Failing
    Error: entity not found


  Custom Metrics Server
  =====================
    ConfigMap name: default/datadog-custom-metrics

    External Metrics
    ----------------
      Total: 0
      Valid: 0


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [WARNING]
        Total Runs: 15
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s

        Warning: [Leader Election not enabled. Not running Kubernetes API Server check or collecting Kubernetes Events.]



=========
Forwarder
=========

  CheckRunsV1: 14
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 1
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 29
  TimeseriesV1: 14

  API Keys status
  ===============
    API key ending with xxxxx on endpoint https://app.datadoghq.com: API Key valid

Describe what you expected:
The metrics should appear in API end point and the HPA should detect the value according to it.

Steps to reproduce the issue:
Install cluster agent by either helm chart or using manifests.

Additional environment details (Operating System, Cloud provider, etc):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.3-eks", GitCommit:"58c199a59046dbf0a13a387d3491a39213be53df", GitTreeState:"clean", BuildDate:"2018-09-21T21:00:04Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Running platform version of EKS: eks.2

values.yaml used for helm install:

daemonset:
  useHostNetwork: true
  useHostPort: true
datadog:
  env:
    - name: DD_USE_DOGSTATSD
      value: "true"
    - name: DD_DOGSTATSD_PORT
      value: "8125"
    - name: DD_DOGSTATSD_NON_LOCAL_TRAFFIC
      value: "true"
  apiKey: "********************************"
  appKey: "****************************************"
clusterAgent:
  enabled: true
  token: "*****************************************"
  metricsProvider:
    enabled: true

HPA specification:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: statsd-demo
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: statsd-demo
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: demoInGo.request.count_total
      metricSelector:
        matchLabels:
          appname: statsd-demo
      targetAverageValue: 10

Most helpful comment

@Chili-Man thank you for sharing!

tl;dr: To answer your question, so long as there is a config without the labels 1.0.0 will not support it. As soon as the hpa manifest is updated to have labels, it will properly handle it.
The fix was merged earlier today, and I sincerely apologies for the trouble. We are starting doing QA on the release which will include this.

To note: With the fix and in the new version, there will be an error message in the logs, however we still do not support autoscaling on metrics without labels.

Could you try using datadog/cluster-agent-dev:charlyf-hpa-labels to confirm that it does not process "bad configs" ?

More details:
The Cluster Agent runs a leader election process in order to process the autoscalers using informers. When processing the autoscalers, we extract the metric name and the labels and we query Datadog to get the timestamp/value.
Then, we store the results in a configmap so that other cluster agents (if running several replicas) can access values to serve to Kubernetes, it also helps reducing the number of calls to Datadog.
When a cluster agent is not a leader, it will only read values from this config map when asked (by Kubernetes).
Lastly, in order to avoid keeping deleted/outdated autoscaler configs in the ConfigMap there is a Garbage Collection process that runs every 5minutes that lists values from the cache of the informer and compares them with the content of the ConfigMap used to store the processed ones.

Hence, if a "bad" config is ever made (out of an update/creation) when the cluster agent is not the leader it will be processed during the GC and crash as we were trying to access a nil pointer (the labels, which are missing). If the cluster agent is the leader and a bad config is made, it will crash almost immediately as it tries to digest the config (and access the missing labels).

All 15 comments

Hey @bhavin192,

Thanks for opening this issue.
From the logs I can see the following:

Warning: [Leader Election not enabled. Not running Kubernetes API Server check or collecting Kubernetes Events.]

This is likely the issue because while the Cluster Agent can run in HA with several replicas, the leader election needs to be enabled for it to query data from Datadog.

I am currently improving the docs on the helm chart to highlight this as you can see here:
https://github.com/helm/charts/blob/2dc2d42e0f470d0ea4803d386ec2b18fdf537314/stable/datadog/README.md#enabling-the-datadog-cluster-agent.

Could you run the chart again with the leader election enabled datadog.leaderElection=true ?

Best,

Hey @CharlyF, thanks for a quick reply :)
After enabling leaderElection that warning got resolved. Now I can see [ok] there, but still the metrics are not there at end point.

Okay, sorry to hear!
Could you set the logs to debug (datadog.logLevel=DEBUG), let the cluster agent run for a few minutes and send a flare from within the cluster agent's pod?
You can use agent flare. This will opened a ticket on our end with a bundle of the cluster agent's logs and more details for us to investigate.

Best,

I have uploaded the flare, should I share the case ID here?

Hey @bhavin192,

I was able to find the ticket, thanks!
Setting the log level to Trace really helped, the issue here is quite intricate but I was able to find it.

When the Cluster Agent started it was not yet leader so it did not process the Autoscaler. We can see in the logs that the AutoscalersController received the event and tried to process it accordingly:

2018-10-31 05:43:11 UTC | DEBUG | (hpa_controller.go:329 in addAutoscaler) | Adding autoscaler default/statsd-demo
Update received for the default/statsd-demo, without a relevant change to the configuration
[...]
2018-10-31 05:43:11 UTC | TRACE | (hpa_controller.go:259 in processNext) | Processing default/statsd-demo
2018-10-31 05:43:11 UTC | TRACE | (hpa_controller.go:290 in syncAutoscalers) | Only the leader needs to sync the Autoscalers
2018-10-31 05:43:11 UTC | TRACE | (hpa_controller.go:274 in handleErr) | Faithfully dropping key default/statsd-demo

Now, this means that everything is working as expected. At this point, the leader should be running the process, but in this case, I think there was a dangling state, so the Cluster Agent became leader a little later

2018-10-31 05:44:57 UTC | DEBUG | (leaderelection.go:154 in EnsureLeaderElectionRuns) | Currently Leader: true. Leader identity: "ddog-cluster-agent-fdddbc496-rxwrt"

I think th leadership was owned by a replica from a previous deployment namely: ddog-cluster-agent-fdddbc496-c4jxf

Reapplying the HPA manifest will solve this, but on my end I am going to see how this can be better processed.
I also want to improve how we process events from the resync of the informer as it could have helped us here.

Let me know if that helps!

@CharlyF that was hard to find :) and yes, re-creating HPA solved the issue.

Awesome!
I have added some work in our backlog to further investigate that behaviour.
Thank you very much for having reached out.

@CharlyF I'm running in to this issue , even re-creating the HPA doesnt resolve it. the relevant details of the datadog clusrter agent have been uploaded through the flare for case #180714, if that helps

Hey @Chili-Man - Can you share the HPA manifest here ?
I suspect that it is lacking a metricsSelector (or labels).
I made a PR yesterday to fix that: https://github.com/DataDog/datadog-agent/pull/2666
We are discussing a release now. Thanks for your feedback

To clarify, it was initially lacking the metricSelector, but even after fixing that, it wasn't working.

@CharlyF , here's the HPA manifest we are using:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: delphiusapp-worker-all
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: delphiusapp-worker-allqueues
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: External
      external:
        metricName: development.delphiusapp_sidekiq.queues.default.enqueued
        metricSelector:
           matchLabels:
              environment: development
        targetValue: 1

Describing the HPA below:

Name:                                                                        delphiusapp-worker-all
Namespace:                                                                   data-core
Labels:                                                                      <none>
Annotations:                                                                 kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"delphiusapp-worker-all","namespace":"data-cor...
CreationTimestamp:                                                           Wed, 14 Nov 2018 23:44:03 -0800
Reference:                                                                   Deployment/delphiusapp-worker-allqueues
Metrics:                                                                     ( current / target )
  "development.delphiusapp_sidekiq.queues.default.enqueued" (target value):  0 / 1
Min replicas:                                                                1
Max replicas:                                                                3
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  the last scale time was sufficiently old as to warrant a new scale
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric development.delphiusapp_sidekiq.queues.default.enqueued(&LabelSelector{MatchLabels:map[string]string{environment: development,},MatchExpressions:[],})
  ScalingLimited  True    TooFewReplicas    the desired replica count is more than the maximum replica count
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Normal   SuccessfulRescale             45m                 horizontal-pod-autoscaler  New size: 1; reason: All metrics below target
  Warning  FailedComputeMetricsReplicas  12m (x823 over 9h)  horizontal-pod-autoscaler  failed to get external metric development.delphiusapp_sidekiq.queues.default.enqueued: unable to get external metric data-core/development.delphiusapp_sidekiq.queues.default.enqueued/&LabelSelector{MatchLabels:map[string]string{environment: development,},MatchExpressions:[],}: no metrics returned from external metrics API
  Warning  FailedGetExternalMetric       2m (x837 over 9h)   horizontal-pod-autoscaler  unable to get external metric data-core/development.delphiusapp_sidekiq.queues.default.enqueued/&LabelSelector{MatchLabels:map[string]string{environment: development,},MatchExpressions:[],}: no metrics returned from external metrics API

So it apparently over night it was eventually somehow able to get the external metric and did scale few times. I'm not sure why all of a sudden it was working now

@Chili-Man thank you for sharing!

tl;dr: To answer your question, so long as there is a config without the labels 1.0.0 will not support it. As soon as the hpa manifest is updated to have labels, it will properly handle it.
The fix was merged earlier today, and I sincerely apologies for the trouble. We are starting doing QA on the release which will include this.

To note: With the fix and in the new version, there will be an error message in the logs, however we still do not support autoscaling on metrics without labels.

Could you try using datadog/cluster-agent-dev:charlyf-hpa-labels to confirm that it does not process "bad configs" ?

More details:
The Cluster Agent runs a leader election process in order to process the autoscalers using informers. When processing the autoscalers, we extract the metric name and the labels and we query Datadog to get the timestamp/value.
Then, we store the results in a configmap so that other cluster agents (if running several replicas) can access values to serve to Kubernetes, it also helps reducing the number of calls to Datadog.
When a cluster agent is not a leader, it will only read values from this config map when asked (by Kubernetes).
Lastly, in order to avoid keeping deleted/outdated autoscaler configs in the ConfigMap there is a Garbage Collection process that runs every 5minutes that lists values from the cache of the informer and compares them with the content of the ConfigMap used to store the processed ones.

Hence, if a "bad" config is ever made (out of an update/creation) when the cluster agent is not the leader it will be processed during the GC and crash as we were trying to access a nil pointer (the labels, which are missing). If the cluster agent is the leader and a bad config is made, it will crash almost immediately as it tries to digest the config (and access the missing labels).

@CharlyF Awesome, thanks for the help and insights!

Anytime! Feel free to let us know if you have feedback or questions.

Hi,

I'm putting this here, as this seems like the most suitable place, instead of opening a new ticket.

I'm trying to set up HPAs using DD as the external provider.

Case 1

  - type: External                                                              
    external:                                                                   
      metricName: rabbitmq.queue.messages_ready                                 
      metricSelector:                                                           
        matchLabels:                                                            
          a: b                                 
      targetAverageValue: 10

This works as expected.

Case 2

  - type: External                                                              
    external:                                                                   
      metricName: rabbitmq.queue.messages_ready                                 
      metricSelector:                                                           
        matchExpressions:         
        - {key: a, operator: In, values: [b]}
      targetAverageValue: 10

It fails, even though Cases 1 and 2 are equivalent.

Case 3

  - type: External                                                              
    external:                                                                   
      metricName: rabbitmq.queue.messages_ready                                 
      metricSelector:            
        matchLabels:                                                            
          a: b                                                                                
        matchExpressions:         
        - {key: rabbitmq_queue, operator: In, values: [d, e]}
      targetAverageValue: 10

It fails. The corresponding message is:

Warning  FailedComputeMetricsReplicas  7m (x13 over 13m)  horizontal-pod-autoscaler  failed to get rabbitmq.queue.messages_ready external metric: unable to get external metric monitoring/rabbitmq.queue.messages_ready/&LabelSelector{MatchLabels:map[string]string{a: b,},MatchExpressions:[{rabbitmq_queue In [d, e]}],}: no metrics returned from external metrics API

A similar message to the above is what's reported for Case 2.

Running:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/monitoring/rabbitmq.queue.messages_ready"

the (truncated) output in Case 3 is:

{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/external.metrics.k8s.io/v1beta1/namespaces/monitoring/rabbitmq.queue.messages_ready"},"items":[{"metricName":"rabbitmq.queue.messages_ready","metricLabels":{"a":"b"},"timestamp":"2019-05-02T16:44:53Z","value":"0"},...]}

I'd expect that matchLabels and matchExpressions are handled similarly. Any thoughts?

hi @pchristos
Currently only matchLabels is supported, could you open a new Issue in order to track the support of matchExpressions.
Thanks and regards

Was this page helpful?
0 / 5 - 0 ratings