Datadog-agent: [cluster-agent] wrong values when HPA contains multiple metrics

Created on 5 Nov 2018  路  5Comments  路  Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

root@ddog-cluster-agent-fdddbc496-f8zzx:/# datadog-cluster-agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================

  Status date: 2018-11-05 13:54:22.644670 UTC
  Pid: 1
  Check Runners: 4
  Log Level: TRACE

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2018-11-05 13:54:22.644670 UTC

  Hostnames
  =========
    ec2-hostname: ip-192-168-223-190.us-west-2.compute.internal
    hostname: i-0c1580d88cbec55c0
    instance-id: i-0c1580d88cbec55c0
    socket-fqdn: ddog-cluster-agent-fdddbc496-f8zzx
    socket-hostname: ddog-cluster-agent-fdddbc496-f8zzx
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Leader Election
  ===============
    Leader Election Status:  Running
    Leader Name is: ddog-cluster-agent-fdddbc496-f8zzx
    Last Acquisition of the lease: Mon, 05 Nov 2018 12:11:35 UTC
    Renewed leadership: Mon, 05 Nov 2018 13:54:18 UTC
    Number of leader transitions: 3 transitions

  Custom Metrics Server
  =====================
    ConfigMap name: default/datadog-custom-metrics

    External Metrics
    ----------------
      Total: 2
      Valid: 2


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [OK]
        Total Runs: 418
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 3, Total: 1,233
        Average Execution Time : 25ms



=========
Forwarder
=========

  CheckRunsV1: 417
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 1
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 835
  TimeseriesV1: 417

  API Keys status
  ===============
    API key ending with xxxxx on endpoint https://app.datadoghq.com: API Key valid

Describe what happened:
When I have following HPA definition the values HPA receives are different than the values at DataDog dashboard
hpa.yaml

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: statsd-demo
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: statsd-demo
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: demoInGo.request.duration.new5
      metricSelector:
        matchLabels:
          appname: statsd-demo
      targetValue: 500
  - type: External
    external:
      metricName: demoInGo.request.duration.new3
      metricSelector:
        matchLabels:
          appname: statsd-demo
      targetValue: 300

The metrics values are generated by a small demo application which generates constant value, values from the dashboard:

 demoInGo.request.duration.new5: 99.2
 demoInGo.request.duration.new3: 56.3

And the values HPA receives:

$ kubectl get hpa
NAME          REFERENCE                TARGETS            MINPODS   MAXPODS   REPLICAS   AGE
statsd-demo   Deployment/statsd-demo   155/500, 155/300   1         10        1          49s

$ kubectl describe hpa statsd-demo
Name:                                               statsd-demo
Namespace:                                          default
Labels:                                             <none>
Annotations:                                        <none>
CreationTimestamp:                                  Mon, 05 Nov 2018 17:41:36 +0530
Reference:                                          Deployment/statsd-demo
Metrics:                                            ( current / target )
  "demoInGo.request.duration.new5" (target value):  155 / 500
  "demoInGo.request.duration.new3" (target value):  155 / 300
Min replicas:                                       1
Max replicas:                                       10
Deployment pods:                                    1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric demoInGo.request.duration.new5(&LabelSelector{MatchLabels:map[string]string{appname: statsd-demo,},MatchExpressions:[],})
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

Describe what you expected:
The values of both metrics should be same as or near to values from DataDog dashboard.

Steps to reproduce the issue:

  • Create HPA as above
  • Inspect the HPA with kubectl get

Additional environment details (Operating System, Cloud provider, etc):
cluster-agent TRACE logs.
multiple-metrics-hpa.log

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.3-eks", GitCommit:"58c199a59046dbf0a13a387d3491a39213be53df", GitTreeState:"clean", BuildDate:"2018-09-21T21:00:04Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Running platform version of EKS: eks.2

values.yaml used for helm install:

daemonset:
  useHostNetwork: true
  useHostPort: true
datadog:
  leaderElection: true
  env:
    - name: DD_USE_DOGSTATSD
      value: "true"
    - name: DD_DOGSTATSD_PORT
      value: "8125"
    - name: DD_DOGSTATSD_NON_LOCAL_TRAFFIC
      value: "true"
  apiKey: "********************************"
  appKey: "****************************************"
clusterAgent:
  enabled: true
  token: "*****************************************"
  metricsProvider:
    enabled: true

All 5 comments

Hey @bhavin192,

Thank you for opening this!
It is not clear to me if the issue is in the Cluster Agent or in the calculation on the HorizontalPodAutoscaler controller side.
I was able to reproduce however.
As we see from your log:

2018-11-05 12:12:36 UTC | TRACE | (provider.go:143 in GetExternalMetric) | External metrics returned: []external_metrics.ExternalMetricValue{external_metrics.ExternalMetricValue{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, MetricName:"demoInGo.request.duration.new3", MetricLabels:map[string]string{"appname":"statsd-demo"}, Timestamp:v1.Time{Time:time.Time{wall:0xbef02acd112708fc, ext:160590852923, loc:(*time.Location)(0x28f4c00)}}, WindowSeconds:(*int64)(nil),
 Value:resource.Quantity{i:resource.int64Amount{value:56, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"DecimalSI"}}, external_metrics.ExternalMetricValue{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, MetricName:"demoInGo.request.duration.new5", MetricLabels:map[string]string{"appname":"statsd-demo"}, Timestamp:v1.Time{Time:time.Time{wall:0xbef02acd11270f87, ext:160590854574, loc:(*time.Location)(0x28f4c00)}}, WindowSeconds:(*int64)(nil),
 Value:resource.Quantity{i:resource.int64Amount{value:99, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"DecimalSI"}}}

It indicates that the values are correctly computed.
From my investigation, the issue is that when kubernetes tries to get the value of one metric, it gets both and as you can see here, the autoscaler then sums the values.
That is what I'm investigating.

It appears that the configmap holds the expected values.

In the meantime, could you confirm by sharing:

  • The output of kubectl describe cm datadog-custom-metrics
  • The output of kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/demoInGo.request.duration.new3" | jq
    as well as kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/demoInGo.request.duration.new5" | jq ?

I'll keep you posted on our findings.

@bhavin192 I found the issue. It's indeed a bug on our end.
This will happen if the scopes of the metrics are equal.
Could you try to change the label selector of one of the metrics ?

  - type: External
    external:
      metricName: demoInGo.request.duration.new5
      metricSelector:
        matchLabels:
          appname: statsd-demo
      targetValue: 500
  - type: External
    external:
      metricName: demoInGo.request.duration.new3
      metricSelector:
        matchLabels:
          otherKey: value
      targetValue: 300

I'm working on a fix now and will be scheduling a bug fix release. Thank you very much for bringing this up to our attention.

@CharlyF hey, setting the different scope makes it show correct values.

Thanks for confirming - I'll work on the bugfix release ASAP.

This fix is in cluster-agent:1.1.0 that was release earlier this month.
I am going to close this issue as it is fixed, but feel free to reach out to us if you have questions or feedback!

Best,
.C

Was this page helpful?
0 / 5 - 0 ratings