Datadog-agent: Intermittent availibility of external metrics

Created on 17 Jun 2019  路  3Comments  路  Source: DataDog/datadog-agent

Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================

  Status date: 2019-06-17 20:53:45.078727 UTC
  Pid: 1
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2019-06-17 20:53:45.078727 UTC

  Hostnames
  =========
    host_aliases: [REDACTED.REDACTED]
    hostname: REDACTED.c.REDACTED.internal
    socket-fqdn: datadog-cluster-agent-644896fcbf-rzbkh
    socket-hostname: datadog-cluster-agent-644896fcbf-rzbkh
    hostname provider: gce
    unused hostname providers:
      configuration/environment: hostname is empty

  Leader Election
  ===============
    Leader Election Status:  Running
    Leader Name is: datadog-cluster-agent-644896fcbf-rzbkh
    Last Acquisition of the lease: Fri, 14 Jun 2019 00:08:54 UTC
    Renewed leadership: Mon, 17 Jun 2019 20:53:40 UTC
    Number of leader transitions: 2 transitions

  Custom Metrics Server
  =====================
    ConfigMap name: datadog/datadog-custom-metrics

    External Metrics
    ----------------
      Total: 1
      Valid: 1


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [OK]
        Total Runs: 22,266
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 4, Total: 89,040
        Average Execution Time : 21ms



=========
Forwarder
=========

  CheckRunsV1: 22,265
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 1
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 44,531
  TimeseriesV1: 22,265

  API Keys status
  ===============
    API key ending with ac197 on endpoint https://app.datadoghq.com: API Key valid
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================

  Status date: 2019-06-17 20:54:15.779667 UTC
  Pid: 1
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2019-06-17 20:54:15.779667 UTC

  Hostnames
  =========
    host_aliases: [REDACTED.REDACTED]
    hostname: REDACTED.c.REDACTED.internal
    socket-fqdn: datadog-cluster-agent-644896fcbf-rzbkh
    socket-hostname: datadog-cluster-agent-644896fcbf-rzbkh
    hostname provider: gce
    unused hostname providers:
      configuration/environment: hostname is empty

  Leader Election
  ===============
    Leader Election Status:  Running
    Leader Name is: datadog-cluster-agent-644896fcbf-rzbkh
    Last Acquisition of the lease: Fri, 14 Jun 2019 00:08:54 UTC
    Renewed leadership: Mon, 17 Jun 2019 20:54:10 UTC
    Number of leader transitions: 2 transitions

  Custom Metrics Server
  =====================
    ConfigMap name: datadog/datadog-custom-metrics

    External Metrics
    ----------------
      Total: 1
      Valid: 0


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [OK]
        Total Runs: 22,268
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 4, Total: 89,048
        Average Execution Time : 21ms



=========
Forwarder
=========

  CheckRunsV1: 22,267
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 1
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 44,535
  TimeseriesV1: 22,267

  API Keys status
  ===============
    API key ending with ac197 on endpoint https://app.datadoghq.com: API Key valid
 ~ $ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1"
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]}
 ~ $ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1"
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[{"name":"gcp.pubsub.subscription.num_undelivered_messages","singularName":"","namespaced":true,"kind":"ExternalMetricValueList","verbs":["get"]}]}

The value and valid fields both change occasionally. (There's not a lot of traffic in our dev environment so the 0 value isn't an indicator of anything being wrong).

+ eval 'kubectl --namespace datadog get configmap datadog-custom-metrics -ojson | jq -C "${JQ_CMD}"'
++ kubectl --namespace datadog get configmap datadog-custom-metrics -ojson
++ jq -C '.data = (.data | map_values(fromjson))'
{
  "apiVersion": "v1",
  "data": {
    "external_metric-platform-[REDACTED]-v1-gcp.pubsub.subscription.num_undelivered_messages": {
      "metricName": "gcp.pubsub.subscription.num_undelivered_messages",
      "labels": {
        "project_id": "[REDACTED]",
        "subscription_id": "[REDACTED]"
      },
      "ts": 1560805763,
      "hpa": {
        "name": "[REDACTED]-v1",
        "namespace": "platform",
        "uid": "adee8919-8eed-11e9-a60b-42010a800128"
      },
      "value": 0,
      "valid": false
    }
  },
  "kind": "ConfigMap",
  "metadata": {
    "annotations": {
      "custom-metrics.datadoghq.com/last-updated": "2019-06-17T21:09:23Z"
    },
    "creationTimestamp": "2019-06-13T23:55:10Z",
    "name": "datadog-custom-metrics",
    "namespace": "datadog",
    "resourceVersion": "74849935",
    "selfLink": "/api/v1/namespaces/datadog/configmaps/datadog-custom-metrics",
    "uid": "ad830eb8-8e36-11e9-a60b-42010a800128"
  }
}

Describe what happened:
The external metric is valid intermittently. Most of the time it is invalid

 ~ $ kubectl describe hpa [REDACTED]-v1
Name:                                                                         [REDACTED]-v1
Namespace:                                                                    platform
Labels:                                                                       app=[REDACTED]
                                                                              chart=[REDACTED]-1.5.0
                                                                              heritage=Tiller
                                                                              release=[REDACTED]-v1
Annotations:                                                                  <none>
CreationTimestamp:                                                            Fri, 14 Jun 2019 16:45:08 -0500
Reference:                                                                    Deployment/[REDACTED]-v1
Metrics:                                                                      ( current / target )
  "gcp.pubsub.subscription.num_undelivered_messages" (target average value):  1 / 20
Min replicas:                                                                 1
Max replicas:                                                                 5
Deployment pods:                                                              1 current / 1 desired
Conditions:
  Type            Status  Reason                   Message
  ----            ------  ------                   -------
  AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric platform/gcp.pubsub.subscription.num_undelivered_messages/&LabelSelector{MatchLabels:map[string]string{project_id: [REDACTED],subscription_id: [REDACTED],},MatchExpressions:[],}: no metrics returned from external metrics API
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range
Events:
  Type     Reason                        Age                      From                       Message
  ----     ------                        ----                     ----                       -------
  Warning  FailedComputeMetricsReplicas  35m (x14177 over 2d23h)  horizontal-pod-autoscaler  failed to get gcp.pubsub.subscription.num_undelivered_messages external metric: unable to get external metric platform/gcp.pubsub.subscription.num_undelivered_messages/&LabelSelector{MatchLabels:map[string]string{project_id: [REDACTED],subscription_id: [REDACTED],},MatchExpressions:[],}: no metrics returned from external metrics API
  Warning  FailedGetExternalMetric       30s (x14296 over 2d23h)  horizontal-pod-autoscaler  unable to get external metric platform/gcp.pubsub.subscription.num_undelivered_messages/&LabelSelector{MatchLabels:map[string]string{project_id: [REDACTED],subscription_id: [REDACTED],},MatchExpressions:[],}: no metrics returned from external metrics API

Describe what you expected:
The external metric should be valid and available to the cluster most of the time.

Steps to reproduce the issue:
Created an autoscaler using a GCP pubsub subcription metric

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: service-v1
  labels:
    app: SERVICE_NAME
spec:
  minReplicas: 1
  maxReplicas: 100
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: SERVICE_NAME
  metrics:
  - type: External
    external:
      metricName: gcp.pubsub.subscription.num_undelivered_messages
      metricSelector:
        matchLabels:
          project_id: "[PROJECT_ID]"
          subscription_id: "[SUBSCRIPTION_ID]"
      targetAverageValue: 20

Additional environment details (Operating System, Cloud provider, etc):

Cloud Provider: Google
Master version: 1.11.8-gke.6
Node version: 1.10.2-gke.1
OS: Container-Optimized OS (cos) (current)

Most helpful comment

This link should be helpful for anyone else that finds this issue:

https://docs.datadoghq.com/integrations/faq/cloud-metric-delay/#gcp
For GCP, Datadog runs the default crawler every 5 minutes.
GCP emits metrics with 1-minute granularity. Therefore, expect metric delays of ~7-8 minutes.

I used a 10 minute window to be conservative.

helm upgrade --install \
        --set clusterAgent.env[0].name=DD_EXTERNAL_METRICS_PROVIDER_MAX_AGE \
        --set-string clusterAgent.env[0].value=600 \
        --set clusterAgent.env[1].name=DD_EXTERNAL_METRICS_PROVIDER_BUCKET_SIZE \
        --set-string clusterAgent.env[1].value=600 \
        ...

All 3 comments

Hey @agosto-calvinbehling, thanks for opening the issue here. In this case, metrics coming from the GCP integration are slightly delayed. It may be worth applying the environment variables DD_EXTERNAL_METRICS_PROVIDER_MAX_AGE and DD_EXTERNAL_METRICS_PROVIDER_BUCKET_SIZE to account for that. There was a similar issue that was addressed here as well if you would like a reference point: https://github.com/DataDog/datadog-agent/pull/3005#issuecomment-462532444.

Otherwise, if adjusting the external metrics provider settings doesn't seem to help, could you open up a ticket with our support team so we can get a flare and start to investigate there? Also, a quick side note: I noticed you're running the first version of the cluster agent. We've made a significant amount of improvements since then, so I'd recommend upgrading as well!

Thanks. I'll look into this.

This link should be helpful for anyone else that finds this issue:

https://docs.datadoghq.com/integrations/faq/cloud-metric-delay/#gcp
For GCP, Datadog runs the default crawler every 5 minutes.
GCP emits metrics with 1-minute granularity. Therefore, expect metric delays of ~7-8 minutes.

I used a 10 minute window to be conservative.

helm upgrade --install \
        --set clusterAgent.env[0].name=DD_EXTERNAL_METRICS_PROVIDER_MAX_AGE \
        --set-string clusterAgent.env[0].value=600 \
        --set clusterAgent.env[1].name=DD_EXTERNAL_METRICS_PROVIDER_BUCKET_SIZE \
        --set-string clusterAgent.env[1].value=600 \
        ...
Was this page helpful?
0 / 5 - 0 ratings