Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================
Status date: 2019-06-17 20:53:45.078727 UTC
Pid: 1
Check Runners: 4
Log Level: INFO
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2019-06-17 20:53:45.078727 UTC
Hostnames
=========
host_aliases: [REDACTED.REDACTED]
hostname: REDACTED.c.REDACTED.internal
socket-fqdn: datadog-cluster-agent-644896fcbf-rzbkh
socket-hostname: datadog-cluster-agent-644896fcbf-rzbkh
hostname provider: gce
unused hostname providers:
configuration/environment: hostname is empty
Leader Election
===============
Leader Election Status: Running
Leader Name is: datadog-cluster-agent-644896fcbf-rzbkh
Last Acquisition of the lease: Fri, 14 Jun 2019 00:08:54 UTC
Renewed leadership: Mon, 17 Jun 2019 20:53:40 UTC
Number of leader transitions: 2 transitions
Custom Metrics Server
=====================
ConfigMap name: datadog/datadog-custom-metrics
External Metrics
----------------
Total: 1
Valid: 1
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Total Runs: 22,266
Metric Samples: 0, Total: 0
Events: 0, Total: 0
Service Checks: 4, Total: 89,040
Average Execution Time : 21ms
=========
Forwarder
=========
CheckRunsV1: 22,265
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 1
Metadata: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 44,531
TimeseriesV1: 22,265
API Keys status
===============
API key ending with ac197 on endpoint https://app.datadoghq.com: API Key valid
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================
Status date: 2019-06-17 20:54:15.779667 UTC
Pid: 1
Check Runners: 4
Log Level: INFO
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2019-06-17 20:54:15.779667 UTC
Hostnames
=========
host_aliases: [REDACTED.REDACTED]
hostname: REDACTED.c.REDACTED.internal
socket-fqdn: datadog-cluster-agent-644896fcbf-rzbkh
socket-hostname: datadog-cluster-agent-644896fcbf-rzbkh
hostname provider: gce
unused hostname providers:
configuration/environment: hostname is empty
Leader Election
===============
Leader Election Status: Running
Leader Name is: datadog-cluster-agent-644896fcbf-rzbkh
Last Acquisition of the lease: Fri, 14 Jun 2019 00:08:54 UTC
Renewed leadership: Mon, 17 Jun 2019 20:54:10 UTC
Number of leader transitions: 2 transitions
Custom Metrics Server
=====================
ConfigMap name: datadog/datadog-custom-metrics
External Metrics
----------------
Total: 1
Valid: 0
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Total Runs: 22,268
Metric Samples: 0, Total: 0
Events: 0, Total: 0
Service Checks: 4, Total: 89,048
Average Execution Time : 21ms
=========
Forwarder
=========
CheckRunsV1: 22,267
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 1
Metadata: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 44,535
TimeseriesV1: 22,267
API Keys status
===============
API key ending with ac197 on endpoint https://app.datadoghq.com: API Key valid
~ $ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1"
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]}
~ $ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1"
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[{"name":"gcp.pubsub.subscription.num_undelivered_messages","singularName":"","namespaced":true,"kind":"ExternalMetricValueList","verbs":["get"]}]}
The value and valid fields both change occasionally. (There's not a lot of traffic in our dev environment so the 0 value isn't an indicator of anything being wrong).
+ eval 'kubectl --namespace datadog get configmap datadog-custom-metrics -ojson | jq -C "${JQ_CMD}"'
++ kubectl --namespace datadog get configmap datadog-custom-metrics -ojson
++ jq -C '.data = (.data | map_values(fromjson))'
{
"apiVersion": "v1",
"data": {
"external_metric-platform-[REDACTED]-v1-gcp.pubsub.subscription.num_undelivered_messages": {
"metricName": "gcp.pubsub.subscription.num_undelivered_messages",
"labels": {
"project_id": "[REDACTED]",
"subscription_id": "[REDACTED]"
},
"ts": 1560805763,
"hpa": {
"name": "[REDACTED]-v1",
"namespace": "platform",
"uid": "adee8919-8eed-11e9-a60b-42010a800128"
},
"value": 0,
"valid": false
}
},
"kind": "ConfigMap",
"metadata": {
"annotations": {
"custom-metrics.datadoghq.com/last-updated": "2019-06-17T21:09:23Z"
},
"creationTimestamp": "2019-06-13T23:55:10Z",
"name": "datadog-custom-metrics",
"namespace": "datadog",
"resourceVersion": "74849935",
"selfLink": "/api/v1/namespaces/datadog/configmaps/datadog-custom-metrics",
"uid": "ad830eb8-8e36-11e9-a60b-42010a800128"
}
}
Describe what happened:
The external metric is valid intermittently. Most of the time it is invalid
~ $ kubectl describe hpa [REDACTED]-v1
Name: [REDACTED]-v1
Namespace: platform
Labels: app=[REDACTED]
chart=[REDACTED]-1.5.0
heritage=Tiller
release=[REDACTED]-v1
Annotations: <none>
CreationTimestamp: Fri, 14 Jun 2019 16:45:08 -0500
Reference: Deployment/[REDACTED]-v1
Metrics: ( current / target )
"gcp.pubsub.subscription.num_undelivered_messages" (target average value): 1 / 20
Min replicas: 1
Max replicas: 5
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetExternalMetric the HPA was unable to compute the replica count: unable to get external metric platform/gcp.pubsub.subscription.num_undelivered_messages/&LabelSelector{MatchLabels:map[string]string{project_id: [REDACTED],subscription_id: [REDACTED],},MatchExpressions:[],}: no metrics returned from external metrics API
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 35m (x14177 over 2d23h) horizontal-pod-autoscaler failed to get gcp.pubsub.subscription.num_undelivered_messages external metric: unable to get external metric platform/gcp.pubsub.subscription.num_undelivered_messages/&LabelSelector{MatchLabels:map[string]string{project_id: [REDACTED],subscription_id: [REDACTED],},MatchExpressions:[],}: no metrics returned from external metrics API
Warning FailedGetExternalMetric 30s (x14296 over 2d23h) horizontal-pod-autoscaler unable to get external metric platform/gcp.pubsub.subscription.num_undelivered_messages/&LabelSelector{MatchLabels:map[string]string{project_id: [REDACTED],subscription_id: [REDACTED],},MatchExpressions:[],}: no metrics returned from external metrics API
Describe what you expected:
The external metric should be valid and available to the cluster most of the time.
Steps to reproduce the issue:
Created an autoscaler using a GCP pubsub subcription metric
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: service-v1
labels:
app: SERVICE_NAME
spec:
minReplicas: 1
maxReplicas: 100
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: SERVICE_NAME
metrics:
- type: External
external:
metricName: gcp.pubsub.subscription.num_undelivered_messages
metricSelector:
matchLabels:
project_id: "[PROJECT_ID]"
subscription_id: "[SUBSCRIPTION_ID]"
targetAverageValue: 20
Additional environment details (Operating System, Cloud provider, etc):
Cloud Provider: Google
Master version: 1.11.8-gke.6
Node version: 1.10.2-gke.1
OS: Container-Optimized OS (cos) (current)
Hey @agosto-calvinbehling, thanks for opening the issue here. In this case, metrics coming from the GCP integration are slightly delayed. It may be worth applying the environment variables DD_EXTERNAL_METRICS_PROVIDER_MAX_AGE and DD_EXTERNAL_METRICS_PROVIDER_BUCKET_SIZE to account for that. There was a similar issue that was addressed here as well if you would like a reference point: https://github.com/DataDog/datadog-agent/pull/3005#issuecomment-462532444.
Otherwise, if adjusting the external metrics provider settings doesn't seem to help, could you open up a ticket with our support team so we can get a flare and start to investigate there? Also, a quick side note: I noticed you're running the first version of the cluster agent. We've made a significant amount of improvements since then, so I'd recommend upgrading as well!
Thanks. I'll look into this.
This link should be helpful for anyone else that finds this issue:
https://docs.datadoghq.com/integrations/faq/cloud-metric-delay/#gcp
For GCP, Datadog runs the default crawler every 5 minutes.
GCP emits metrics with 1-minute granularity. Therefore, expect metric delays of ~7-8 minutes.
I used a 10 minute window to be conservative.
helm upgrade --install \
--set clusterAgent.env[0].name=DD_EXTERNAL_METRICS_PROVIDER_MAX_AGE \
--set-string clusterAgent.env[0].value=600 \
--set clusterAgent.env[1].name=DD_EXTERNAL_METRICS_PROVIDER_BUCKET_SIZE \
--set-string clusterAgent.env[1].value=600 \
...
Most helpful comment
This link should be helpful for anyone else that finds this issue:
I used a 10 minute window to be conservative.