Describe what happened:
I have setup the cluster agent using the helm chart stable/datadog. When I query the external metrics end point I get empty list of resources.
$ kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]}
And the HPA stucks at the <unknown> value.
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
statsd-demo Deployment/statsd-demo <unknown>/10 (avg) 1 10 1 50m
Output of status:
root@ddog-cluster-agent-84486db86-qbwrw:/# datadog-cluster-agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================
Status date: 2018-10-30 18:28:19.562472 UTC
Pid: 1
Check Runners: 4
Log Level: WARNING
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2018-10-30 18:28:19.562472 UTC
Hostnames
=========
ec2-hostname: ip-1xx-1xx-2xx-190.us-west-2.compute.internal
hostname: i-0c1580d88cbec55c0
instance-id: i-0c1580d88cbec55c0
socket-fqdn: ddog-cluster-agent-84486db86-qbwrw
socket-hostname: ddog-cluster-agent-84486db86-qbwrw
hostname provider: aws
unused hostname providers:
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
Leader Election
===============
Leader Election Status: Failing
Error: entity not found
Custom Metrics Server
=====================
ConfigMap name: default/datadog-custom-metrics
External Metrics
----------------
Total: 0
Valid: 0
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [WARNING]
Total Runs: 15
Metric Samples: 0, Total: 0
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 0s
Warning: [Leader Election not enabled. Not running Kubernetes API Server check or collecting Kubernetes Events.]
=========
Forwarder
=========
CheckRunsV1: 14
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 1
Metadata: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 29
TimeseriesV1: 14
API Keys status
===============
API key ending with xxxxx on endpoint https://app.datadoghq.com: API Key valid
Describe what you expected:
The metrics should appear in API end point and the HPA should detect the value according to it.
Steps to reproduce the issue:
Install cluster agent by either helm chart or using manifests.
Additional environment details (Operating System, Cloud provider, etc):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.3-eks", GitCommit:"58c199a59046dbf0a13a387d3491a39213be53df", GitTreeState:"clean", BuildDate:"2018-09-21T21:00:04Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Running platform version of EKS: eks.2
values.yaml used for helm install:
daemonset:
useHostNetwork: true
useHostPort: true
datadog:
env:
- name: DD_USE_DOGSTATSD
value: "true"
- name: DD_DOGSTATSD_PORT
value: "8125"
- name: DD_DOGSTATSD_NON_LOCAL_TRAFFIC
value: "true"
apiKey: "********************************"
appKey: "****************************************"
clusterAgent:
enabled: true
token: "*****************************************"
metricsProvider:
enabled: true
HPA specification:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: statsd-demo
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: statsd-demo
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: demoInGo.request.count_total
metricSelector:
matchLabels:
appname: statsd-demo
targetAverageValue: 10
Hey @bhavin192,
Thanks for opening this issue.
From the logs I can see the following:
Warning: [Leader Election not enabled. Not running Kubernetes API Server check or collecting Kubernetes Events.]
This is likely the issue because while the Cluster Agent can run in HA with several replicas, the leader election needs to be enabled for it to query data from Datadog.
I am currently improving the docs on the helm chart to highlight this as you can see here:
https://github.com/helm/charts/blob/2dc2d42e0f470d0ea4803d386ec2b18fdf537314/stable/datadog/README.md#enabling-the-datadog-cluster-agent.
Could you run the chart again with the leader election enabled datadog.leaderElection=true ?
Best,
Hey @CharlyF, thanks for a quick reply :)
After enabling leaderElection that warning got resolved. Now I can see [ok] there, but still the metrics are not there at end point.
Okay, sorry to hear!
Could you set the logs to debug (datadog.logLevel=DEBUG), let the cluster agent run for a few minutes and send a flare from within the cluster agent's pod?
You can use agent flare. This will opened a ticket on our end with a bundle of the cluster agent's logs and more details for us to investigate.
Best,
I have uploaded the flare, should I share the case ID here?
Hey @bhavin192,
I was able to find the ticket, thanks!
Setting the log level to Trace really helped, the issue here is quite intricate but I was able to find it.
When the Cluster Agent started it was not yet leader so it did not process the Autoscaler. We can see in the logs that the AutoscalersController received the event and tried to process it accordingly:
2018-10-31 05:43:11 UTC | DEBUG | (hpa_controller.go:329 in addAutoscaler) | Adding autoscaler default/statsd-demo
Update received for the default/statsd-demo, without a relevant change to the configuration
[...]
2018-10-31 05:43:11 UTC | TRACE | (hpa_controller.go:259 in processNext) | Processing default/statsd-demo
2018-10-31 05:43:11 UTC | TRACE | (hpa_controller.go:290 in syncAutoscalers) | Only the leader needs to sync the Autoscalers
2018-10-31 05:43:11 UTC | TRACE | (hpa_controller.go:274 in handleErr) | Faithfully dropping key default/statsd-demo
Now, this means that everything is working as expected. At this point, the leader should be running the process, but in this case, I think there was a dangling state, so the Cluster Agent became leader a little later
2018-10-31 05:44:57 UTC | DEBUG | (leaderelection.go:154 in EnsureLeaderElectionRuns) | Currently Leader: true. Leader identity: "ddog-cluster-agent-fdddbc496-rxwrt"
I think th leadership was owned by a replica from a previous deployment namely: ddog-cluster-agent-fdddbc496-c4jxf
Reapplying the HPA manifest will solve this, but on my end I am going to see how this can be better processed.
I also want to improve how we process events from the resync of the informer as it could have helped us here.
Let me know if that helps!
@CharlyF that was hard to find :) and yes, re-creating HPA solved the issue.
Awesome!
I have added some work in our backlog to further investigate that behaviour.
Thank you very much for having reached out.
@CharlyF I'm running in to this issue , even re-creating the HPA doesnt resolve it. the relevant details of the datadog clusrter agent have been uploaded through the flare for case #180714, if that helps
Hey @Chili-Man - Can you share the HPA manifest here ?
I suspect that it is lacking a metricsSelector (or labels).
I made a PR yesterday to fix that: https://github.com/DataDog/datadog-agent/pull/2666
We are discussing a release now. Thanks for your feedback
To clarify, it was initially lacking the metricSelector, but even after fixing that, it wasn't working.
@CharlyF , here's the HPA manifest we are using:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: delphiusapp-worker-all
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: delphiusapp-worker-allqueues
minReplicas: 1
maxReplicas: 3
metrics:
- type: External
external:
metricName: development.delphiusapp_sidekiq.queues.default.enqueued
metricSelector:
matchLabels:
environment: development
targetValue: 1
Describing the HPA below:
Name: delphiusapp-worker-all
Namespace: data-core
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"delphiusapp-worker-all","namespace":"data-cor...
CreationTimestamp: Wed, 14 Nov 2018 23:44:03 -0800
Reference: Deployment/delphiusapp-worker-allqueues
Metrics: ( current / target )
"development.delphiusapp_sidekiq.queues.default.enqueued" (target value): 0 / 1
Min replicas: 1
Max replicas: 3
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric development.delphiusapp_sidekiq.queues.default.enqueued(&LabelSelector{MatchLabels:map[string]string{environment: development,},MatchExpressions:[],})
ScalingLimited True TooFewReplicas the desired replica count is more than the maximum replica count
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 45m horizontal-pod-autoscaler New size: 1; reason: All metrics below target
Warning FailedComputeMetricsReplicas 12m (x823 over 9h) horizontal-pod-autoscaler failed to get external metric development.delphiusapp_sidekiq.queues.default.enqueued: unable to get external metric data-core/development.delphiusapp_sidekiq.queues.default.enqueued/&LabelSelector{MatchLabels:map[string]string{environment: development,},MatchExpressions:[],}: no metrics returned from external metrics API
Warning FailedGetExternalMetric 2m (x837 over 9h) horizontal-pod-autoscaler unable to get external metric data-core/development.delphiusapp_sidekiq.queues.default.enqueued/&LabelSelector{MatchLabels:map[string]string{environment: development,},MatchExpressions:[],}: no metrics returned from external metrics API
So it apparently over night it was eventually somehow able to get the external metric and did scale few times. I'm not sure why all of a sudden it was working now
@Chili-Man thank you for sharing!
tl;dr: To answer your question, so long as there is a config without the labels 1.0.0 will not support it. As soon as the hpa manifest is updated to have labels, it will properly handle it.
The fix was merged earlier today, and I sincerely apologies for the trouble. We are starting doing QA on the release which will include this.
To note: With the fix and in the new version, there will be an error message in the logs, however we still do not support autoscaling on metrics without labels.
Could you try using datadog/cluster-agent-dev:charlyf-hpa-labels to confirm that it does not process "bad configs" ?
More details:
The Cluster Agent runs a leader election process in order to process the autoscalers using informers. When processing the autoscalers, we extract the metric name and the labels and we query Datadog to get the timestamp/value.
Then, we store the results in a configmap so that other cluster agents (if running several replicas) can access values to serve to Kubernetes, it also helps reducing the number of calls to Datadog.
When a cluster agent is not a leader, it will only read values from this config map when asked (by Kubernetes).
Lastly, in order to avoid keeping deleted/outdated autoscaler configs in the ConfigMap there is a Garbage Collection process that runs every 5minutes that lists values from the cache of the informer and compares them with the content of the ConfigMap used to store the processed ones.
Hence, if a "bad" config is ever made (out of an update/creation) when the cluster agent is not the leader it will be processed during the GC and crash as we were trying to access a nil pointer (the labels, which are missing). If the cluster agent is the leader and a bad config is made, it will crash almost immediately as it tries to digest the config (and access the missing labels).
@CharlyF Awesome, thanks for the help and insights!
Anytime! Feel free to let us know if you have feedback or questions.
Hi,
I'm putting this here, as this seems like the most suitable place, instead of opening a new ticket.
I'm trying to set up HPAs using DD as the external provider.
- type: External
external:
metricName: rabbitmq.queue.messages_ready
metricSelector:
matchLabels:
a: b
targetAverageValue: 10
This works as expected.
- type: External
external:
metricName: rabbitmq.queue.messages_ready
metricSelector:
matchExpressions:
- {key: a, operator: In, values: [b]}
targetAverageValue: 10
It fails, even though Cases 1 and 2 are equivalent.
- type: External
external:
metricName: rabbitmq.queue.messages_ready
metricSelector:
matchLabels:
a: b
matchExpressions:
- {key: rabbitmq_queue, operator: In, values: [d, e]}
targetAverageValue: 10
It fails. The corresponding message is:
Warning FailedComputeMetricsReplicas 7m (x13 over 13m) horizontal-pod-autoscaler failed to get rabbitmq.queue.messages_ready external metric: unable to get external metric monitoring/rabbitmq.queue.messages_ready/&LabelSelector{MatchLabels:map[string]string{a: b,},MatchExpressions:[{rabbitmq_queue In [d, e]}],}: no metrics returned from external metrics API
A similar message to the above is what's reported for Case 2.
Running:
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/monitoring/rabbitmq.queue.messages_ready"
the (truncated) output in Case 3 is:
{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/external.metrics.k8s.io/v1beta1/namespaces/monitoring/rabbitmq.queue.messages_ready"},"items":[{"metricName":"rabbitmq.queue.messages_ready","metricLabels":{"a":"b"},"timestamp":"2019-05-02T16:44:53Z","value":"0"},...]}
I'd expect that matchLabels and matchExpressions are handled similarly. Any thoughts?
hi @pchristos
Currently only matchLabels is supported, could you open a new Issue in order to track the support of matchExpressions.
Thanks and regards
Most helpful comment
@Chili-Man thank you for sharing!
tl;dr: To answer your question, so long as there is a config without the labels 1.0.0 will not support it. As soon as the hpa manifest is updated to have labels, it will properly handle it.
The fix was merged earlier today, and I sincerely apologies for the trouble. We are starting doing QA on the release which will include this.
To note: With the fix and in the new version, there will be an error message in the logs, however we still do not support autoscaling on metrics without labels.
Could you try using
datadog/cluster-agent-dev:charlyf-hpa-labelsto confirm that it does not process "bad configs" ?More details:
The Cluster Agent runs a leader election process in order to process the autoscalers using informers. When processing the autoscalers, we extract the metric name and the labels and we query Datadog to get the timestamp/value.
Then, we store the results in a configmap so that other cluster agents (if running several replicas) can access values to serve to Kubernetes, it also helps reducing the number of calls to Datadog.
When a cluster agent is not a leader, it will only read values from this config map when asked (by Kubernetes).
Lastly, in order to avoid keeping deleted/outdated autoscaler configs in the ConfigMap there is a Garbage Collection process that runs every 5minutes that lists values from the cache of the informer and compares them with the content of the ConfigMap used to store the processed ones.
Hence, if a "bad" config is ever made (out of an update/creation) when the cluster agent is not the leader it will be processed during the GC and crash as we were trying to access a nil pointer (the labels, which are missing). If the cluster agent is the leader and a bad config is made, it will crash almost immediately as it tries to digest the config (and access the missing labels).