Output of the info page (if this is a bug)
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.1.0)
==============================
Status date: 2018-12-10 17:50:27.539631 UTC
Pid: 1
Check Runners: 4
Log Level: info
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2018-12-10 17:50:27.539631 UTC
Hostnames
=========
ec2-hostname: ip-172-29-155-132.ad.data.activision.com
hostname: i-04a44a8ee90554219
instance-id: i-04a44a8ee90554219
socket-fqdn: datadog-cluster-agent-6f8dc64ccd-6skj5
socket-hostname: datadog-cluster-agent-6f8dc64ccd-6skj5
hostname provider: aws
unused hostname providers:
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
Leader Election
===============
Leader Election Status: Running
Leader Name is: datadog-cluster-agent-65cbf9844d-w4bm9
Last Acquisition of the lease: Mon, 10 Dec 2018 17:49:08 UTC
Renewed leadership: Mon, 10 Dec 2018 17:49:38 UTC
Number of leader transitions: 9 transitions
Custom Metrics Server
=====================
ConfigMap name: kube-system/datadog-custom-metrics
External Metrics
----------------
Total: 1
Valid: 1
hpa:
- name: nginxext
- namespace: default
- uid: 64e03168-fc99-11e8-9eea-020490deafb0
labels:
- kube_container_name: nginx
metricName: nginx.net.request_per_s
ts: 1.544464144e+09
valid: true
value: 1
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Total Runs: 5
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 100ms
=========
Forwarder
=========
Transactions
============
CheckRunsV1: 4
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 1
Metadata: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 9
TimeseriesV1: 4
API Keys status
===============
API key ending with 1ff8a on endpoint https://app.datadoghq.com: API Key valid
Describe what happened:
The APIService cannot query the datadog-custom-metrics-server you can see the error here:
mercury-core $ kubectl describe APIService v1beta1.external.metrics.k8s.io
Name: v1beta1.external.metrics.k8s.io
Namespace:
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{},"name":"v1beta1.external.metrics.k8s.io"},...
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2018-12-10T17:29:04Z
Resource Version: 6186546
Self Link: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.external.metrics.k8s.io
UID: 1733cbb1-fca1-11e8-94e3-069324fbe3cc
Spec:
Ca Bundle: <nil>
Group: external.metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: datadog-custom-metrics-server
Namespace: kube-system
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2018-12-10T17:29:04Z
Message: no response from https://172.29.155.94:443: Get https://172.29.155.94:443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Events: <none>
I was hoping it was a networking issue, but it seems that the cluster-agent responds badly to the request as well:
root@datadog-cluster-agent-6f8dc64ccd-6skj5:/# curl -vk https://localhost:443 && echo
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
* subject: CN=localhost@1544464170
* start date: Dec 10 17:49:30 2018 GMT
* expire date: Dec 10 17:49:30 2019 GMT
* issuer: CN=localhost-ca@1544464169
* SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55d899e53b00)
> GET / HTTP/2
> Host: localhost
> User-Agent: curl/7.62.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 403
< content-type: application/json
< x-content-type-options: nosniff
< content-length: 46
< date: Mon, 10 Dec 2018 17:56:56 GMT
<
* Connection #0 to host localhost left intact
: no kind is registered for the type v1.Status
I'm not really sure what that last line means. Is it trying to query for Status objects? kubectl api-resources reveals there isn't any objects with that name, only ComponentStatus. I suppose it could also be an RBAC issue on my end?
The resulting test HPA looks like this (because the APIService is broken):
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
nginxext Deployment/nginx <unknown>/9 (avg) 1 3 1 1h
Describe what you expected:
I expected the APIService to correctly communicate with the cluster-agent.
Steps to reproduce the issue:
I followed the great guides you have written, and the related YAML is here:
apiVersion: v1
data:
event.tokenKey: "0"
kind: ConfigMap
metadata:
name: datadogtoken
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: datadog-cluster-agent
rules:
- apiGroups:
- ""
resources:
- services
- events
- endpoints
- pods
- nodes
- componentstatuses
verbs:
- get
- list
- watch
- apiGroups:
- "autoscaling"
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
resourceNames:
- datadogtoken # Kubernetes event collection state
- datadog-leader-election # Leader election token
verbs:
- get
- update
- apiGroups: # To create the leader election token
- ""
resources:
- configmaps
verbs:
- create
- get
- update
- nonResourceURLs:
- "/version"
- "/healthz"
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: datadog-cluster-agent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: datadog-cluster-agent
subjects:
- kind: ServiceAccount
name: datadog-cluster-agent
namespace: kube-system
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: datadog-cluster-agent
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: datadog-cluster-agent
namespace: kube-system
spec:
template:
metadata:
labels:
app: datadog-cluster-agent
name: datadog-agent
annotations:
ad.datadoghq.com/datadog-cluster-agent.check_names: '["prometheus"]'
ad.datadoghq.com/datadog-cluster-agent.init_configs: '[{}]'
ad.datadoghq.com/datadog-cluster-agent.instances: '[{"prometheus_url": "http://%%host%%:5000/metrics","namespace": "datadog.cluster_agent","metrics": ["go_goroutines","go_memstats_*","process_*","api_requests","datadog_requests","external_metrics"]}]'
spec:
serviceAccountName: datadog-cluster-agent
containers:
- image: datadog/cluster-agent:latest
imagePullPolicy: Always
ports:
- containerPort: 5005
- containerPort: 443
name: datadog-cluster-agent
env:
- name: DD_API_KEY
valueFrom:
secretKeyRef:
name: datadog-secrets
key: api-key
- name: DD_APP_KEY
valueFrom:
secretKeyRef:
name: datadog-secrets
key: app-key
- name: DD_COLLECT_KUBERNETES_EVENTS
value: "true"
- name: DD_LEADER_ELECTION
value: "true"
- name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
value: 'true'
- name: DD_CLUSTER_AGENT_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: datadog-secrets
key: cluster-token
---
apiVersion: v1
kind: Service
metadata:
name: datadog-cluster-agent
labels:
app: datadog-cluster-agent
spec:
ports:
- port: 5005 # Has to be the same as the one exposed in the DCA. Default is 5005.
protocol: TCP
selector:
app: datadog-cluster-agent
---
# HPA stuff
kind: Service
apiVersion: v1
metadata:
name: datadog-custom-metrics-server
namespace: kube-system
spec:
selector:
app: datadog-cluster-agent
ports:
- protocol: TCP
port: 443
targetPort: 443
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:auth-delegator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- kind: ServiceAccount
name: datadog-cluster-agent
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: datadog-cluster-agent
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: datadog-cluster-agent
namespace: kube-system
---
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
name: v1beta1.external.metrics.k8s.io
spec:
insecureSkipTLSVerify: true
group: external.metrics.k8s.io
groupPriorityMinimum: 100
versionPriority: 100
service:
name: datadog-custom-metrics-server
namespace: kube-system
version: v1beta1
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: external-metrics-reader
rules:
- apiGroups:
- "external.metrics.k8s.io"
resources:
- "*"
verbs:
- list
- get
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: external-metrics-reader
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: external-metrics-reader
subjects:
- kind: ServiceAccount
name: horizontal-pod-autoscaler
namespace: kube-system
Additional environment details (Operating System, Cloud provider, etc):
I'm running on AWS EKS, here's my versions:
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-27T01:14:37Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.11-eks", GitCommit:"6bf27214b7e3e1e47dce27dcbd73ee1b27adadd0", GitTreeState:"clean", BuildDate:"2018-12-04T13:33:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Hey @george-miller - Thanks for opening this up, and apologies for the headache.
This is strange indeed, I tested it last week and it was working correctly.
Thanks for sharing these outputs, I'll try to reproduce on my end and keep you posted.
Best,
.C
Thanks for the quick reply, it's not really a huge headache because everything else still works great (event collection etc.), just the custom metrics stuff doesn't work. Happy to give you more info on my setup or whatever you need.
Appreciate you making this cool tool! Your docs made it super easy.
@george-miller, thanks for the feedback!
It looks like the API Server is not able to reach the cluster agent, I faced a similar issue in the past on EKS - This solved it:
https://github.com/kubernetes-incubator/metrics-server/issues/45#issuecomment-421345121
I thought that it had been fixed upstream though, but I could be wrong. Did you just create the cluster ?
I'm continuing digging in the meantime.
Best,
.C
@CharlyF
It works!
$ kubectl describe APIService v1beta1.external.metrics.k8s.io
...
Status:
Conditions:
Last Transition Time: 2018-12-10T19:27:27Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
You're the best! Thank you so much! Slightly disappointed that I didn't realize this, in hindsight it makes a whole lot of sense though.
For anyone else looking at this, the EKS Cloud formation template only allows the masters to communicate with the worker nodes on ports 1024 - 65535. You have to make that either all traffic or add another rule for port 443.
To answer your question, no I created my cluster a couple months ago, we ported the cloudformation to terraform so maybe AWS updated their cloudformation to include this, not sure.
That's fantastic, very happy that it helped.
I figured I'd ask because of the CVE from last week, I could have missed a cluster template update.
I'm going to close this but feel free to reach out should you have questions or issues.
Best,
.C
Thanks for the clear guidelines, I'm also using Terraform to provision the workers, I do reference the upstream AWS CFN stack, but pass in references to security groups I create in Terraform based on the official recommended security groups (which still only allow 1024 - 65535).
Adding a single rule to allow ingress on port 443 fixed it for me as well
@george-miller / @CharlyF - how come metrics-server worked without having to open any additional ports and the datadog cluster agent metrics api didn't?
both are svc of type clusterIP listening on port 443?
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
datadog-cluster-agent ClusterIP 172.20.99.230 <none> 5005/TCP 60m
datadog-cluster-agent-metrics-api ClusterIP 172.20.56.224 <none> 443/TCP 60m
metrics-server ClusterIP 172.20.111.100 <none> 443/TCP 29h
EDIT: For me, I did not have port 443 open from control plane to workers and metrics-server was working... but after deploying datadog using the helm chart, it didn't work until I opened the security group between the control plane and the workers ... strange... I don't know where to start looking to understand this :(
I'm having the same issue. I followed the instructions here and here.
After creating the services, the apiservice and applying rbac, this is what I get when I run kubectl get apiservices:
v1beta1.external.metrics.k8s.io datadog/datadog-cluster-agent-metrics-api False (FailedDiscoveryCheck) 15m
When running kubectl describe apiservices v1beta1.external.metrics.k8s.io I get:
Conditions:
Last Transition Time: 2021-06-08T09:05:03Z
Message: failing or missing response from https://10.10.75.34:443/apis/external.metrics.k8s.io/v1beta1: Get "https://10.10.75.34:8443/apis/external.metrics.k8s.io/v1beta1": dial tcp 10.10.75.34:443: connect: connection refused
I made sure that the control plane security group has 8443 access to the data plane security group. I also made sure to expose 443 in the deployment manifest.
What could be the issue?
Whoever comes across this issue, and has followed the instructions and opened the relevant security groups (read above in this issue to learn more), but still couldn't get it to work - this is how I solved it.
It turns out that the datadog-cluster-agent service account was missing permissions.
Internal Server Error: "/test": subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:datadog:datadog-cluster-agent" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope
It was also missing permissions to read configmaps.
So I added the following to the ClusterRole that is bound to this service account:
- verbs:
- '*'
apiGroups:
- authorization.k8s.io
resources:
- '*'
apiGroups:
- ''
resources:
- services
- endpoints
- pods
- nodes
- namespaces
- componentstatuses
- configmaps
Note: I'm working on EKS, K8S version of v1.19.6-eks-49a6c0.
Most helpful comment
@CharlyF
It works!
You're the best! Thank you so much! Slightly disappointed that I didn't realize this, in hindsight it makes a whole lot of sense though.
For anyone else looking at this, the EKS Cloud formation template only allows the masters to communicate with the worker nodes on ports 1024 - 65535. You have to make that either all traffic or add another rule for port 443.