Datadog-agent: [cluster-agent] APIService cannot connect to the cluster agnent

Created on 10 Dec 2018  路  10Comments  路  Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.1.0)
==============================

  Status date: 2018-12-10 17:50:27.539631 UTC
  Pid: 1
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2018-12-10 17:50:27.539631 UTC

  Hostnames
  =========
    ec2-hostname: ip-172-29-155-132.ad.data.activision.com
    hostname: i-04a44a8ee90554219
    instance-id: i-04a44a8ee90554219
    socket-fqdn: datadog-cluster-agent-6f8dc64ccd-6skj5
    socket-hostname: datadog-cluster-agent-6f8dc64ccd-6skj5
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Leader Election
  ===============
    Leader Election Status:  Running
    Leader Name is: datadog-cluster-agent-65cbf9844d-w4bm9
    Last Acquisition of the lease: Mon, 10 Dec 2018 17:49:08 UTC
    Renewed leadership: Mon, 10 Dec 2018 17:49:38 UTC
    Number of leader transitions: 9 transitions

  Custom Metrics Server
  =====================
    ConfigMap name: kube-system/datadog-custom-metrics

    External Metrics
    ----------------
      Total: 1
      Valid: 1

      hpa:
      - name: nginxext
      - namespace: default
      - uid: 64e03168-fc99-11e8-9eea-020490deafb0
      labels:
      - kube_container_name: nginx
      metricName: nginx.net.request_per_s
      ts: 1.544464144e+09
      valid: true
      value: 1


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Total Runs: 5
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 100ms

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 4
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 1
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 9
    TimeseriesV1: 4

  API Keys status
  ===============
    API key ending with 1ff8a on endpoint https://app.datadoghq.com: API Key valid

Describe what happened:
The APIService cannot query the datadog-custom-metrics-server you can see the error here:

mercury-core $ kubectl describe APIService v1beta1.external.metrics.k8s.io 
Name:         v1beta1.external.metrics.k8s.io
Namespace:    
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{},"name":"v1beta1.external.metrics.k8s.io"},...
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2018-12-10T17:29:04Z
  Resource Version:    6186546
  Self Link:           /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.external.metrics.k8s.io
  UID:                 1733cbb1-fca1-11e8-94e3-069324fbe3cc
Spec:
  Ca Bundle:                 <nil>
  Group:                     external.metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            datadog-custom-metrics-server
    Namespace:       kube-system
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2018-12-10T17:29:04Z
    Message:               no response from https://172.29.155.94:443: Get https://172.29.155.94:443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>

I was hoping it was a networking issue, but it seems that the cluster-agent responds badly to the request as well:

root@datadog-cluster-agent-6f8dc64ccd-6skj5:/# curl -vk  https://localhost:443 && echo
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=localhost@1544464170
*  start date: Dec 10 17:49:30 2018 GMT
*  expire date: Dec 10 17:49:30 2019 GMT
*  issuer: CN=localhost-ca@1544464169
*  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55d899e53b00)
> GET / HTTP/2
> Host: localhost
> User-Agent: curl/7.62.0
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 403 
< content-type: application/json
< x-content-type-options: nosniff
< content-length: 46
< date: Mon, 10 Dec 2018 17:56:56 GMT
< 
* Connection #0 to host localhost left intact
: no kind is registered for the type v1.Status

I'm not really sure what that last line means. Is it trying to query for Status objects? kubectl api-resources reveals there isn't any objects with that name, only ComponentStatus. I suppose it could also be an RBAC issue on my end?

The resulting test HPA looks like this (because the APIService is broken):

NAME       REFERENCE          TARGETS             MINPODS   MAXPODS   REPLICAS   AGE
nginxext   Deployment/nginx   <unknown>/9 (avg)   1         3         1          1h

Describe what you expected:
I expected the APIService to correctly communicate with the cluster-agent.

Steps to reproduce the issue:
I followed the great guides you have written, and the related YAML is here:

apiVersion: v1
data:
  event.tokenKey: "0"
kind: ConfigMap
metadata:
  name: datadogtoken
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: datadog-cluster-agent
rules:
- apiGroups:
  - ""
  resources:
  - services
  - events
  - endpoints
  - pods
  - nodes
  - componentstatuses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - "autoscaling"
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  resourceNames:
  - datadogtoken             # Kubernetes event collection state
  - datadog-leader-election  # Leader election token
  verbs:
  - get
  - update
- apiGroups:  # To create the leader election token
  - ""
  resources:
  - configmaps
  verbs:
  - create
  - get
  - update
- nonResourceURLs:
  - "/version"
  - "/healthz"
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: datadog-cluster-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: datadog-cluster-agent
subjects:
- kind: ServiceAccount
  name: datadog-cluster-agent
  namespace: kube-system
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: datadog-cluster-agent
  namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: datadog-cluster-agent
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        app: datadog-cluster-agent
      name: datadog-agent
      annotations:
        ad.datadoghq.com/datadog-cluster-agent.check_names: '["prometheus"]'
        ad.datadoghq.com/datadog-cluster-agent.init_configs: '[{}]'
        ad.datadoghq.com/datadog-cluster-agent.instances: '[{"prometheus_url": "http://%%host%%:5000/metrics","namespace": "datadog.cluster_agent","metrics": ["go_goroutines","go_memstats_*","process_*","api_requests","datadog_requests","external_metrics"]}]'
    spec:
      serviceAccountName: datadog-cluster-agent
      containers:
      - image: datadog/cluster-agent:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 5005
        - containerPort: 443
        name: datadog-cluster-agent
        env:
          - name: DD_API_KEY
            valueFrom:
              secretKeyRef:
                name: datadog-secrets
                key: api-key
          - name: DD_APP_KEY
            valueFrom:
              secretKeyRef:
                name: datadog-secrets
                key: app-key
          - name: DD_COLLECT_KUBERNETES_EVENTS
            value: "true"
          - name: DD_LEADER_ELECTION
            value: "true"
          - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
            value: 'true'
          - name: DD_CLUSTER_AGENT_AUTH_TOKEN
            valueFrom:
              secretKeyRef:
                name: datadog-secrets
                key: cluster-token
---
apiVersion: v1
kind: Service
metadata:
  name: datadog-cluster-agent
  labels:
    app: datadog-cluster-agent
spec:
  ports:
  - port: 5005 # Has to be the same as the one exposed in the DCA. Default is 5005.
    protocol: TCP
  selector:
    app: datadog-cluster-agent
---





# HPA stuff

kind: Service
apiVersion: v1
metadata:
  name: datadog-custom-metrics-server
  namespace: kube-system
spec:
  selector:
    app: datadog-cluster-agent
  ports:
  - protocol: TCP
    port: 443
    targetPort: 443
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:auth-delegator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: datadog-cluster-agent
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: datadog-cluster-agent
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: datadog-cluster-agent
  namespace: kube-system
---
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  insecureSkipTLSVerify: true
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  versionPriority: 100
  service:
    name: datadog-custom-metrics-server
    namespace: kube-system
  version: v1beta1
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-metrics-reader
rules:
- apiGroups:
  - "external.metrics.k8s.io"
  resources:
  - "*"
  verbs:
  - list
  - get
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-metrics-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-metrics-reader
subjects:
- kind: ServiceAccount
  name: horizontal-pod-autoscaler
  namespace: kube-system

Additional environment details (Operating System, Cloud provider, etc):

I'm running on AWS EKS, here's my versions:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-27T01:14:37Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.11-eks", GitCommit:"6bf27214b7e3e1e47dce27dcbd73ee1b27adadd0", GitTreeState:"clean", BuildDate:"2018-12-04T13:33:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
teacontainers

Most helpful comment

@CharlyF
It works!

$ kubectl describe  APIService v1beta1.external.metrics.k8s.io
...
Status:
  Conditions:
    Last Transition Time:  2018-12-10T19:27:27Z
    Message:               all checks passed
    Reason:                Passed
    Status:                True
    Type:                  Available

You're the best! Thank you so much! Slightly disappointed that I didn't realize this, in hindsight it makes a whole lot of sense though.

For anyone else looking at this, the EKS Cloud formation template only allows the masters to communicate with the worker nodes on ports 1024 - 65535. You have to make that either all traffic or add another rule for port 443.

All 10 comments

Hey @george-miller - Thanks for opening this up, and apologies for the headache.
This is strange indeed, I tested it last week and it was working correctly.
Thanks for sharing these outputs, I'll try to reproduce on my end and keep you posted.

Best,
.C

Thanks for the quick reply, it's not really a huge headache because everything else still works great (event collection etc.), just the custom metrics stuff doesn't work. Happy to give you more info on my setup or whatever you need.

Appreciate you making this cool tool! Your docs made it super easy.

@george-miller, thanks for the feedback!
It looks like the API Server is not able to reach the cluster agent, I faced a similar issue in the past on EKS - This solved it:
https://github.com/kubernetes-incubator/metrics-server/issues/45#issuecomment-421345121

I thought that it had been fixed upstream though, but I could be wrong. Did you just create the cluster ?
I'm continuing digging in the meantime.

Best,
.C

@CharlyF
It works!

$ kubectl describe  APIService v1beta1.external.metrics.k8s.io
...
Status:
  Conditions:
    Last Transition Time:  2018-12-10T19:27:27Z
    Message:               all checks passed
    Reason:                Passed
    Status:                True
    Type:                  Available

You're the best! Thank you so much! Slightly disappointed that I didn't realize this, in hindsight it makes a whole lot of sense though.

For anyone else looking at this, the EKS Cloud formation template only allows the masters to communicate with the worker nodes on ports 1024 - 65535. You have to make that either all traffic or add another rule for port 443.

To answer your question, no I created my cluster a couple months ago, we ported the cloudformation to terraform so maybe AWS updated their cloudformation to include this, not sure.

That's fantastic, very happy that it helped.
I figured I'd ask because of the CVE from last week, I could have missed a cluster template update.

I'm going to close this but feel free to reach out should you have questions or issues.

Best,
.C

Thanks for the clear guidelines, I'm also using Terraform to provision the workers, I do reference the upstream AWS CFN stack, but pass in references to security groups I create in Terraform based on the official recommended security groups (which still only allow 1024 - 65535).

Adding a single rule to allow ingress on port 443 fixed it for me as well

@george-miller / @CharlyF - how come metrics-server worked without having to open any additional ports and the datadog cluster agent metrics api didn't?

both are svc of type clusterIP listening on port 443?

NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
datadog-cluster-agent               ClusterIP   172.20.99.230    <none>        5005/TCP        60m
datadog-cluster-agent-metrics-api   ClusterIP   172.20.56.224    <none>        443/TCP         60m
metrics-server                      ClusterIP   172.20.111.100   <none>        443/TCP         29h

EDIT: For me, I did not have port 443 open from control plane to workers and metrics-server was working... but after deploying datadog using the helm chart, it didn't work until I opened the security group between the control plane and the workers ... strange... I don't know where to start looking to understand this :(

I'm having the same issue. I followed the instructions here and here.

After creating the services, the apiservice and applying rbac, this is what I get when I run kubectl get apiservices:

v1beta1.external.metrics.k8s.io        datadog/datadog-cluster-agent-metrics-api   False (FailedDiscoveryCheck)   15m

When running kubectl describe apiservices v1beta1.external.metrics.k8s.io I get:

Conditions:
    Last Transition Time:  2021-06-08T09:05:03Z
    Message:               failing or missing response from https://10.10.75.34:443/apis/external.metrics.k8s.io/v1beta1: Get "https://10.10.75.34:8443/apis/external.metrics.k8s.io/v1beta1": dial tcp 10.10.75.34:443: connect: connection refused

I made sure that the control plane security group has 8443 access to the data plane security group. I also made sure to expose 443 in the deployment manifest.

What could be the issue?

Whoever comes across this issue, and has followed the instructions and opened the relevant security groups (read above in this issue to learn more), but still couldn't get it to work - this is how I solved it.

It turns out that the datadog-cluster-agent service account was missing permissions.

Internal Server Error: "/test": subjectaccessreviews.authorization.k8s.io is forbidden: User "system:serviceaccount:datadog:datadog-cluster-agent" cannot create resource "subjectaccessreviews" in API group "authorization.k8s.io" at the cluster scope

It was also missing permissions to read configmaps.

So I added the following to the ClusterRole that is bound to this service account:

  - verbs:
      - '*'
    apiGroups:
      - authorization.k8s.io
    resources:
      - '*'
    apiGroups:
      - ''
    resources:
      - services
      - endpoints
      - pods
      - nodes
      - namespaces
      - componentstatuses
      - configmaps

Note: I'm working on EKS, K8S version of v1.19.6-eks-49a6c0.

Was this page helpful?
0 / 5 - 0 ratings