Datadog-agent: One Does not Simply... Configure datadog for JMX autodiscovery in Kubernetes

Created on 26 Mar 2018 · 7Comments · Source: DataDog/datadog-agent

Our goal is to get datadog to automatically monitor the usual jmx stuff java containers in our cluster. This is pretty much the standard jmx integration, moved to k8s. This shouldn't be all that difficult: Just monitor a given port for jmx on any container that has it. But we can't get it working.

Reading https://docs.datadoghq.com/agent/autodiscovery/#template-source-kubernetes-pod-annotations, we understand that there are two ways to configure things:

(1) using auto-conf. This would require us to use a special container name for all our containers. Since we run about 10 different applications, we can't use the same image for all of them, so this doesn't work.

(2) using kubernetes labels. On the surface, this might work. The problem is that the JMX configuration annotations are quite long. The apache example is already a bit much to stuff into a json annotation value, but even a basic JMX policy is really long ( see below). Even if it works, it would require us to duplicate the JMX configuration into an annotation on every k8s deployment we have, which is horrible.

Using annotations is reasonable, but it would be much better to use them to tag a container as needing a check, not to include the entire configuration.

Is it possible to use an annotation like this:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    ad.datadoghq.com/jmx.check_names: '["jmx"]'
    ad.datadoghq.com/jmx.init_configs: '[{}]'

In combination with an autoconf file like below ( basically the standard jmx configuration file) , to check JMX on containers without the need to stash a 100 line json file into the value of annotation?

  conf:
  - include:
      type: ThreadPool
      attribute:
        maxThreads:
          alias: tomcat.threads.max
          metric_type: gauge
        currentThreadCount:
          alias: tomcat.threads.count
          metric_type: gauge
        currentThreadsBusy:
          alias: tomcat.threads.busy
          metric_type: gauge        
    - include:
        domain: java.lang
        type: MemoryPool
        attribute:
          Usage.used:
            alias: jvm.memory_pool.used
            metric_type: gauge
          Usage.max:
            alias: jvm.memory_pool.max
            metric_type: gauge
          Usage.init:
            alias: jvm.memory_pool.init
            metric_type: gauge
          Usage.committed:
            alias: jvm.memory_pool.committed
            metric_type: gauge
    - include:
        domain: java.lang
        type: GarbageCollector
        attribute:
          CollectionCount:
            alias: jvm.gc.count
            metric_type: gauge
          CollectionTime:
            alias: jvm.gc.time
            metric_type: gauge**

Source

dcowden

👍2

Most helpful comment

I succeeded to instrument my JVMs via JMX in k8s.

Here's the config I have for my containers:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dashboard
  labels:
    app: dashboard
spec:
  serviceName: dashboard
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: dashboard
  template:
    metadata:
      annotations:
        # Annotations should have this format: `ad.datadoghq.com/<container_name>.check_names`
        ad.datadoghq.com/dashboard.check_names: '["jmx"]'
        ad.datadoghq.com/dashboard.init_configs: '[{}]'
        ad.datadoghq.com/dashboard.instances: '[{"jmx_url": "service:jmx:rmi://%%host%%:7199/"}]'
      labels:
        app: dashboard
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: dashboard  # this is the `container_name` you should use in the annotations.
..............
           ports:
           - containerPort: 7199
           name: jmx_port

and here's the values.yml file I use to deploy the Datadog chart in k8s:

# Copied from here: https://github.com/kubernetes/charts/blob/master/stable/datadog/values.yaml

# Default values for datadog.
image:
  # This chart is compatible with different images, please choose one
  repository: datadog/agent               # Agent6
  # repository: datadog/dogstatsd         # Standalone DogStatsD6
  # repository: datadog/docker-dd-agent   # Agent5
  tag: 6.2.1-jmx # Use 6.2.1-jmx to enable jmx fetch collection
  pullPolicy: IfNotPresent

# NB! Normally you need to keep Datadog DaemonSet enabled!
# The exceptional case could be a situation when you need to run
# single DataDog pod per every namespace, but you do not need to
# re-create a DaemonSet for every non-default namespace install.
# Note, that StatsD and DogStatsD work over UDP, so you may not
# get guaranteed delivery of the metrics in Datadog-per-namespace setup!
daemonset:
  enabled: true
  ## Bind ports on the hostNetwork. Useful for CNI networking where hostPort might
  ## not be supported. The ports will need to be available on all hosts. Can be
  ## used for custom metrics instead of a service endpoint.
  ## WARNING: Make sure that hosts using this are properly firewalled otherwise
  ## metrics and traces will be accepted from any host able to connect to this host.
  # useHostNetwork: true

  ## Sets the hostPort to the same value of the container port. Can be used as
  ## for sending custom metrics. The ports will need to be available on all
  ## hosts.
  ## WARNING: Make sure that hosts using this are properly firewalled otherwise
  ## metrics and traces will be accepted from any host able to connect to this host.
  useHostPort: true

  ## Annotations to add to the DaemonSet's Pods
  # podAnnotations:
  #   scheduler.alpha.kubernetes.io/tolerations: '[{"key": "example", "value": "foo"}]'

  ## Allow the DaemonSet to schedule on tainted nodes (requires Kubernetes >= 1.6)
  # tolerations: []

  ## Allow the DaemonSet to schedule on selected nodes
  # Ref: https://kubernetes.io/docs/user-guide/node-selection/
  # nodeSelector: {}

  ## Allow the DaemonSet to schedule ussing affinity rules
  # Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  # affinity: {}

  ## Allow the DaemonSet to perform a rolling update on helm update
  ## ref: https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/
  # updateStrategy: RollingUpdate

# Apart from DaemonSet, deploy Datadog agent pods and related service for
# applications that want to send custom metrics. Provides DogStasD service.
#
# HINT: If you want to use datadog.collectEvents, keep deployment.replicas set to 1.
deployment:
  enabled: false
  replicas: 1
  # Affinity for pod assignment
  # Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}
  # Tolerations for pod assignment
  # Ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  tolerations: []

## deploy the kube-state-metrics deployment
## ref: https://github.com/kubernetes/charts/tree/master/stable/kube-state-metrics
##
kubeStateMetrics:
  enabled: true

datadog:
  ## You'll need to set this to your Datadog API key before the agent will run.
  ## ref: https://app.datadoghq.com/account/settings#agent/kubernetes
  ##
  apiKey: 'TOCHANGE'

  ## dd-agent container name
  ##
  name: dd-agent

  ## Set logging verbosity.
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ## Note: For Agent6 (image `datadog/agent`) the valid log levels are
  ## trace, debug, info, warn, error, critical, and off
  ##
  logLevel: WARNING

  ## Un-comment this to make each node accept non-local statsd traffic.
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  # nonLocalTraffic: true

  ## Set host tags.
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  # tags:

  ## Enables event collection from the kubernetes API
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  collectEvents: true

  ## Un-comment this to enable APM and tracing, on ports 7777 and 8126
  ## ref: https://github.com/DataDog/docker-dd-agent#tracing-from-the-host
  ##
  apmEnabled: true

  ## The dd-agent supports many environment variables
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  env:
    - name: DD_PROCESS_AGENT_ENABLED # https://docs.datadoghq.com/guides/process/
      value: "true"
    - name: DD_LOGS_ENABLED # https://app.datadoghq.com/logs/onboarding/container
      value: "true"
    - name: DD_LEADER_ELECTION # https://github.com/DataDog/datadog-agent/tree/master/Dockerfiles/agent#kubernetes-integration
      value: "true"
    - name: DD_COLLECT_KUBERNETES_EVENTS # https://github.com/DataDog/datadog-agent/tree/master/Dockerfiles/agent#event-collection
      value: "true"
    - name: SD_JMX_ENABLE # https://docs.datadoghq.com/agent/faq/docker-jmx/
      value: "true"

  ## The dd-agent supports detailed process and container monitoring and
  ## requires control over the volume and volumeMounts for the daemonset
  ## or deployment.
  ## ref: https://docs.datadoghq.com/guides/process/
  ##
  volumes:
    - hostPath:
        path: /etc/passwd
      name: passwd
  volumeMounts:
    - name: passwd
      mountPath: /etc/passwd
      readOnly: true

  ## Enable leader election mechanism for event collection
  ##
  leaderElection: true

  ## Set the lease time for leader election
  ##
  # leaderLeaseDuration: 600

  ## Provide additional service definitions
  ## Each key will become a file in /conf.d/auto_conf
  ## ref: https://github.com/DataDog/docker-dd-agent#configuration-files
  ##
  # autoconf:
  #   kubernetes_state.yaml: |-
  #     docker_images:
  #       - kube-state-metrics
  #     init_config:
  #     instances:
  #       - kube_state_url: http://%%host%%:%%port%%/metrics

  ## Provide additional service definitions
  ## Each key will become a file in /conf.d
  ## ref: https://github.com/DataDog/docker-dd-agent#configuration-files
  ##
  confd:
  #   redisdb.yaml: |-
  #     init_config:
  #     instances:
  #       - host: "name"
  #         port: "6379"
  # https://app.datadoghq.com/logs/onboarding/container
  # https://github.com/DataDog/datadog-agent/tree/master/Dockerfiles/agent#configuration-file-example
    logs.yaml: |-
      init_config:
      instances:
        [{}]
      logs:
        - type: docker
          service: myapp
          source: myapp-logs

  ## Provide additional service checks
  ## Each key will become a file in /checks.d
  ## ref: https://github.com/DataDog/docker-dd-agent#configuration-files
  ##
  # checksd:
  #   service.py: |-

  ## datadog-agent resource requests and limits
  ## Make sure to keep requests and limits equal to keep the pods in the Guaranteed QoS class
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 256Mi

rbac:
  ## If true, create & use RBAC resources
  create: false

  ## Ignored if rbac.create is true
  serviceAccountName: default

tolerations: []

kube-state-metrics:
  rbac:
    create: false

    ## Ignored if rbac.create is true
    serviceAccountName: default

All the credits to someone called C8n on the Datadog Slack.

If you have more questions, this Slack is useful !

guizmaii on 12 Jun 2018

👍4

All 7 comments

Did you succeed to monitor JMX in k8s ? I'm also trying to do this.

guizmaii on 11 Jun 2018

Hi, @guizmaii

No unfortunately I didnt. I ended up tired and frustrated-- it just doesnt work well. I should also note that DataDog's lack of a response on this issue was also super frustrating.

The path we ended up taking was somewhat unusual. We instrumented all of our applications using prometheus, and then used datadog's prometheus integration to pull the metrics in. It was still a struggle to get datadog to read the k8s api, but once we had that going, this pretty much just worked.

This provided several other benefits too :

We can do all of our applications the same way. JMX is java only, but we also support nodejs and python projects
We have a future path to replace datadog with prometheus. DD is not cheap, and this way we don't have to touch all our apps if we want to switch
Pull is better. push monitoring is super frustrating when it doesn't work. We've found that exposing metrics as an api, and then pulling it is much easier to scale and manage. The apps don't have to know anything about monitoring and where it is. At the end of the day, its really valuable that you can always get metrics using a web-browser by doing http://app:monitoringport/metrics

After running the datadog agent in-cluster, and then using the java prometheus client to expose metrics, you add annotations to your manifest like this:

spec:
  replicas: {{ .Values.replicaCount }}
  template:
    metadata:
      annotations:
        ad.datadoghq.com/ptplace-bff.check_names: '["prometheus"]'
        ad.datadoghq.com/ptplace-bff.init_configs: '[{}]'
        ad.datadoghq.com/ptplace-bff.instances: '[ { "prometheus_url": "http://%%host%%:8080/metrics/", "namespace": "ptplace-bff", "metrics": [ "*" ] } ]'

dcowden on 11 Jun 2018

I succeeded to instrument my JVMs via JMX in k8s.

Here's the config I have for my containers:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dashboard
  labels:
    app: dashboard
spec:
  serviceName: dashboard
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: dashboard
  template:
    metadata:
      annotations:
        # Annotations should have this format: `ad.datadoghq.com/<container_name>.check_names`
        ad.datadoghq.com/dashboard.check_names: '["jmx"]'
        ad.datadoghq.com/dashboard.init_configs: '[{}]'
        ad.datadoghq.com/dashboard.instances: '[{"jmx_url": "service:jmx:rmi://%%host%%:7199/"}]'
      labels:
        app: dashboard
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: dashboard  # this is the `container_name` you should use in the annotations.
..............
           ports:
           - containerPort: 7199
           name: jmx_port

and here's the values.yml file I use to deploy the Datadog chart in k8s:

# Copied from here: https://github.com/kubernetes/charts/blob/master/stable/datadog/values.yaml

# Default values for datadog.
image:
  # This chart is compatible with different images, please choose one
  repository: datadog/agent               # Agent6
  # repository: datadog/dogstatsd         # Standalone DogStatsD6
  # repository: datadog/docker-dd-agent   # Agent5
  tag: 6.2.1-jmx # Use 6.2.1-jmx to enable jmx fetch collection
  pullPolicy: IfNotPresent

# NB! Normally you need to keep Datadog DaemonSet enabled!
# The exceptional case could be a situation when you need to run
# single DataDog pod per every namespace, but you do not need to
# re-create a DaemonSet for every non-default namespace install.
# Note, that StatsD and DogStatsD work over UDP, so you may not
# get guaranteed delivery of the metrics in Datadog-per-namespace setup!
daemonset:
  enabled: true
  ## Bind ports on the hostNetwork. Useful for CNI networking where hostPort might
  ## not be supported. The ports will need to be available on all hosts. Can be
  ## used for custom metrics instead of a service endpoint.
  ## WARNING: Make sure that hosts using this are properly firewalled otherwise
  ## metrics and traces will be accepted from any host able to connect to this host.
  # useHostNetwork: true

  ## Sets the hostPort to the same value of the container port. Can be used as
  ## for sending custom metrics. The ports will need to be available on all
  ## hosts.
  ## WARNING: Make sure that hosts using this are properly firewalled otherwise
  ## metrics and traces will be accepted from any host able to connect to this host.
  useHostPort: true

  ## Annotations to add to the DaemonSet's Pods
  # podAnnotations:
  #   scheduler.alpha.kubernetes.io/tolerations: '[{"key": "example", "value": "foo"}]'

  ## Allow the DaemonSet to schedule on tainted nodes (requires Kubernetes >= 1.6)
  # tolerations: []

  ## Allow the DaemonSet to schedule on selected nodes
  # Ref: https://kubernetes.io/docs/user-guide/node-selection/
  # nodeSelector: {}

  ## Allow the DaemonSet to schedule ussing affinity rules
  # Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  # affinity: {}

  ## Allow the DaemonSet to perform a rolling update on helm update
  ## ref: https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/
  # updateStrategy: RollingUpdate

# Apart from DaemonSet, deploy Datadog agent pods and related service for
# applications that want to send custom metrics. Provides DogStasD service.
#
# HINT: If you want to use datadog.collectEvents, keep deployment.replicas set to 1.
deployment:
  enabled: false
  replicas: 1
  # Affinity for pod assignment
  # Ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
  affinity: {}
  # Tolerations for pod assignment
  # Ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  tolerations: []

## deploy the kube-state-metrics deployment
## ref: https://github.com/kubernetes/charts/tree/master/stable/kube-state-metrics
##
kubeStateMetrics:
  enabled: true

datadog:
  ## You'll need to set this to your Datadog API key before the agent will run.
  ## ref: https://app.datadoghq.com/account/settings#agent/kubernetes
  ##
  apiKey: 'TOCHANGE'

  ## dd-agent container name
  ##
  name: dd-agent

  ## Set logging verbosity.
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ## Note: For Agent6 (image `datadog/agent`) the valid log levels are
  ## trace, debug, info, warn, error, critical, and off
  ##
  logLevel: WARNING

  ## Un-comment this to make each node accept non-local statsd traffic.
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  # nonLocalTraffic: true

  ## Set host tags.
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  # tags:

  ## Enables event collection from the kubernetes API
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  collectEvents: true

  ## Un-comment this to enable APM and tracing, on ports 7777 and 8126
  ## ref: https://github.com/DataDog/docker-dd-agent#tracing-from-the-host
  ##
  apmEnabled: true

  ## The dd-agent supports many environment variables
  ## ref: https://github.com/DataDog/docker-dd-agent#environment-variables
  ##
  env:
    - name: DD_PROCESS_AGENT_ENABLED # https://docs.datadoghq.com/guides/process/
      value: "true"
    - name: DD_LOGS_ENABLED # https://app.datadoghq.com/logs/onboarding/container
      value: "true"
    - name: DD_LEADER_ELECTION # https://github.com/DataDog/datadog-agent/tree/master/Dockerfiles/agent#kubernetes-integration
      value: "true"
    - name: DD_COLLECT_KUBERNETES_EVENTS # https://github.com/DataDog/datadog-agent/tree/master/Dockerfiles/agent#event-collection
      value: "true"
    - name: SD_JMX_ENABLE # https://docs.datadoghq.com/agent/faq/docker-jmx/
      value: "true"

  ## The dd-agent supports detailed process and container monitoring and
  ## requires control over the volume and volumeMounts for the daemonset
  ## or deployment.
  ## ref: https://docs.datadoghq.com/guides/process/
  ##
  volumes:
    - hostPath:
        path: /etc/passwd
      name: passwd
  volumeMounts:
    - name: passwd
      mountPath: /etc/passwd
      readOnly: true

  ## Enable leader election mechanism for event collection
  ##
  leaderElection: true

  ## Set the lease time for leader election
  ##
  # leaderLeaseDuration: 600

  ## Provide additional service definitions
  ## Each key will become a file in /conf.d/auto_conf
  ## ref: https://github.com/DataDog/docker-dd-agent#configuration-files
  ##
  # autoconf:
  #   kubernetes_state.yaml: |-
  #     docker_images:
  #       - kube-state-metrics
  #     init_config:
  #     instances:
  #       - kube_state_url: http://%%host%%:%%port%%/metrics

  ## Provide additional service definitions
  ## Each key will become a file in /conf.d
  ## ref: https://github.com/DataDog/docker-dd-agent#configuration-files
  ##
  confd:
  #   redisdb.yaml: |-
  #     init_config:
  #     instances:
  #       - host: "name"
  #         port: "6379"
  # https://app.datadoghq.com/logs/onboarding/container
  # https://github.com/DataDog/datadog-agent/tree/master/Dockerfiles/agent#configuration-file-example
    logs.yaml: |-
      init_config:
      instances:
        [{}]
      logs:
        - type: docker
          service: myapp
          source: myapp-logs

  ## Provide additional service checks
  ## Each key will become a file in /checks.d
  ## ref: https://github.com/DataDog/docker-dd-agent#configuration-files
  ##
  # checksd:
  #   service.py: |-

  ## datadog-agent resource requests and limits
  ## Make sure to keep requests and limits equal to keep the pods in the Guaranteed QoS class
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 256Mi

rbac:
  ## If true, create & use RBAC resources
  create: false

  ## Ignored if rbac.create is true
  serviceAccountName: default

tolerations: []

kube-state-metrics:
  rbac:
    create: false

    ## Ignored if rbac.create is true
    serviceAccountName: default

All the credits to someone called C8n on the Datadog Slack.

If you have more questions, this Slack is useful !

guizmaii on 12 Jun 2018

👍4

@guizmaii thanks for posting a working solution! My ship has already sailed, but I'm going to close this issue because you've solved it.

dcowden on 12 Jun 2018

🎉1

On AWS ECS I was able to get this equivalent thing to work:
"dockerLabels": {
"com.datadoghq.ad.check_names": "[\"jmx\"]",
"com.datadoghq.ad.instances": "[ {\"host\": \"localhost\", \"port\":\"9000\"}]",

                "com.datadoghq.ad.init_configs": "[{}]"
            }

but not this
"dockerLabels": {
"com.datadoghq.ad.check_names": "[\"jmx\"]",
"com.datadoghq.ad.instances": "[ {\"host\": \"%%host%%\", \"port\":\"9000\"}]",

                "com.datadoghq.ad.init_configs": "[{}]"
            }

nor variants with jmx_url

DaveWHarvey on 25 Oct 2018

I'm really curious if the configuration parameter SD_JMX_ENABLE is really needed, when searching in the code in the repo, it does not show any traces other than some legacy tests. I almost believe that JMX is enabled by default via SD if you run the agent with the -jmx postfix.