Datadog-agent: Agent v6.5.2 broken logs from docker

Created on 28 Sep 2018  路  21Comments  路  Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

==============
Agent (v6.5.2)
==============

  Status date: 2018-09-28 09:34:53.577571 UTC
  Pid: 808
  Python Version: 2.7.15
  Logs: 
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -346碌s
    System UTC time: 2018-09-28 09:34:53.577571 UTC

  Host Info
  =========
    bootTime: 2018-09-27 19:59:30.000000 UTC
    kernelVersion: 3.10.0-693.2.2.el7.x86_64
    os: linux
    platform: centos
    platformFamily: rhel
    platformVersion: 7.4.1708
    procs: 148
    uptime: 11s

  Hostnames
  =========
    hostname: k***s.com
    socket-fqdn: h***2.hostwindsdns.com.
    socket-hostname: h***2.hostwindsdns.com
    hostname provider: configuration

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
        Instance ID: cpu [OK]
        Total Runs: 3,261
        Metric Samples: 6, Total: 19,560
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    disk (1.3.0)
    ------------
        Instance ID: disk:e5dffb8bef24336f [OK]
        Total Runs: 3,261
        Metric Samples: 58, Total: 175,170
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 40ms


    file_handle
    -----------
        Instance ID: file_handle [OK]
        Total Runs: 3,260
        Metric Samples: 5, Total: 16,300
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    io
    --
        Instance ID: io [OK]
        Total Runs: 3,261
        Metric Samples: 26, Total: 84,768
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 3ms


    load
    ----
        Instance ID: load [OK]
        Total Runs: 3,260
        Metric Samples: 6, Total: 19,560
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    memory
    ------
        Instance ID: memory [OK]
        Total Runs: 3,261
        Metric Samples: 17, Total: 55,437
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    network (1.6.1)
    ---------------
        Instance ID: network:2a218184ebe03606 [OK]
        Total Runs: 3,261
        Metric Samples: 32, Total: 104,340
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 1ms


    ntp
    ---
        Instance ID: ntp:b4579e02d1981c12 [OK]
        Total Runs: 3,261
        Metric Samples: 1, Total: 3,261
        Events: 0, Total: 0
        Service Checks: 1, Total: 3,261
        Average Execution Time : 31ms


    uptime
    ------
        Instance ID: uptime [OK]
        Total Runs: 3,261
        Metric Samples: 1, Total: 3,261
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 3,260
  Dropped: 0
  DroppedOnInput: 0
  Errors: 81
  Events: 0
  HostMetadata: 0
  IntakeV1: 249
  Metadata: 0
  Requeued: 87
  Retried: 82
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 6,769
  TimeseriesV1: 3,260

  API Keys status
  ===============
    API key ending in 24fb5 for endpoint https://app.datadoghq.com: API Key valid

==========
Logs Agent
==========

  custom
  ------
    Type: docker
    Name: front-container
    Status: Pending

=========
DogStatsD
=========

  Checks Metric Sample: 533,936
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 3,260
  Series Flushed: 420,505
  Service Check: 29,401
  Service Checks Flushed: 32,652


Describe what happened:
Logs Agent dont collect logs, just pending.

Describe what you expected:
Agent (v6.4.2) works as expected (installed on same server 4/9/2018 13:08)

  ------
    Type: docker
    Name: front-container
    Status: OK
    Inputs: 99b00c7d6467b686ce83333dfb86e5297cd20cd1810b99e0ac32dd218cadade1 

Steps to reproduce the issue:
Agent (v6.4.2), Docker version 18.06.1-ce, build e68fc7a - working
Agent (v6.5.2), Docker version 18.06.1-ce, build e68fc7a - not working

/etc/datadog-agent/conf.d/custom.yaml

logs:
  - type: docker
    name: front-container
    source: nginx
    service: docker

datadog.yaml

dd_url: https://app.datadoghq.com

api_key: a***5

hostname: k***s.com

tags:
  - role:shop-front

logs_enabled: true

Additional environment details (Operating System, Cloud provider, etc):
Centos 7, Docker version 18.06.1-ce
Digital Ocean/Hostwindsdns - same problem

Most helpful comment

Also seeing a similar problem with v6.5.2. We are using Docker labels on our app containers to configure the Datadog container agent on hosts running Docker version 18.06.1-ce. Seems like Datadog is having a problem with parsing container labels, which didn't change when we upgraded Datadog

[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at: /etc/datadog-agent/conf.d
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (tailer.go:86 in Start) | Start tailing container: e***1
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (file.go:74 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at:
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (file.go:74 in Collect) | Skipping, open : no such file or directory
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container c***9: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container d***f: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container 8***5: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:250 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:251 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:276 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (disk).
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:250 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:251 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:276 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (network).

2018/10/05 additional info:

We are running our agent containers with the following environment variables:

DD_API_KEY=8***d
DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true //to collect statsd from containers on host
DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
DD_LOGS_ENABLED=true
SD_BACKEND=docker

Example of a container service with Docker labels:

com.datadoghq.ad.check_names=["nginx"]
com.datadoghq.ad.init_configs=[{}]
com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

All 21 comments

Also seeing a similar problem with v6.5.2. We are using Docker labels on our app containers to configure the Datadog container agent on hosts running Docker version 18.06.1-ce. Seems like Datadog is having a problem with parsing container labels, which didn't change when we upgraded Datadog

[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at: /etc/datadog-agent/conf.d
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (tailer.go:86 in Start) | Start tailing container: e***1
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (file.go:74 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at:
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (file.go:74 in Collect) | Skipping, open : no such file or directory
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container c***9: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container d***f: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container 8***5: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:250 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:251 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:276 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (disk).
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:250 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:251 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:276 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (network).

2018/10/05 additional info:

We are running our agent containers with the following environment variables:

DD_API_KEY=8***d
DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true //to collect statsd from containers on host
DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
DD_LOGS_ENABLED=true
SD_BACKEND=docker

Example of a container service with Docker labels:

com.datadoghq.ad.check_names=["nginx"]
com.datadoghq.ad.init_configs=[{}]
com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

I am experiencing same issue with 6.5.x series (6.5.2, 6.5.1, 6.5.0x).
I am on AWS, running Kubernetes 1.8.5 on debian OS.

I had to roll back to 6.4.x

Same issue guys with 6.5.2 agent version.
Centos 7, Docker version 18.06.1-ce
Digital Ocean/Hostwindsdns

Hello everyone,

Thanks for reporting this issue. We will definitely have a look, replicate and fix this behaviour.

There might however be some workaround until this is fixed.
Indeed until the Agent version 6.5 it was required for Kubernetes to use configuration files to filter container by name or image.

As it is now possible to use the Autodiscovery feature with the agent, you can do the same configuration directly in container labels or pod annotations.

Examples: https://docs.datadoghq.com/logs/log_collection/docker/?tab=nginxdockerfile#examples

This means that you now have the ability to easily:

  • Collect all logs with the DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL environment variable.
  • Override the service and source value thanks to labels or pod annotations
  • Choose to collect only specific logs by removing the DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL variable and setting the labels or pod annotations on the container that should be collected.
  • Include or exclude containers thanks to the DD_AC_INCLUDE and DD_AC_EXCLUDE variables (example).

That said, previous configuration should still work so this definitely needs to be fixed. I just wanted to share the new behaviour which we believe is much more dynamic and flexible.

Sorry for the trouble caused and once again thanks for reporting it.

Ok, adding labels to docker image for autodiscovery fix my problems.

Hello @undiabler, @btsuhako and @johanvereshchaga.

We have identified a potential issue which might explain the behaviour you observed.
Is the Datadog Agent running on the host?

If yes, would you mind adding the following lines to datadog.yaml and let us know if after restarting the agent it then works fine:

listeners:
  - name: docker

config_providers:
  - name: docker
    polling: true

Why do we need this?

As explained in the previous post, the log collection was merged in the Autodiscovery feature of the Agent which means that we now need to have it enabled.
The above lines enable the Autodiscovery feature in the agent.

This is not necessary when running the containerised version of the agent as it is enabled automatically.

Hello @undiabler, @btsuhako and @johanvereshchaga.

We have identified a potential issue which might explain the behaviour you observed.
Is the Datadog Agent running on the host?

If yes, would you mind adding the following lines to datadog.yaml and let us know if after restarting the agent it then works fine:

listeners:
  - name: docker

config_providers:
  - name: docker
    polling: true

Why do we need this?

As explained in the previous post, the log collection was merged in the Autodiscovery feature of the Agent which means that we now need to have it enabled.
The above lines enable the Autodiscovery feature in the agent.

This is not necessary when running the containerised version of the agent as it is enabled automatically.

@NBParis thanks for the tip, but we're running the containerized version of the agent, with no YAML overrides for the config, just environment variables as noted in my post above.

@btsuhako Thanks for the feedback.

I believe we have found your issue.

Could you change your labels from

com.datadoghq.ad.check_names=["nginx"]
com.datadoghq.ad.init_configs=[{}]
com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

To:

com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

And let us know if that works?

Why?

Currently the agent looks at the config on container as a whole and try to parse it.
As you had check_names and init_config it was also looking for instances which was missing.
This led to the whole config being unparsed.
Example of configuration for Nginx check.

What will change?

In the future, we will try to split the parsing of the log configuration and the metric one.
This will ensure that even if one of them is not correct, the other is still correctly taken into account.

We will see if the status of the agent can be updated to better reflect the issues on label parsing.

Side question, was the documentation unclear about the fact that logs label could be used independently?
Do you have suggestion on how this could be improved to clarify the process if not clear?

@btsuhako Thanks for the feedback.

I believe we have found your issue.

Could you change your labels from

com.datadoghq.ad.check_names=["nginx"]
com.datadoghq.ad.init_configs=[{}]
com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

To:

com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

And let us know if that works?

Why?

Currently the agent looks at the config on container as a whole and try to parse it.
As you had check_names and init_config it was also looking for instances which was missing.
This led to the whole config being unparsed.
Example of configuration for Nginx check.

What will change?

In the future, we will try to split the parsing of the log configuration and the metric one.
This will ensure that even if one of them is not correct, the other is still correctly taken into account.

We will see if the status of the agent can be updated to better reflect the issues on label parsing.

Side question, was the documentation unclear about the fact that logs label could be used independently?
Do you have suggestion on how this could be improved to clarify the process if not clear?

@NBParis I'll try this and report back to see if it works. It would be helpful to update the documentation though to your recommendation:

https://docs.datadoghq.com/logs/log_collection/docker/?tab=containerinstallation#examples currently shows 4 Docker labels to configure Autodiscovery

Same problem for me.

  • Agent (v6.5.2): no Docker log
  • Agent (v6.4.2): Docker logs are back

Hello @prodis,

Are you running the agent on the host or as a container?
Or do you have some labels or pod annotations set on your containers to configure log collection?

@NBParis Running as a container on Amazon ECS.

My Dockerfile for the agent:

FROM datadog/latest
ADD conf.d/log.yaml /etc/datadog-agent/conf.d/log.yaml
ADD datadog.yaml /etc/datadog-agent/datadog.yaml

log.yaml

logs:
  - type: docker
    source: docker
    service: tee2
    tags:
      - "env:tee2"
      - "component:all"

datadog.yaml

log_level: info
apm_config:
  env: tee2
  apm_non_local_traffic: true

Environment variables:

      "environment": [
        {
          "name": "DD_API_KEY",
          "value": <MY_KEY>
        },
        {
          "name": "DD_APM_ENABLED",
          "value": "true"
        },
        {
          "name": "DD_LOGS_ENABLED",
          "value": "true"
        },
        {
          "name": "LOG_LEVEL",
          "value": "INFO"
        },
        {
          "name": "NON_LOCAL_TRAFFIC",
          "value": "true"
        },
        {
          "name": "SD_BACKEND",
          "value": "docker"
        }
      ],

From agent status output:

...

==========
Logs Agent
==========

  log
  ---
    Type: docker
    Status: Pending

...

When I use Agent (v6.4.2):

Dockerfile:

FROM datadog/agent:6.4.2
ADD conf.d/log.yaml /etc/datadog-agent/conf.d/log.yaml
ADD datadog.yaml /etc/datadog-agent/datadog.yaml

Docker logs are back:

...

==========
Logs Agent
==========

  log
  ---
    Type: docker
    Status: OK
    Inputs: f6920bff0b06d950b5f196b89581ec083eeeb7734f35d3a952dd3883356e0b70 1761805c620ba3df1efb2f5f141d2a1047e4971a5580557ba1dc8c8738155f1a 432550cd3e40d024a53b00b2b0791df084444e5869dac89addbe01574e898593 e3f014bc62462e3dfa147a5962e922aaac6a6276bbef2ffb922f982cd9a1ef3d b107ef42890e9b3beec7645aa4f01e1975115b45ebc939ab675dacedc8e1822c

...

Thanks a lot @prodis for all those details.

We will have a look and come back to you with our findings.

@btsuhako were you able to get ECS log collection working with any version of datadog-agent >=6.5.2?

I just started from scratch with 6.6.0 and was pulling my hair out on why I couldn't get any logs shipped to Datadog until I found the notes in this issue. I downgraded to 6.4.2 and :boom: I suddenly had logs flowing to Datadog!

@NBParis - I'm not clear if this is a bug in the datadog-agent or just a (shared) misunderstanding of the current documentation.

Also, I see this in the 6.6.0 release notes:

Fix bug that occurs when checks labels/annotation are misconfigured
and would prevent the logs of the container to be tailed

Is that bug fix related to this issue? Thanks!

@jalessio we're successfully running the 6.7.0 agent on AWS Linux 2 hosts. Logging and APM work as expected.

Datadog Agent task definition -> https://gist.github.com/btsuhako/097a2e0d7932cca588cfcdcdf36dbb88

Sample ECS service task definition -> https://gist.github.com/btsuhako/33c1d3d6a2bbee52afa4cf92d3df1f6b

We build our Docker images without any labels, and apply the needed ones at runtime with the task definition. Note that we use only 1 label for our NodeJS application, and 4 labels for the nginx reverse proxy sidecar.

From @NBParis https://github.com/DataDog/datadog-agent/issues/2383#issuecomment-428104773, seems like you can use 1 label (com.datadoghq.ad.logs) or all 4 (com.datadoghq.ad.instances, com.datadoghq.ad.check_names, com.datadoghq.ad.init_configs, com.datadoghq.ad.logs), but anything else in between may not function properly.

@btsuhako many thanks for the quick reply! I鈥檒l try this out today.

@nic-lan - I'm pretty sure that your datadog-agent needs to mount Docker socket from the host node. Afaik containers log to stdout -> Docker daemon, which the Datadog Agent consumes to get logs. Make sure that your manifest has everything in https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/, especially the VolumeMounts

ohh sorry it looks like the issue is solved... there was probably a wrong conf in the nginx image.
thank you for the help !!

@btsuhako were you able to get ECS log collection working with any version of datadog-agent >=6.5.2?

I just started from scratch with 6.6.0 and was pulling my hair out on why I couldn't get any logs shipped to Datadog until I found the notes in this issue. I downgraded to 6.4.2 and 馃挜 I suddenly had logs flowing to Datadog!

@NBParis - I'm not clear if this is a bug in the datadog-agent or just a (shared) misunderstanding of the current documentation.

Also, I see this in the 6.6.0 release notes:

Fix bug that occurs when checks labels/annotation are misconfigured
and would prevent the logs of the container to be tailed

Is that bug fix related to this issue? Thanks!

Hello @jalessio ,

So it is indeed partially linked.
The situation we had is that a badly formatted annotations/labels for metrics or logs was breaking the entire collection of data for that container.
Now, logs and metrics annotations can fail independently and not block the other data type collection.

The next agent version 6.8 should solve all the issues raised in this thread.

And what about not using annotations ? I use the option to get logs from all containers, and that one only works on version <=6.4.2.

Hello there,

The agent 6.8 has been released and should have fixed all the issue raised in this thread.
For clarity purpose, I'm going to close this thread as the original issues has been addressed.

Feel free to open new issues or support ticket to [email protected] if you face any problems with log collection in your containerised environment (or any other problems).

As a reminder the recommended setup for container log collection is the following:

Collect all logs from all container

Collecting logs require the access to the docker socket.
Setting the two following environment variables is enough to collect all logs from all containers:

  • DD_LOGS_ENABLED=true
  • DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true

We do not recommend to use yaml files anymore for container log collection. Default source and service values are set according to the container image name. Those values can be overriden with container labels or pod annotations as described below.

Collect logs only from a specific subset of container(s)

This is handled by pod annotations or container labels thanks to autodiscovery.
The log collect still must be enabled with DD_LOGS_ENABLED=true but the collect all should not be used.

Then for the wanted container set the pod annotation or container label as follows:

  • container labels: com.datadoghq.ad.logs=[{"source": "<SOURCE>", "service": "<SERVICE>"}]
    Requires the following listeners in your Datadog.yaml file (should be there by default):
listeners:
  - name: docker

config_providers:
  - name: docker
    polling: true
  • pod annotations: ad.datadoghq.com/<identifier>.logs: '[{"source":"<SOURCE","service":"<SERVICE>"}]'

Exclude some container from the log collection

If the agent is configured to collect logs from all container but some container logs should not be collected, the DD_AC_EXCLUDE environment variable can be used.
Examples available here.

Was this page helpful?
0 / 5 - 0 ratings