Datadog-agent: Should be collecting logs, ECS, and Docker data - Logs Agent is not running, permanent failure in dockerutil, decoding task metadata failed

Created on 15 Mar 2018 · 11Comments · Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

Getting the status from the agent.

==============
Agent (v6.0.3)
==============

  Status date: 2018-03-15 17:14:23.501699 UTC
  Pid: 8828
  Python Version: 2.7.13
  Logs:
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 0.002778145 s
    System UTC time: 2018-03-15 17:14:23.501699 UTC

  Host Info
  =========
    bootTime: 2018-03-15 16:20:25.000000 UTC
    kernelVersion: 4.9.81-35.56.amzn1.x86_64
    os: linux
    platform: amazon
    platformFamily: rhel
    platformVersion: 2017.09
    procs: 211
    uptime: 87
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    ec2-hostname: ip-172-31-15-66.us-west-2.compute.internal
    hostname: i-0691f3728b71dd647
    instance-id: i-0691f3728b71dd647
    socket-fqdn: ip-172-31-15-66.us-west-2.compute.internal.
    socket-hostname: ip-172-31-15-66

=========
Collector
=========

  Running Checks
  ==============
    cpu
    ---
      Total Runs: 210
      Metrics: 6, Total Metrics: 1254
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    disk
    ----
      Total Runs: 210
      Metrics: 52, Total Metrics: 11298
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    docker
    ------
      Total Runs: 210
      Metrics: 0, Total Metrics: 0
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0Error: UNKNOWN ERROR
      No traceback
      Warning: Error initialising check: [permanent failure in dockerutil: retry number exceeded]


    file_handle
    -----------
      Total Runs: 210
      Metrics: 1, Total Metrics: 210
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    io
    --
      Total Runs: 210
      Metrics: 91, Total Metrics: 21370
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    load
    ----
      Total Runs: 210
      Metrics: 6, Total Metrics: 1260
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    memory
    ------
      Total Runs: 210
      Metrics: 14, Total Metrics: 2940
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    network
    -------
      Total Runs: 210
      Metrics: 26, Total Metrics: 6366
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

    ntp
    ---
      Total Runs: 210
      Metrics: 1, Total Metrics: 199
      Events: 0, Total Events: 0
      Service Checks: 1, Total Service Checks: 210

    uptime
    ------
      Total Runs: 210
      Metrics: 1, Total Metrics: 210
      Events: 0, Total Events: 0
      Service Checks: 0, Total Service Checks: 0

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 210
  IntakeV1: 17
  RetryQueueSize: 0
  Success: 437
  TimeseriesV1: 210

  API Keys status
  ===============
    https://6-0-3-app.agent.datadoghq.com,*************************3d0bf: API Key valid

==========
Logs Agent
==========

  Logs Agent is not running

=========
DogStatsD
=========

  Checks Metric Sample: 49096
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 210
  Series Flushed: 45898
  Service Check: 2520
  Service Checks Flushed: 2718
  Dogstatsd Metric Sample: 2039

Describe what happened:

==> /var/log/datadog/agent.log <==
2018-03-15 17:15:07 UTC | WARN | (checkbase.go:60 in Warnf) | Error initialising check: [permanent failure in dockerutil: retry number exceeded]
2018-03-15 17:15:07 UTC | ERROR | (runner.go:276 in work) | Error running check docker: permanent failure in dockerutil: retry number exceeded
2018-03-15 17:15:07 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (disk.py:104) | Unable to get disk metrics for /var/lib/docker/devicemapper/mnt/ba004f4db542edfe90eade24759318016564726655cd022b478e18425df78148: [Errno 13] Permission denied: '/var/lib/docker/devicemapper/mnt/ba004f4db542edfe90eade24759318016564726655cd022b478e18425df78148'
2018-03-15 17:15:07 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (disk.py:104) | Unable to get disk metrics for /var/run/docker/netns/default: [Errno 13] Permission denied: '/var/run/docker/netns/default'
2018-03-15 17:15:07 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (disk.py:104) | Unable to get disk metrics for /var/lib/docker/containers/4361404cc1b8fc34d9acf5c4b1e44edf857c125c6139986ddad48aebaa9f4b52/shm: [Errno 13] Permission denied: '/var/lib/docker/containers/4361404cc1b8fc34d9acf5c4b1e44edf857c125c6139986ddad48aebaa9f4b52/shm'
2018-03-15 17:15:22 UTC | WARN | (checkbase.go:60 in Warnf) | Error initialising check: [permanent failure in dockerutil: retry number exceeded]
2018-03-15 17:15:22 UTC | ERROR | (runner.go:276 in work) | Error running check docker: permanent failure in dockerutil: retry number exceeded
2018-03-15 17:15:22 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (disk.py:104) | Unable to get disk metrics for /var/lib/docker/devicemapper/mnt/ba004f4db542edfe90eade24759318016564726655cd022b478e18425df78148: [Errno 13] Permission denied: '/var/lib/docker/devicemapper/mnt/ba004f4db542edfe90eade24759318016564726655cd022b478e18425df78148'
2018-03-15 17:15:22 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (disk.py:104) | Unable to get disk metrics for /var/run/docker/netns/default: [Errno 13] Permission denied: '/var/run/docker/netns/default'
2018-03-15 17:15:22 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (disk.py:104) | Unable to get disk metrics for /var/lib/docker/containers/4361404cc1b8fc34d9acf5c4b1e44edf857c125c6139986ddad48aebaa9f4b52/shm: [Errno 13] Permission denied: '/var/lib/docker/containers/4361404cc1b8fc34d9acf5c4b1e44edf857c125c6139986ddad48aebaa9f4b52/shm'
==> /var/log/datadog/process-agent.log <==
2018-03-15 17:12:01 ERROR (common.go:64) - unable to get the container list from ecs
2018-03-15 17:12:01 ERROR (container.go:91) - failed to get container list from ecs - json: cannot unmarshal string into Go value of type ecs.TaskMetadata
2018-03-15 17:12:11 ERROR (common.go:42) - decoding task metadata failed - json: cannot unmarshal string into Go value of type ecs.TaskMetadata
2018-03-15 17:12:11 ERROR (common.go:50) - unable to retrieve task metadata
2018-03-15 17:12:11 ERROR (common.go:64) - unable to get the container list from ecs
2018-03-15 17:12:11 ERROR (container.go:91) - failed to get container list from ecs - json: cannot unmarshal string into Go value of type ecs.TaskMetadata
2018-03-15 17:12:21 ERROR (common.go:42) - decoding task metadata failed - json: cannot unmarshal string into Go value of type ecs.TaskMetadata
2018-03-15 17:12:21 ERROR (common.go:50) - unable to retrieve task metadata
2018-03-15 17:12:21 ERROR (common.go:64) - unable to get the container list from ecs
2018-03-15 17:12:21 ERROR (container.go:91) - failed to get container list from ecs - json: cannot unmarshal string into Go value of type ecs.TaskMetadata

==> /var/log/datadog/process-errors.log <==
2018-03-15 16:21:51 INFO (main_common.go:84) - pid '8684' written to pid file '/opt/datadog-agent/run/process-agent.pid'
2018-03-15 16:21:51 INFO (tagger.go:77) - starting the tagging system
2018-03-15 16:21:51 INFO (tagger.go:148) - ecs tag collector successfully started
2018-03-15 16:21:51 ERROR (common.go:42) - decoding task metadata failed - json: cannot unmarshal string into Go value of type ecs.TaskMetadata
2018-03-15 16:21:51 ERROR (common.go:50) - unable to retrieve task metadata
2018-03-15 16:21:51 ERROR (common.go:64) - unable to get the container list from ecs
2018-03-15 16:21:51 ERROR (container.go:91) - unable to connect to docker - temporary failure in dockerutil, will retry later: try delay not elapsed yet
2018-03-15 16:21:51 ERROR (container.go:91) - failed to get container list from ecs - json: cannot unmarshal string into Go value of type ecs.TaskMetadata

Describe what you expected:
Should be collecting ECS and Docker data correctly
Should be collecting logs from all docker containers

Steps to reproduce the issue:
Install the agent, configure as follows:

# datadog.yaml
dd_url: https://app.datadoghq.com
tags:
  - instance_id:UNDEFINED_INSTANCE_ID
  - environment:UNDEFINED_ENVIRONMENT
histogram_percentiles: ["0.90","0.95","0.99"]
forwarder_num_workers: 2
collect_ec2_tags: true
check_runners: 0
enable_gohai: true
use_dogstatsd: yes
dogstatsd_port: 8125
dogstatsd_non_local_traffic: no
logs_enabled: false
listeners:
  - name: auto
docker_labels_as_tags:
  com.amazonaws.ecs.cluster:                 cluster
  com.amazonaws.ecs.task-definition-family:  task_family
  com.amazonaws.ecs.task-definition-version: task_version
  environment:                               environment
  git-sha:                                   sha
process_config:
  enabled: "true"
apm_config:
  enabled: true

# datadog-ecs.yaml
## Provides autodetected defaults, for kubernetes environments,
## please see datadog.yaml.example for all supported options

# Autodiscovery
listeners:
  - name: ecs

config_providers:
  ## The ecs provider handles templates embedded in container labels, see
  ## https://docs.datadoghq.com/guides/autodiscovery/#template-source-docker-label-annotations
  - name: ecs
    polling: true

# conf.d/docker.d/conf.yaml
init_config:

instances:
  - 
    collect_events: false
    collect_container_size: true
    collect_images_stats: true
    collect_image_size: true
    collect_disk_stats: true
    collect_exit_codes: true
logs:
   - type: docker
     service: docker-alpha
     source: docker-alpha
     tags: alpha

Additional environment details (Operating System, Cloud provider, etc):

Source

blaines

Most helpful comment

getting these errors:
process-agent[3185]: 2019-09-26 19:17:56 UTC | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available
Is there a fix for this issue?

toli-belo on 26 Sep 2019

👍3

All 11 comments

Moving to support ticket.

maycmlee on 15 Mar 2018

I'm having this same issue @maycmlee

mandeepbal on 1 May 2018

@mandeepbal I'm sorry to hear that. Could you open a support ticket by emailing [email protected]. Thanks!

maycmlee on 1 May 2018

Hi May, this seems to be an issue with the open source agent. Why should I have to open a support ticket ?

mandeepbal on 1 May 2018

Hi @mandeepbal, if you open a ticket then we can take a closer look into your issue and specific setup to see where the problem is coming from. Thanks.

maycmlee on 1 May 2018

toli-belo on 26 Sep 2019

👍3

Hi @toli-belo

Could you open a new issue with your specific issue please?
Also, you can contact our support team and open a ticket: [email protected] so they can further look into your issue.

Thanks!

Simwar on 27 Sep 2019

getting these errors:
process-agent[3185]: 2019-09-26 19:17:56 UTC | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available
Is there a fix for this issue?

@toli-belo did you find the solution for this? I have the same error in logs.

Thanks

rechinu007 on 29 Sep 2019

I have same error in y log is there any update?
Oct 1 19:52:39 ip-172-31-31-162 process-agent[1154]: 2019-10-01 19:52:39 CEST | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available

ghost on 1 Oct 2019

I have the same issue in my log files.

2019-10-09 11:17:51 BST | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available

jeremyquinton on 9 Oct 2019

The same.

grep -c "PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available" /var/log/messages
7877

Centos7, datadog-agent-6.14.1-1.x86_64