Datadog-agent: Unable to run check 'container': temporary failure in detector, will retry later: No collector detected

Created on 2 Oct 2019 · 13Comments · Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

process-agent [CRITICAL] UTC | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': temporary failure in detector, will retry later: No collector detected

Describe what happened:
Upgraded to datadog-agent-6.14 from 6.13

Describe what you expected:
No new errors. Instead I now see this critical error showing up.

Steps to reproduce the issue:
Not entirely sure other than updating the package. Happy to help figure it out given some direction.

Additional environment details (Operating System, Cloud provider, etc):
CentOS 7 x86_64 on AWS m5.large

Agent Status:

Getting the status from the agent. 

=============== 
Agent (v6.14.1) 
=============== 

  Status date: 2019-10-02 00:58:03.057030 UTC 
  Agent start: 2019-10-02 00:54:39.416667 UTC 
  Pid: 1127 
  Go Version: go1.12.9 
  Python Version: 2.7.16 
  Check Runners: 4 
  Log Level: info 

  Paths 
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 565µs
    System UTC time: 2019-10-02 00:58:03.057030 UTC

  Host Info
  =========
    bootTime: 2019-10-02 00:54:28.000000 UTC
    kernelVersion: 3.10.0-1062.1.1.el7.x86_64
    os: linux
    platform: centos
    platformFamily: rhel
    platformVersion: 7.7.1908
    procs: 153
    uptime: 14s

  Hostnames
  =========
    ec2-hostname: ip-172-31-31-46.ec2.internal
    hostname: prod-db2.airfordable.amz
    instance-id: i-dcbcfe4f
    socket-fqdn: prod-db2.airfordable.amz.
    socket-hostname: prod-db2.airfordable.amz 
    host tags:
      af.environment:production
    hostname provider: fqdn
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

=========
Collector
=========



  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 6, Total: 42
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    disk (2.5.0) 
    ------------ 
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 60, Total: 472
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 45ms


    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 5, Total: 40
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 26, Total: 190
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 6, Total: 48
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 17, Total: 136
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    mongo (1.11.0)
    --------------
      Instance ID: mongo:353e102defc4ca96 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/mongo.d/conf.yaml
      Total Runs: 9
      Metric Samples: Last Run: 951, Total: 7,608
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 9
      Average Execution Time : 145ms

    network (1.11.4)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 26, Total: 208
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms


    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 8
      Metric Samples: Last Run: 1, Total: 8
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 8
      Average Execution Time : 10ms


    process (1.10.0)
    ----------------
      Instance ID: process:mongod:a9bdade959619a48 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/process.d/conf.yaml
      Total Runs: 8
      Metric Samples: Last Run: 17, Total: 134
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 8
      Average Execution Time : 1ms

      Instance ID: process:sshd:b35e1dd1044820ad [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/process.d/conf.yaml
      Total Runs: 8
      Metric Samples: Last Run: 17, Total: 134
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 8
      Average Execution Time : 2ms


    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 9
      Metric Samples: Last Run: 1, Total: 9
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 8
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 2
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 18
    TimeseriesV1: 8

  API Keys status
  ===============
    API key ending with bf83e: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - bf83e

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 9,210
  Dogstatsd Metric Sample: 572
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 8
  Series Flushed: 7,707
  Service Check: 124
  Service Checks Flushed: 125

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 571
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 36,777
  Udp Packet Reading Errors: 0
  Udp Packets: 572
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

teaprocesses

Source

stieg

👍4

Most helpful comment

Can confirm that the following stopped the errors for me:

process_config:
  enabled: 'disabled'

I can also confirm that setting the value to 'true' also stops the error. Seems the default ("false") has some kind of bug. Fun.

stieg on 8 Oct 2019

👍5

All 13 comments

And after a while it seems the mesage morphs into this:

process-agent [CRITICAL] UTC | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available

stieg on 2 Oct 2019

2019-10-02 14:17:56 UTC | PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': permanent failure in detector: No collector available

I am getting these constantly as well.

This is on Ubuntu 18.04.3 LTS

datadog-agent/unknown,now 1:6.14.1-1 amd64 [installed]

tparvu on 2 Oct 2019

We're seeing this too. Raised a support case where suggestions have been made to alter defaults e.g. setting process_config enabled: "disabled" and container_collect_all: false (we don't even have the log collection feature enabled.)

Regardless we shouldn't need to change from the defaults, they should be sane.

mattmonkey83 on 7 Oct 2019

Raised a support case where suggestions have been made to alter defaults e.g. setting process_config enabled: "disabled" and container_collect_all: false (we don't even have the log collection feature enabled.)

Did you try this @mattmonkey83 ? If so can you confirm this is a viable workaround or not?

stieg on 7 Oct 2019

Can confirm that the following stopped the errors for me:

process_config:
  enabled: 'disabled'

I can also confirm that setting the value to 'true' also stops the error. Seems the default ("false") has some kind of bug. Fun.

stieg on 8 Oct 2019

👍5

Sorry I didn't get chance to confirm but that's good to know. Just had the following from the support case however -

You're right. That is an issue and is currently being addressed by engineering, and looks like the fix should go out in our next Agent version.

mattmonkey83 on 8 Oct 2019

I can confirm that changing enabled: 'false' to enabled: 'disabled' in /etc/datadog-agent/datadog.yaml fixed the problem for me.

timvisher on 8 Oct 2019

For what it's worth, I'm still seeing this issue with the latest nightly build:

# datadog-agent status
[...]
Agent (v6.15.0-devel+git.36.4e6cb31)

# tail -n2 /var/log/syslog
Oct 31 06:29:06 redacted process-agent[8057]: PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': temporary failure in detector, will retry later: No collector detected
Oct 31 06:29:16 redacted process-agent[8057]: PROCESS | CRITICAL | (collector.go:91 in runCheck) | Unable to run check 'container': temporary failure in detector, will retry later: No collector detected

I'm not convinced disabling process_config is the right option for us as I suspect we have some teams using this feature.

rene00 on 31 Oct 2019

👍1

I can confirm the same issue on RHEL 8:

datadog-agent status
Getting the status from the agent.

===============
Agent (v6.15.1)
===============

  Status date: 2019-12-10 11:25:36.810597 UTC
  Agent start: 2019-12-10 11:16:00.134127 UTC
  Pid: 18144
  Go Version: go1.12.9
  Python Version: 2.7.17
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -944µs
    System UTC time: 2019-12-10 11:25:36.810597 UTC

  Host Info
  =========
    bootTime: 2019-11-26 18:16:20.000000 UTC
    kernelVersion: 4.18.0-80.4.2.el8_0.x86_64
    os: linux
    platform: redhat
    platformFamily: rhel
    platformVersion: 8.1
    procs: 247
    uptime: 328h59m41s
    virtualizationRole: host
    virtualizationSystem: kvm

One thing that I find very weird is that it reports using python2.7 which is not even installed on the system, the default python being python3 (3.6)

ssbarnea on 10 Dec 2019

Hi @ssbarnea,
I pinged the folks working on the process agent for an update.

The agent brings it's own embedded python and does not rely on what's installed on the system.

arbll on 10 Dec 2019

Still seeing the issue. Any update after three months?

pwp333 on 11 Dec 2019

I seems that the same happens with

     platformFamily: debian
     platformVersion: 18.04

Or to rephrase it this means that process-agent is broken on 100% platforms I deployed datadog agent on, and these being the most popular linux distros, not some weird ones.

ssbarnea on 14 Dec 2019

👋 Really sorry for the delayed response here and for the inconvenience this may have caused. A fix for this will be available in the 6.17 release.

It is worth mentioning this is a benign log entry and it does not affect process data collection whatsoever. We do acknowledge the message (and log level) is extremely misleading and we're removing it.