Datadog-agent: Possible race condition for killed Docker containers

Created on 26 Mar 2018 · 11Comments · Source: DataDog/datadog-agent

I'm running datadog-agent 6.1.0 on Kubernetes 1.8.9 (running on GKE).

I deleted a handful of pods in my cluster and for each corresponding container I saw two logs from datadog-agent:

[ AGENT ] 2018-03-26 09:31:44 UTC | ERROR | (docker_main.go:118 in fetchForDockerID) | Failed to inspect container cfc65ca71e3fc63124d582317f54d07e51d9f4ba2ad6a4593ed7e79362146c45 - Error: No such container: cfc65ca71e3fc63124d582317f54d07e51d9f4ba2ad6a4593ed7e79362146c45
[ AGENT ] 2018-03-26 09:31:44 UTC | WARN | (tagger.go:245 in Tag) | error collecting from docker: Error: No such container: cfc65ca71e3fc63124d582317f54d07e51d9f4ba2ad6a4593ed7e79362146c45

These logs come from https://github.com/DataDog/datadog-agent/blob/e84f7ad7829543acb2f9ac1fe6a4d1a53d3426bc/pkg/collector/corechecks/containers/docker.go#L218 and https://github.com/DataDog/datadog-agent/blob/c98beb40bbaee155152cf43e63ebe794a918625b/pkg/tagger/tagger.go#L245.

This doesn't look like an error condition to me (or even something that should be warned about). Indeed it appears in the second case that the code is trying to handle this condition explicitly in the clause before. (This looks like it's somewhat related to #1345, so perhaps that just isn't working as expected yet.)

teacontainers

Source

benbc

👍5

Most helpful comment

Same with 6.13.

mbelang on 11 Sep 2019

👍3

All 11 comments

We're seeing similar behavior on containers that were cleaned up by docker-gc using datadog-agent 6.1.2 on kube.

jlwynkoop on 9 May 2018

👍1

We use datadog-agent 6.5.2. The problem still exists. Looks like because of this we don't see certain docker events.

vvbogdanov87 on 15 Oct 2018

This is also affecting us. We have an automated system that deletes old containers. We're getting spammed with these "errors" in our logs.

irlevesque on 26 Nov 2018

Also seeing this problem, besides our agents are getting restarted and I'm not sure if thisis the reason (probe just fails with 12 unhealthy components after a minute)

eduardohl on 13 Dec 2018

All,

Apologies for the delay on this issue. This log can occur in a few scenarios. Especially if the containers churn - As you can see we readjusted the logging in #2485 so we do not log misleading errors that are actually info/debug.

We will be releasing the new version of the agent 6.8.1 shortly, which will embed this fix.

Thank you very much for your patience.
Best,
.C