Datadog-agent: Possible race condition for killed Docker containers

Created on 26 Mar 2018  路  11Comments  路  Source: DataDog/datadog-agent

I'm running datadog-agent 6.1.0 on Kubernetes 1.8.9 (running on GKE).

I deleted a handful of pods in my cluster and for each corresponding container I saw two logs from datadog-agent:

[ AGENT ] 2018-03-26 09:31:44 UTC | ERROR | (docker_main.go:118 in fetchForDockerID) | Failed to inspect container cfc65ca71e3fc63124d582317f54d07e51d9f4ba2ad6a4593ed7e79362146c45 - Error: No such container: cfc65ca71e3fc63124d582317f54d07e51d9f4ba2ad6a4593ed7e79362146c45
[ AGENT ] 2018-03-26 09:31:44 UTC | WARN | (tagger.go:245 in Tag) | error collecting from docker: Error: No such container: cfc65ca71e3fc63124d582317f54d07e51d9f4ba2ad6a4593ed7e79362146c45

These logs come from https://github.com/DataDog/datadog-agent/blob/e84f7ad7829543acb2f9ac1fe6a4d1a53d3426bc/pkg/collector/corechecks/containers/docker.go#L218 and https://github.com/DataDog/datadog-agent/blob/c98beb40bbaee155152cf43e63ebe794a918625b/pkg/tagger/tagger.go#L245.

This doesn't look like an error condition to me (or even something that should be warned about). Indeed it appears in the second case that the code is trying to handle this condition explicitly in the clause before. (This looks like it's somewhat related to #1345, so perhaps that just isn't working as expected yet.)

teacontainers

Most helpful comment

Same with 6.13.

All 11 comments

We're seeing similar behavior on containers that were cleaned up by docker-gc using datadog-agent 6.1.2 on kube.

We use datadog-agent 6.5.2. The problem still exists. Looks like because of this we don't see certain docker events.

This is also affecting us. We have an automated system that deletes old containers. We're getting spammed with these "errors" in our logs.

Also seeing this problem, besides our agents are getting restarted and I'm not sure if thisis the reason (probe just fails with 12 unhealthy components after a minute)

All,

Apologies for the delay on this issue. This log can occur in a few scenarios. Especially if the containers churn - As you can see we readjusted the logging in #2485 so we do not log misleading errors that are actually info/debug.

We will be releasing the new version of the agent 6.8.1 shortly, which will embed this fix.

Thank you very much for your patience.
Best,
.C

Closing as this fix was released. Feel free to reach out if you are still having issues.

I still have Agent 6.10 flooding the logs with these exact same error messages at the same log levels.

I have the same error message and seems that the stats metrics API doesn't work. I am not sure if the error causes the problem.

Same with 6.13.

v6.14.0 does the same

Confirm it happens in 6.10.1

Was this page helpful?
0 / 5 - 0 ratings