Datadog-agent: Documentation for logging of short-lived containers is inaccurate

Created on 22 Feb 2019  路  34Comments  路  Source: DataDog/datadog-agent

If a Docker container runs for less than 10 seconds (by default), auto-discovery won't (consistently) detect it, and the shorter the run, the more likely it is to be missed.

The documentation says:

To ensure their logs are properly collected, Autodiscovery detects short lived containers as soon as they are started.

... which is not true. The Datadog agent polls for new containers every 10 seconds, not as soon as they start.

The impact of this is that it takes something somewhat annoying and makes it much harder to debug, since the documentation implies that short-lived container logging should work just fine.

Suggested fix:

  • Change to say something about polling for new containers every 10 seconds by default
  • Mention DD_AD_CONFIG_POLL_INTERVAL
  • Show a workaround (change yourcommand --etc to sh -c 'sleep $n; exec yourcommand --etc where $n is whatever DD_AD_CONFIG_POLL_INTERVAL is set to)

This impacts anything using Docker auto-discovery, including Kubernetes and ECS.

See #2749

Most helpful comment

Hey everyone,

Thanks for reaching out, and for the thorough investigation.

That's an issue we're actively working on and hope to solve in 6.12 (the window for getting code merged for 6.11 is closing at the end of the week, and we're not satisfied with the state of our solution to push it by then). We'll share progress here, and will propose a custom build with this feature if it's ready before 6.12 and you're interested in beta-testing it.

cc @NBParis and @l0k0ms for the documentation part.

All 34 comments

(For reference, if I had found #2749 when we first ran into this issue, it would have saved me several days worth of debugging)

I'd like to add that we've run into this as well, but it seems a little worse than just the documentation being inconsistent with the behavior.

As we've adopted Datadog, one of the primary logging use cases was developers looking for their crashed application logs and the polling behavior all but guarantees those logs are not available in Datadog despite being available on the underlying kubernetes node. This has led to teams needing to rely on kubectl logs and having access to the clusters themselves in order to debug their crashed applications. If we had known short lived or crashlooping container logs were not going to be reliably available, it likely would have been a blocker to adoption. Our previous logging tools did reliably provide short lived container logs.

Just a vote that this warrants updating the agent functionality rather than just clarifying in the docs that polling leads to missed logs.

I agree with the above ^ CloudWatch logging was perfectly consistent and much easier to debug. Datadog logs are better enough that we might still have done it with the workaround we have now (pause before running anything) but it would be even better if we didn't need a workaround for every Docker we run.

The main point of this issue is that correctly documenting this is the bare minimum though, and should be pretty fast compared to actually fixing the problem.

Hey everyone,

Thanks for reaching out, and for the thorough investigation.

That's an issue we're actively working on and hope to solve in 6.12 (the window for getting code merged for 6.11 is closing at the end of the week, and we're not satisfied with the state of our solution to push it by then). We'll share progress here, and will propose a custom build with this feature if it's ready before 6.12 and you're interested in beta-testing it.

cc @NBParis and @l0k0ms for the documentation part.

Hey everyone,

Thanks for reaching out, and for the thorough investigation.

That's an issue we're actively working on and hope to solve in 6.12 (the window for getting code merged for 6.11 is closing at the end of the week, and we're not satisfied with the state of our solution to push it by then). We'll share progress here, and will propose a custom build with this feature if it's ready before 6.12 and you're interested in beta-testing it.

cc @NBParis and @l0k0ms for the documentation part.

Any progress that you can share with us? This just bit our team today as we're trying to start using DataDog.

Hello @bjornstar

The 6.12 does include this feature and short lived container log collection should work fine in this version.

Regarding the documentation part, there was internal discussion about documenting a workaround while the new version of the Agent was not that far away and we decided to wait for the 6.12.

Regarding the 6.12 release timeline, we are currently starting the test process and hopefully it should be released early June (depending on the results of the QA).
While we test it, there are some RC (release candidate) version that you can find on the official Datadog Agent GitHub directory.
Feel free to use the 6.12 rc version in your staging environment and let us know if it fixes your problems.

Thanks.

I can confirm after testing with the image tagged 6.12.0-rc.3 that this issue was resolved for us. Even short lived Job containers' logs show up in Datadog properly.

Thanks @echoboomer for helping with the test and the confirmation that it now works as expected.

I'll make sure to update this thread when the agent 6.12 is officially released.

Hello everyone ,

The agent 6.12 is now officially released and the documentation updated to reflect the support for short lived containers.

I'll go ahead and close this issue but feel free to reach out to [email protected] if you have any question.

It looks like we're still not seeing logs from short lived containers after upgrading to 6.12. This worked when I tested using an rc version of the image. Any thoughts on where to start diagnosing?

Hello @echoboomer,

Would you mind sending your configuration to [email protected] so we can better investigate?
A flare and the result of docker inspect of the monitored container would be perfect.

Thanks a lot.

We have faced the same problems for initContainers as well.
We do use initContainers for various purposes.

  • Create a shared volume between nginx and php-fpm.
  • Run migrations.
  • Run pre-run-application hooks to be sure, some things can be done.

All of these actions are very short, but on the other hand, such things like _running migrations_ HAVE TO (not by RFC, but various debugging and troubleshooting purposes) be kept in logs.

Hello @limakzi ,

Thanks for sharing this and sorry to hear that you are still facing this issue.

Would you mind sending us a flare of the Datadog Agent to [email protected] so we can have a look at the Agent configuration and investigate more in depth?

I went ahead and opened a ticket last week and I'm currently sending flares off to Datadog support along with some docker inspect output. Hopefully this will be ironed out there!

I ran into this issue as well today, could not get short-lived container logs whatsoever.

Turns out I had typo'ed the environment variable which forces use of the /var/log/pods directory even when the docker socket is mounted. The environment variable is:
DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE

However I was using:
DD_LOGS_CONFIG_K8S_CONTAINER_USER_FILE

@ajp-lqx The problem on occurs He艂m Chart, as well.
I could not see this variable in the stable/datadog.
Maybe, this is the root-cause @NBParis ?

@limakzi You can pass the vars in via the chart's values file, that's what we're doing:

datadog:
  env:
  - name: DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE
    value: "true"

Also, we updated our image to 6.13.0 at Datadog's request and we're getting logs from short lived contains now without an issue provided you're tagging things properly.

I see no documentation for DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE.

For Agent v6.12+, short lived container logs (stopped or crashed) are automatically collected when using the K8s file log collection method (through /var/log/pods). This also includes the collection init container logs. found here, is a little lacking on details.

@nitrag if you look 3 paragraphs up from that section you'll see the docu for that env. I agree that it could be made more clear that this is mandatory for short-lived container logs.

I'm confused about what the change in 6.12 was if the agent is still just polling every 10 seconds. Is the short-lived container thing fixed or do we still need workarounds (like very short polling times or forcing tasks to run for at least 10 seconds)?

Hello @brendanlong ,

Indeed there was some misunderstanding as in 6.12+ the improvement was made to support short lived container in Kubernetes and not in pure Docker environment.

We should have make this clearer.

Now that said, the great news is that the we found the issue which was preventing us to do the same in pure Docker environment so in agent 6.14 (current target is this week) it should work without to do any extra setup on your end.

There will still be a polling process (every 1 second instead of 10 second) to make sure we collect the logs with the right configuration (in case you added log configuration in container labels) but we found how to query the logs on the Docker socket for containers that are stopped.
Therefore it means that containers that started and stopped in the past seconds will still be properly collected.

Sorry for the confusion about 6.12 and looking forward for your feedback on the 6.14.

@NBParis Actually, I have a fear that polling each second is still not sufficient.
For example:

  • Most of mine containers are migration init-containers..
  • The life of these containers are way less than one second. (mostly it is around 0.2 -- 0.3 not-including pulling time)

Having that said, as you just cut the time gap between polling, the problem will definitely still persist.

Hey @limakzi,

Sorry if I was not clear in my previous message, but regardless of the polling frequency, we will get the logs that were alive during the last second.

Here is a description of what is happening:

Long live containers

  • Second 0: containers starts => Agent receives the event from the Docker event bus and knows that it will need to start monitoring this container
  • Second 1: agent is collecting the configuration from those containers labels (you might have none) and start collecting logs from second 0 and keeps collecting logs
  • Second X: containers stop => Agent receives the events and stop collecting data from this container until everything is sent

Short lived container

  • Second X: container A starts => Agent receives the event from the Docker event bus and knows that it will need to start monitoring this container
  • Second X+0.5: container A dies => Agent receives the event from the Docker event bus and knows that we will need to stop collecting data from this container from now on
  • Second X+1: Agent looks for the configuration of all containers (including the stopped one) => Agent collect logs from the container A from it's start and knows that no more logs will need to be collected from now on.

So this means that even if your container last for less than a second, their logs are collected. The only condition for it is that the container is not removed (which usually is not the case as Docker does not remove stopped or crashing containers by itself).

Let me know if that clarifies the situation

@NBParis do you know which agent release will have short lived container logs covered?
I tested 6.14.0 and it's still losing all logs from init containers and job containers

So it should definitely be covered with 6.14 so that's strange.
Would you mind sending an email to [email protected] with the following information:

  • An Agent flare
  • The result of docker logs for that given init or job containers

This will be a great help so we can investigate why this is still happening. As from our test, a container with a couple ms of life still had its log collected.

@NBParis sure, I will provide everything you need to reproduce the issue to [email protected]

Even I am getting the same issue. I ran a job over kubernetes which failed. I am not able to see the logs for the failed job over datadog. While kubectl logs pod-name is giving me the logs and Trace. I will also create a ticket and send a flare with relevant information.

@Krishna1408 did you get a solution to this? We are seeing the same problem.

Pods that we run as Jobs (cronjobs or one-offs) are not getting their logs picked up when they run for a very short period of time (~3s). On the other hand, if I run the "job" as a pod (which "fails" after 3s and is repeatedly restarted by kubernetes), logs show up for all of these runs in Datadog.

Hi @awesometown
Are you familiar with this part of the documentation? https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/?tab=k8sfile#short-lived-containers There's also a note just above that applies in case you mount the docker socket as well. If that's not helpful, I encourage you to reach out to support with an agent flare, so we can look at your agent configuration. We did some work on that topic recently and this is expected to work.

Hi @awesometown

My issue was solved as well using the solution @hkaj mentioned in above comment.

I've been through that part of the documentation and think we are setting things up appropriately, but we are using the Helm chart to install everything so it's possible we have missed something. Based on the volumes for the agent, it appears that both k8s file and the docker socket have been mounted.

I've opened a support ticket with a flare so will see if that helps get to the bottom of things.

Support pointed me at the DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE environment variable which I now see is mentioned as "mandatory" for short-lived containers further back in this thread. Setting that flag seems to have resulted in logs being consistently collected.

I would say that this is still not clear from the documentation though. The docs around that flag say to set it _"If you do want to collect logs from /var/log/pods even if the Docker socket is mounted"_ and there is a section above that that gives some reasons why I'd want to use log file collection over docker socket collection, but short-lived containers are not mentioned there.

On the other hand, the "Short Lived Containers" section says under "Docker Socket" that _"Since Agent v6.14+, the Agent collects logs for all containers (running or stopped) ... short lived containers logs that have started and stopped in the past second are still collected as long as they are not removed"_, which makes it sound like collection via the socket should "just work".

@awesometown, I confirm that your finding is applicable to Kubernetes using agent 7.+. The only way to collect short-lived container logs is to set the DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE variable to true.

It took me some time to figure out how to inject DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE in a helm chart.

The obvious location for environment passing in the helm chart

datadog:
     env:
        - name: DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE
          value: "true"

configures initContainers.

The right configuration for containers is

   agents:
     containers:
       agent:
         env:
            - name: DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE
              value: "true"

For folks using helm, it appears that setting 'logs_config.k8s_container_use_file: true' is a potential solution (but have not yet verified).

Has anybody in the community attempted this ? Also attaching the reference documentation, from Datadog

Reference: https://docs.datadoghq.com/agent/kubernetes/log/?tab=daemonset

Was this page helpful?
0 / 5 - 0 ratings