Datadog-agent: No logs from short-lived pod from Kubernetes cronjob

Created on 7 Dec 2018  路  14Comments  路  Source: DataDog/datadog-agent

I have a cronjob and a deployment both using the same image. Logs arrive from the deployment's pods, but not from the cronjob's pods.

I have datadog deployed into the cluster using the helm chart (chart version 1.9.2, agent version 6.5.2).

My configuration looks as follows:

confd:
  myservice.yaml: |-
    init_config:
    logs:
     - type: docker
       image: myimage
       service: myservice
       source: mysource

which as I understand it should grab everything from all pods using that image.

My suspicion was that the cronjob pods were starting, logging their "nothing to do" message, and going away before the agent noticed them. But I've made the sleep for 30s first, but still with the same result. I have tried replacing the job command with echo test1 && sleep 30 && ... to run for a few minutes, and still nothing arrives in the log explorer. I see the expected output if I kubectl logs the pods themselves.

I poked into the logs of the agent (having set the log level to DEBUG). Grepping for the docker container ID from a cronjob pod gets the following:

$ kubectl -n system logs datadog-datadog-kfwhp | grep 946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:35:23 UTC | DEBUG | (network.go:58 in findDockerNetworks) | Container 946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702 uses the network namespace of container:53f4640aa42adc16aa9afcc86f274a74d5ffe486b506ed383043e978e0a23d4e
[ AGENT ] 2018-12-07 11:35:23 UTC | DEBUG | (tagger.go:246 in Tag) | cache miss for kubelet, collecting tags for docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:35:23 UTC | DEBUG | (tagger.go:246 in Tag) | cache miss for kube-metadata-collector, collecting tags for docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:35:23 UTC | DEBUG | (tagger.go:250 in Tag) | entity docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702 not found in kube-metadata-collector, skipping: "docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702" not found
[ AGENT ] 2018-12-07 11:35:28 UTC | DEBUG | (autoconfig.go:676 in processNewService) | Unable to fetch templates from the cache: AD id docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702 not found in cache
[ AGENT ] 2018-12-07 11:35:28 UTC | INFO | (scheduler.go:63 in Schedule) | Received a new service: docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:40:28 UTC | INFO | (scheduler.go:90 in Unschedule) | New service to remove: entity: docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:42:53 UTC | DEBUG | (tagger.go:246 in Tag) | cache miss for kube-metadata-collector, collecting tags for docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:42:53 UTC | DEBUG | (tagger.go:250 in Tag) | entity docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702 not found in kube-metadata-collector, skipping: "docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702" not found
[ AGENT ] 2018-12-07 11:42:53 UTC | DEBUG | (tagger.go:246 in Tag) | cache miss for docker, collecting tags for docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702
[ AGENT ] 2018-12-07 11:42:53 UTC | DEBUG | (tagger.go:246 in Tag) | cache miss for kubelet, collecting tags for docker://946ca067941a9f660a74cacc233968708b4666c257727453a4f18c4fb6226702

Should I be concerned by the

entity docker://... not found in kube-metadata-collector, skipping: "docker://..." not found

line? Might that explain why it is not doing anything with the logs from this container? These logs at least suggest to me that the agent is noticing the pods (that is, I think my "are the pod going away before they are noticed" hypothesis is wrong).

What else is worth investigating, and what else might be causing the agent to ignore or drop the log output from my cronjob's pods?

componenlogs

Most helpful comment

Check out this issue https://github.com/DataDog/datadog-agent/issues/3077 - we were finally able to resolve this by updating to agent version 6.13.0 and making sure our tags were correct.

All 14 comments

Hey @plumdog, thanks for opening this !
I believe we do not recommend to use the image names in the conf files to base the log tailing - Rather, it's better to use Docker labels or Kubernetes annotations.
I think this section of the doc is relevant: https://docs.datadoghq.com/logs/log_collection/docker/?tab=dockerfile#short-lived-containers
As well as the section above it on configuring the autodiscovery based off of labels.

Let me know if that helps.
Best,
.C

@CharlyF OK, great - I think switching to use auto discovery has done it :tada:

The crucial bits of the docs (https://docs.datadoghq.com/agent/autodiscovery/?tab=kubernetes) for me were:

ad.datadoghq.com/<container identifier>.logs: '[<LOG_CONFIG>]'

and

For Annotations, Autodiscovery identifies containers by name, NOT image (as it does for auto-conf files and key-value stores). That is, it looks to match <container identifier> to .spec.containers[0].name, not .spec.containers[0].image.

So in my case, I had to faff about with names to make them match between the annotation and my containers (which were different between the deployment and the cronjob).

But having the autodiscovery annotations does feel like a much more sensible way of configuring this kind of thing anyway.

Hmmm - I may have been mistaken. It would seem the short-lived-ness is still relevant.

A cronjob that looks like:

echo test1 && sleep 20 && echo test2 && sleep 20 && echo test3

reliably does log the output to datadog (now using autodiscovery). The DD agent has logs that look like:

[ AGENT ] 2018-12-10 14:00:31 UTC | DEBUG | (kubelet.go:198 in createService) | No ports found for pod mypod
[ AGENT ] 2018-12-10 14:00:32 UTC | DEBUG | (datadog_agent.go:153 in LogMessage) | (prometheus.py:375) | No corresponding usage found for metric kubernetes.filesystem.usage_pct and container mypod_myns_somehash_0, skipping usage_pct for now.

However, a cronjob that looks like:

echo test1 && sleep 1 && echo test2 && sleep 1 && echo test3

reliably does _not_ long the output to datadog. No similar logs about "No ports found for pod mypod" or "No corresponding usage" appear in the agent's logs.

But the bit of the docs you linked says:

add this in your datadog.yaml file (it is automatically added for the containerized version)

so I would expect it to be enabled and so the "short lived containers are detected as soon as they are started to ensure their logs are properly collected" should apply for me, but it appears not to be. Might I be missing something? Can I check if the

listeners:
  - name: docker
config_providers:
  - name: docker
    polling: true

bit of config is applied for me (grepping the logs for config_providers or polling or listeners finds nothing)?


Edit: this page (https://docs.datadoghq.com/logs/faq/log-collection-troubleshooting-guide/#configuration-issues-1) seems to think it should be name: kubelet instead. At any rate, running cat /etc/datadog/agent/datadog.yaml in the agent's container finds

# Autodiscovery for Kubernetes
listeners:
  - name: kubelet
config_providers:
  - name: kubelet
    polling: true

so I think that's all as expected. So I'm still none-the-wiser.

Edit2: so my horrible hackaround that seems to work is to run the command I actually want to run, then do a sleep 60. Something like:

command: ['sh', '-c', '/app/mycommand --opt thing; sleep 60']

Which is yucky, and means I lose the exit code of the actual command, but the logs are captured as I need. Though what is actually happening and what the actual fix is, I'm still clueless.

@plumdog thanks for reporting. We are having a look and will reach back.
I agree that adding a sleep 60 is not a solution. Out of curiosity, how much time do your short lived tasks last for?

@tmichelet The cronjob in question takes < 1s when it decides there is nothing to do. It is a: "do a DB query, process each matched thing" type of cronjob, so if the query finds nothing, it does nothing very quickly and logs a "Found 0 things to process" message. Of course, I still want to see that log message because knowing "it ran and did nothing" is different to not knowing if it ran at all.

Though really it's that

echo test1 && sleep 20 && echo test2 && sleep 20 && echo test3

logged everything and that

echo test1 && sleep 1 && echo test2 && sleep 1 && echo test3

logged nothing that I find most surprising.

Hey @plumdog, apologies for the delay.
So indeed we have some limitations at the moment due to the "polling" approach when collecting logs.
Discussing with the team, it seems like below 20 seconds of lifespan, we can't guarantee a consistent collection of the logs at the moment.
We are working towards supporting use cases like yours though, our first step was to have independent polling frequencies for our different config providers (https://github.com/DataDog/datadog-agent/pull/2555).
We are also testing a file based tailing for the Kubernetes logs internally and are hoping to release it soon. The lifespan of the file storing the logs would be configurable as well as you can see here https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-termination-and-cleanup.
Nevertheless, we are still met with the challenge of the metadata collection - As we need to match the container/pod id in the log path and from the kubelet api / docker socket.
Lastly, we are looking into collecting logs from stopped containers.

Apologies for the misunderstanding here, feel free to open a ticket with our support team so we can keep you posted as we make progress on the featureset.

Best,
.C

OK, understood. For all my workaround is grotty, it is reliable, and probably fine at the scale I'm operating at.

And I completely do not envy you having to solve that problem.

Many thanks for taking the time to investigate.

Glad to hear - Thanks for reaching out again. I linked up this issue internally so we can track it down.
I am going to close this for now, feel free to reach out to us should you have any questions/issues.

Best,
.C

Hey @CharlyF , we are really interested in any solution you might provide for short live containers such as crashing ones or really short jobs.

We decreased the pollingInterval through the DD_AD_CONFIG_POLL_INTERVAL env variable in our Kubernetes cluster (with a 6.8.3 agent) but at the scale at which we operate, this is obviously not enough. Any news or potential solutions for these short lived containers?

In the meantime, do you see any reason why not to play with DD_AD_CONFIG_POLL_INTERVAL?

Best,
Th茅o

What is the conclusion. Should we change DD_AD_CONFIG_POLL_INTERVAL? What's the sintax?

Which is yucky, and means I lose the exit code of the actual command, but the logs are captured as I need. Though what is actually happening and what the actual fix is, I'm still clueless.

While Datadog works on this issue, our team mitigate this issue like this聽:

  1. Leaving the DD_AD_CONFIG_POLL_INTERVAL to its original value
  2. Decorate the original command like this (inspired by @plumdog proposal)
command:
- sh
- -c 
- sleep 30; <ORIGINAL_COMMAND>; status=$?; sleep 30; exit $status;

Time padding before the command to let the agent be aware about new pod to watch
Time padding after the command to let the agent some time to grab the logs
This also keep tracks of the original's command exit code
(Hey! Datadog team, feel free to correct me if I鈥檓 wrong on the explanation below)

I just ran into this issue as well for very short lived pods. @CharlyF can you give us an update on the development status?

Check out this issue https://github.com/DataDog/datadog-agent/issues/3077 - we were finally able to resolve this by updating to agent version 6.13.0 and making sure our tags were correct.

Ah very interesting, thanks I will give this a try.

Was this page helpful?
0 / 5 - 0 ratings