Nomad: When running docker plugin with short lived jobs the stdout\stderr log files will occasionally be empty

Created on 12 Jan 2020  路  5Comments  路  Source: hashicorp/nomad

Nomad version

0.10.0-dev

Operating system and Environment details

Debian 10 on AWS

Issue

On very short runs, the dockerLogger fails to attach to docker's std streams, resulting with empty stdout stderr logs.

We noticed on our tests that when spinning jobs that are supposed to make very short operations and stop we occasionally get empty stdout stderr log files.

This behavior will not reproduce on 100% of the times we run, as far as we could understand the reason for the random behavior stems from the randomness of the function nextBackoff in file: https://github.com/hashicorp/nomad/blob/master/drivers/docker/docklog/docker_logger.go#L70 .
The dockerLogger will try to attach to the docker streams every backoff seconds.
Backoff is calculated as: " backoff * 1.15 * (1.0 + rand.Float64()) "

This also explains to the closed issue: https://github.com/hashicorp/nomad/issues/2457

Even tough a short lived script doesn't seem like a logical use case for Nomad, we successfully reproduced the issue with a python script that has a syntax error, in that case the stderr will be empty and we will not know why.

Reproduction steps

create a docker image with a python script that has no requirements and has a very short execution time such as
print("hello world")
or with a syntax error
print "hello world"

themdrivedocker themlogging typbug

Most helpful comment

Thanks for reporting this, @Manicqin!

Even tough a short lived script doesn't seem like a logical use case for Nomad

Lots of users use Nomad for batch workloads that can be short-lived, so this is definitely something we care about. We'll investigate and see if we can come up with a solution.

All 5 comments

Thanks for reporting this, @Manicqin!

Even tough a short lived script doesn't seem like a logical use case for Nomad

Lots of users use Nomad for batch workloads that can be short-lived, so this is definitely something we care about. We'll investigate and see if we can come up with a solution.

I'm experiencing the same with raw_exec jobs.
In my case the script I'm running exists fast due to lack of license and the logs are empty. I tried to add sleep period as a step to the script but from some reason it didn't solve the issue.

I think i I figured out the reason for my issue . .
Looks like garbage collection kick immediately after the job completed causing my API request for the logs to fail.

I checked the logs and found

May 2 08:28:46 osboxes nomad[32344]: 2020-05-02T08:28:46.877-0400 [WARN] client.gc: garbage collection skipped because no terminal allocations: reason="inode usage of 78 is over gc threshold of 70"

solved by increasing the threshold
https://www.nomadproject.io/docs/configuration/client/#gc_inode_usage_threshold

@hmarko75 My experience was the almost similar. In my case the OOM killer was applicable, rather han gc threshold. I learned this way that Nomad allocates a standard amount of resources to a job (rather than unlimited).

Just a 馃憤 for seeing this on 0.10.5/0.11.3 as well

Was this page helpful?
0 / 5 - 0 ratings