0.10.0-dev
Debian 10 on AWS
On very short runs, the dockerLogger fails to attach to docker's std streams, resulting with empty stdout stderr logs.
We noticed on our tests that when spinning jobs that are supposed to make very short operations and stop we occasionally get empty stdout stderr log files.
This behavior will not reproduce on 100% of the times we run, as far as we could understand the reason for the random behavior stems from the randomness of the function nextBackoff in file: https://github.com/hashicorp/nomad/blob/master/drivers/docker/docklog/docker_logger.go#L70 .
The dockerLogger will try to attach to the docker streams every backoff seconds.
Backoff is calculated as: " backoff * 1.15 * (1.0 + rand.Float64()) "
This also explains to the closed issue: https://github.com/hashicorp/nomad/issues/2457
Even tough a short lived script doesn't seem like a logical use case for Nomad, we successfully reproduced the issue with a python script that has a syntax error, in that case the stderr will be empty and we will not know why.
create a docker image with a python script that has no requirements and has a very short execution time such as
print("hello world")
or with a syntax error
print "hello world"
Thanks for reporting this, @Manicqin!
Even tough a short lived script doesn't seem like a logical use case for Nomad
Lots of users use Nomad for batch workloads that can be short-lived, so this is definitely something we care about. We'll investigate and see if we can come up with a solution.
I'm experiencing the same with raw_exec jobs.
In my case the script I'm running exists fast due to lack of license and the logs are empty. I tried to add sleep period as a step to the script but from some reason it didn't solve the issue.
I think i I figured out the reason for my issue . .
Looks like garbage collection kick immediately after the job completed causing my API request for the logs to fail.
I checked the logs and found
May 2 08:28:46 osboxes nomad[32344]: 2020-05-02T08:28:46.877-0400 [WARN] client.gc: garbage collection skipped because no terminal allocations: reason="inode usage of 78 is over gc threshold of 70"
solved by increasing the threshold
https://www.nomadproject.io/docs/configuration/client/#gc_inode_usage_threshold
@hmarko75 My experience was the almost similar. In my case the OOM killer was applicable, rather han gc threshold. I learned this way that Nomad allocates a standard amount of resources to a job (rather than unlimited).
Just a 馃憤 for seeing this on 0.10.5/0.11.3 as well
Most helpful comment
Thanks for reporting this, @Manicqin!
Lots of users use Nomad for batch workloads that can be short-lived, so this is definitely something we care about. We'll investigate and see if we can come up with a solution.