Nomad: Get back reason of killing a task into log

Created on 26 Jun 2019 · 7Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Issue

In our case it's very important to know when a task was killed by OOM-killer.
Nomad has metric which indicates that task was killed by OOM, but currently it works not as expected:

it doesn't increment for a task only for allocation (yes, you can summarize it but task, but see the reason number 2)
it disappears quickly, so it's not very possible to find out that a task was killed
it doesn't have initial value (it's a problem for prometheus: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics)

For now we have to parse nomad logs, find lines like this:
2019/04/08 06:03:16.608873 [INFO] client: task "ebook-similarities-service-worker" for alloc "c5d0e905-946b-d847-412f-4d727da98ab2" failed: Wait returned exit code 137, signal 0, and error OOM Killed
and increment counter in prometheus exporter.
It works good, but in v0.9.x I can't find this message in log file.
Could you please get it back?

Reproduction steps

execute docker task which is killed by OOM killer
there is no reason why the task was killed in log files

stagneeds-discussion themclient themmetrics typenhancement

Source

pznamensky

👀1 👍1

All 7 comments

The weird part is that the UI does show the message though at the task level.

scalp42 on 27 Jun 2019

Is there any news? :)

pznamensky on 16 Dec 2019

Are there any news or workarounds?

We also relied on parsing nomad logs for "OOM Killed" but now that doesn't work anymore with newer releases. We are currently using v0.10.5.

IvanVlasic on 14 Apr 2020

It looks like this issue was addressed several years ago here:
https://github.com/hashicorp/nomad/issues/2203

and was moved to the docker plugin here:

https://github.com/hashicorp/nomad/blob/20f8227c0a9fa5bfff9414edbcafef0ee455c9a3/drivers/docker/handle.go#L204-L206

you get a werr (web error maybe?) for it, but no log message.

On a larger note - it seems to me that "Allocation Status" should be logged somewhere. If the allocation gets GC'd before you have a chance to investigate - you've lost vital information and context. These statuses don't go to Telemetry either, so unless you're actively scraping the API periodically - you will lose them.

kpweiler on 22 Apr 2020

👍1

As noted, this is closed by #2203.

tgross on 24 Aug 2020

👎1

@tgross it's still bugged as it won't show in the logs, not sure why you closed the issue.

scalp42 on 25 Aug 2020

👍1

not sure why you closed the issue.

We're doing some cleanup of stale issues. It looked like the feature request had been resolved by #2203 (I don't see a bug here, not sure why it was labelled as such). I can reopen and mark for discussion in the roadmap.

tgross on 25 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings