Nomad: Get back reason of killing a task into log

Created on 26 Jun 2019  路  7Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)

Issue

In our case it's very important to know when a task was killed by OOM-killer.
Nomad has metric which indicates that task was killed by OOM, but currently it works not as expected:

  • it doesn't increment for a task only for allocation (yes, you can summarize it but task, but see the reason number 2)
  • it disappears quickly, so it's not very possible to find out that a task was killed
  • it doesn't have initial value (it's a problem for prometheus: https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics)

For now we have to parse nomad logs, find lines like this:
2019/04/08 06:03:16.608873 [INFO] client: task "ebook-similarities-service-worker" for alloc "c5d0e905-946b-d847-412f-4d727da98ab2" failed: Wait returned exit code 137, signal 0, and error OOM Killed
and increment counter in prometheus exporter.
It works good, but in v0.9.x I can't find this message in log file.
Could you please get it back?

Reproduction steps

  • execute docker task which is killed by OOM killer
  • there is no reason why the task was killed in log files
stagneeds-discussion themclient themmetrics typenhancement

All 7 comments

The weird part is that the UI does show the message though at the task level.

Is there any news? :)

Are there any news or workarounds?

We also relied on parsing nomad logs for "OOM Killed" but now that doesn't work anymore with newer releases. We are currently using v0.10.5.

It looks like this issue was addressed several years ago here:
https://github.com/hashicorp/nomad/issues/2203

and was moved to the docker plugin here:

https://github.com/hashicorp/nomad/blob/20f8227c0a9fa5bfff9414edbcafef0ee455c9a3/drivers/docker/handle.go#L204-L206

you get a werr (web error maybe?) for it, but no log message.

On a larger note - it seems to me that "Allocation Status" should be logged somewhere. If the allocation gets GC'd before you have a chance to investigate - you've lost vital information and context. These statuses don't go to Telemetry either, so unless you're actively scraping the API periodically - you will lose them.

As noted, this is closed by #2203.

@tgross it's still bugged as it won't show in the logs, not sure why you closed the issue.

not sure why you closed the issue.

We're doing some cleanup of stale issues. It looked like the feature request had been resolved by #2203 (I don't see a bug here, not sure why it was labelled as such). I can reopen and mark for discussion in the roadmap.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ashald picture ashald  路  3Comments

byronwolfman picture byronwolfman  路  3Comments

hamann picture hamann  路  3Comments

mancusogmu picture mancusogmu  路  3Comments

jrasell picture jrasell  路  3Comments