Nomad: Nomad doesn't observe OOM killed or SIGKILL processes

Created on 12 Mar 2019  路  12Comments  路  Source: hashicorp/nomad

This is somewhat related to #5363 and #5395 but I want to create a new issue to steer the discussion more precisely to what i've been able to reproduce.

I've been able to repro this in:

  • Nomad v0.8.7
  • Nomad v0.9.0-beta2
  • Nomad v0.9.0-beta3

Example reproducer Nomad spec (but not really causing this):

```#!hcl
job "hello" {
type = "service"
datacenters = ["dc1"]

task "hello" {
driver = "raw_exec"

config {
  command = "hello-signals"
}

}
}

The test binary I'm using is written in Go and its source is:

```#!go
package main

import (
    "flag"
    "fmt"
    "os"
    "os/signal"
)

var ignore bool

func init() {
    flag.BoolVar(&ignore, "ignore", false, "ignore signals")
}

func main() {
    flag.Parse()

    done := make(chan struct{}, 1)

    c := make(chan os.Signal, 1)
    signal.Notify(c)

    go func() {
        for {
            s := <-c
            fmt.Printf("caught signal: %d (%s)\n", s, s.String())
            if !ignore {
                done <- struct{}{}
            } else {
                fmt.Println("-ignore flag passed, ignoring signals")
            }
        }
    }()

    fmt.Println("Hello World!")

    fmt.Println("waiting for signal...")
    <-done
}

Steps to repro:

  • Run nomad in "dev" mode: ./nomad agent -dev
  • Deploy the job: ./nomad run hello.nomad
  • Kill the nomad executor or hello-signals process with kill -KILL $(pidof hello-signals) or pkill -KILL -f 'nomad executor'.

Observe that both the Nomad UI and ./nomad status and ./nomad status and./nomad alloc status` all still think the Job/Tasks is still "Running".

Test in both the raw_exec and exec drivers. I had a quick cursory glance at the codebase looking for evidence of waitpid() being called and I can see os/exec.Cmd.Wait() being called; but I'm not to familiar with the code structure to dive too deep right now.

themclient typbug

Most helpful comment

This is a critical issue for a scheduler. Can it be treated with higher priority ? Thanks.

All 12 comments

Thanks for providing a great reproduction case. We鈥檙e looking into this.

Hey @prologic,

Thanks for the test case, this bug is pretty interesting.

While running on macOS, on killing the binary I see the behaviour we expect to happen (but there is still a bug (1) with executor shutdown):
image

However, when running on Ubuntu 18.04 I found two distinct failure modes.

1) On macOS + Linux When a nomad executor is SIGKILL'd, the nomad client detects the executor connection has died, then restarts the task as expected. While this is almost the correct behaviour, it leaks the process that the executor had originally started, and this process will never be terminated.

2) On Linux (but not macOS), when a task process is SIGKILL'd, the executor supervisor does not always seem to detect this (i saw this behaviour ~twice in a hundred or so attempts), and the client continues to believe that the process is running successfully.

I'll take a further look into these this week and try to land a fix before the Nomad 0.9 RC1.

@dantoml could you confirm this is the same issue as #5363?

Hey @dantoml 2) is what we have observed precisely. Thanks for confirming! We are unable to really utilise macOS in any useful way as we're primarily using the Exec driver. But in general I do too notice the leaked executor and logmon proesseses.

@tzahimizrahi Very likely the same; but to be precise #5363 I believe is the "Symptom".

This is a critical issue for a scheduler. Can it be treated with higher priority ? Thanks.

@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now)

@prologic I'm curious as to see more of your nomad setup if you're seeing this consistently. Could you possibly share your client configuration and OS? (I'm trying to get a more reproducible failure case running for the undetected child termination now)

What precisely do you want a repro of? I thought my repro case above was pretty clear. The config/OS I used above was $ nomad agent -dev on CentOS 7.5-1805

I can see there is a PR up to address part of this problem (but not fully) #5419

@prologic OS config was what I was looking for, thanks :) - I've been failing to repro completely undetected loss of the user process again, which is quite unfortunate, and was hoping there might be a pointer there, but CentOS behaved as expected in my test.

I'll spin up a test harness and see if I can automate it.

@prologic Thanks for your patience in this one and it does sound like a very serious bug that we are prioritizing. However, I feel like there are some confounding issues that may mask what's going on.

First, how often do you notice nomad not detecting a process being killed out-of-band and how reproducible is it in your environment? Would you be able to demonstrate the case in a Vagrant setup?

I have attempted to reproduce the issue in https://github.com/notnoop/nomad-repro-5408 . In my testing, nomad was always able to detect that the job was killed, when I send SIGKILL to either nomad executor or hello-signal; though the job status may be marked as "running" when the alloc is due to be restarted. When setting a different restart/reschedule policy, the job status may transition pending/failed state. When viewing the UI, the allocation client status appears under job details or alloc page.

The above behavior of job status is a confusing UX problem, where job status doesn't obviously convey the concrete status of its allocations, and we should address it - but want to ensure I'm not missing other more substantial problems.

Did you find cases where a task process (e.g. hello-signal in this case) is killed, yet without nomad alloc status emitting "Terminated" event or a log line like the following (in Nomad 0.8)

    2019/03/20 21:37:26.947950 [INFO] client: task "hello" for alloc "1b0ba82f-6bc3-f056-043a-93c37b4432f5" failed: Wait returned exit code 137, signal 9, and error <nil>

Would love to get the client logs for this case; and if you can contribute a script or reproduce the issue in that repo with sample nomad client logs, that will be great!

@notnoop this is not a proper response; I'll try to find the time next week to go over the repro again and share more of what I find.

Closing this issue as it's stale now and I believe that the confusing UX is what's at play here. Please reopen with any info you have to the contrary and we will follow up.

That being said, we agree that #5363 and #5395 need to be addressed, and sorry that the UX is getting in the way of distilling/reproducing it.

Was this page helpful?
0 / 5 - 0 ratings