Nomad: Kill Allocations when client is disconnected from servers

Created on 12 Jan 2017  路  14Comments  路  Source: hashicorp/nomad

Nomad servers replaces the allocations running on a node when the client misses heartbeats. The client can be partitioned from the servers, or the client might just be dead but it doesn't mean that the allocations are actually dead when a client is disconnected. This might be a problem in some cases when certain applications need only a fixed number of shards running.

Nomad will solve the above problem by allowing the users to configure a time duration at a task group level which will make the client kill the allocations of that task group after it is disconnected from the server. In cases where the client will be dead too, drivers like exec or raw_exec which uses an executor for supervising the processes will kill the process and exit.

themclient typenhancement

Most helpful comment

@dadgar
Replacing allocations on just a single heartbeat miss looks very conservative. Client seems to be not resilient to connection problems at all. Connection can disappear or just lag for a very short period of time but, by bad luck, coincide with the heartbeat.

It would be great to have configurable timeout for the cases when the exact number of running tasks is not important and network is overloaded/laggy.

Also it seems to me that it's better for a Nomad server to kill allocation not at the time the node is lost, but at the time the node is back or after configurable timeout.
I can clearly see the benefit in the single-node case :-). Nomad won't even have to kill the tasks when the node comes back fast, because it will see that tasks are already running.

All 14 comments

May be #2184 and #2176 issued because of that problem?

@diptanu A related question:
To my understanding, currently Nomad client restarts all tasks when it reconnects back to server after connection loss. Even if it is a short network problem.
Is it by design? It seems more logical just to keep tasks running after reconnecting.

Workaround till 0.6 release:

Script in crontab

#!/usr/bin/env sh

export NOMAD_ADDR=...

files=$(ls ~/nomad/*)

for file in $files; do
  nomad run $file
done

@drscre Not quite. So it depends on how long of a connection loss there is. The clients heartbeat to the server every 15-45 seconds depending on the size of the cluster. If you fail a heartbeat the server marks that node as down and will replace the allocations. When the node comes back it will detect that it shouldn't be running those allocations and kill them.

If you loose and regain connection within a heartbeat nothing will be restarted.

i asked something similar but got a refuse on this. Bad judgement IMHO.
issue #2069
I had to workaround this bug myself using my own raw_exec script.
Having containers running without any control on them in the cluster is a serious bug.
Imagine a cluster of 100 machines with containers running without any way for the admin to terminate them.

IMHO whenever there is a connection lost/killed agent/any issue with nomad that prevent the admin to terminate the machines from remote - the tasks on the nodes must suicide immediately.

On a 100 machines cluster these issues will happen on a daily basis.

@dadgar
Replacing allocations on just a single heartbeat miss looks very conservative. Client seems to be not resilient to connection problems at all. Connection can disappear or just lag for a very short period of time but, by bad luck, coincide with the heartbeat.

It would be great to have configurable timeout for the cases when the exact number of running tasks is not important and network is overloaded/laggy.

Also it seems to me that it's better for a Nomad server to kill allocation not at the time the node is lost, but at the time the node is back or after configurable timeout.
I can clearly see the benefit in the single-node case :-). Nomad won't even have to kill the tasks when the node comes back fast, because it will see that tasks are already running.

This is really helpful for running virtual machine (qemu) in nomad. Usually virtual machine disk image is on shared storage, and we want to keep exactly one instance of a particular virtual machine among the whole cluster, otherwise two instances of a vm writing to the same disk image will cause data corruption. For now we have to use exec driver with "consul lock ... qemu-kvm ...." to workaround this problem.

Is there any setting for nomad client (or will be in the future) that will
allow client to kill all allocations when disconnected from servers after certain amount of time?
I know it will kill all containers when reconnected, but I would be nice if client could
kill everything and do the cleanup without waiting for reconnection.

@jfvubiquity there currently isn't. This would be the issue to watch for that feature.

I think there maybe two types of workloads, "at least N copies of instance" and "exactly N copies of instance". Being able to kill allocation to for the second use case.

I think it would be helpful for Nomad to provide a semaphore-like resource declaration to solve this problem. For example,

job "example" {
  group "example" {
    task "server" {
      resources {
        cpu    = 100
        memory = 256

        semaphore {
            name = 'xxx'
            slots = 3
            consume = 1
            lease_timeout = "60s"
        }

        network {
          mbits = 100
          port "http" {}
          port "ssh" {
            static = 22
          }
        }
      }
    }
  }
}

So Nomad can uses Raft to implement an semaphore, each instance of the task will consume one slot. When the client is offline for 60s, it loses the semaphore and kill the allocation, Nomad will be able to create a new instance on other node. Another way is to integrate Consul lock interface in the job specification.

@edwardbadboy your idea is just great. Currently i need to implement locks manually via mongodb database (going to replace with consul locks now) but nomad locking might solve a lot of my problems and reduce system complexity.

I can see the benefit of the proposed changes in this thread and I would like to have them too. That said, I'd like to be able to completely opt out for such timeout/cleanup behavior.

From a failure handling PoV, given such features, say for whatever reason Nomad client and server lost contact for a certain period of time, I'd like Nomad to NOT wipe out my infrastructure services, and offer me the chance to recover from such networking/Nomad outage without fighting all other infra outage fire at the same time.

This is one of the things I tested and like about Nomad in my destructive testing cases.

On a side note, DC/OS and k8s has similar behavior of NOT wiping out existing runtimes in such cases, at least for the versions that we tested and running in our infra. For a couple of times, DC/OS not wiping out all existing runtimes upon complete master nodes failure gave us the relief of only fighting DC/OS fire while existing services running on the broker DC/OS cluster remain unaffected.

Appending some notes to this issue for CSI support:

The unpublishing workflow for CSI assumes the cooperation of a healthy client that can report that allocs are terminal. On the client side we'll help to reconcile this by having client that is out-of-contact with the server for too long will mark its allocs as terminal, which will call NodeUnstageVolume and NodeUnpublishVolume. (aka "heart yeet")

@langmartin while we're working through the design for this, we should consider the cases of https://github.com/hashicorp/nomad/issues/6212 and https://github.com/hashicorp/nomad/issues/7607 as well.

Was this page helpful?
0 / 5 - 0 ratings