Nomad: Clients try to rerun old allocations after machine reboot

Created on 6 Oct 2016  路  3Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.3.2

Operating system and Environment details

Ubuntu 14.04

Issue

Nomad client puts some of the temporary files to /tmp. In the situation when alloc_dir is not erased on reboot, Nomad client tries to restart all the allocations that were run previously on this node and terminates them after receiving a command from server (note Started,Started, Killed event sequence in the allocation status). Since the socket in /tmp is gone after reboot, the client produces an error log (see below).

In our setup we solved it by putting alloc_dir to /tmp.

Reproduction steps

  • Create a Nomad cluster with 2 clients
  • Submit a job on 1 of the clients
  • Reboot the client which runs the job

    Nomad job status

ID          = infra-cluster-broccoli
Name        = infra-cluster-broccoli
Type        = service
Priority    = 50
Datacenters = dc1
Status      = running
Periodic    = false

==> Evaluations
ID        Priority  Triggered By  Status
194b03f7  50        node-update   complete
36b8d462  50        node-update   complete
319c009c  50        node-update   complete

==> Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status
6d9c70aa  36b8d462  3df25888  server      run      running
c130ac62  319c009c  449c9ba8  server      stop     complete

Nomad alloc-status

ID            = c130ac62
Eval ID       = 319c009c
Name          = infra-cluster-broccoli.server[0]
Node ID       = 449c9ba8
Job ID        = infra-cluster-broccoli
Client Status = complete

==> Task Resources
Task: "server"
CPU  Memory MB  Disk MB  IOPS  Addresses
500  1024       300      0     http: 10.250.18.29:9000

==> Task "server" is "dead"
Recent Events:
Time                    Type                   Description
06/10/16 14:12:30 CEST  Killed                 Task successfully killed
06/10/16 14:12:23 CEST  Started                Task started by client
04/10/16 16:59:45 CEST  Started                Task started by client
04/10/16 16:59:44 CEST  Downloading Artifacts  Client is downloading artifacts
04/10/16 16:59:44 CEST  Received               Task received by client

Nomad Client configuration

log_level = "INFO"
datacenter = "dc1"
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
advertise {
  http = "10.250.18.28:4646"
  rpc = "10.250.18.28:4647"
  serf = "10.250.18.28:4648"
}
client {
  enabled = true
  servers = ["10.250.18.27"]
  options {
    "driver.raw_exec.enable" = "1"
    "driver.exec.enable" = "1"
    "driver.docker.enable" = "1"
  }
}

Nomad Client logs (if appropriate)

==> Caught signal: terminated
    2016/10/06 14:10:43 [INFO] agent: requesting shutdown
    2016/10/06 14:10:43 [INFO] client: shutting down
    2016/10/06 14:10:43 [INFO] agent: shutdown complete
    Loaded configuration from /etc/nomad.d/client/config.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: false

==> Nomad agent started! Log data will stream in below:

    2016/10/06 14:12:19 [INFO] client: using state directory /var/lib/nomad/client
    2016/10/06 14:12:19 [INFO] client: using alloc directory /var/lib/nomad/alloc
    2016/10/06 14:12:19 [INFO] fingerprint.cgroups: cgroups are available
    2016/10/06 14:12:23 [WARN] fingerprint.env_gce: Could not read value for attribute "machine-type"
    2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2016/10/06 14:12:23 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2016/10/06 14:12:23 [WARN] client: port not specified, using default port
    2016/10/06 14:12:23 [INFO] client: setting server address list: [10.250.18.27:4647]
    2016/10/06 14:12:23 [ERR] driver.raw_exec: error connecting to plugin so destroying plugin pid and user pid
    2016/10/06 14:12:23 [ERR] driver.raw_exec: error destroying plugin and userpid: 2 error(s) occurred:

* os: process already finished
* os: process already finished
    2016/10/06 14:12:23 [ERR] client: failed to open handle to task 'server' for alloc 'c130ac62-268f-3ae8-3aac-95d315f37b99': error connecting to plugin: error creating rpc client for executor plugin: Reattachment process not found
themclient typbug

Most helpful comment

Hi @dadgar,

Thanks for the explanation, it makes sense to me for in-place upgrades or when you just restart Nomad Client.

In our case the problem was that after the VM reboot, Nomad Client was starting allocations that were already rescheduled so we ended up having jobs running multiple times. So the rebooted Nomad client immediately sent SIGKILL to all jobs which are just started.

However, SIGKILL cannot be caught and the jobs cannot do any cleanup. Actually, they should not have been started in the first place, should they? We solved it by putting alloc_dir to /tmp so that the client does not try to run previously allocated jobs after reboot.

All 3 comments

Hey @Gerrrr,

This is actually expected behavior. So what the client is doing is attempting to re-attach to anything that was already running. This can be useful if you kill Nomad Client and do an in-place upgrade for example, start it up again and have it find all the processes.

In your case, there is nothing to connect to anymore because the tasks are dead, so it is just cleaning up.

Let me know if that made sense!

Hi @dadgar,

Thanks for the explanation, it makes sense to me for in-place upgrades or when you just restart Nomad Client.

In our case the problem was that after the VM reboot, Nomad Client was starting allocations that were already rescheduled so we ended up having jobs running multiple times. So the rebooted Nomad client immediately sent SIGKILL to all jobs which are just started.

However, SIGKILL cannot be caught and the jobs cannot do any cleanup. Actually, they should not have been started in the first place, should they? We solved it by putting alloc_dir to /tmp so that the client does not try to run previously allocated jobs after reboot.

Ah thanks for the clarification. Will re-open

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jippi picture jippi  路  3Comments

mlafeldt picture mlafeldt  路  3Comments

clinta picture clinta  路  3Comments

stongo picture stongo  路  3Comments

bdclark picture bdclark  路  3Comments