Nomad v0.5.1 on 3 servers and 3 clients
[root@centos-7-srv1 ~]# nomad server-members
Name Address Port Status Leader Protocol Build Datacenter Region
centos-7-srv1.us-east-1 192.168.33.10 4648 alive false 2 0.5.1 molecule us-east-1
centos-7-srv2.us-east-1 192.168.33.11 4648 alive true 2 0.5.1 molecule us-east-1
centos-7-srv3.us-east-1 192.168.33.12 4648 alive false 2 0.5.1 molecule us-east-1
[root@centos-7-srv1 ~]# nomad node-status
ID DC Name Class Drain Status
b973a2ab molecule centos-7-cl1 <none> false ready
ea1c52a4 molecule centos-7-cl3 <none> false ready
7a734562 molecule centos-7-cl2 <none> false ready
All 6 nodes running
Linux centos-7-srv1 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Signal to shutdown a container isn't fired.
Create a nomad job with count set to 2.
Issue commands nomad plan phil.nomad and nomad run phil.nomad
You will see the 2 containers correctly start up on the appropriate clients.
ID = phil
Name = phil
Type = service
Priority = 50
Datacenters = molecule
Status = running
Periodic = false
^[1mSummary^[0m^[0m
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 2 0 20 0
^[1mAllocations^[0m^[0m
ID Eval ID Node ID Task Group Desired Status Created At
57bc8460 e1123b6b ea1c52a4 cache run running 12/15/16 20:19:11 UTC
e698cf48 e1123b6b 7a734562 cache run running 12/15/16 20:18:16 UTC
Change the count from 2 to 0, run a nomad plan phil.nomad and nomad run phil.nomad and you will see the containers continuing to stay up for ~5 extra minutes. You will also see a
ID = phil
Name = phil
Type = service
Priority = 50
Datacenters = molecule
Status = dead
Periodic = false
^[1mSummary^[0m^[0m
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 1 0 21 0
^[1mAllocations^[0m^[0m
ID Eval ID Node ID Task Group Desired Status Created At
57bc8460 e1123b6b ea1c52a4 cache stop complete 12/15/16 20:19:11 UTC
e698cf48 9b746ea4 7a734562 cache stop running 12/15/16 20:18:16 UTC
[root@centos-7-srv1 ~]# cat /var/log/nomad/nomad.log
Loaded configuration from /etc/nomad/server.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
Atlas: <disabled>
Client: false
Log Level: INFO
Region: us-east-1 (DC: molecule)
Server: true
Version: 0.5.1
==> Nomad agent started! Log data will stream in below:
2016/12/15 14:23:45 [INFO] serf: EventMemberJoin: centos-7-srv1.us-east-1 192.168.33.10
2016/12/15 14:23:45.635033 [INFO] nomad: starting 1 scheduling worker(s) for [system service batch _core]
2016/12/15 14:23:45 [INFO] raft: Node at 192.168.33.10:4647 [Follower] entering Follower state (Leader: "")
2016/12/15 14:23:45.636033 [INFO] nomad: adding server centos-7-srv1.us-east-1 (Addr: 192.168.33.10:4647) (DC: molecule)
2016/12/15 14:23:45 [INFO] serf: EventMemberJoin: centos-7-srv2.us-east-1 192.168.33.11
2016/12/15 14:23:45.772685 [INFO] nomad: adding server centos-7-srv2.us-east-1 (Addr: 192.168.33.11:4647) (DC: molecule)
2016/12/15 14:23:45.772737 [INFO] nomad: Attempting bootstrap with nodes: [192.168.33.10:4647 192.168.33.11:4647]
2016/12/15 14:23:45 [INFO] serf: EventMemberJoin: centos-7-srv3.us-east-1 192.168.33.12
2016/12/15 14:23:45.776479 [INFO] server.consul: successfully contacted 2 Nomad Servers
2016/12/15 14:23:45.776601 [INFO] nomad: adding server centos-7-srv3.us-east-1 (Addr: 192.168.33.12:4647) (DC: molecule)
2016/12/15 14:23:47 [WARN] raft: Rejecting vote request from 192.168.33.12:4647 since we have a leader: 192.168.33.11:4647
2016/12/15 14:23:48 [WARN] raft: Rejecting vote request from 192.168.33.12:4647 since we have a leader: 192.168.33.11:4647
2016/12/15 14:23:49 [WARN] raft: Heartbeat timeout from "192.168.33.11:4647" reached, starting election
2016/12/15 14:23:49 [INFO] raft: Node at 192.168.33.10:4647 [Candidate] entering Candidate state
2016/12/15 14:23:49 [INFO] raft: Node at 192.168.33.10:4647 [Follower] entering Follower state (Leader: "")
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log Thu Dec 15 20:25:48 2016
centos-7-cl1
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2016/12/15 20:14:27.187577 [INFO] client: node registration complete
2016/12/15 20:15:32.435209 [INFO] driver.docker: created container 7798b58ca79d0a55d757076f36c62a104e18b128b9e34a336862887c45951a2d
2016/12/15 20:15:33.127100 [INFO] driver.docker: started container 7798b58ca79d0a55d757076f36c62a104e18b128b9e34a336862887c45951a2d
2016/12/15 20:15:33.130685 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 20:16:42.892537 [INFO] driver.docker: stopped container 7798b58ca79d0a55d757076f36c62a104e18b128b9e34a336862887c45951a2d
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log Thu Dec 15 20:25:47 2016
centos-7-cl2
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aa9adce9803d redis:3.2 "docker-entrypoint.sh" 7 minutes ago Up 7 minutes 10.0.2.15:40067->6379/tcp, 10.0.2.15:40067->6379/udp redis-e698cf48-b708-2212-8
83f-b2022380b1c5
2016/12/15 20:17:27.832225 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 20:18:06.733780 [INFO] driver.docker: stopped container e71f1ed9dba1f56536b298443fe4e16ac11a7690ee2466024554a28a4ed6eb07
2016/12/15 20:18:45.839302 [INFO] driver.docker: created container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
2016/12/15 20:18:46.085315 [INFO] driver.docker: started container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
2016/12/15 20:18:46.087891 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log Thu Dec 15 20:25:47 2016
centos-7-cl3
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2016/12/15 20:19:03.445758 [INFO] driver.docker: stopped container d1efd8e4600faa790eaef39c18cac521650f7b8418b4cc04352d6a024d8f8a43
2016/12/15 20:20:08.928978 [INFO] driver.docker: created container 2de7461e2d82d830bfd548bb392fa240ac5054c704419a66ec000709efca8349
2016/12/15 20:20:09.117096 [INFO] driver.docker: started container 2de7461e2d82d830bfd548bb392fa240ac5054c704419a66ec000709efca8349
2016/12/15 20:20:09.119012 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 20:24:59.953326 [INFO] driver.docker: stopped container 2de7461e2d82d830bfd548bb392fa240ac5054c704419a66ec000709efca8349
I can still hit the container that should be stopped.
[root@centos-7-cl2 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aa9adce9803d redis:3.2 "docker-entrypoint.sh" 8 minutes ago Up 8 minutes 10.0.2.15:40067->6379/tcp, 10.0.2.15:40067->6379/udp redis-e698cf48-b708-2212-883f-b2022380b1c5
[root@centos-7-cl2 ~]# telnet 10.0.2.15 40067
Trying 10.0.2.15...
Connected to 10.0.2.15.
Escape character is '^]'.
ping
+PONG
When this happens, the container will hang around indefinitely. The signal(s) will be fired during the next nomad run phil.nomad assuming that I make a change to the job file.
centos-7-cl2
=============>
2016/12/15 20:18:45.839302 [INFO] driver.docker: created container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
2016/12/15 20:18:46.085315 [INFO] driver.docker: started container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
2016/12/15 20:18:46.087891 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 20:30:17.234377 [INFO] driver.docker: stopped container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
<=============
2016/12/15 20:31:01.851016 [INFO] driver.docker: created container b13507cd35da28130c4156122097d57338cd4f7c57db355c2257d86747802b77
This is not exactly ideal.
count=2
job "phil" {
datacenters = ["molecule"]
region = "us-east-1"
type = "service"
update {
stagger = "10s"
max_parallel = 1
}
group "cache" {
count = 2
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
ephemeral_disk {
size = 300
}
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
port_map {
db = 6379
}
}
resources {
network {
mbits = 10
port "db" {}
}
}
service {
name = "global-redis-check"
tags = ["global", "cache"]
port = "db"
check {
name = "alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}
}
count=0
job "phil" {
datacenters = ["molecule"]
region = "us-east-1"
type = "service"
update {
stagger = "10s"
max_parallel = 1
}
group "cache" {
count = 0
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
ephemeral_disk {
size = 300
}
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
port_map {
db = 6379
}
}
resources {
network {
mbits = 10
port "db" {}
}
}
service {
name = "global-redis-check"
tags = ["global", "cache"]
port = "db"
check {
name = "alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}
}
Can you do nomad alloc-status e698cf48? And paste the full logs of the client running that alloc?
How reliably does this happen?
I can force this to happen within ~2 minutes by changing the count and running the job, waiting for the job output, running the job, "rinse and repeat". This particular change was count=2 to count=1 on my job from above.
My alloc-status log was cleared for some reason so nomad alloc-status e698cf48 returned
[root@centos-7-srv1 ~]# nomad alloc-status e698cf48
No allocation(s) with prefix or id "e698cf48" found
However, I just reproduced the issue again and here's the output
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log Thu Dec 15 21:39:59 2016
centos-7-cl1
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
33bd282be38b redis:3.2 "docker-entrypoint.sh" About a minute ago Up About a minute 10.0.2.15:22996->6379/tcp, 10.0.2.15:22996->6379/udp redis-75a2dba7-ddd4-1977-
3008-864986fcc2bc
2016/12/15 21:37:29.609524 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 21:38:12.663605 [INFO] driver.docker: stopped container 7cf5f1d87baf197a46f3cfc36186c95bbc967c0eb3080f02015f05bd2826de3e
2016/12/15 21:38:25.976447 [INFO] driver.docker: created container 33bd282be38bc217a867e5415d9d5002a1d3486a138d362240937c521fbd9f34
2016/12/15 21:38:26.108368 [INFO] driver.docker: started container 33bd282be38bc217a867e5415d9d5002a1d3486a138d362240937c521fbd9f34
2016/12/15 21:38:26.109462 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log Thu Dec 15 21:40:01 2016
centos-7-cl2
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2016/12/15 20:42:15.417791 [INFO] client: using alloc directory /opt/nomad/alloc
2016/12/15 20:42:15.421197 [INFO] fingerprint.cgroups: cgroups are available
2016/12/15 20:42:15.423134 [INFO] fingerprint.consul: consul agent is available
2016/12/15 20:42:18.457273 [INFO] client: Node ID "7a734562-617a-a4c0-cefe-c798cd13fc27"
2016/12/15 20:42:18.479463 [INFO] client: node registration complete
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log Thu Dec 15 21:40:00 2016
centos-7-cl3
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e84548972440 redis:3.2 "docker-entrypoint.sh" About a minute ago Up About a minute 10.0.2.15:32857->6379/tcp, 10.0.2.15:32857->6379/udp redis-540e7bbf-8754-2592-
870c-836249b1b30a
2016/12/15 21:37:28.788605 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 21:38:19.763671 [INFO] driver.docker: stopped container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
2016/12/15 21:38:26.000837 [INFO] driver.docker: created container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
2016/12/15 21:38:26.162356 [INFO] driver.docker: started container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
2016/12/15 21:38:26.165919 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
[root@centos-7-srv1 ~]# nomad run phil.nomad
==> Monitoring evaluation "ea3acdee"
Evaluation triggered by job "phil"
Allocation "75a2dba7" modified: node "b973a2ab", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "ea3acdee" finished with status "complete"
[root@centos-7-srv1 ~]# nomad alloc-status 75a2dba7
ID = 75a2dba7
Eval ID = ea3acdee
Name = phil.cache[0]
Node ID = b973a2ab
Job ID = phil
Client Status = running
Client Description = <none>
Created At = 12/15/16 21:38:25 UTC
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
2/100 MHz 992 KiB/10 MiB 0 B 0 db: 10.0.2.15:22996
Recent Events:
Time Type Description
12/15/16 21:38:26 UTC Started Task started by client
12/15/16 21:38:25 UTC Received Task received by client
md5-4d8623156effcf17faa2e27e976ee983
[root@centos-7-srv1 ~]# nomad alloc-status 540e7bbf
ID = 540e7bbf
Eval ID = b382ea91
Name = phil.cache[1]
Node ID = ea1c52a4
Job ID = phil
Client Status = running
Client Description = <none>
Created At = 12/15/16 21:38:25 UTC
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
2/100 MHz 992 KiB/10 MiB 0 B 0 db: 10.0.2.15:32857
Recent Events:
Time Type Description
12/15/16 21:38:26 UTC Started Task started by client
12/15/16 21:38:25 UTC Received Task received by client
md5-4d8623156effcf17faa2e27e976ee983
[root@centos-7-srv1 ~]# nomad status phil
ID = phil
Name = phil
Type = service
Priority = 50
Datacenters = molecule
Status = running
Periodic = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 2 0 28 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
540e7bbf b382ea91 ea1c52a4 cache stop running 12/15/16 21:38:25 UTC
75a2dba7 ea3acdee b973a2ab cache run running 12/15/16 21:38:25 UTC
md5-842eaa175e0b0d42600bd623e83b0e86
[root@centos-7-srv1 ~]# nomad alloc-status 540e7bbf
ID = 540e7bbf
Eval ID = b382ea91
Name = phil.cache[1]
Node ID = ea1c52a4
Job ID = phil
Client Status = complete
Client Description = <none>
Created At = 12/15/16 21:38:25 UTC
Task "redis" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 10 MiB 0 B 0 db: 10.0.2.15:32857
Recent Events:
Time Type Description
12/15/16 21:43:46 UTC Killed Task successfully killed
12/15/16 21:43:46 UTC Killing Sent interrupt. Waiting 5s before force killing
12/15/16 21:38:26 UTC Started Task started by client
12/15/16 21:38:25 UTC Received Task received by client
I should note the version of docker I am using
[root@centos-7-cl1 ~]# docker version
Client:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:23:11 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:23:11 2016
OS/Arch: linux/amd64
Can you paste the full client logs of node ea1c52a4. Can't really follow that style of logs. Also are you running Nomad clients in containers?
I am running Nomad on the host itself via systemd. If you would like to schedule a screenshare or something like that, I'd be happy to oblige.
==> Starting Nomad agent...
==> Nomad agent configuration:
Atlas: <disabled>
Client: true
Log Level: INFO
Region: us-east-1 (DC: molecule)
Server: false
Version: 0.5.1
==> Nomad agent started! Log data will stream in below:
2016/12/15 20:42:15.500090 [INFO] client: using state directory /opt/nomad/client
2016/12/15 20:42:15.500170 [INFO] client: using alloc directory /opt/nomad/alloc
2016/12/15 20:42:15.502557 [INFO] fingerprint.cgroups: cgroups are available
2016/12/15 20:42:15.504121 [INFO] fingerprint.consul: consul agent is available
2016/12/15 20:42:18.544095 [INFO] client: Node ID "ea1c52a4-785f-d00a-911b-65e5e79fba02"
2016/12/15 20:42:18.564452 [INFO] client: node registration complete
2016/12/15 21:37:28.225018 [INFO] driver.docker: created container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
2016/12/15 21:37:28.758178 [INFO] driver.docker: started container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
2016/12/15 21:37:28.788605 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 21:38:19.763671 [INFO] driver.docker: stopped container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
2016/12/15 21:38:26.000837 [INFO] driver.docker: created container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
2016/12/15 21:38:26.162356 [INFO] driver.docker: started container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
2016/12/15 21:38:26.165919 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
2016/12/15 21:43:46.819799 [INFO] driver.docker: stopped container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
@pgporada Appreciate the offer. Hopefully we can get to the root of it async. If it is possible could you reproduce while running at log level DEBUG for both the server and client and paste the results.
I will get you that by tomorrow. We're about to kick off a company Christmas party. Thank you for the help so far.
Sorry for the delay. Please see the enclosed files
consul.log with standard loggingnomad.log with debug logginghttpd.nomad which is the job I've been planning/running for this round of expirementsserver_status.txt which is output from the server I've been running the jobs from. You will see the problem hereserver_status_allocations.txt which shows the job allocations. The problem can be seen here as well.troubleshooting_script the commands I ran to gather this dataconsul.log with standard loggingnomad.log with debug loggingSpecifically, why did it take 5+ minutes to kill a container?
+/- Job: "httpd"
Datacenters {
Datacenters: "molecule"
}
+/- Task Group: "cache" (6 destroy)
+/- Count: "6" => "0" (forces destroy)
Task: "httpd"
Scheduler dry-run:
- All tasks successfully allocated.
---
Mon Dec 19 14:45:28 UTC 2016
==> Monitoring evaluation "6b81a051"
Evaluation triggered by job "httpd"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "6b81a051" finished with status "complete"
---
ID = 05364f3b
Eval ID = 3ca39c14
Name = httpd.cache[3]
Node ID = a42fd16a
Job ID = httpd
Client Status = complete
Client Description = <none>
Created At = 12/19/16 14:45:08 UTC
Task "httpd" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 10 MiB 0 B 0 http: 10.0.2.15:50777
Recent Events:
Time Type Description
12/19/16 14:50:40 UTC Killed Task successfully killed
12/19/16 14:50:39 UTC Killing Sent interrupt. Waiting 5s before force killing
12/19/16 14:45:08 UTC Started Task started by client
12/19/16 14:45:08 UTC Received Task received by client
We're hitting the same issue - I guess that somehow the timeout in docker.go line 89 is hit for some reason, or maybe JobGCInterval or NodeGCInterval which are also set to 5 minutes.
Thanks for the logs and report. Will get a fix!
Just noted that issue #2133 reports a similar symptom for the raw_exec driver.
@bluen I think they are unrelated. Docker driver has an http client with an explicit 5 minute timeout. We use this because occasionally docker engine stops being responsive and we don't want it to hang the Nomad client.
Most helpful comment
Thanks for the logs and report. Will get a fix!