Nomad v0.6.3
debian
I have a number of nomad client boxes that were shut down. However, nomad still shows the node as being in the "ready" state when I do a nomad node-status. However, when I list the allocations on the node, they show as being lost with a message alloc is lost since its node is down. Clearly nomad knows the node is down, but is not registering it as such. It is also trying to create new allocations on those nodes which causes all sorts of problems (I have put them in drain state so that things can keep working).
Note that this is very similar to #3072 . However, in that case we tracked the issue to node data being too large. That does not seem to be the cause here.
Here are the details:
When I do a nomad node-status I get:
error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): Get https://internal-cluster-0df.stag.awse.traitify.com:4646/v1/client/stats?region=global: dial tcp 10.1.200.206:4646: i/o timeout
ID = d6988f10
Name = internal-cluster-0df.stag.awse
Class = <none>
DC = awse
Drain = true
Status = ready
Drivers = exec
Allocated Resources
CPU Memory Disk IOPS
0/2500 MHz 0 B/3.7 GiB 0 B/16 GiB 0/0
Allocation Resource Utilization
CPU Memory
0/2500 MHz 0 B/3.7 GiB
error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): actual resource usage not present
Allocations
ID Node ID Task Group Version Desired Status Created At
4b585ab6 d6988f10 app 24 stop lost 10/25/17 14:48:28 EDT
Notice it shows a status of ready, but under allocations it shows lost. When I do an alloc-status on that allocation I get:
ID = 4b585ab6
Eval ID = a52b271b
Name = gambit-stag.app[1]
Node ID = d6988f10
Job ID = gambit-stag
Job Version = 24
Client Status = lost
Client Description = <none>
Desired Status = stop
Desired Description = alloc is lost since its node is down
Created At = 10/25/17 14:48:28 EDT
Deployment ID = 3b1dc798
Deployment Health = healthy
Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Get https://internal-cluster-0df.stag.awse.traitify.com:4646/v1/client/allocation/4b585ab6-f984-c164-a852-2fd037b719c9/stats?region=global: dial tcp 10.1.200.206:4646: i/o timeout
...
Notice that it tells us alloc is lost since its node is down
Thanks!
@dansteen Could you provide all your server logs? Would you mind doing curl -XPUT http://127.0.0.1:4646/v1/system/gc before collecting the logs?
Hi @dadgar thanks for the response!
I ran the gc and gathered some logs. We have 3 nomad servers in our cluster, so I have included logs from each one.
One other thing that I noticed is that we seem to have an extra server listed in nomad operator raft list-peers:
Node ID Address State Voter
nomad-0ea.global 10.1.99.123:4647 10.1.99.123:4647 follower true
nomad-02f.global 10.1.200.13:4647 10.1.200.13:4647 follower true
nomad-0dc.global 10.1.200.16:4647 10.1.200.16:4647 leader true
(unknown) 10.1.200.164:4647 10.1.200.164:4647 follower true
compared to the output of nomad server-members
Name Address Port Status Leader Protocol Build Datacenter Region
nomad-02f.global 10.1.200.13 4648 alive false 2 0.6.3 awse global
nomad-0dc.global 10.1.200.16 4648 alive true 2 0.6.3 awse global
nomad-0ea.global 10.1.99.123 4648 alive false 2 0.6.3 awse global
I don't know where that came from, or what box it is since it doesn't seem to actually exist at this point. Could that be related to the problem?
@dansteen Ah I should have asked for you to up the log level to debug first. Would you mind doing that and then letting it run like that for ~20ish minutes, GC and then send those? You can bump the log level in the config and just SIGHUP the servers.
I would also do a nomad operator raft remove-peer -peer-address=10.1.200.164:4647 on the down server
@dadgar sorry about that, I should have known. Anyway, here are the fancy new logs:
nomad3.log
nomad2.log
nomad1.log
@dansteen Can you show me the metric nomad.heartbeat.active and the output of curl 127.0.0.1:4646/v1/nodes.
You can get the heartbeat metric either from your metrics sink or if you aren't exporting them if you send SIGUSR1 on the leader node it will output the metrics to the STDERR
Unfortunately (or fortunately?) it looks like nomad finally figured out that those nodes were down. Here is the output of curling the nodes:
Here are graphs of the nomad.heartbeat.active metric. I hadn't set disable_hostname = true yet, so the metric is divided across my three boxes. That's annoying, but it brings to light that it seems like my active nomad node is changing back and fourth fairly frequently. Is that normal?

Interestingly, it looks like nomad finally figured out that a whole bunch of boxes were down. (according to the graphs above, that may have happened around 20:30, if I'm understanding the information correctly). Either way, here is the output of nomad node-status late on friday:
ID DC Name Class Drain Status
624984b6 awse internal-cluster-0ba.stag.awse <none> false ready
7afd5aec awse internal-cluster-035.stag.awse <none> false ready
1171c72f awse woofound-test-kitchen <none> false initializing
add73f90 awse woofound-test-kitchen <none> false initializing
2e753b69 awse woofound-test-kitchen <none> false ready
9cf10e88 awse woofound-test-kitchen <none> false ready
a424bb84 awse woofound-test-kitchen <none> false initializing
5dd8cfc9 awse woofound-test-kitchen <none> false ready
567a8ee7 awse api-cluster-test-kitchen <none> false ready
d6988f10 awse internal-cluster-0df.stag.awse <none> true ready
2f908c13 awse internal-cluster-034.stag.awse <none> true ready
e964e768 awse utility-debian-jessie-ec2 <none> false ready
eeb98acf awse images-test-kitchen <none> false ready
16913d8d awse admin-interface-test-kitchen <none> false ready
310718d4 awse deck-validator-test-kitchen <none> false initializing
f838c291 awse billing-test-kitchen <none> false ready
8fdd3a35 awse woofound-test-kitchen <none> false ready
58034cd5 awse api-test-kitchen <none> false ready
643f7262 awse orgs-test-kitchen <none> false ready
b7aa8f0d awse fileserve-test-kitchen <none> false initializing
f5e333f4 awse api-cluster-test-kitchen <none> false ready
27c5a00b awse utility-debian-jessie-ec2 <none> false ready
a1e121e1 awse images-test-kitchen <none> false down
0dab9782 awse deck-validator-test-kitchen <none> false ready
c27dfdd0 awse admin-interface-test-kitchen <none> false ready
d66284a6 awse billing-test-kitchen <none> false ready
9e52273b awse api-test-kitchen <none> false ready
1bc9eee6 awse orgs-test-kitchen <none> false ready
f14a9929 awse woofound-test-kitchen <none> false ready
05072c00 awse woofound-008.stag.awse <none> false ready
a3eaf6ef awse api-cluster-046.stag.awse <none> false ready
df0e8cf3 awse api-cluster-051.stag.awse <none> false ready
a778eec8 awse api-cluster-087.stag.awse <none> false ready
31ff8691 awse api-cluster-0a4.stag.awse <none> false ready
9c7726f2 awse api-051.stag.awse <none> false ready
1fe66cb1 awse api-047.stag.awse <none> false ready
392d0b8f awse deck-validator-085.stag.awse <none> false ready
1007da5a awse deck-validator-0d1.stag.awse <none> false ready
27aa5e74 awse woofound-0e5.stag.awse <none> false ready
356871a4 awse fileserve-09a.stag.awse <none> false ready
3fc8b7eb awse fileserve-044.stag.awse <none> false ready
0c01f3c6 awse core-0fa.stag.awse <none> false ready
7e7d2d78 awse orgs-0fc.stag.awse <none> false ready
5cd6aca6 awse billing-087.stag.awse <none> false ready
48a53867 awse images-02d.stag.awse <none> false ready
8b9c4d87 awse orgs-0c8.stag.awse <none> false ready
c1c19ed2 awse images-0d1.stag.awse <none> false ready
5ad2de90 awse billing-0f2.stag.awse <none> false ready
421e8917 awse utility-02f.stag.awse <none> false ready
d295309f awse utility-06c.stag.awse <none> false ready
0c913dd4 awse admin-interface-005.stag.awse <none> false ready
49d1ec09 awse admin-interface-055.stag.awse <none> false ready
46ece8ff awse utility-004.stag.awse <none> false ready
ec253e5b awse core-0af.stag.awse <none> false ready
And here is the output just now ~24 hours later:
ID DC Name Class Drain Status
624984b6 awse internal-cluster-0ba.stag.awse <none> false ready
7afd5aec awse internal-cluster-035.stag.awse <none> false ready
1171c72f awse woofound-test-kitchen <none> false down
add73f90 awse woofound-test-kitchen <none> false down
2e753b69 awse woofound-test-kitchen <none> false down
9cf10e88 awse woofound-test-kitchen <none> false down
a424bb84 awse woofound-test-kitchen <none> false down
5dd8cfc9 awse woofound-test-kitchen <none> false ready
567a8ee7 awse api-cluster-test-kitchen <none> false down
d6988f10 awse internal-cluster-0df.stag.awse <none> true down
2f908c13 awse internal-cluster-034.stag.awse <none> true down
e964e768 awse utility-debian-jessie-ec2 <none> false down
eeb98acf awse images-test-kitchen <none> false down
16913d8d awse admin-interface-test-kitchen <none> false down
310718d4 awse deck-validator-test-kitchen <none> false down
f838c291 awse billing-test-kitchen <none> false down
8fdd3a35 awse woofound-test-kitchen <none> false down
58034cd5 awse api-test-kitchen <none> false down
643f7262 awse orgs-test-kitchen <none> false down
b7aa8f0d awse fileserve-test-kitchen <none> false down
f5e333f4 awse api-cluster-test-kitchen <none> false down
27c5a00b awse utility-debian-jessie-ec2 <none> false down
0dab9782 awse deck-validator-test-kitchen <none> false down
c27dfdd0 awse admin-interface-test-kitchen <none> false down
d66284a6 awse billing-test-kitchen <none> false down
9e52273b awse api-test-kitchen <none> false down
1bc9eee6 awse orgs-test-kitchen <none> false down
f14a9929 awse woofound-test-kitchen <none> false down
05072c00 awse woofound-008.stag.awse <none> false ready
a3eaf6ef awse api-cluster-046.stag.awse <none> false ready
df0e8cf3 awse api-cluster-051.stag.awse <none> false ready
a778eec8 awse api-cluster-087.stag.awse <none> false ready
31ff8691 awse api-cluster-0a4.stag.awse <none> false ready
9c7726f2 awse api-051.stag.awse <none> false ready
1fe66cb1 awse api-047.stag.awse <none> false ready
392d0b8f awse deck-validator-085.stag.awse <none> false ready
1007da5a awse deck-validator-0d1.stag.awse <none> false ready
27aa5e74 awse woofound-0e5.stag.awse <none> false ready
356871a4 awse fileserve-09a.stag.awse <none> false ready
3fc8b7eb awse fileserve-044.stag.awse <none> false ready
0c01f3c6 awse core-0fa.stag.awse <none> false ready
7e7d2d78 awse orgs-0fc.stag.awse <none> false ready
5cd6aca6 awse billing-087.stag.awse <none> false ready
48a53867 awse images-02d.stag.awse <none> false ready
8b9c4d87 awse orgs-0c8.stag.awse <none> false ready
c1c19ed2 awse images-0d1.stag.awse <none> false ready
5ad2de90 awse billing-0f2.stag.awse <none> false ready
421e8917 awse utility-02f.stag.awse <none> false ready
d295309f awse utility-06c.stag.awse <none> false ready
0c913dd4 awse admin-interface-005.stag.awse <none> false ready
49d1ec09 awse admin-interface-055.stag.awse <none> false ready
46ece8ff awse utility-004.stag.awse <none> false ready
ec253e5b awse core-0af.stag.awse <none> false ready
All those boxes that showed up as down in the more recent output have been down for days - far more than 24 hours - even the ones that showed up as initializingor ready in the older output.
@dansteen Ah! Your leadership flapping explains this issue. You are loosing leadership roughly every minute:
7772:Oct 27 17:28:29 nomad-02f nomad[24645]: 2017/10/27 17:28:29.077011 [INFO] nomad: cluster leadership lost
8131:Oct 27 17:29:34 nomad-02f nomad[24645]: 2017/10/27 17:29:34.796362 [INFO] nomad: cluster leadership lost
8483:Oct 27 17:31:08 nomad-02f nomad[24645]: 2017/10/27 17:31:08.749761 [INFO] nomad: cluster leadership lost
8795:Oct 27 17:32:12 nomad-02f nomad[24645]: 2017/10/27 17:32:12.135644 [INFO] nomad: cluster leadership lost
9783:Oct 27 17:36:50 nomad-02f nomad[24645]: 2017/10/27 17:36:50.811047 [INFO] nomad: cluster leadership lost
10149:Oct 27 17:37:55 nomad-02f nomad[24645]: 2017/10/27 17:37:55.877377 [INFO] nomad: cluster leadership lost
Only the leader will mark a node as down based on the heartbeat, and when there is a transition, a 5 minute grace period is applied before marking a node as down. This is done to ensure that all the clients in the cluster learn about the new leader so they don't get marked as down needlessly.
Since you didn't have a leader for the 5 minutes + the timeout, those nodes didn't get marked as down. Based on the metrics you showed, as soon as there was a leader for a longer amount of time, the nodes were marked as down.
Looking at your logs, you are having contact times around ~750ms between servers. You need to reduce that significantly: https://www.nomadproject.io/guides/cluster/requirements.html#network-topology
I am going to close this issue now that we know what is happening. I will be looking into potentially changing the grace period based on cluster size since 5m may be a bit too conservative on smaller clusters.
Nice @dadgar! I can see how that would be an issue. However, I'm not sure we are quite out of the woods yet. Where in the logs do you see that we have >750ms? From what I can see we are sub 1ms:
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=1 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=2 ttl=64 time=0.490 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=3 ttl=64 time=0.443 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=4 ttl=64 time=0.499 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=5 ttl=64 time=0.502 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=6 ttl=64 time=0.456 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=7 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=8 ttl=64 time=0.431 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=1 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=2 ttl=64 time=0.490 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=3 ttl=64 time=0.443 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=4 ttl=64 time=0.499 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=5 ttl=64 time=0.502 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=6 ttl=64 time=0.456 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=7 ttl=64 time=0.503 ms
64 bytes from nomad-0dc.stag.awse.node.awse.consul (10.1.200.16): icmp_seq=8 ttl=64 time=0.431 ms
As a note, these servers are all in different Availability Zones in the same DataCenter in AWS.
@dansteen Hmm usually AZs have a good network latency between them. May have been a ephemeral issue. You can see the results of grepping contact.*in: https://gist.github.com/dadgar/623840e44ed09c741264a7bbad391310
You should monitor the metric nomad.raft.leader.lastContact. For more telemetry details check out: https://www.nomadproject.io/docs/agent/telemetry.html
Thanks @dadgar here is a graph of nomad.raft.leader.lastContact:

I'm not sure I believe that we had latency of 15 seconds between the two boxes (our graphs go back 30 days, and it seems to have been an issue all that time), but it certainly illustrates your point and matches the timeframe where the issues stopped. If we see this again I'll start debugging from there. Thanks for all your help with this.