Nomad: Nomad can't elect leader after all node retsart

Created on 15 Aug 2016 · 13Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.4.1-dev

Operating system and Environment details

vagrant linux host, 3 windows guests

Issue

I have vagrant multi-machine configuration with 3 identical servers (see vagrant file). When i do
"vagrant halt" servers go down one by one (from srv-3 to srv-1). I have "leave_on_interrupt" falg enabled so during this operation nomad agents leaves cluster.
Then i do vagrant up which brings nomad's (from srv-1 to srv-3) back with the following command lines respectively

nomad agent -bind 192.168.33.10 -data-dir data -config config -server -bootstrap-expect=2 -rejoin
nomad agent -bind 192.168.33.20 -data-dir data -config config -server -bootstrap-expect=2 -rejoin -retry-join=192.168.33.10:4648
nomad agent -bind 192.168.33.30 -data-dir data -config config -server -bootstrap-expect=2 -rejoin -retry-join=192.168.33.10:4648 -retry-join=192.168.33.20:4648

*I have a script that resolves available nomad agents in consul and starts nomad with retry-join's.

After startup nomad cluster is degradated. No mater how long I wait I see cluster with 3 nomad servers but with no leader.

 nomad server-members
Name         Address        Port  Status  Leader  Protocol  Build     Datacenter  Region                                                    
SRV1.global  192.168.33.10  4648  alive   false   2         0.4.1dev  dc1         global                                                    
SRV2.global  192.168.33.20  4648  alive   false   2         0.4.1dev  dc1         global                                                    
SRV3.global  192.168.33.30  4648  alive   false   2         0.4.1dev  dc1         global

Reproduction steps

described above. Can provide vagrant environment where the problem is reproducing 100% stable

Attached logs and configuration
Vagrantfile
srv1.zip
srv2.zip
srv3.zip

stagneeds-investigation stagwaiting-reply

Source

capone212

Most helpful comment

@ketzacoatl I put in a docs PR (https://github.com/hashicorp/nomad/pull/5251) to update the port in the sample Raft V3 configuration and make it a little clearer in the notes at the bottom.

angrycub on 28 Jan 2019

👍3

All 13 comments

I have seen similar issues with my clusters. I usually have to manually edit the peers.json file on each server member and add entries for the other server members. For example, in my three node cluster, I'll have to modify peers.json on each server so it looks something like this:

[
  "10.XX.XX.33:4647",
  "10.XX.XX.34:4647",
  "10.XX.XX.35:4647"
]

After I've done this I can restart my nomad service and they'll elect a new leader. I'm not sure what HashiCorp's plans are in regards to this issue but it certainly would be nice if this could be automated.

dudemcbacon on 17 Aug 2016

experiencing the same issue. I'm using ubuntu 14.04 server with nomad as service. 3 nodes can't elect a leader after restart. same results with 1 server node and 5. sudo service nomad restart is needed to elect a new leader. i hope this issue gets fixed.

hardcoar on 19 Aug 2016

Hi @capone212, just a note to your configuration. You're using -bootstrap-expect=2 however this configuration should be the # of servers in the cluster. Since you a have 3 servers, you should use -bootstrap-expect=3.

hgontijo on 19 Aug 2016

I have removed leave_on_interrup flag, and this problem does not reproduce anymore.

capone212 on 22 Aug 2016

I had the same issue using consul discovery for bootstrapping.

dbresson on 2 Sep 2016

Same.

pedrofranceschi on 17 May 2017

Any ideas on this old issue? When I reboot one of my nomad server nodes, the flap forever in a "no leader" ->"leader"->"no leader" scenario.
Possibly relevant logs --> https://gist.github.com/jcomeaux/992d3fea6a98d08af88ea0c309956fdd

Basically, I have nodes A, B, and C. I reboot "A" (172.16.8.6 in the gisted logs), and when it comes back up, leadership oscillates back and forth between B and C. The logs I've gisted are from the B node.

Some things to note:
1) when I restart a node, it gets a new hostname...and the process that does that doesn't clean up the previous route53 A record (something i must fix)
2) i have both leave_on_interrupt and leave_on_terminate set to true ....I replace my entire cluster in a rolling update, regularly, and these settings seem to give the best result...unless I do the unthinkable and restart a single node 😄

nomad server-members output from "B" node:

Name                                   Address       Port  Status  Leader  Protocol  Build  Datacenter       Region
ss-handy-horse.liveoak.us.int.global   172.16.8.6    4648  alive   false   2         0.7.0  shared-services  global
ss-living-sloth.liveoak.us.int.global  172.16.6.180  4648  alive   true    2         0.7.0  shared-services  global
ss-tender-cobra.liveoak.us.int.global  172.16.8.6    4648  left    false   2         0.7.0  shared-services  global
ss-valued-slug.liveoak.us.int.global   172.16.5.124  4648  alive   false   2         0.7.0  shared-services  global

nomad server-members output from "A" node: (the one i restarted)

Name                                   Address       Port  Status  Leader  Protocol  Build  Datacenter       Region
ss-handy-horse.liveoak.us.int.global   172.16.8.6    4648  alive   false   2         0.7.0  shared-services  global
ss-living-sloth.liveoak.us.int.global  172.16.6.180  4648  alive   false   2         0.7.0  shared-services  global
ss-valued-slug.liveoak.us.int.global   172.16.5.124  4648  alive   false   2         0.7.0  shared-services  global

...also, the copy above was from one of the times when no leader was elected...the cluster oscillates between a "no leader" and "leader" state every few minutes

I also noticed that the nomad operator raft remove-peer command doesn't work:

[root@ss-valued-slug ~]# nomad server-members
Name                                   Address       Port  Status  Leader  Protocol  Build  Datacenter       Region
ss-handy-horse.liveoak.us.int.global   172.16.8.6    4648  left    false   2         0.7.0  shared-services  global
ss-living-sloth.liveoak.us.int.global  172.16.6.180  4648  alive   true    2         0.7.0  shared-services  global
ss-tender-cobra.liveoak.us.int.global  172.16.8.6    4648  left    false   2         0.7.0  shared-services  global
ss-valued-slug.liveoak.us.int.global   172.16.5.124  4648  alive   false   2         0.7.0  shared-services  global
[root@ss-valued-slug ~]# nomad operator raft remove-peer -peer-address 172.16.8.6:4648
Failed to remove raft peer: Unexpected response code: 500 (rpc error: address "172.16.8.6:4648" was not found in the Raft configuration)
[root@ss-valued-slug ~]#

jcomeaux on 20 Dec 2017

I'm experiencing the same issue. Editing peers.json didn't work, and calling nomad operator raft list-peers fails with Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader).

inkel on 5 Jan 2018

I'm seeing this (or at least something very similar) on Nomad 0.7 as well.

On server-use1-1-10-2-218-211.global, it reports this:

server-use1-0-10-2-218-111.global  10.2.218.111  4648  alive   false   2         0.7.0  us-east-1   global
server-use1-1-10-2-218-211.global  10.2.218.211  4648  alive   true    2         0.7.0  us-east-1   global
server-use1-2-10-2-219-111.global  10.2.219.111  4648  alive   false   2         0.7.0  us-east-1   global

The other two servers don't believe the election has been won

server-use1-0-10-2-218-111.global  10.2.218.111  4648  alive   false   2         0.7.0  us-east-1   global
server-use1-1-10-2-218-211.global  10.2.218.211  4648  alive   false   2         0.7.0  us-east-1   global
server-use1-2-10-2-219-111.global  10.2.219.111  4648  alive   false   2         0.7.0  us-east-1   global

and are printing things like

Feb 20 04:37:25 ip-10-2-218-79 nomad[22687]:     2018/02/20 04:37:25.559414 [ERR] http: Request /v1/agent/health?type=server, error: {"server":{"ok":false,"message":"No cluster leader"}}
Feb 20 04:37:25 ip-10-2-218-79 nomad[22687]:     2018/02/20 04:37:25.635965 [ERR] http: Request /v1/status/leader, error: No cluster leader
Feb 20 04:37:29 ip-10-2-218-79 nomad[22687]:     2018/02/20 04:37:29.905626 [ERR] http: Request /v1/status/leader, error: No cluster leader
Feb 20 04:37:29 ip-10-2-218-79 nomad[22687]:     2018/02/20 04:37:29.939090 [ERR] http: Request /v1/status/leader, error: No cluster leader
Feb 20 04:37:30 ip-10-2-218-79 nomad[22687]:     2018/02/20 04:37:30.678490 [ERR] http: Request /v1/status/leader, error: No cluster leader

holtwilkins on 20 Feb 2018

I have run into this too. In my situation, I should be able to following the manual recovery steps outlined in https://www.nomadproject.io/guides/operations/outage.html#manual-recovery-using-peers-json. I can do this repeatedly, using the raft v2 spec (simpler), with the correct IPs, and while the 3 leaders all find each other and start talking, they all error out with the following:

...
[INFO] server.nomad: successfully contacted 2 Nomad Servers
[WARN] raft: not part of stable configuration, aborting election
[ERR] worker: failed to dequeue evaluation: No cluster leader
[ERR] http: Request /v1/agent/health?type=server, error: {"server":{"ok":false,"message":"No cluster leader"}}

and then that repeats

With early versions of nomad, I was able to recover with peers.json and the system was very reliable. More concerning, the error messages coming from nomad do not explain why it is failing to elect a leader.

While the docs in consul say that the not part of stable configuration error is related to incorrect entries in peers.json, the peers.json file looks fine to me, so I am not sure why nomad is refusing to start.

EDIT: at least in my case, I had the port incorrect, entries for raft v2 should be IP:PORT, where PORT is the server 4647.

ketzacoatl on 26 Jan 2019

@ketzacoatl I put in a docs PR (https://github.com/hashicorp/nomad/pull/5251) to update the port in the sample Raft V3 configuration and make it a little clearer in the notes at the bottom.

angrycub on 28 Jan 2019

👍3

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale[bot] on 10 May 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:

stale[bot] on 9 Jun 2019

Was this page helpful?

0 / 5 - 0 ratings