Nomad v0.4.1-dev
vagrant linux host, 3 windows guests
I have vagrant multi-machine configuration with 3 identical servers (see vagrant file). When i do
"vagrant halt" servers go down one by one (from srv-3 to srv-1). I have "leave_on_interrupt" falg enabled so during this operation nomad agents leaves cluster.
Then i do vagrant up which brings nomad's (from srv-1 to srv-3) back with the following command lines respectively
nomad agent -bind 192.168.33.10 -data-dir data -config config -server -bootstrap-expect=2 -rejoin
nomad agent -bind 192.168.33.20 -data-dir data -config config -server -bootstrap-expect=2 -rejoin -retry-join=192.168.33.10:4648
nomad agent -bind 192.168.33.30 -data-dir data -config config -server -bootstrap-expect=2 -rejoin -retry-join=192.168.33.10:4648 -retry-join=192.168.33.20:4648
*I have a script that resolves available nomad agents in consul and starts nomad with retry-join's.
After startup nomad cluster is degradated. No mater how long I wait I see cluster with 3 nomad servers but with no leader.
nomad server-members
Name Address Port Status Leader Protocol Build Datacenter Region
SRV1.global 192.168.33.10 4648 alive false 2 0.4.1dev dc1 global
SRV2.global 192.168.33.20 4648 alive false 2 0.4.1dev dc1 global
SRV3.global 192.168.33.30 4648 alive false 2 0.4.1dev dc1 global
described above. Can provide vagrant environment where the problem is reproducing 100% stable
Attached logs and configuration
Vagrantfile
srv1.zip
srv2.zip
srv3.zip
I have seen similar issues with my clusters. I usually have to manually edit the peers.json file on each server member and add entries for the other server members. For example, in my three node cluster, I'll have to modify peers.json on each server so it looks something like this:
[
"10.XX.XX.33:4647",
"10.XX.XX.34:4647",
"10.XX.XX.35:4647"
]
After I've done this I can restart my nomad service and they'll elect a new leader. I'm not sure what HashiCorp's plans are in regards to this issue but it certainly would be nice if this could be automated.
experiencing the same issue. I'm using ubuntu 14.04 server with nomad as service. 3 nodes can't elect a leader after restart. same results with 1 server node and 5. sudo service nomad restart is needed to elect a new leader. i hope this issue gets fixed.
Hi @capone212, just a note to your configuration. You're using -bootstrap-expect=2 however this configuration should be the # of servers in the cluster. Since you a have 3 servers, you should use -bootstrap-expect=3.
I have removed leave_on_interrup flag, and this problem does not reproduce anymore.
I had the same issue using consul discovery for bootstrapping.
Same.
Any ideas on this old issue? When I reboot one of my nomad server nodes, the flap forever in a "no leader" ->"leader"->"no leader" scenario.
Possibly relevant logs --> https://gist.github.com/jcomeaux/992d3fea6a98d08af88ea0c309956fdd
Basically, I have nodes A, B, and C. I reboot "A" (172.16.8.6 in the gisted logs), and when it comes back up, leadership oscillates back and forth between B and C. The logs I've gisted are from the B node.
Some things to note:
1) when I restart a node, it gets a new hostname...and the process that does that doesn't clean up the previous route53 A record (something i must fix)
2) i have both leave_on_interrupt and leave_on_terminate set to true ....I replace my entire cluster in a rolling update, regularly, and these settings seem to give the best result...unless I do the unthinkable and restart a single node 馃槃
nomad server-members output from "B" node:
Name Address Port Status Leader Protocol Build Datacenter Region
ss-handy-horse.liveoak.us.int.global 172.16.8.6 4648 alive false 2 0.7.0 shared-services global
ss-living-sloth.liveoak.us.int.global 172.16.6.180 4648 alive true 2 0.7.0 shared-services global
ss-tender-cobra.liveoak.us.int.global 172.16.8.6 4648 left false 2 0.7.0 shared-services global
ss-valued-slug.liveoak.us.int.global 172.16.5.124 4648 alive false 2 0.7.0 shared-services global
nomad server-members output from "A" node: (the one i restarted)
Name Address Port Status Leader Protocol Build Datacenter Region
ss-handy-horse.liveoak.us.int.global 172.16.8.6 4648 alive false 2 0.7.0 shared-services global
ss-living-sloth.liveoak.us.int.global 172.16.6.180 4648 alive false 2 0.7.0 shared-services global
ss-valued-slug.liveoak.us.int.global 172.16.5.124 4648 alive false 2 0.7.0 shared-services global
...also, the copy above was from one of the times when no leader was elected...the cluster oscillates between a "no leader" and "leader" state every few minutes
I also noticed that the nomad operator raft remove-peer command doesn't work:
[root@ss-valued-slug ~]# nomad server-members
Name Address Port Status Leader Protocol Build Datacenter Region
ss-handy-horse.liveoak.us.int.global 172.16.8.6 4648 left false 2 0.7.0 shared-services global
ss-living-sloth.liveoak.us.int.global 172.16.6.180 4648 alive true 2 0.7.0 shared-services global
ss-tender-cobra.liveoak.us.int.global 172.16.8.6 4648 left false 2 0.7.0 shared-services global
ss-valued-slug.liveoak.us.int.global 172.16.5.124 4648 alive false 2 0.7.0 shared-services global
[root@ss-valued-slug ~]# nomad operator raft remove-peer -peer-address 172.16.8.6:4648
Failed to remove raft peer: Unexpected response code: 500 (rpc error: address "172.16.8.6:4648" was not found in the Raft configuration)
[root@ss-valued-slug ~]#
I'm experiencing the same issue. Editing peers.json didn't work, and calling nomad operator raft list-peers fails with Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader).
I'm seeing this (or at least something very similar) on Nomad 0.7 as well.
On server-use1-1-10-2-218-211.global, it reports this:
server-use1-0-10-2-218-111.global 10.2.218.111 4648 alive false 2 0.7.0 us-east-1 global
server-use1-1-10-2-218-211.global 10.2.218.211 4648 alive true 2 0.7.0 us-east-1 global
server-use1-2-10-2-219-111.global 10.2.219.111 4648 alive false 2 0.7.0 us-east-1 global
The other two servers don't believe the election has been won
server-use1-0-10-2-218-111.global 10.2.218.111 4648 alive false 2 0.7.0 us-east-1 global
server-use1-1-10-2-218-211.global 10.2.218.211 4648 alive false 2 0.7.0 us-east-1 global
server-use1-2-10-2-219-111.global 10.2.219.111 4648 alive false 2 0.7.0 us-east-1 global
and are printing things like
Feb 20 04:37:25 ip-10-2-218-79 nomad[22687]: 2018/02/20 04:37:25.559414 [ERR] http: Request /v1/agent/health?type=server, error: {"server":{"ok":false,"message":"No cluster leader"}}
Feb 20 04:37:25 ip-10-2-218-79 nomad[22687]: 2018/02/20 04:37:25.635965 [ERR] http: Request /v1/status/leader, error: No cluster leader
Feb 20 04:37:29 ip-10-2-218-79 nomad[22687]: 2018/02/20 04:37:29.905626 [ERR] http: Request /v1/status/leader, error: No cluster leader
Feb 20 04:37:29 ip-10-2-218-79 nomad[22687]: 2018/02/20 04:37:29.939090 [ERR] http: Request /v1/status/leader, error: No cluster leader
Feb 20 04:37:30 ip-10-2-218-79 nomad[22687]: 2018/02/20 04:37:30.678490 [ERR] http: Request /v1/status/leader, error: No cluster leader
I have run into this too. In my situation, I should be able to following the manual recovery steps outlined in https://www.nomadproject.io/guides/operations/outage.html#manual-recovery-using-peers-json. I can do this repeatedly, using the raft v2 spec (simpler), with the correct IPs, and while the 3 leaders all find each other and start talking, they all error out with the following:
...
[INFO] server.nomad: successfully contacted 2 Nomad Servers
[WARN] raft: not part of stable configuration, aborting election
[ERR] worker: failed to dequeue evaluation: No cluster leader
[ERR] http: Request /v1/agent/health?type=server, error: {"server":{"ok":false,"message":"No cluster leader"}}
and then that repeats
With early versions of nomad, I was able to recover with peers.json and the system was very reliable. More concerning, the error messages coming from nomad do not explain why it is failing to elect a leader.
While the docs in consul say that the not part of stable configuration error is related to incorrect entries in peers.json, the peers.json file looks fine to me, so I am not sure why nomad is refusing to start.
EDIT: at least in my case, I had the port incorrect, entries for raft v2 should be IP:PORT, where PORT is the server 4647.
@ketzacoatl I put in a docs PR (https://github.com/hashicorp/nomad/pull/5251) to update the port in the sample Raft V3 configuration and make it a little clearer in the notes at the bottom.
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
Most helpful comment
@ketzacoatl I put in a docs PR (https://github.com/hashicorp/nomad/pull/5251) to update the port in the sample Raft V3 configuration and make it a little clearer in the notes at the bottom.