Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)
I originally ran then the cluster with the nomad server+client on 3 Ubuntu machines (in docker containers using this image), which works perfectly. Next, I installed nomad on two windows servers using chocolatey and neither is able to successfully join the cluster:
2018/03/28 13:47:07.736646 [ERR] client: registration failure: No cluster leader
2018/03/28 13:47:12.462925 [INFO] server.nomad: successfully contacted 2 Nomad Servers
2018/03/28 13:47:16.795184 [ERR] worker: failed to dequeue evaluation: No cluster leader
2018/03/28 13:47:18.564270 [ERR] http: Request /v1/agent/health?type=server, error: {"server":"ok":false,"message":"No cluster leader"}}
It complains about "no cluster leader", however the logs from the Ubuntu servers seem to indicate that there's a communication issue with the Windows nomad:
2018/03/28 13:41:30 [ERR] memberlist: Push/Pull with dev-sb-05.global failed: dial tcp 128.1.38.66:4648: i/o timeout
2018/03/28 13:41:31 [INFO] serf: EventMemberUpdate: dev-sb-05.global
Netstat on dev-sb-05 shows that it is listening on port 4648, and dev-sb-05 gives no indication that there's a communication issue on that port, so how can I figure out what's happening? (Note: the servers are all running in the same network)
If I stop the nomad service on Windows and bring up the other Ubuntu server then everything start working again as expected. The configurations are exactly the same (with the exception of the bind_addr) so it's really difficult to tell why nomad on Windows is having issues. I even tried disabling the nomad client on Windows (leaving just the server) but the nomad server still cannot work successfully in the cluster.
The more I look at the logs from one of the Ubuntu servers, the more it looks like a raft issue with the Windows server, but how would I remedy this issue?
2018/03/28 14:18:57 [INFO] raft: Node at 128.1.6.243:4647 [Follower] entering Follower state (Leader: "")
2018/03/28 14:18:58 [ERR] raft: Failed to make RequestVote RPC to {Voter 128.1.38.66:4647 128.1.38.66:4647}: dial tcp 128.1.38.66:4647: i/o timeout
2018/03/28 14:19:01 [ERR] raft: Failed to make RequestVote RPC to {Voter 128.1.38.66:4647 128.1.38.66:4647}: dial tcp 128.1.38.66:4647: i/o timeout
2018/03/28 14:19:01.897730 [ERR] client.consul: error discovering nomad servers: 2 error(s) occurred:
hi @bis-sb that raft layer is doing some RPCs over the network, nothing about that is Windows specific. Do you perhaps have firewall rules by default in your Windows servers?
Here is a list of ports used by Nomad https://www.nomadproject.io/guides/cluster/requirements.html#ports-used. You might want to make sure you don't have any rules blocking traffic on those ports.
@preetapan Whoops, that probably should've been the first thing I checked!! As soon as I added a firewall rule for ports 4646, 4647, and 4648 it started working as expected. Thank you!
@bis-sb Also, you alluded to using two Nomad servers. In production we recommend a minimum of three for failure tolerance - https://www.nomadproject.io/docs/internals/consensus.html#deployment-table
Thanks for your help, maybe you could clarify some more things for me.
I got my cluster running with nomad in 5 servers: 2 Windows and 3 Ubuntu
I've opened the requisite ports and it looks like the server components are working correctly (they are able to elect a leader), but the clients however are in constant flux (looking in the nomad UI the clients are constantly changing from "ready" to "initializing").
The Ubuntu logs seem to indicate that the issue happens when getting a heartbeat from the Windows computers, and indeed disabling the client {} section in the Windows nomad services makes everything stablize, i.e. the nomad clients on the Windows machines show as "down" and the ones on the Ubuntu servers show as "ready".
Is there additional configuration I have to do in Windows to make the clients work correctly?
I noticed for my PC in particular it wants me to set the network_interface property in the client configuration, but that doesn't resolve the issue on that server. The other Windows server made no such complaint and was still unable to successfully register its client. That other server as well as the Ubuntu servers in the cluster keep logging the following:
2018/03/28 17:55:28.539817 [INFO] client: node registration complete
2018/03/28 17:55:33.898131 [WARN] client: heartbeat missed (request took 9.006ms). Heartbeat TTL was 10.947363891s and heartbeated after 5.3583142s
I'm unsure of what's causing nomad on Windows to not be able to complete the heartbeat. Also if I check the client statuses in the nomad UI it always shows at least one of the clients as "initializing" and the others as "ready".
Also I want to test a nomad job to run a service on my Windows machines, however when I run nomad plan test.nomad it reports:
- WARNING: Failed to place all allocations.
Task Group "test" (failed to place 1 allocation):
* Constraint "${attr.kernel.name} = windows" filtered 2 nodes
* Constraint "missing drivers" filtered 1 nodes
Does nomad need a client running on a Windows machine in order for this to succeed, or is a server sufficient? What is the filtered x nodes message mean?
I've been able to resolve the first issue; essentially it came down to Vault being incorrectly configured. I noticed that nomad on Windows was getting an error in its log on startup: [ERR] vault: failed to validate self token/role and not retrying: failed to lookup Vault periodic token: Error making API request.
Checking the Vault log revealed a similar issue: [ERROR] http/handleRequestForwarding: error forwarding request: error=error during forwarding RPC request
Upon re-checking the Vault HA documentation I noticed I was configuring the api_addr incorrectly in my Vault instances. After updating the api_addr to point to the load balancer address, the nomad clients on my Windows machines were able to successfully register themselves. What's maybe a little confusing is that the nomad clients on Ubuntu seemed unaffected by this.
I'm still confused on how to interpret the output from nomad plan <job>, since it's still failing (or at least giving me this warning):
- WARNING: Failed to place all allocations.
Task Group "test" (failed to place 1 allocation):
* Constraint "missing drivers" filtered 2 nodes
* Constraint "${attr.kernel.name} = windows" filtered 3 nodes
Does it mean that there are no nodes which meet the constraints (i.e. having a "windows" kernel)? What does "missing driviers" mean (*note I have docker installed and running successfully on both windows machines)?
This output is indicating that your three Ubuntu machines are being filtered out because they don't meet the "${attr.kernel.name} = windows" constraint. Based on the Docker issue you filed (#4080), that would explain your two windows boxes being filtered out first with the missing drivers constraint.
Yes it's making sense now, your help is much appreciated! Now that I've gotten the driver issue sorted out it's starting to accept my jobs and I can begin with the real work of setting up my services :)