0.5.6
after a failed upgrade from nomad 0.5.4 to 0.5.6 on some of our hosts, we got broken nomad on that nodes(it doesn't work) So we decide to cleanup nomad client state dir(we simply remove it from file system), and relaunch nomad agent. But it can't join to working cluster due follow errors in log:
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.1:4647: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connec
tion refused
Apr 12 00:14:52 monitor1 nomad[3226]: client: registration failure: 7 error(s) occurred:#012#012* RPC failed to server 192.168.30.3:4647: rpc error: failed t
o get conn: dial tcp 192.168.30.3:4647: getsockopt: connection refused#012* RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does
not match. Not registering node.#012* RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012
* RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.30.1:4
647: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does
not match. Not registering node.#012* RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connect
ion refused
and node maked as donw, without any chance go to ready state
root@social:/home/ruslan# nomad node-status
00000000 test vol-h-docker-02 ceph false ready
a50ce082 test server6 ceph false ready
a3e6b08b test monitor1 ceph false down
439a2f5a test graphite ceph false ready
41b521c8 test vol-h-docker-01 ceph false ready
ec475f0a test social ceph false ready
1e6111fb test server2 ceph false ready
As in understand due GH-2277, nodes now have persistent IDs, but secretIDs not persistent, because it can be cleared by remove nomad agent state dir(in our case) so nomad servers thinks that buggy node(because it remember persistent nodeID) try to register, and reject it. In nomad, no any commands that allow to force nomad to forget about down nodes, thus giving her a chance to re-register (it seems that we just have to wait when nomad will made a GC of down nodes, but this require time). What can we do in this situation?
As workaround we fount that new noamd ganet client option no_host_uuid config parameter take place
@tantra35 Yeah that is a bit tricky. What you can do is stop the node and wait for nomad to detect it as dead (30 seconds) and then issue a GC which will clear knowledge of that node from the servers. You can do that as follows:
$ curl -XPUT http://127.0.0.1:4646/v1/system/gc
Most helpful comment
@tantra35 Yeah that is a bit tricky. What you can do is stop the node and wait for nomad to detect it as dead (30 seconds) and then issue a GC which will clear knowledge of that node from the servers. You can do that as follows:
$ curl -XPUT http://127.0.0.1:4646/v1/system/gc