On the consul server:
2018/03/20 12:53:56 [INFO] serf: EventMemberJoin: consul-relay01-test2-pa4.central.criteo.preprod 10.224.45.123
2018/03/20 12:53:56 [INFO] consul: member 'consul-relay01-test2-pa4.central.criteo.preprod' joined, marking health alive
2018/03/20 12:53:56 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"
2018/03/20 12:54:00 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"
Output of consul members:
consul-relay01-pa4.central.criteo.preprod 10.224.45.123:8301 failed client 1.0.6 2 pa4 <default>
consul-relay01-test2-pa4.central.criteo.preprod 10.224.45.123:8301 alive client 1.0.6 2 pa4 <default>
consul version for both Client and ServerClient: 1.0.6
Server: 1.0.6 (with some patches)
consul info for both Client and ServerClient:
agent:
check_monitors = 0
check_ttls = 0
checks = 3
services = 4
build:
prerelease =
revision =
version = 1.0.6
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 32
goroutines = 46
max_procs = 2
os = linux
version = go1.9.4
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 3715
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1133006
members = 453
query_queue = 0
query_time = 1929
Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 3
build:
prerelease = criteo5
revision =
version = 1.0.6
consul:
bootstrap = false
known_datacenters = 2
leader = true
leader_addr = 10.224.47.92:8300
server = true
raft:
applied_index = 83407159
commit_index = 83407159
fsm_pending = 0
last_contact = 0
last_log_index = 83407159
last_log_term = 98
last_snapshot_index = 83403466
last_snapshot_term = 98
latest_configuration = [{Suffrage:Voter ID:4fd4772d-e3cd-ebd6-731f-c7e6431ce284 Address:10.224.46.86:8300} {Suffrage:Voter ID:1bfb896b-ee04-520c-10e0-8b382dc0c832 Address:10.224.47.92:8300} {Suffrage:Voter ID:f5e7d35e-66b4-a4b1-b85b-5747af533b58 Address:10.224.47.83:8300}]
latest_configuration_index = 77270117
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 98
runtime:
arch = amd64
cpu_count = 32
goroutines = 2592
max_procs = 31
os = linux
version = go1.10
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 3715
failed = 2
health_score = 0
intent_queue = 0
left = 1
member_time = 1133006
members = 456
query_queue = 0
query_time = 1929
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 6511
members = 6
query_queue = 0
query_time = 1
centos7.3
There are many existing (closed) tickets about allowing consul servers to change ip address, most of them were closed thanks to using raft3 (and possibly the node-id).
@hashicorp : what is the reason for restricting changes in node names since there is now a nodeId ?
Simply because implementation was complicated ? (Update of existing services/checks...) or to be defensive to avoid clashes ?
What is the use-case for having client agent name change without it's ID change? There may well be one but it's worth understanding why it's needed before considering the change which at least has some subtleties to think through.
The main value of that on _servers_ is that they have persistent state and participate in raft where identity and state both matter for correctness. My guess is we didn't extend renaming to work fine for agents just because it's not very clear why you would need to rename a _client_ agent (e.g. change hostname) without also just letting it get a new ID (i.e. wiping it's persistent state).
I could be wrong but I don't think there is any problem if a client agent leaves and comes back with a different name AND ID but the same IP right?
Personally I don't have a strong user-case but I can report that we had a "mini-incident" due to this bug. We also experienced the blocked consul servers behaviour that @kamaradclimber mentioned.
We usually don't rename nodes but we ended up hitting this issue due to a race condition in our provisioning pipeline (consul process was started before the hostname was properly rendered)
I think it could be nice to have consul to gracefully handle this event.
On our cluster, consul node name is the fqdn of the machine. Some of our users change their domain name, leading to an attempt to change consul node name.
As a side note, we dont touch the node id and let consul generate it using its deterministic method.
For my on premise solution, I have a cron job which names the machine based on its ip address and the Proxmox VMID.
I currently am on v 1.0.7.
If a machine is offline for a few days, it gets a new ip and the name change goes through smoothly.
Recently I changed the naming scheme a little bit.
A VM which had been off for a few months (Consul 0.9) came online with the old naming scheme.
After updating the Consul agent and updating the cron files, I had two entries in my consul members output, one with old name and one with the new name.
I just let it be and the next day, the old name was gone from the list.
@shantanugadgil
Thank you for sharing your approach and experience!
Unfortunately, we had the issue several times causing various production issue and if you want this to be fixed, it requires on our side manual intervention, which is painful. (We even had cases when we could not fix it without waiting for a few hours)
Since the node now contains an ID, I think it's name could be changed without too much troubles.
@banks would have feedback on that issue?
@pierresouchay we might want to include #3983 in our next consul build to check the improvement for our use case
We had to revert #3983 as it caused problems in testing and we discovered it's a breaking change which we can't include in current release cycle.
We still think this is close and will add some extra details about what we need to do to get this into 1.3.
@banks Ok, I'll give you more details about our incident as well
\o/
Finally!!! 馃憤馃憤馃憤
Is this fixed in 1.3.0? Because I keep getting errors similar to this all over my stack:
Node name bastion-03f55798ed842fc0e is reserved by node a76399da-e1b0-c1ed-d426-e7ac892ef6c2 with name bastion-03f55798ed842fc0e
Sometimes it goes away with a service restart, sometimes it doesn't.