Consul: Conflicts appears when changing node_name on agents

Created on 20 Mar 2018  路  14Comments  路  Source: hashicorp/consul

Description of the Issue (and unexpected/desired result)

  • When changing node_name of an agent, we observe conflicts in consul servers logs.
  • It is also possible (not reproduced yet, but observed twice on our production) that consul servers ends up blocked when several agents have changed their names.

Reproduction steps

  • change node_name configuration in a consul agent configuration
  • restart consul agent

On the consul server:

2018/03/20 12:53:56 [INFO] serf: EventMemberJoin: consul-relay01-test2-pa4.central.criteo.preprod 10.224.45.123
2018/03/20 12:53:56 [INFO] consul: member 'consul-relay01-test2-pa4.central.criteo.preprod' joined, marking health alive
2018/03/20 12:53:56 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"
2018/03/20 12:54:00 [WARN] consul.fsm: EnsureRegistration failed: failed inserting node: node ID "4d1dac9f-8977-4258-4aca-fa254c9f48da" for node "consul-relay01-test2-pa4.central.criteo.preprod" aliases existing node "consul-relay01-pa4.central.criteo.preprod"

Output of consul members:

consul-relay01-pa4.central.criteo.preprod                10.224.45.123:8301  failed  client  1.0.6  2         pa4  <default>
consul-relay01-test2-pa4.central.criteo.preprod          10.224.45.123:8301  alive   client  1.0.6  2         pa4  <default>

consul version for both Client and Server

Client: 1.0.6
Server: 1.0.6 (with some patches)

consul info for both Client and Server

Client:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 3
    services = 4
build:
    prerelease = 
    revision = 
    version = 1.0.6
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 32
    goroutines = 46
    max_procs = 2
    os = linux
    version = go1.9.4
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 3715
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1133006
    members = 453
    query_queue = 0
    query_time = 1929

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 2
    services = 3
build:
    prerelease = criteo5
    revision = 
    version = 1.0.6
consul:
    bootstrap = false
    known_datacenters = 2
    leader = true
    leader_addr = 10.224.47.92:8300
    server = true
raft:
    applied_index = 83407159
    commit_index = 83407159
    fsm_pending = 0
    last_contact = 0
    last_log_index = 83407159
    last_log_term = 98
    last_snapshot_index = 83403466
    last_snapshot_term = 98
    latest_configuration = [{Suffrage:Voter ID:4fd4772d-e3cd-ebd6-731f-c7e6431ce284 Address:10.224.46.86:8300} {Suffrage:Voter ID:1bfb896b-ee04-520c-10e0-8b382dc0c832 Address:10.224.47.92:8300} {Suffrage:Voter ID:f5e7d35e-66b4-a4b1-b85b-5747af533b58 Address:10.224.47.83:8300}]
    latest_configuration_index = 77270117
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 98
runtime:
    arch = amd64
    cpu_count = 32
    goroutines = 2592
    max_procs = 31
    os = linux
    version = go1.10
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 3715
    failed = 2
    health_score = 0
    intent_queue = 0
    left = 1
    member_time = 1133006
    members = 456
    query_queue = 0
    query_time = 1929
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 6511
    members = 6
    query_queue = 0
    query_time = 1

Operating system and Environment details

centos7.3

typbug waiting-pr-merge

All 14 comments

There are many existing (closed) tickets about allowing consul servers to change ip address, most of them were closed thanks to using raft3 (and possibly the node-id).

@hashicorp : what is the reason for restricting changes in node names since there is now a nodeId ?

Simply because implementation was complicated ? (Update of existing services/checks...) or to be defensive to avoid clashes ?

What is the use-case for having client agent name change without it's ID change? There may well be one but it's worth understanding why it's needed before considering the change which at least has some subtleties to think through.

The main value of that on _servers_ is that they have persistent state and participate in raft where identity and state both matter for correctness. My guess is we didn't extend renaming to work fine for agents just because it's not very clear why you would need to rename a _client_ agent (e.g. change hostname) without also just letting it get a new ID (i.e. wiping it's persistent state).

I could be wrong but I don't think there is any problem if a client agent leaves and comes back with a different name AND ID but the same IP right?

Personally I don't have a strong user-case but I can report that we had a "mini-incident" due to this bug. We also experienced the blocked consul servers behaviour that @kamaradclimber mentioned.

We usually don't rename nodes but we ended up hitting this issue due to a race condition in our provisioning pipeline (consul process was started before the hostname was properly rendered)

I think it could be nice to have consul to gracefully handle this event.

On our cluster, consul node name is the fqdn of the machine. Some of our users change their domain name, leading to an attempt to change consul node name.

As a side note, we dont touch the node id and let consul generate it using its deterministic method.

For my on premise solution, I have a cron job which names the machine based on its ip address and the Proxmox VMID.

I currently am on v 1.0.7.

If a machine is offline for a few days, it gets a new ip and the name change goes through smoothly.

Recently I changed the naming scheme a little bit.

A VM which had been off for a few months (Consul 0.9) came online with the old naming scheme.

After updating the Consul agent and updating the cron files, I had two entries in my consul members output, one with old name and one with the new name.

I just let it be and the next day, the old name was gone from the list.

@shantanugadgil
Thank you for sharing your approach and experience!
Unfortunately, we had the issue several times causing various production issue and if you want this to be fixed, it requires on our side manual intervention, which is painful. (We even had cases when we could not fix it without waiting for a few hours)
Since the node now contains an ID, I think it's name could be changed without too much troubles.

@banks would have feedback on that issue?
@pierresouchay we might want to include #3983 in our next consul build to check the improvement for our use case

We had to revert #3983 as it caused problems in testing and we discovered it's a breaking change which we can't include in current release cycle.

We still think this is close and will add some extra details about what we need to do to get this into 1.3.

@banks Ok, I'll give you more details about our incident as well

\o/

Finally!!! 馃憤馃憤馃憤

Is this fixed in 1.3.0? Because I keep getting errors similar to this all over my stack:

Node name bastion-03f55798ed842fc0e is reserved by node a76399da-e1b0-c1ed-d426-e7ac892ef6c2 with name bastion-03f55798ed842fc0e

Sometimes it goes away with a service restart, sometimes it doesn't.

Was this page helpful?
0 / 5 - 0 ratings