Consul: unable to join consul 1.7.x cluster due to other members having conflicting node id's

Created on 5 Mar 2020  路  18Comments  路  Source: hashicorp/consul

Overview of the Issue

since 1.7.0 we notice issues with members not being able to join because other clients in the DC have "conflicting" node ID's. example error below on a test environment in AWS, note that the client unable to join a completely different node than the one with conflicting node id:

Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:     2020/03/05 07:34:56 [WARN] agent: (LAN) couldn't join: 0 Err: 3 errors occurred:
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.38.239: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.36.51: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.37.206: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'

in this test environment I took 1 of the client, changed the node name and restarted the consul service (leaving the cluster and rejoining with a new name). on 1.6.4, this works and does not block other servers from joining.
below, consul member list from a server in the cluster, showing the 2 old node names as status left

[root@ip-172-22-37-158 ~]# consul members
Node                                             Address             Status  Type    Build  Protocol  DC       Segment
ip-172-22-36-208.eu-west-1.compute.internal      172.22.36.208:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-37-158.eu-west-1.compute.internal      172.22.37.158:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-38-132.eu-west-1.compute.internal      172.22.38.132:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-36-4.eu-west-1.compute.internal        172.22.36.4:8301    alive   client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.compute.internal       172.22.37.78:8301   alive   client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.compute.test.internal  172.22.37.78:8301   left    client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.test.internal          172.22.37.78:8301   left    client  1.6.4  2         sandbox  <default>

On consul 1.7.x the status for those clients is _also_ "left" BUT as shown in the first log output, this blocks other clients from joining the cluster.
I think "left" clients should not cause duplicate id's and should definitely not block other clients from joining the cluster.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 2 client nodes and 3 server nodes, all nodes on 1.7.x
  2. change the node_name in the consul config.json file on one of the client nodes
  3. restart consul and rejoin with the new node name (probably see it fail for duplicate id already)
  4. the node should have left properly with the old nodename^
  5. rejoin consul on the 2nd client join, or try to join consul with a new client node
  6. see it fail due to the first member having duplicate node id's

Consul info for both Client and Server


Client info

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 4
    services = 4
build:
    prerelease = 
    revision = 95fb95bf
    version = 1.7.0
consul:
    acl = disabled
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 53
    max_procs = 2
    os = linux
    version = go1.12.16
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 8
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 4
    member_time = 552
    members = 9
    query_queue = 0
    query_time = 1


Server info

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 4
    services = 4
build:
    prerelease = 
    revision = 95fb95bf
    version = 1.7.0
consul:
    acl = disabled
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 172.22.38.162:8300
    server = true
raft:
    applied_index = 4193
    commit_index = 4193
    fsm_pending = 0
    last_contact = 0
    last_log_index = 4193
    last_log_term = 11
    last_snapshot_index = 0
    last_snapshot_term = 0
    latest_configuration = [{Suffrage:Voter ID:889c7894-a360-9e48-6be4-304ac6cba83c Address:172.22.38.162:8300} {Suffrage:Voter ID:7f05c5b7-c8e9-65fd-0139-595c0a5fc94c Address:172.22.36.26:8300} {Suffrage:Voter ID:700913b8-dd52-54ae-a1ef-ba43d7346c71 Address:172.22.37.68:8300}]
    latest_configuration_index = 0
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 11
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 88
    max_procs = 2
    os = linux
    version = go1.12.16
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 9
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 553
    members = 5
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 24
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

Distributor ID: Debian
Description: Debian GNU/Linux 9.11 (stretch)
Release: 9.11
Codename: stretch

nodes in AWS

needs-investigation theminternals typbug typumbrella-鈽傦笍

Most helpful comment

Thank you to everyone who has been so patient in reporting this. We're actively working on this and are looking to get a fix out soon. We're tracking this for 1.7, and have seen it in 1.5 too.

Seems like #7445 and #7692 are also related.

We'll be using this issue to track all conflicting NodeID issues.

All 18 comments

Hey @TomRitserveldt ,

Thank you so much for bringing this up to us. It sounds like the problem you are seeing is that once a node enters a "left" state any other nodes cannot join the cluster (regardless of ID), is this true?

Based off your replication steps you've given it looks like once the node is in the left state you are trying to add a new node with the same ID. This will always fail because of the TombstoneTimeout in the Serf library. The TombstoneTimeout makes it so a node has to wait 24hours once entering the left state to be reaped. Once the node is reaped all the node's data is gone from the cluster and the node's ID, IP, etc can be reassigned. If you'd like to bypass this timeout I would recommend looking into the consul force-leave command.

Please let me know if the issue is the first item I mentioned and we can continue digging in.

same issue. force-leave has no effect and the conflicts no longer shows up in members on any of the other nodes.

is there a way to set the tombstone timeout to something like 1 second to prevent this?

restarting the leader manually helps btw, even tho the error says the conflicts are on some other node

@s-christoff yes, any other unrelated nodes are unable to join the cluster because of these conflicting id's, regardless of their own id.

As @aep said as well, force-leave command does not fix this. the only way to fix for now is restart a consul server/leader every time this error occurs.

EDIT: We also do not have this issue at all running consul servers of any version below 1.7.x, so we feel like some behaviour regarding left nodes must have changed there. Even though we see nothing that would indicate that in the changelog

Same here. Conflict error occurs when I change my server's hostname(which is used as the node name). force-leave has no effect and consul force-leave -prune <node> can remove the node from the members lis. However the conflict error seems to be continue.

I'm experiencing exactly same issue, even after TombstoneTimeout passed and left/failed member(s) no longer in the node list, they are still causing has conflicting node ID errors, and new agents failing to join cluster. Like other mentioned only after restarting consul server leader the error is gone.

I'm running consul 1.7.1

We also have this exact same issue; except with Consul 1.7.2 - renaming a node and restarting it is now causing all agents to be unable to rejoin, restarting the leader also does not solve the issue.

Thank you for reporting! This is something we will look into for the next release!

Spinning up a new server node and having it join the cluster also fails:

# consul join srv-002.xxx.consul.yyy
Error joining address 'srv-002.xxx.consul.yyy': Unexpected response code: 500 (1 error occurred:
    * Failed to join 95.xxx.xxx.100: Member 'srv-002.zzz.consul.yyy' has conflicting node ID '518ab4e0-d07a-509d-21bc-cdd94ff6c212' with member 'srv-002-zzz'

So far I have tried doing a consul force-leave -prune for both conflicting names (e.g. srv-002-zzz and srv-002.zzz.consul.yyy) - repeatedly, until the consul leader claims there is no such member. Waiting another 10 minutes after that, attempting to join an agent (or server) to the cluster results right back in the same error.

Hoping the next release can happen soon...

Restarting the consul server was needed in my case, before any node could join or even rejoin. If I stopped the consul service on any client, that was already a part of the cluster, they were unable to re-join afterwards getting the same error message about conflicting node id's.

Removing everything in the data-dir and starting consul with -disable-host-node-id or -node-id=.. on the new client didn't have any effect. As soon as the server was restarted, they joined the cluster.

Build: 1.7.2

@KLIM8D you restarted ALL your consul servers, or do you only have one?

We are experiencing the same issue, and no nodes can join the cluster because nodes that aren't even in the cluster have conflicting IDs.

So, for example, because staging-docker-swarm-001 conflicts with docker-swarm-001 (and neither are in the cluster, aren't appearing as members under consul members, and have been consul force-leave -prune staging-docker-swarm-001/consul force-leave -prune docker-swarm-001 multiple times and are clearly stating No node found with name ...), staging-worker-abcde can't join the cluster, and everything is broken

Servers can't rejoin the cluster after restarting consul, sooooooo, we will have to rebuild Vault from scratch as its data is on consul and the cluster is borked

@hashicorp-support @i0rek can you please put some urgency on this

Yep, this bug leads to complete data loss with consul

------ Warning, do not upgrade to consul 1.7.x ------

This is a disaster, how did this make it through QA?

@quinndiggityaxiom Yes, I've only one consul server. I'm not sure whenever all servers has to be restarted, or just one, if you have more than one consul server

Thank you to everyone who has been so patient in reporting this. We're actively working on this and are looking to get a fix out soon. We're tracking this for 1.7, and have seen it in 1.5 too.

Seems like #7445 and #7692 are also related.

We'll be using this issue to track all conflicting NodeID issues.

Any workaround on older versions? Not able join new node by

  1. data dir cleanup
  2. leave
  3. force-leave
  4. copy data from running node then start
  5. disable node id
    I'm facing the same error in 0.8.4 raft v2

Hi guys, any progress on this issue ? We are facing the same on 1.7.2.

@princepathria97 assuming the node that you want to used the node-id from is no longer alive, you should be able to boot the new node after 72h. This is how long memberlist will hold onto the node-id. If you are still experiencing problems please open a separate issue with a reference to this one.

@rustamgk yes, good news. We merge a fix and released it in Consul v1.7.3: https://github.com/hashicorp/consul/pull/7747.

Is there any guidance on whether our clients should upgrade to 1.7.3 (but servers remain on 1.7.2)? Or are we required to upgrade our servers to effectively resolve this issue?

We're unable to upgrade our servers to 1.7.3 at the moment, so we're wondering if our clients could run 1.7.3 to get around this issue until we can get to upgrading the servers.

Was this page helpful?
0 / 5 - 0 ratings