Consul: Raft cannot handle node changing IP address

Created on 6 Nov 2014 · 26Comments · Source: hashicorp/consul

If a node changes IP address as a server, the gossip layer will properly handle the IP update, but the Raft peer set will not be updated. This will cause replication errors and potentially an outage.

This can be triggered by restarting a docker container with the Consul servers without doing a graceful leave.

typbug

Source

armon

👍12

Most helpful comment

Highly interested as well. If using docker swarm mode, i got new ip's almost every time.

yellowmegaman on 3 Feb 2017

👍8

All 26 comments

The reproduction recipe is here: https://groups.google.com/d/msg/consul-tool/pWj3rHdgdqY/PMXCywgXo28J.

That uses the progrium/consul docker container, which does not have leave_on_terminate=true set by default (currently). If the author accepts and fixes https://github.com/progrium/docker-consul/issues/34 then that will no longer be the case.

rocketraman on 6 Nov 2014

We use short-lived windows instances - most last not more than a day. We refer to our instances by a logical name (eg, web_001 through web_33). As instances come and go, we re-use the logical names to fill gaps before adding more at the top end.

This means nodes will come and go with different IPs, but the same node names, and it sounds like we'll be affected by this issue. As a workaround, should we inject some uniqueness into the node-name that we pick for the consul agent (such as the AWS instance ID)?

However, we'd prefer not to have to. We use logical names in the first place for 2 reasons:

it's easier to refer to nodes this way!
graphite's whisper storage backend pre-allocates space on disk when new metrics are seen. We avoid needing to reap storage by re-using logical names. This allows us to have continuity in our metrics and create alerts/dashboards more simply.

petemounce on 16 Jan 2015

@petemounce This will not affect your case. This is only when the server nodes themselves change IPs but not their node name. The clients can change IPs all day :)

armon on 20 Jan 2015

What is the proper way to change the IP (or the advertise_addr config option) on a server? Is there one?

blalor on 23 Feb 2015

It cannot be done currently. You need to remove (gracefully) first then re-add the server. Consul can't handle the address change case.

armon on 23 Feb 2015

So, “consul leave” on the host, change the IP or advertise_addr, then restart? That seems to confuse the agents in the cluster, which continue to show the old IP and a state of “left”.

blalor on 23 Feb 2015

Assuming the node name is the same, they shouldn't be confused. The IP address should update on the clients. If the node name changes, they will be confused since it looks like a different node. But effectively yes, the node is leaving and then re-joining with new configuration.

armon on 23 Feb 2015

That doesn't seem to be the case, unfortunately. I modified the config for node consul-000.us-east-1.aws.test.example.com to use a specific advertise_addr; leaving and re-joining with the new config is resulting in lots of

2015/02/24 00:05:10 [WARN] memberlist: Refuting a suspect message (from: consul-000.us-east-1.aws.test.example.com)

on the node I just modified, and messages like this on the other servers and agents in the cluster even minutes after reconfiguring:

2015/02/24 00:06:48 [INFO] serf: EventMemberJoin: consul-000.us-east-1.aws.test.example.com 11.222.33.444
2015/02/24 00:06:48 [INFO] consul: adding server consul-000.us-east-1.aws.test.example.com (Addr: 11.222.33.444:8300) (DC: us-east-1_aws_test)
2015/02/24 00:07:04 [INFO] serf: EventMemberFailed: consul-000.us-east-1.aws.test.example.com 11.222.33.444
2015/02/24 00:07:04 [INFO] consul: removing server consul-000.us-east-1.aws.test.example.com (Addr: 11.222.33.444:8300) (DC: us-east-1_aws_test)

blalor on 24 Feb 2015

@blalor Can you provide the DEBUG level logs from the machine and maybe one other machine? This looks slightly different than the issue of this ticket. The ticket is that the Raft peers cannot handle an IP update of a server, while this looks like a different issue (Join/Fail) not converging.

armon on 24 Feb 2015

https://gist.github.com/blalor/60539004449c35fc079a

consul_debug.000 is for server node consul-000 which had its IP address changed from 10.130.0.248 to 11.222.33.444. consul_debug.001 is for server node consul-001 whose configuration was unchanged save for enabling debug logging.

blalor on 24 Feb 2015

@blalor It looks like consul-001 is unable to ping (directly or indirectly) consul-000:

[INFO] memberlist: Suspect consul-000.us-east-1.aws.test.example.com has failed, no acks received

This could mean there is some network issue preventing UDP packets between them, which is causing the flapping. Could you investigate possible network issues?

armon on 24 Feb 2015

Not anymore; I’ve rebuilt that cluster. :-)

blalor on 24 Feb 2015

Assuming all servers have leave_on_terminate set, what are the clients suppose to do when the complete cluster is gone? Should they try to reconnect via the DNS name?
Then I'd had a workaround for this at least.

discordianfish on 2 Apr 2015

I'm determined to introduce ip change support in Consul.

I've hacked the code to allow that and it seems to work. I'd like to agree with you on the design of the final solution so that, possibly, my pull request could be integrated with mainline Consul.
@armon Please let me know your comments and concerns.

The requirements:

allow "old-style" behavior - identification of nodes by their IP address
allow identification of nodes by some unique (cluster-wide) identifier;
it is not required to provide online IP change support (i.e. you need to at least restart agent to use new IP)

OK, so here's the idea:

use node name as a "node address" (consistency with serf, web API etc.)
keep this node address in RaftLayer and in serverParts
use serf-based node address resolver in RaftLayer::Dial and ConnPool::getNewConn to resolve node address to proper IP when creating a new connection

Please note that no reverse resolution (IP->node address) is required.

Correctness:
Obviously, there is a question whether such approach is correct.

Assumptions:

Raft algorithm doesn't require reliable network (i.e. network delays,
partitions, packet loss, duplication, re-ordering is allowed)
after code inspection I believe hashicorp's Raft implementation doesn't require reliable network either
consul doesn't identify message sender by remote address (which is by nature IP); instead the sender node identification is passed in messages (if needed); in general, the messages are valid or not regardless of who sends them - it is their content that matters
nodes are identified (in Raft, for RPC) by their address, but there is no requirement that this is a TCP address. Thus, it can be arbitrary node address without affecting Raft/consistency.

Observations:

If there is a property that IP address is not re-used the approach is correct, because effectively IP change is seen as a transient network problem.
if IP address is re-used after some reasonably long time, the scenario is reduced to transient network problem as well.

This is good enough for me, because that covers real-world scenarios I need to handle.

However, I believe than even in case of rapid IP addresses changes the approach stays correct. The new case to consider is when messages reach different destination then intended because serf data is not up-to-date. Still, because it is message content that matters, and not the sender, all invalid requests will be dropped (even now there must be a support for handling stray or delayed messages). There is a risk that some valid requests are dropped, but this affects only efficiency, but not correctness.

Obviously, this is hardly a _proof_ of correctness. I do not intend to perform formal verification though. Is it good enough for you?

I've looked over web API and I think this change doesn't affect it. I hope I haven't broken anything.

jakubzytka on 1 Dec 2015

fwiw, my workaround worked okayish - until a node hard crashes and you need to replace it.
@jakubzytka's design sounds reasonable to me, but I'm wondering what happens if you end up with two nodes using the same node name.

discordianfish on 2 Dec 2015

Two nodes cannot have the same node name; thats a serf requirement.
Right now an error is logged from serf (and a cluster doesn't form I guess) should such thing happen:

    2015/12/02 12:08:06 [ERR] memberlist: Conflicting address for blahblah. Mine: 192.168.9.3:8301 Theirs: 192.168.9.1:8301
    2015/12/02 12:08:06 [ERR] serf: Node name conflicts with another node at 192.168.9.1:8301. Names must be unique! (Resolution enabled: false)

jakubzytka on 2 Dec 2015

This would be far less of an issue in my implementation if I had a mechanism to kick out dead raft peers that serf thought were running again.

In my environment changing running servers IPs isn't the issue, it's if a server node fails there's a decent chance someone will not follow procedure and re-launch it with the same name, but a different IP address without force-leaving the failed node first. Serf will think everything is ok, all nodes will show as alive, but there's an orphaned raft peer lying around.

Detecting the orphaned raft node is easy enough to do with a monitoring system by comparing the number of raft peers with the number of consul servers. When that alert triggers normally it would be a simple manner of issuing a force-leave command for the failed node, however currently the force-leave command requires the node to be evicted to exist in serf. If someone doesn't follow procedure and re-launches a failed node and uses the old name (and it gets assigned a different IP address by EC2) then the only option is to bring the entire cluster down to update the peers.json file.

if the force-leave command could be extended (or a new command added) to being able to kick out an orphaned raft node without having to shutdown everything this becomes much less of an issue for me at least.

deltaroe on 28 Jan 2016

@deltaroe You can workaround your issue by scripting the startup, and not relying on a manual procedure. Just check and persist the IP when starting consul and then on every restart re-check that IP. If it changed - remove old data and start the node with a new name. Or, alternatively, use node names that contain the IP. You'll never have the same node name for different IPs and you will be able to remove stray peers with force-leave.
The problem (for me) is that both these approaches require quorum of nodes to be alive, and my solution works when there is no quorum.

jakubzytka on 1 Feb 2016

See related discussion - https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/consul-tool/RqRZL-cnjFg/gcnd9i3IHQAJ

slackpad on 1 Mar 2016

I don't suppose someone from Hashicorp could give @jakubzytka some feedback on the proposed design? This problem has just bitten us _badly_, and although I'm implementing workarounds, it'd be nice if this problem was solved in consul.

mpalmer on 5 Apr 2016

@mpalmer sorry you got bit by this.

We are currently working on some improvements for Raft's management of config changes but we want to be super careful we do this in the best way. We are currently leaning towards adding a cluster-wide GUID that comes from the memberlist layer and is used to track identity regardless of IP and node name, so we are working through the implications of that.

slackpad on 5 Apr 2016

👍3

@mpalmer If you badly need some solution you can try my patched consul. It handles changing IP address of a node as long as node name stays the same.
The code is available at https://github.com/jakubzytka/consul/tree/ipChangeSupport
The branch is based on consul v0.6 if I remember correctly, but I guess it should apply cleanly over the newest version.
We've been using it in "staging" for a few months without issues.

jakubzytka on 5 Apr 2016

Hi Guys

I am running a containerized version of consul single node cluster with volumes attached. When I bring this single node cluster up for the first time, the leader is elected successfully. I see the following entry is added in the peers.json
["172.17.4.162:8300"] ==> this is correct as 172.17.4.162 is IP of my container.
Now, I remove this container and make sure that it exits gracefully as I have set "leave_on_terminate": true in my configuration file. After exit, the peers.json returns null. Up to this point, everything seems good. Now, when I restart the consul single node cluster, the IP assigned to the new container is now ["172.17.4.170:8300"] and this is added in peers.json successfully and this is the only value exisitng in peers.json.
In spite of this, consul deployment fails. The new node somehow tries to connect to the previous IP "172.17.4.162" hat has already been deleted from peers.json. Here are the logs:
2016/07/06 20:10:50 [INFO] serf: EventMemberJoin: consul 172.17.4.170
2016/07/06 20:10:50 [INFO] serf: EventMemberJoin: consul.dc1 172.17.4.170
2016/07/06 20:10:50 [INFO] raft: Node at 172.17.4.170:8300 [Follower] entering Follower state
2016/07/06 20:10:50 [INFO] consul: adding LAN server consul (Addr: 172.17.4.170:8300) (DC: dc1)
2016/07/06 20:10:50 [INFO] consul: adding WAN server consul.dc1 (Addr: 172.17.4.170:8300) (DC: dc1)
2016/07/06 20:10:50 [ERR] agent: failed to sync remote state: No cluster leader
2016/07/06 20:10:52 [WARN] raft: Heartbeat timeout reached, starting election
2016/07/06 20:10:52 [INFO] raft: Node at 172.17.4.170:8300 [Candidate] entering Candidate state
2016/07/06 20:10:52 [INFO] raft: Election won. Tally: 1
2016/07/06 20:10:52 [INFO] raft: Node at 172.17.4.170:8300 [Leader] entering Leader state
2016/07/06 20:10:52 [INFO] consul: cluster leadership acquired
2016/07/06 20:10:52 [INFO] consul: New leader elected: consul
2016/07/06 20:10:52 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/07/06 20:10:52 [INFO] raft: Added peer 172.17.4.162:8300, starting replication
2016/07/06 20:10:52 [INFO] raft: Removed peer 172.17.4.162:8300, stopping replication (Index: 18)
2016/07/06 20:10:52 [INFO] consul: member 'consul' joined, marking health alive
2016/07/06 20:10:53 [INFO] agent: Synced service 'consul'
2016/07/06 20:10:55 [ERR] raft: Failed to heartbeat to 172.17.4.162:8300: dial tcp 172.17.4.162:8300: getsockopt: no route to host
2016/07/06 20:10:55 [ERR] raft: Failed to AppendEntries to 172.17.4.162:8300: dial tcp 172.17.4.162:8300: getsockopt: no route to host
2016/07/06 20:10:58 [ERR] raft: Failed to heartbeat to 172.17.4.162:8300: dial tcp 172.17.4.162:8300: getsockopt: no route to host

Could anyone help me find the reason?

sakshigeminisys on 6 Jul 2016

👍4

Any progress on this?
I use Consul in single mode (one node). When container restart, the ip address changed and the Consul could not start because it remembered his previous ip address.
Is there something to run Consul in single mode (one node).
My config (Docker compose):

version: '2'
services:
  consul:
    image: consul:0.7.2
    ports:
      - "8500:8500"
      - "8600:8600/tcp"
      - "8600:8600/udp"
    # https://github.com/hashicorp/consul/issues/166#issuecomment-233711577
    command: agent -server -bootstrap -ui -client 0.0.0.0