Consul: Simple solution for consul on docker swarm 1.12+

Created on 18 Sep 2017 · 12Comments · Source: hashicorp/consul

The issue with running consul on docker swarm with persistent data, is that if deployed using docker stack, the IPs will regularly change as individual containers are recreated. Rather than trying to fix raft, for which the problems are plentiful, and frequently documented, here is a simple solution. Separate out the real data(k/v, policies, acls) that people actually need/want to keep from the raft(protocol, servers, etc). The docker can persist the real data onto storage, and allow the other data, that will just be recreated anyways on startup to be blown away. At this point, recovery in the event of catastrophic failure becomes easy. Shutdown the service, letting the raft server data get deleted. Restart the service, servers find each other, elect, and bring up a new cluster, and the real data, that people actually care about, was persisted and reloaded into consul. If it weren't for the db, storing out dated information, like old IP addresses, etc. consul on swarm would be piece of cake, scaleable, and simple and quick to recover, as in seconds. Less time than to even bring up peers.json in an editor or bring up a browser to look up the current recovery method. The idea of hard-coding server connections into a file, and forcing that on an application when starting up is out of step with modern systems. Or even just a simple command line parameter that can be passed or put in the config file, that says ignore anything you ever thought you knew about yourself or other servers, and start a fresh cluster. The garbage data would still be there, but at least ignored and not causing innumerable issues.

Source

jfgibbins

Most helpful comment

That was kind of my point from the start, that consul is not ready or docker swarm capable. And it's because consul insists on going old school with the idea of using static information. If the data wasn't in the same file as the server info. then the data could be persisted, and the useless raft/serf info, could/would just be created fresh whenever the docker service is recreated by docker stack. Consul starts great the first time, and it's because that information isn't there. Yes, setting node-id, or creating persistence and forcing a consul server to be sticking to a docker host, would work, but if you're going to do that, might as well, just go back to old school, and use static servers, and IPs, and throw out the whole "cloud" concept, ie, three consul servers, means rack up 3 pizza boxes in your datacenter. I was hoping that the initial entry would trigger making consul a cloud capable app, and extending its usefulness and longevity going forward. Not how to shove a square app into a cloud hole. I appreciate you trying to help, but I guess as you said, consul isn't configured to work properly in that type of environment.

jfgibbins on 29 Sep 2017

👍4

All 12 comments

Hi @jfgibbins Raft manages the set of servers in the quorum as part of the Raft data, so they are kind of deeply tied together. It would be difficult to keep the data but not the server info. You could probably cook up something with the Consul snapshot API if you wanted to always use completely fresh servers.

What we did for Consul 0.9.3, though, is use the Serf information and the fact that we have UUIDs for each node to re-map the IPs on the fly. If you are running Consul 0.9.3 and have configured -raft-protocol to 3 for all your servers, then even if the whole cluster restarts with new IPs, Raft can continue to work and will automatically fix up the Raft IPs for the quorum. Consul 1.0 will default the Raft protocol to 3 so it won't take any special configuration to get this.

You can find out more under https://github.com/hashicorp/consul/issues/1580. Hopefully this is as simple as possible for operators as there should be nothing to do :-)

slackpad on 29 Sep 2017

@slackpad What you said in theory would seem to work, but in reality, doesn't. When consul is restarted with new IPs, it fails miserably. Log files continually report trying to reach non-existent servers, and in fact, doesn't even have enough sense to know that it's own IP has changed, as it continues to try and connect to its old addresses as if it was a different server, regardless of whether you're manually setting raft to 3, or trusting that 0.9.3 actually does, despite reporting always saying 2. I also, don't understand, how "real data", ie; key/values that are stored, are in any way related to, or dependent upon server parameters, such as server addresses, ports, protocols, etc, or why they couldn't be saved to two separate files. Example, you don't exactly store your word and excel data, in your windows registry. Without all the prior server info garbage, and new cluster comes up quickly, and cleanly, and if the "real data" was kept separate, then it could just go straight to work with persistent data intact.

jfgibbins on 29 Sep 2017

What you said in theory would seem to work, but in reality, doesn't. When consul is restarted with new IPs, it fails miserably.

We've worked super hard to address this, so if you are still seeing issues with 0.9.3 and raft protocol 3, please post some log info and we will take a look. In our internal tests with Docker and Swarm-based tests that folks in the community did to vet #1580, the servers were able to recover on their own, even when the whole cluster was restarted at once with all new IPs.

regardless of whether you're manually setting raft to 3, or trusting that 0.9.3 actually does, despite reporting always saying 2

That sounds like you are looking at the protocol version in consul members - that's showing the internal RPC protocol (currently 2), not the Raft one. This command will show the Raft protocol version in use on each server - https://www.consul.io/docs/commands/operator/raft.html#list-peers.

I also, don't understand, how "real data", ie; key/values that are stored, are in any way related to, or dependent upon server parameters, such as server addresses, ports, protocols, etc, or why they couldn't be saved to two separate files.

For a system that's consensus-based like Raft it treats the list of servers in the quorum as a first class concept and stores modifications to that along with the "real data" in the Raft log. Changes to the quorum go through the same change process as writing a KV entry, so it's safe and consistent. That's why when it loses quorum it requires a human to sort it out via peers.json, because it doesn't have a majority of servers available to make changes any more. The peers.json recovery essentially forces a record onto the end of the log on each server. There's not a safe way from consistency perspective to have multiple servers automatically stripping off their quorum configuration and using something new without some form of coordination.

slackpad on 29 Sep 2017

@slackpad Well, if you have a working .yml file that works with docker stack deploy, It's docker, so it should be able to handle a docker stack deploy ...., docker stack rm...., and deploy again and come up successfully, consistently. I can take a look, make sure that it's not something in mine, but as it stands, it fails. Manually changing a peers.json file is not a legitimate option for production quality software, and is a poor hack to compensate for a flawed architecture. Nor would it work in a docker stack, where the IPs wouldn't be known til after a start is attempted. Here are some logs:

Node Address Status Type Build Protocol DC Segment
95cc375e28c4 10.0.0.5:8301 alive server 0.9.3 2 dc1
consul-seed 10.0.0.3:8301 alive server 0.9.3 2 dc1
d22eefacfd92 10.0.0.6:8301 alive server 0.9.3 2 dc1

Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (rpc error: No cluster leader)
It will report 3 on the occasions it does actually come up, which is usually only when it starts without all the old raft/server data.

2017/09/29 16:17:27 [INFO] raft: Added peer 10.0.0.6:8300, starting replication
2017/09/29 16:17:27 [INFO] raft: Added peer 10.0.0.5:8300, starting replication
2017/09/29 16:17:27 [INFO] consul: cluster leadership acquired
2017/09/29 16:17:27 [INFO] raft: pipelining replication to peer {Voter 10.0.0.5:8300 10.0.0.5:8300}
2017/09/29 16:17:27 [INFO] raft: pipelining replication to peer {Voter 10.0.0.6:8300 10.0.0.6:8300}
2017/09/29 16:17:27 [INFO] raft: Node at 10.0.0.3:8300 [Follower] entering Follower state (Leader: "")
2017/09/29 16:17:27 [INFO] raft: aborting pipeline replication to peer {Voter 10.0.0.6:8300 10.0.0.6:8300}
2017/09/29 16:17:27 [ERR] consul: failed to wait for barrier: leadership lost while committing log
2017/09/29 16:17:27 [INFO] consul: cluster leadership lost

2017/09/29 16:17:27 [WARN] Unable to get address for server id 10.0.0.7:8300, using fallback address 10.0.0.7:8300: Could not find address for server id 10.0.0.7:8300
2017/09/29 16:17:27 [WARN] Unable to get address for server id 10.0.0.8:8300, using fallback address 10.0.0.8:8300: Could not find address for server id 10.0.0.8:8300
Seriously? can't find the address for an ip number, so it falls back to the ip number, and can't find it again?

jfgibbins on 29 Sep 2017

2017/09/29 16:17:27 [WARN] Unable to get address for server id 10.0.0.7:8300, using fallback address 10.0.0.7:8300: Could not find address for server id 10.0.0.7:8300

^ That's the problem - the server has been added in the old way (pre-Raft protocol version 3) so it is using the IP as the ID, so our new fix can't map it. The fix in 0.9.3 matches the UUID for the server to the latest Serf info to update the IP address. It looks like you haven't updated those servers to use -raft-protocol=3. This link has some info on how to do a rolling update - https://www.consul.io/docs/upgrade-specific.html#raft-protocol-version-compatibility. If you are deploying a fresh cluster you should just be able to configure it to the right protocol.

slackpad on 29 Sep 2017

Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (rpc error: No cluster leader) It will report 3 on the occasions it does actually come up, which is usually only when it starts without all the old raft/server data.

If you do consul operator raft list-peers -stale it should fetch that when there's no leader available.

slackpad on 29 Sep 2017

Well, unfortunately, it was brought up initially with raft 3.
consul operator raft list-peers -stale
Node ID Address State Voter RaftProtocol
consul-seed 10.0.0.3:8300 10.0.0.3:8300 follower true 3
d22eefacfd92 10.0.0.6:8300 10.0.0.6:8300 follower true 3
95cc375e28c4 10.0.0.5:8300 10.0.0.5:8300 follower true 3

jfgibbins on 29 Sep 2017

Well, unfortunately, it was brought up initially with raft 3.

It doesn't make sense how it didn't map those IDs correctly - I don't think I know enough about how this cluster was configured and/or upgraded to help there. Can you start fresh with this cluster, blowing away the data directories on all the servers, and starting them all with -raft-protocol=3? The IDs in that output should be the UUIDs, not the IPs, and then I think things will start working correctly.

slackpad on 29 Sep 2017

It doesn't make sense how it didn't map those IDs correctly - I don't think I know enough about how this cluster was configured and/or upgraded to help there.

One theory is that if there ever was a mix of servers in there that didn't support the new protocol, Consul will hold off on using the new ID scheme until all servers can support it.

slackpad on 29 Sep 2017

Whether it's UUID, IPs, etc, consul does not seem to have the ability to start cleanly, so long as the software insists on hanging on to old/outdated information. None of that information is going to be persistent in a dynamic environment. Which brings us back to the first entry. Data and server info are not related, and one should not cripple the other. I do understand about consensus of servers, but that is a completely different function then persistence of data. Data is kept up by using transaction logs, not by which raft it serfed in on. Think about how full blown databases perform. You can move sql info around from server to server all day long, nothing in the data has to be changed or informed that a different server is handling it. And when a server is handling data, it doesn't care where it originated.

2017/09/29 18:52:30 [WARN] Unable to get address for server id 74728941-6707-70fa-d813-3ed1001e6be6, using fallback address 10.0.0.5:8300: Could not find address for server id 74728941-6707-70fa-d813-3ed1001e6be6
2017/09/29 18:52:30 [WARN] Unable to get address for server id 773fe7dd-740c-e85d-b295-8eea062368f4, using fallback address 10.0.0.6:8300: Could not find address for server id 773fe7dd-740c-e85d-b295-8eea062368f4
2017/09/29 18:52:30 [INFO] consul: cluster leadership acquired
2017/09/29 18:52:30 [INFO] consul: New leader elected: consul-seed
2017/09/29 18:52:30 [INFO] raft: Node at 10.0.0.6:8300 [Follower] entering Follower state (Leader: "")
2017/09/29 18:52:30 [ERR] consul: failed to wait for barrier: leadership lost while committing log
2017/09/29 18:52:30 [INFO] consul: cluster leadership lost
2017/09/29 18:52:40 [WARN] raft: Heartbeat timeout from "10.0.0.6:8300" reached, starting election
2017/09/29 18:52:40 [INFO] raft: Node at 10.0.0.6:8300 [Candidate] entering Candidate state in term 11
2017/09/29 18:52:40 [WARN] Unable to get address for server id 773fe7dd-740c-e85d-b295-8eea062368f4, using fallback address 10.0.0.6:8300: Could not find address for server id 773fe7dd-740c-e85d-b295-8eea062368f4
2017/09/29 18:52:40 [WARN] Unable to get address for server id 74728941-6707-70fa-d813-3ed1001e6be6, using fallback address 10.0.0.5:8300: Could not find address for server id 74728941-6707-70fa-d813-3ed1001e6be6
2017/09/29 18:52:40 [INFO] raft: Duplicate RequestVote for same term: 11
2017/09/29 18:52:40 [WARN] raft: Duplicate RequestVote from candidate: 10.0.0.6:8300
2017/09/29 18:52:40 [INFO] raft: Duplicate RequestVote for same term: 11
2017/09/29 18:52:40 [WARN] raft: Duplicate RequestVote from candidate: 10.0.0.6:8300
2017/09/29 18:52:40 [INFO] raft: Election won. Tally: 2
2017/09/29 18:52:40 [INFO] raft: Node at 10.0.0.6:8300 [Leader] entering Leader state
2017/09/29 18:52:40 [INFO] raft: Added peer 74728941-6707-70fa-d813-3ed1001e6be6, starting replication
2017/09/29 18:52:40 [INFO] raft: Added peer 773fe7dd-740c-e85d-b295-8eea062368f4, starting replication
2017/09/29 18:52:40 [INFO] consul: cluster leadership acquired
2017/09/29 18:52:40 [WARN] Unable to get address for server id 74728941-6707-70fa-d813-3ed1001e6be6, using fallback address 10.0.0.5:8300: Could not find address for server id 74728941-6707-70fa-d813-3ed1001e6be6
2017/09/29 18:52:40 [INFO] consul: New leader elected: consul-seed
2017/09/29 18:52:40 [WARN] Unable to get address for server id 773fe7dd-740c-e85d-b295-8eea062368f4, using fallback address 10.0.0.6:8300: Could not find address for server id 773fe7dd-740c-e85d-b295-8eea062368f4
2017/09/29 18:52:40 [INFO] raft: Node at 10.0.0.6:8300 [Follower] entering Follower state (Leader: "")
2017/09/29 18:52:40 [ERR] consul: failed to wait for barrier: node is not the leader
2017/09/29 18:52:40 [INFO] consul: cluster leadership lost

As for configuration. One container is brought up with access to the persistent data. Then two more containers are started to provide load-balancing, redundancy, and connected to the initial container. Pretty basic and simple. The only thing that fails, is election of leader, because it's constantly trying to use stale and static information, in a dynamic infrastructure.

jfgibbins on 29 Sep 2017

As for configuration. One container is brought up with access to the persistent data. Then two more containers are started to provide load-balancing, redundancy, and connected to the initial container. Pretty basic and simple. The only thing that fails, is election of leader, because it's constantly trying to use stale and static information, in a dynamic infrastructure.

Consul isn't designed to be configured like that. If you have persistent state for all three it will recover even if the IP addresses are dynamic. If you want to have to fresh servers you could work around that by giving the two fresh ones the same node IDs as the old ones (basically configure static node IDs for each server). Hope that helps!

slackpad on 29 Sep 2017

👍2

jfgibbins on 29 Sep 2017

👍4

Was this page helpful?

0 / 5 - 0 ratings