Consul: Consul does not replicate WAN list across its members.

Created on 27 Jan 2016 · 14Comments · Source: hashicorp/consul

Consul don't replicate WAN list across servers on the same group, for example:

[root@prod-consul-xv-01 ~]# consul members | grep server
prod-consul-xv-01         10.1.11.237:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-02         10.1.66.242:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-03         10.1.83.251:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-04         10.1.43.229:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-05         10.1.83.250:8301  alive   server  0.6.0  2         xv-prod
[root@prod-consul-xv-01 ~]# consul members -wan
Node                       Address           Status  Type    Build  Protocol  DC
prod-consul-ca-01.ca-prod  10.5.6.230:8302   alive   server  0.6.0  2         ca-prod
prod-consul-lc-01.lc-prod  10.2.34.249:8302  alive   server  0.6.0  2         lc-prod
prod-consul-xa-01.xa-prod  10.16.1.253:8302  alive   server  0.6.0  2         xa-prod
prod-consul-xf-03.xf-prod  10.33.5.244:8302  alive   server  0.6.0  2         xf-prod
prod-consul-xv-01.xv-prod  10.1.11.237:8302  alive   server  0.6.0  2         xv-prod

And on the second node:

[root@prod-consul-xv-02 ~]# consul members | grep server
prod-consul-xv-01         10.1.11.237:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-02         10.1.66.242:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-03         10.1.83.251:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-04         10.1.43.229:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-05         10.1.83.250:8301  alive   server  0.6.0  2         xv-prod
[root@prod-consul-xv-02 ~]# consul members -wan
Node                       Address           Status  Type    Build  Protocol  DC
prod-consul-xv-02.xv-prod  10.1.66.242:8302  alive   server  0.6.0  2         xv-prod

This may or may not be related to #1471. Since depending on the node the client connects to you might see or not other data centers.

typenhancement

Source

takeda

👍1

Most helpful comment

Apologies because the docs are a bit confusing with regards to the "you only need to join a single existing member" part. What it is trying to say is that the LAN and the WAN join both work such that you only have to join with one other existing member of the cluster in order to join the entire cluster (that's why you don't need to do 80 joins in the example above). What's not super clear is that there's no connection between the WAN and LAN clusters, even though there could be.

What we intend to add with the enhancement is automatic WAN joining based on the LAN. You'd have to do at least one WAN join with a server in each datacenter, but after that Consul would recognize that there are servers on the LAN that aren't present in the WAN and would auto-join them. This should make it much harder to get into a situation where you've only WAN-joined a subset of your servers, which is fairly easy to do today.

slackpad on 8 Jul 2016

👍5

All 14 comments

You must join all servers into the WAN pool, it is not automatic. On prod-consul-xv-02, run consul join 10.1.11.237 -wan (where 10.1.11.237 is prod-consul-xv-01, which is already a member of the WAN pool. See: https://www.consul.io/docs/agent/options.html#_join_wan - specifically:

By default, the agent won't -join-wan any nodes when it starts up.

highlyunavailable on 27 Jan 2016

Could this be a feature request then?

With 5 nodes in 5 datacenters that's already 25 nodes to modify, where there's a significant chance of making a mistake.

Consul's purpose (or rather serf and raft which it relies on) is to make sure that data is replicated within its cluster, so it make sense to replicate DC information as well.

takeda on 27 Jan 2016

👍2

I've had this issue after rebooting nodes one at a time within a cluster. When they come back up, they don't have the wan config, and it doesn't appear to replicate from the other nodes.

atrbgithub on 4 Mar 2016

I believe that this is the root of the problem in #1471 ; I'd like to see it such that joining 1 member of a server cluster to a member of another cluster should be persistent and should flow across the 2 clusters.

wwalker on 8 Jun 2016

@slackpad after taking look at the https://www.consul.io/docs/guides/datacenters.html, particularly this fragment

The join command is used with the -wan flag to indicate we are attempting to join a server in the WAN gossip pool. As with LAN gossip, you only need to join a single existing member, and the gossip protocol will be used to exchange information about all known members. For the initial setup, however, each server will only know about itself and must be added to the cluster.

It appears to me that this is in fact a bug and this issue should be relabeled accordingly

takeda on 15 Jun 2016

Not sure if this is helpful to anyone, however I worked around this issue by adding the nodes of the opposing datacentre into the consul configuration:

  "retry_join_wan":[
    "192.168.15.232",
    "192.168.15.208",
    "192.168.15.31"
  ],

After a reboot, each node is then able to rejoin the wan.

csghuser on 7 Jul 2016

Does this appear to be a bug because each Consul Server must join the WAN pool
individually and retry join each of the remote Consul datacenters, or because
you are expecting the WAN pool information to flood over the LAN gossip pool?

sean- on 7 Jul 2016

@csghuser that is the correct way to configure Consul for servers connected to a
WAN. It's not necessary to have this be configured in a full-mesh, however
there is no harm in having symmetry between all Consul servers participating in
the WAN pool. The important part is that all members of the WAN eventually
converge to create a consistent pool.

sean- on 7 Jul 2016

@sean- I consider this a bug, because when you connect two DCs together using command you expect it to mean for the entire cluster, this is why @wwalker had connectivity problems.

Take look at for example Riak's MDC setup (http://docs.basho.com/riak/kv/2.1.4/configuring/v3-multi-datacenter/quick-start/) you just issue a single connection between the clusters it even obtains IPs of other nodes to connect to. It doesn't even matter what happens to the node you issued connections from, the cluster is connected.

Consul has all the tools necessary to accomplish the same thing.

I suppose one can use the configuration file, and that should work, it's essentially offloading work to something else. If it's done by hand is prone to mistakes, if is done automatically you'll need service discovery for the consul itself, plus dealing with special case like not listing own IP there.

Anyway as it is right now the join command line is totally useless. If you have 5 datacenters with 5 nodes won't you need to issue 80 joins to ensure that all nodes are connected with all nodes (5 * 4 * 4) so you won't encounter @wwalker's issues and then be on top of that each time a node is replaced (20 joins for a new node)

Edit: referenced wrong person

takeda on 7 Jul 2016

slackpad on 8 Jul 2016

👍5

It's not really a bug, more of a missing feature :-)

slackpad on 8 Jul 2016

Sounds good. That addition would help a lot.

takeda on 8 Jul 2016

👍1

Yep, agreed.

All you should need to do is to perform the wan join once, and then within both dc's the wan config should propagate to all server nodes.

Currently this doesn't seem to happen, and when the servers reboot they seem to lose all knowledge of the wan they were part of. Unless you manually add it into the config file as above.

csghuser on 11 Jul 2016

👍1

WAN join flooding made it into Consul 0.8:

WAN Join Flooding: A new routine was added that looks for Consul servers in the LAN and makes sure that they are joined into the WAN as well. This catches up up newly-added servers onto the WAN as soon as they join the LAN, keeping them in sync automatically. [GH-2801]

Closed in #2801.

slackpad on 12 Apr 2017

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings