Hi -
TL;DR - what EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?
I'm setting up my first Consul cluster on EC2 (VPC, Ubuntu 14.04, consul v0.5.1 amd64) and while everything worked great locally on a docker-compose setup, things in EC2 didn't work.
My cluster (at this point) looked like this:
After launching consul on serverA, I would launch consul on serverB and have it join serverA.
The logs on serverA looked like this:
2015/05/18 17:53:30 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:32 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:34 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:36 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:37 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:37 [INFO] serf: EventMemberFailed: serverB 10.0.2.95
2015/05/18 17:53:37 [INFO] consul: member 'serverB' failed, marking health critical
2015/05/18 17:53:38 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:51 [INFO] serf: attempting reconnect to serverB 10.0.2.95:8301
2015/05/18 17:53:51 [INFO] serf: EventMemberJoin: serverB 10.0.2.95
2015/05/18 17:53:51 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:54 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:56 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:58 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:59 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:59 [INFO] serf: EventMemberFailed: serverB 10.0.2.95
The logs on serverB looked the same, just s/serverB/serverA/sg
In the EC2 security group's networking settings I had opened the ingress and egress for UDP and TCP 8300-8600 and all ICMP. Still no luck. Was getting the same errors as above.
Finally I opened all egress traffic within the subnet as shown in the following screenshot. Consul just started working.

I don't know what extra ports need to be opened, but as far as I can tell I've followed the consul docs but still didn't get it working.
This brings me to my question:
What EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?
Also, really loving consul. Thank you.
http://www.consul.io/docs/agent/options.html
Look at the section "Ports Used." Please re-open if that doesn't cover it.
Hey guys,
I had the same issue and I opened all neccessary ports (at least IMHO). :)
Two bare metal servers, both holding a docker image using consul 0.5.2:
ports:
- "8500:8500"
- "8300:8300"
- "8400:8400"
- "8301:8301/tcp"
- "8302:8302/tcp"
- "8301:8301/udp"
- "8302:8302/udp"
For me it was a matter of the explicit UDP port description. After adding the /udp ports, the RPC went through... :)
TL;DR: Expose 8301 and 8302 ports explicitly for both protocols (TCP and UDP).
This is not a Consul issue but related to the way Docker exposes ports.
I encountered a similar problem: I could create a cluster of 3 Consul servers (DigitalOcean machines, Consul server running as gliderlabs/consul-server Docker image), the nodes could see each other and elect a leader, but would fail right after election.
2015/09/13 18:41:07 [INFO] consul: Attempting bootstrap with nodes: [<server-1-ip>:8300 <server-2-ip>:8300 <server-3-ip>:8300]
2015/09/13 18:41:07 [WARN] raft: Heartbeat timeout reached, starting election
2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Candidate] entering Candidate state
2015/09/13 18:41:07 [INFO] raft: Election won. Tally: 2
2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
2015/09/13 18:41:07 [INFO] consul: cluster leadership acquired
2015/09/13 18:41:07 [INFO] consul: New leader elected: consul-web4
2015/09/13 18:41:07 [INFO] raft: pipelining replication to peer <server-3-ip>:8300
2015/09/13 18:41:07 [INFO] consul: member 'consul-web4' joined, marking health alive
2015/09/13 18:41:08 [WARN] raft: Remote peer <server-2-ip>:8300 does not have local node <server-1-ip>:8300 as a peer
2015/09/13 18:41:08 [INFO] consul: member 'consul-web3' joined, marking health alive
2015/09/13 18:41:08 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
2015/09/13 18:41:08 [INFO] agent: Synced service 'consul'
2015/09/13 18:41:09 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
2015/09/13 18:41:11 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
2015/09/13 18:41:12 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
2015/09/13 18:41:13 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
2015/09/13 18:41:14 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
2015/09/13 18:41:14 [INFO] memberlist: Marking consul-web2 as failed, suspect timeout reached
2015/09/13 18:41:14 [INFO] serf: EventMemberLeave: consul-web2 <server-3-ip>
2015/09/13 18:41:14 [INFO] consul: removing server consul-web2 (Addr: <server-3-ip>:8300) (DC: dc1)
2015/09/13 18:41:14 [INFO] raft: Removed peer <server-3-ip>:8300, stopping replication (Index: 8)
2015/09/13 18:41:14 [INFO] consul: removed server 'consul-web2' as peer
2015/09/13 18:41:14 [INFO] consul: member 'consul-web2' left, deregistering
I had exposed the appropriate ports in docker-compose.yml:
ports:
- "8400:8400"
- "8500:8500"
- "8301:8301"
- "8302:8302"
- "8300:8300"
- "8600:8600"
...but this did not seem to work. Explicitly defining tcp/udp ports as @ChristianKniep suggested did the trick:
ports:
- "8300:8300"
- "8301:8301/tcp"
- "8301:8301/udp"
- "8302:8302/tcp"
- "8302:8302/udp"
- "8400:8400"
- "8500:8500"
- "8600:8600"
2015/09/13 18:45:22 [INFO] raft: Election won. Tally: 2
2015/09/13 18:45:22 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
2015/09/13 18:45:22 [INFO] consul: cluster leadership acquired
2015/09/13 18:45:22 [INFO] consul: New leader elected: consul-web4
2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-1-ip>:8300
2015/09/13 18:45:22 [INFO] consul: member 'consul-web4' joined, marking health alive
2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
2015/09/13 18:45:22 [INFO] consul: member 'consul-web3' joined, marking health alive
2015/09/13 18:45:22 [INFO] consul: member 'consul-web2' joined, marking health alive
2015/09/13 18:45:22 [INFO] agent: Synced service 'consul'
This might be due to the fact that by default, Docker only exposes a TCP port, so you need to expose each port twice, but with different protocol switches.
Additionally, all of these publishing rules will default to tcp. If you need udp, simply tack it on to the end such as -p 1234:1234/udp. (source)
Related: #1465 and https://github.com/hashicorp/memberlist/pull/37.
@anroots I've tried explicitly adding the ports and it doesn't seem to make any difference whatsoever.
I'm suspecting it has something to do with the SG (since I'm trying this on EC2 instances and only one of the instances keeps failing)
in my case advertise address was wrong. I changed it and it worked.