Consul: memberlist: Suspect <name> has failed, no acks received

Created on 18 May 2015 · 5Comments · Source: hashicorp/consul

Hi -
TL;DR - what EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?

I'm setting up my first Consul cluster on EC2 (VPC, Ubuntu 14.04, consul v0.5.1 amd64) and while everything worked great locally on a docker-compose setup, things in EC2 didn't work.

My cluster (at this point) looked like this:

serverA runs consul with bootstrap=true,server=true
serverB (different subnet, same security group) runs consul with bootstrap=false,server=true

After launching consul on serverA, I would launch consul on serverB and have it join serverA.

The logs on serverA looked like this:

2015/05/18 17:53:30 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:32 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:34 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:36 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:37 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:37 [INFO] serf: EventMemberFailed: serverB 10.0.2.95
2015/05/18 17:53:37 [INFO] consul: member 'serverB' failed, marking health critical
2015/05/18 17:53:38 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:51 [INFO] serf: attempting reconnect to serverB 10.0.2.95:8301
2015/05/18 17:53:51 [INFO] serf: EventMemberJoin: serverB 10.0.2.95
2015/05/18 17:53:51 [INFO] consul: member 'serverB' joined, marking health alive
2015/05/18 17:53:54 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:56 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:58 [INFO] memberlist: Suspect serverB has failed, no acks received
2015/05/18 17:53:59 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
2015/05/18 17:53:59 [INFO] serf: EventMemberFailed: serverB 10.0.2.95

The logs on serverB looked the same, just s/serverB/serverA/sg

In the EC2 security group's networking settings I had opened the ingress and egress for UDP and TCP 8300-8600 and all ICMP. Still no luck. Was getting the same errors as above.

The Solution

Finally I opened all egress traffic within the subnet as shown in the following screenshot. Consul just started working.

screen shot 2015-05-18 at 11 32 09 am

I don't know what extra ports need to be opened, but as far as I can tell I've followed the consul docs but still didn't get it working.

This brings me to my question:

What EXACT network settings are required for consul nodes to speak to one another and do the "ack" they require to remain in a healthy state?

Also, really loving consul. Thank you.

Source

jdrago999

👍1

All 5 comments

http://www.consul.io/docs/agent/options.html

Look at the section "Ports Used." Please re-open if that doesn't cover it.

ryanbreen on 18 May 2015

Hey guys,

I had the same issue and I opened all neccessary ports (at least IMHO). :)
Two bare metal servers, both holding a docker image using consul 0.5.2:

    ports:
     - "8500:8500"
     - "8300:8300"
     - "8400:8400"
     - "8301:8301/tcp"
     - "8302:8302/tcp"
     - "8301:8301/udp"
     - "8302:8302/udp"

For me it was a matter of the explicit UDP port description. After adding the /udp ports, the RPC went through... :)

ChristianKniep on 31 Aug 2015

TL;DR: Expose 8301 and 8302 ports explicitly for both protocols (TCP and UDP).

This is not a Consul issue but related to the way Docker exposes ports.

I encountered a similar problem: I could create a cluster of 3 Consul servers (DigitalOcean machines, Consul server running as gliderlabs/consul-server Docker image), the nodes could see each other and elect a leader, but would fail right after election.

2015/09/13 18:41:07 [INFO] consul: Attempting bootstrap with nodes: [<server-1-ip>:8300 <server-2-ip>:8300 <server-3-ip>:8300]
    2015/09/13 18:41:07 [WARN] raft: Heartbeat timeout reached, starting election
    2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Candidate] entering Candidate state
    2015/09/13 18:41:07 [INFO] raft: Election won. Tally: 2
    2015/09/13 18:41:07 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
    2015/09/13 18:41:07 [INFO] consul: cluster leadership acquired
    2015/09/13 18:41:07 [INFO] consul: New leader elected: consul-web4
    2015/09/13 18:41:07 [INFO] raft: pipelining replication to peer <server-3-ip>:8300
    2015/09/13 18:41:07 [INFO] consul: member 'consul-web4' joined, marking health alive
    2015/09/13 18:41:08 [WARN] raft: Remote peer <server-2-ip>:8300 does not have local node <server-1-ip>:8300 as a peer
    2015/09/13 18:41:08 [INFO] consul: member 'consul-web3' joined, marking health alive
    2015/09/13 18:41:08 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
    2015/09/13 18:41:08 [INFO] agent: Synced service 'consul'
    2015/09/13 18:41:09 [INFO] memberlist: Suspect consul-web2 has failed, no acks received


    2015/09/13 18:41:11 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
    2015/09/13 18:41:12 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
    2015/09/13 18:41:13 [INFO] memberlist: Suspect consul-web2 has failed, no acks received
    2015/09/13 18:41:14 [INFO] memberlist: Suspect consul-web3 has failed, no acks received
    2015/09/13 18:41:14 [INFO] memberlist: Marking consul-web2 as failed, suspect timeout reached
    2015/09/13 18:41:14 [INFO] serf: EventMemberLeave: consul-web2 <server-3-ip>
    2015/09/13 18:41:14 [INFO] consul: removing server consul-web2 (Addr: <server-3-ip>:8300) (DC: dc1)
    2015/09/13 18:41:14 [INFO] raft: Removed peer <server-3-ip>:8300, stopping replication (Index: 8)
    2015/09/13 18:41:14 [INFO] consul: removed server 'consul-web2' as peer
    2015/09/13 18:41:14 [INFO] consul: member 'consul-web2' left, deregistering

I had exposed the appropriate ports in docker-compose.yml:

ports:
    - "8400:8400"
    - "8500:8500"
    - "8301:8301"
    - "8302:8302"
    - "8300:8300"
    - "8600:8600"

...but this did not seem to work. Explicitly defining tcp/udp ports as @ChristianKniep suggested did the trick:

ports:
      - "8300:8300"
      - "8301:8301/tcp"
      - "8301:8301/udp"
      - "8302:8302/tcp"
      - "8302:8302/udp"
      - "8400:8400"
      - "8500:8500"
      - "8600:8600"

 2015/09/13 18:45:22 [INFO] raft: Election won. Tally: 2
    2015/09/13 18:45:22 [INFO] raft: Node at <server-1-ip>:8300 [Leader] entering Leader state
    2015/09/13 18:45:22 [INFO] consul: cluster leadership acquired
    2015/09/13 18:45:22 [INFO] consul: New leader elected: consul-web4
    2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-1-ip>:8300
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web4' joined, marking health alive
    2015/09/13 18:45:22 [INFO] raft: pipelining replication to peer <server-2-ip>:8300
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web3' joined, marking health alive
    2015/09/13 18:45:22 [INFO] consul: member 'consul-web2' joined, marking health alive
    2015/09/13 18:45:22 [INFO] agent: Synced service 'consul'

This might be due to the fact that by default, Docker only exposes a TCP port, so you need to expose each port twice, but with different protocol switches.