consul client keeps failing and joining over and over again

Created on 20 Apr 2016  路  6Comments  路  Source: hashicorp/consul

Hi,

I have 3 server nodes deployed in binary way.
And 2 client nodes deployed in docker container.
initially, the 2 client nodes join to the cluster successfully.
but after that, the server complain the "memberlist: Push/Pull" error and keep marking the client nodes failed and then rejoin them to the cluster over and over again.

logs from the server:

2016/04/20 12:20:58 [ERR] memberlist: Push/Pull with k8s2consul failed: dial tcp 172.17.0.5:8301: i/o timeout
2016/04/20 12:22:11 [INFO] memberlist: Suspect k8s2consul has failed, no acks received
2016/04/20 12:22:16 [INFO] memberlist: Marking k8s2consul as failed, suspect timeout reached
2016/04/20 12:22:16 [INFO] serf: EventMemberFailed: k8s2consul 172.17.0.5
2016/04/20 12:22:22 [INFO] serf: EventMemberJoin: k8s2consul 172.17.0.5
2016/04/20 12:25:38 [ERR] memberlist: Push/Pull with sdclient-pd failed: dial tcp 172.17.0.6:8301: i/o timeout
2016/04/20 12:25:46 [INFO] memberlist: Suspect sdclient-pd has failed, no acks received
2016/04/20 12:25:51 [INFO] memberlist: Marking sdclient-pd as failed, suspect timeout reached
2016/04/20 12:25:51 [INFO] serf: EventMemberFailed: sdclient-pd 172.17.0.6
2016/04/20 12:25:58 [INFO] serf: EventMemberJoin: sdclient-pd 172.17.0.6

logs from the client:

2016/04/20 03:19:59 [WARN] memberlist: Refuting a suspect message (from: server-252)
2016/04/20 03:20:22 [INFO] serf: EventMemberFailed: k8s2consul 172.17.0.5
2016/04/20 03:20:28 [INFO] serf: EventMemberJoin: k8s2consul 172.17.0.5
2016/04/20 03:20:46 [INFO] memberlist: Marking k8s2consul as failed, suspect timeout reached
2016/04/20 03:20:46 [INFO] serf: EventMemberFailed: k8s2consul 172.17.0.5
2016/04/20 03:20:48 [INFO] serf: EventMemberJoin: k8s2consul 172.17.0.5
2016/04/20 03:22:26 [WARN] memberlist: Refuting a suspect message (from: server-252)
2016/04/20 03:24:25 [INFO] memberlist: Marking k8s2consul as failed, suspect timeout reached
2016/04/20 03:24:25 [INFO] serf: EventMemberFailed: k8s2consul 172.17.0.5
2016/04/20 03:24:28 [INFO] serf: EventMemberJoin: k8s2consul 172.17.0.5
2016/04/20 03:25:43 [INFO] serf: EventMemberFailed: k8s2consul 172.17.0.5
2016/04/20 03:25:43 [INFO] serf: EventMemberJoin: k8s2consul 172.17.0.5
2016/04/20 03:25:48 [WARN] memberlist: Refuting a suspect message (from: sdclient-pd)

Most helpful comment

Hi @hehailong5 these issues are almost always caused by network configuration issues. You need port 8301 open for TCP and UDP between all nodes in a cluster (Consul requires them to be a fully connected mesh).

All 6 comments

I'm seeing something very similar, with a larger infrastructure (~200 nodes, 40-60 failed at any given time and over 100 suspect for any given time)

Sample of the logs:

2016/04/20 03:54:12 [INFO] memberlist: Marking foo-server-18 as failed, suspect timeout reached
2016/04/20 03:54:12 [INFO] serf: EventMemberFailed: foo-server-18 10.XXX.XXX.XXX
2016/04/20 03:54:13 [INFO] serf: EventMemberJoin: foo-server-18 10.XXX.XXX.XXX

I'm using 0.6.4 on the servers, and clients are currently 0.6.1
Here is some metric data from the consul telemetry, and this is fixed inventory, not dynamically scaling:
screen shot 2016-04-19 at 9 02 31 pm

b.t.w. in my case, server and client both in 0.6.3

Hi @hehailong5 these issues are almost always caused by network configuration issues. You need port 8301 open for TCP and UDP between all nodes in a cluster (Consul requires them to be a fully connected mesh).

for anyone who reached here after googling the error message ...

I solved this by creating TCP/UDP "allow rules" from and to the same CIDR as that of the machines themselves.

i.e. allow all ports from/to 192.168.x.x network for all the machines in the 192.168.x.x network.

HTH,
Shantanu

@shantanugadgil why all ports are required to open? Should only need to open port those are necessary for consul communication.

@shantanugadgil why all ports are required to open? Should only need to open port those are necessary for consul communication.

this was quite some time back and the "push/pull" error had annoyed me quite a bit during that time, so it was a "WTH moment" decision.

But, basically you are correct, you should open up only required ports.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

atomantic picture atomantic  路  4Comments

hooksie1 picture hooksie1  路  3Comments

wing731 picture wing731  路  3Comments

lmb picture lmb  路  4Comments

runswithd6s picture runswithd6s  路  3Comments