consul 0.9.2 - [ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890->a.b.c.d:8301: i/o timeout

Created on 24 Aug 2017 · 16Comments · Source: hashicorp/consul

Running consul docker image 0.9.2

`consul version` for both Client and Server

Client: 0.7.5 -> upd to 0.9.2
Server: 0.7.5 -> upd to 0.9.2

`consul info` for both Client and Server

Client:

# consul info
agent:
    check_monitors = 0
    check_ttls = 0
    checks = 7
    services = 18
build:
    prerelease =
    revision = 75ca2ca
    version = 0.9.2
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 84
    max_procs = 4
    os = linux
    version = go1.8.3
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 4625
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 6897
    members = 12
    query_queue = 0
    query_time = 2

Server:

# consul info
agent:
    check_monitors = 0
    check_ttls = 0
    checks = 5
    services = 13
build:
    prerelease =
    revision = 75ca2ca
    version = 0.9.2
consul:
    bootstrap = false
    known_datacenters = 8
    leader = false
    leader_addr = 192.168.10.237:8300
    server = true
raft:
    applied_index = 431457249
    commit_index = 431457249
    fsm_pending = 0
    last_contact = 27.164445ms
    last_log_index = 431457249
    last_log_term = 23227
    last_snapshot_index = 431453186
    last_snapshot_term = 23227
    latest_configuration = [{Suffrage:Voter ID:a.b.c.d1:8300 Address:a.b.c.d1:8300} {Suffrage:Voter ID:a.b.c.d2:8300 Address:a.b.c.d2:8300} {Suffrage:Voter ID:a.b.c.d3:8300 Address:a.b.c.d3:8300}]
    latest_configuration_index = 359053270
    num_peers = 2
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 23227
runtime:
    arch = amd64
    cpu_count = 8
    goroutines = 359
    max_procs = 8
    os = linux
    version = go1.8.3
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 4625
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 6897
    members = 12
    query_queue = 0
    query_time = 2
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 767
    members = 13
    query_queue = 0
    query_time = 1

Operating system and Environment details

Ubuntu 16.04.3 LTS, docker 17.05.0-ce

Description of the Issue (and unexpected/desired result)

After upgrade Consul to v0.9.2. seeing lot of messages in log on every host, random

[ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890-> a.b.c.d:8301: i/o timeout
````

### Reproduction steps
Consul docker image 0.7.5 upgrade to v.0.9.2, after that randomly get log messages about fallback ping.

Tried to use -log-level=TRACE but it is impossible to capture on what host is this going to happen. It is totally random.

All docker ports are open as I see it:

"Ports": {
"8300/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8300"
}
],
"8301/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8301"
}
],
"8301/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8301"
}
],
"8302/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8302"
}
],
"8302/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8302"
}
],
"8400/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8400"
}
],
"8500/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8500"
}
],
"8600/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8600"
}
],
"8600/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8600"
}
]
},

On my test enviroment, installed 2 docker hosts with consul 0.7.5, after that upgrade to v0.8.5 and then to v0.9.0 and fallback ping started. So I think this is something caused from version 0.9.x

No firewall, no iptables, nothing that could block connection and cause timeout.


Edit:
Also, seeing a lot of this on 2 servers on the same subnet/network:

2017/08/24 13:12:04 [DEBUG] memberlist: Initiating push/pull sync with: a.b.c.d:8301
2017/08/24 13:12:10 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
2017/08/24 13:12:11 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
```

themoperator-usability typenhancement

Source

mnuic

Most helpful comment

Seeing the same thing in 1.0.2. More specifically, the nodes having the issues on are different VPCs, but the VPCs are peered and I have verified they can reach each other bi-direcitonally on all of the required ports. The only thing I could think of is that since all of the nodes are in a private subnet with a NAT, that is somehow causing interference, but they have appropriate direct routes setup. Debug messages don't shed any additional light.

charlieoleary on 9 Jan 2018

👍9

All 16 comments

The same happening. I'm on EC2 isntances. NO docker.

Consul v0.9.0

It is a 3 servers cluster only. No clients. Ports from 8300-8500 are allowed both udp and tcp. Not for 8600.

2017/09/01 19:40:51 [WARN] memberlist: Was able to connect to [server] but other probes failed, network may be misconfigured 2017/09/01 19:40:52 [DEBUG] memberlist: Stream connection from=10.0.3.237:44932 2017/09/01 19:40:52 [DEBUG] memberlist: Failed ping: [server] (timeout reached)

Consul cluster is alive and healthy. Just dont underestand that logs.

aroca on 1 Sep 2017

For what is worth, I rechecked the ACLs in AWS and the UDP ports were missing. The log is not that helpful though. I remember that in previous versions it stated that UDP was not reaching, fallback to TCP.. now the ping message isnt very helpful. Perhaps in the new 0.9.2 it changed.
Cheers!

aroca on 1 Sep 2017

Don't have any ACL's, and all ports are open, but still get random timeout messages.

mnuic on 1 Sep 2017

We changed our infrastructure so that consul container has a host network and CONSUL_ALLOW_PRIVILEGED_PORTS=1 this morning. And we are seeing a lot of the same log messages:

[ERR] memberlist: Failed fallback ping: write tcp 10.0.0.1:49826->10.0.0.5:8301: i/o timeout

I found the explanation and can see the use of it and would not like to disable it but it is a little too excesive. Log lines are full for no obvious reason.https://github.com/hashicorp/consul/blob/v0.6.4/vendor/github.com/hashicorp/memberlist/state.go#L275-L299

@slackpad can You help?

mnuic on 19 Sep 2017

Hmm that error message did get more generic after a refactoring. We should look at making these messages more specific and actionable (and less spammy).

slackpad on 19 Dec 2017

That would be great @slackpad.

And also, can you do someting about this type of messages? We get them every couple of minutes, even on the last version 1.0.2

 [ERR] yamux: keepalive failed: session shutdow

mnuic on 19 Dec 2017

👍2

Hi @slackpad ,

Is there a chance to resolve this in the next release?
And to lower the log-level for the yamux keepalive failed session shutdown?

I'm asking because we have a lot of nodes and this kind of log messages are becoming too spammy.

Thanks

mnuic on 8 Jan 2018

charlieoleary on 9 Jan 2018

👍9

Hi,

Consul version 1.2.0, on the same LAN network every few minutes logs are filled:

2018/06/27 13:00:31 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:35168->10.0.66.192:8302: i/o timeout
2018/06/27 13:03:41 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:53268->10.0.66.192:8302: i/o timeout

mnuic on 28 Jun 2018

I have been seeing the same for quite some time (and versions) between my on premise server and a cloud server.

_I have verified all ports back and forth using telnet, netcat, iperf3._

consul version 1.2.2

shantanugadgil on 2 Aug 2018

+1 consul version 1.2.2

linydquantil on 5 Sep 2018

👍1

encountered same issue with 1.2.2

~~~
docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node Address Status Type Build Protocol DC Segment
consulserver 192.168.99.100:8301 alive server 1.2.2 2 labsetup
consulclient 192.168.99.101:8301 alive client 1.2.2 2 labsetup

2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured

~~~

ervikrant06 on 8 Sep 2018

+1 consul version 1.4.0

lorierp on 27 Dec 2018

The errors I was seeing have since gone away.
The issue was how the VPN was setup between the two endpoints.

Previously it was a software based VPN (StrongSwan).
Once a site-to-site VPN was setup between the on-premise firewall and AWS, this error went away.

shantanugadgil on 27 Dec 2018

encountered same issue with 1.2.2

docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node          Address              Status  Type    Build  Protocol  DC        Segment
consulserver  192.168.99.100:8301  alive   server  1.2.2  2         labsetup  <all>
consulclient  192.168.99.101:8301  alive   client  1.2.2  2         labsetup  <default>


    2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
    2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
    2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured

hi,i meet the same problem in consul 1.5.1 ,have you found a solution ? thanks
log:
memberlist: Was able to connect to FSKY_Client but other probes failed, network may be misconfigured