Running consul docker image 0.9.2
consul version for both Client and ServerClient: 0.7.5 -> upd to 0.9.2
Server: 0.7.5 -> upd to 0.9.2
consul info for both Client and ServerClient:
# consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 7
services = 18
build:
prerelease =
revision = 75ca2ca
version = 0.9.2
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 4
goroutines = 84
max_procs = 4
os = linux
version = go1.8.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 4625
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 6897
members = 12
query_queue = 0
query_time = 2
Server:
# consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 5
services = 13
build:
prerelease =
revision = 75ca2ca
version = 0.9.2
consul:
bootstrap = false
known_datacenters = 8
leader = false
leader_addr = 192.168.10.237:8300
server = true
raft:
applied_index = 431457249
commit_index = 431457249
fsm_pending = 0
last_contact = 27.164445ms
last_log_index = 431457249
last_log_term = 23227
last_snapshot_index = 431453186
last_snapshot_term = 23227
latest_configuration = [{Suffrage:Voter ID:a.b.c.d1:8300 Address:a.b.c.d1:8300} {Suffrage:Voter ID:a.b.c.d2:8300 Address:a.b.c.d2:8300} {Suffrage:Voter ID:a.b.c.d3:8300 Address:a.b.c.d3:8300}]
latest_configuration_index = 359053270
num_peers = 2
protocol_version = 2
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 23227
runtime:
arch = amd64
cpu_count = 8
goroutines = 359
max_procs = 8
os = linux
version = go1.8.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 4625
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 6897
members = 12
query_queue = 0
query_time = 2
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 767
members = 13
query_queue = 0
query_time = 1
Ubuntu 16.04.3 LTS, docker 17.05.0-ce
After upgrade Consul to v0.9.2. seeing lot of messages in log on every host, random
[ERR] memberlist: Failed fallback ping: write tcp 172.17.0.5:45890-> a.b.c.d:8301: i/o timeout
````
### Reproduction steps
Consul docker image 0.7.5 upgrade to v.0.9.2, after that randomly get log messages about fallback ping.
Tried to use -log-level=TRACE but it is impossible to capture on what host is this going to happen. It is totally random.
All docker ports are open as I see it:
"Ports": {
"8300/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8300"
}
],
"8301/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8301"
}
],
"8301/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8301"
}
],
"8302/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8302"
}
],
"8302/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8302"
}
],
"8400/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8400"
}
],
"8500/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8500"
}
],
"8600/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8600"
}
],
"8600/udp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8600"
}
]
},
On my test enviroment, installed 2 docker hosts with consul 0.7.5, after that upgrade to v0.8.5 and then to v0.9.0 and fallback ping started. So I think this is something caused from version 0.9.x
No firewall, no iptables, nothing that could block connection and cause timeout.
Edit:
Also, seeing a lot of this on 2 servers on the same subnet/network:
2017/08/24 13:12:04 [DEBUG] memberlist: Initiating push/pull sync with: a.b.c.d:8301
2017/08/24 13:12:10 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
2017/08/24 13:12:11 [DEBUG] memberlist: Failed ping: SERVER (timeout reached)
```
The same happening. I'm on EC2 isntances. NO docker.
Consul v0.9.0
It is a 3 servers cluster only. No clients. Ports from 8300-8500 are allowed both udp and tcp. Not for 8600.
2017/09/01 19:40:51 [WARN] memberlist: Was able to connect to [server] but other probes failed, network may be misconfigured
2017/09/01 19:40:52 [DEBUG] memberlist: Stream connection from=10.0.3.237:44932
2017/09/01 19:40:52 [DEBUG] memberlist: Failed ping: [server] (timeout reached)
Consul cluster is alive and healthy. Just dont underestand that logs.
For what is worth, I rechecked the ACLs in AWS and the UDP ports were missing. The log is not that helpful though. I remember that in previous versions it stated that UDP was not reaching, fallback to TCP.. now the ping message isnt very helpful. Perhaps in the new 0.9.2 it changed.
Cheers!
Don't have any ACL's, and all ports are open, but still get random timeout messages.
We changed our infrastructure so that consul container has a host network and CONSUL_ALLOW_PRIVILEGED_PORTS=1 this morning. And we are seeing a lot of the same log messages:
[ERR] memberlist: Failed fallback ping: write tcp 10.0.0.1:49826->10.0.0.5:8301: i/o timeout
I found the explanation and can see the use of it and would not like to disable it but it is a little too excesive. Log lines are full for no obvious reason.https://github.com/hashicorp/consul/blob/v0.6.4/vendor/github.com/hashicorp/memberlist/state.go#L275-L299
@slackpad can You help?
Hmm that error message did get more generic after a refactoring. We should look at making these messages more specific and actionable (and less spammy).
That would be great @slackpad.
And also, can you do someting about this type of messages? We get them every couple of minutes, even on the last version 1.0.2
[ERR] yamux: keepalive failed: session shutdow
Hi @slackpad ,
Is there a chance to resolve this in the next release?
And to lower the log-level for the yamux keepalive failed session shutdown?
I'm asking because we have a lot of nodes and this kind of log messages are becoming too spammy.
Thanks
Seeing the same thing in 1.0.2. More specifically, the nodes having the issues on are different VPCs, but the VPCs are peered and I have verified they can reach each other bi-direcitonally on all of the required ports. The only thing I could think of is that since all of the nodes are in a private subnet with a NAT, that is somehow causing interference, but they have appropriate direct routes setup. Debug messages don't shed any additional light.
Hi,
Consul version 1.2.0, on the same LAN network every few minutes logs are filled:
2018/06/27 13:00:31 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:35168->10.0.66.192:8302: i/o timeout
2018/06/27 13:03:41 [ERR] memberlist: Failed fallback ping: read tcp 10.0.66.150:53268->10.0.66.192:8302: i/o timeout
I have been seeing the same for quite some time (and versions) between my on premise server and a cloud server.
_I have verified all ports back and forth using telnet, netcat, iperf3._
consul version 1.2.2
+1 consul version 1.2.2
encountered same issue with 1.2.2
~~~
docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500
Node Address Status Type Build Protocol DC Segment
consulserver 192.168.99.100:8301 alive server 1.2.2 2 labsetup
consulclient 192.168.99.101:8301 alive client 1.2.2 2 labsetup
2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive
2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
~~~
+1 consul version 1.4.0
The errors I was seeing have since gone away.
The issue was how the VPN was setup between the two endpoints.
Previously it was a software based VPN (StrongSwan).
Once a site-to-site VPN was setup between the on-premise firewall and AWS, this error went away.
encountered same issue with 1.2.2
docker@consulserver:~$ docker exec -it 5cecf4554a0d consul members --http-addr=192.168.99.100:8500 Node Address Status Type Build Protocol DC Segment consulserver 192.168.99.100:8301 alive server 1.2.2 2 labsetup <all> consulclient 192.168.99.101:8301 alive client 1.2.2 2 labsetup <default> 2018/09/08 11:14:05 [INFO] consul: member 'consulclient' joined, marking health alive 2018/09/08 11:14:15 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured 2018/09/08 11:14:22 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured 2018/09/08 11:14:29 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured 2018/09/08 11:14:35 [WARN] memberlist: Was able to connect to consulclient but other probes failed, network may be misconfigured
hi,i meet the same problem in consul 1.5.1 ,have you found a solution ? thanks
log:
memberlist: Was able to connect to FSKY_Client but other probes failed, network may be misconfigured
Seeing the same issues with 1.8.3 on AWS peered VPCs
Most helpful comment
Seeing the same thing in 1.0.2. More specifically, the nodes having the issues on are different VPCs, but the VPCs are peered and I have verified they can reach each other bi-direcitonally on all of the required ports. The only thing I could think of is that since all of the nodes are in a private subnet with a NAT, that is somehow causing interference, but they have appropriate direct routes setup. Debug messages don't shed any additional light.