consul version for both Client and ServerClient:
Consul v0.7.3-criteo1-criteo1 (f3d518bc+CHANGES)
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)Server: Consul v0.8.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
The custom criteo version is based f3d518bc and only contains patches from #2474 and #2657.
consul info for both Client and ServerClient:
agent:
check_monitors = 0
check_ttls = 0
checks = 2
services = 3
build:
prerelease = criteo1
revision = 'f3d518b
version = 0.7.3
consul:
known_servers = 0
server = false
runtime:
arch = amd64
cpu_count = 24
goroutines = 30
max_procs = 2
os = windows
version = go1.7.4
serf_lan:
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 3
build:
prerelease =
revision = f436077
version = 0.8.4
consul:
bootstrap = false
known_datacenters = 9
leader = false
leader_addr = 10.71.4.141:8300
server = true
raft:
applied_index = 213985112
commit_index = 213985112
fsm_pending = 0
last_contact = 12.873464ms
last_log_index = 213985113
last_log_term = 5511
last_snapshot_index = 213979994
last_snapshot_term = 5511
latest_configuration = [{Suffrage:Voter ID:10.71.4.143:8300 Address:10.71.4.143:8300} {Suffrage:Voter ID:10.71.4.142:8300 Address:10.71.4.142:8300} {Suffrage:Voter ID:10.71.4.141:8300 Address:10.71.4.141:8300}]
latest_configuration_index = 129059017
num_peers = 2
protocol_version = 2
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 5511
runtime:
arch = amd64
cpu_count = 24
goroutines = 8348
max_procs = 23
os = linux
version = go1.8.3
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1872
failed = 0
health_score = 0
intent_queue = 0
left = 10
member_time = 1095554
members = 1872
query_queue = 0
query_time = 370
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 66250
members = 27
query_queue = 0
query_time = 1
50% servers are linux (centos7 mostly), 50% are windows server 2012r2.
consul agent already present in the serf cluster had upgraded without issues.
new consul agent added during the upgrade cannot join the cluster.
Here is the logs from an agent (on windows):
2017/07/03 14:35:46 [WARN] manager: No servers available
2017/07/03 14:35:46 [ERR] agent: failed to sync remote state: No known Consul servers
2017/07/03 14:35:50 [INFO] agent: (LAN) joining: [consul02-ty5.central.criteo.prod consul01-ty5.central.criteo.prod consul03-ty5.central.criteo.prod]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul02-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.141:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul01-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.142:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul03-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.143:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:
* Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [WARN] agent: Join failed: <nil>, retrying in 30s
2017/07/03 14:35:50 [WARN] manager: No servers available
Seems weird for many reasons:
Hi @kamaradclimber these errors are related to the Serf protocol:
* Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
Consul 0.7 dropped support for protocol version 1, it that hasn't been used since Consul 0.3. What version of Consul is running on web-fbx019-ty5.ty5, hostw145-ty5.ty5, and couchs01e23-ty5?
according to consul members, those nodes are running 0.7.3-criteo1:
couchs01e23-ty5.storage.criteo.prod 10.211.136.240:8301 alive client 0.7.3criteo1 2 ty5
hostw145-ty5.ty5.ad.criteo.prod 10.211.161.238:8301 alive client 0.7.3criteo1 2 ty5
web-fbx019-ty5.ty5.ad.criteo.prod 10.211.165.91:8301 alive client 0.7.3criteo1 2 ty5
here is a summary of versions in that consul cluster:
1692 0.7.3criteo1
70 0.8.4
110 0.8.5
We finally solved the issue by restarting a consul server that looks weird.
This server has been detected because many agents in the serf cluster were not seeing it as part of the serf cluster.
For instance, some agents were in one case:
consul members|grep server
consul01-ty5.central.criteo.prod 10.71.4.142:8301 alive server 0.8.4 2 ty5
consul02-ty5.central.criteo.prod 10.71.4.141:8301 alive server 0.8.4 2 ty5
while others (including the faulty server, consul03-ty5.central.criteo.prod):
consul01-ty5.central.criteo.prod 10.71.4.142:8301 alive server 0.8.4 2 ty5
consul02-ty5.central.criteo.prod 10.71.4.141:8301 alive server 0.8.4 2 ty5
consul03-ty5.central.criteo.prod 10.71.4.143:8301 alive server 0.8.4 2 ty5
Using code from https://github.com/criteo-forks/consul/commit/4e58f83a4882ab51e0a2a82edf3029ca28b65f1e during the incident, we could see:
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on maxpmin to 1
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on minpmax to 5
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on maxdmin to 2
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on maxpmin to 1
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on minpmax to 5
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on maxdmin to 2
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on maxpmin to 1
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on minpmax to 5
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on maxdmin to 2
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
2017/07/04 18:59:22 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:
* Failed to join 10.71.4.141: Node 'mems12e06-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'web-rtb032-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'web-fbx024-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/04 18:59:22 [WARN] agent: Join failed: <nil>, retrying in 30s
I think the weird part can be seen when minpmax is set to 0.
Anyway incident is closed on our side, I'd be happy to contribute to a discussion to detect such scenario better.
Thanks for the follow up note @kamaradclimber - not sure how things got into that state though, and I've never seen something like this before. It's odd that the server was related to the issue but the weird version in the log is associated with mesos-slave035-ty5.central.criteo.prod. Am I missing something about mesos-slave035-ty5.central.criteo.prod?
We are facing a similar issue in our environment which we haven't been able to fix so far. At the moment we have 462 nodes running consul, where from a 462 universe, 375 are 0.9.x - 61 are 0.8.x and finally 10 0.7.x agents. Consul servers are running 0.9.x, respectively.
Some of the messages we have seen so far are (which look very similar each other and may have the same root cause):
Error1
2017/11/27 17:33:54 [ERR] memberlist: Push/Pull with i-asdasdasd failed: Node 'i-asdasdsda' protocol version (2) is incompatible: [1, 0]
Error2
2017/11/27 17:33:42 [ERR] memberlist: Failed push/pull merge: Node 'i-asdasdsadd' protocol version (2) is incompatible: [1, 0] from=10.x.x.x:60109
we have been wondering how gets those values [1, 0], as far as we know the default protocol version spoken within the cluster is version 2, which is set in the source code as the minimum version as well.
Feel free to reach me if you guys need more information :)
thanks @kamaradclimber for the patch provided we were able to identify a node that was causing the issue. pretty much the output we got was:
./consul agent -config-dir /etc/consul
==> Starting Consul agent...
==> Joining cluster...
Node 'i-000000000' put constraint on maxpmin to 1
Node 'i-000000000' put constraint on minpmax to 5
Node 'i-000000000' put constraint on maxdmin to 2
Node 'i-000000000' put constraint on mindmax to 5
Node 'i-555555555' put constraint on minpmax to 4
Node 'i-555555555' put constraint on mindmax to 4
End of iteration on remote nodes
==> 1 error(s) occurred:
* Failed to join 10.51.229.135: Node 'i-abcdfe' protocol version (0) is incompatible: [1, 4]
after forcing the removal of the node from the cluster and wiped out the data directory we were able to re-join the failing nodes.
I work with @kamusin, we saw this again yesterday and used the patch above to debug. There was a node that allowed inbound traffic on 8301, but not outbound, so somehow it became part of the cluster at one point, but subsequently it couldn't initiate connections to anything. Eventually it seemed like the server thought it's maximum supported protocol version was 0, and nothing new could join the cluster.
We resolved it by stopping consul on the node and then doing a force-leave.
I just experienced this same bug on a Consul cluster with 6 servers and 800+ agents, all running Consul v0.9.3. This was during normal operation of the cluster, not an upgrade as described in the original post. But as in earlier comments, new nodes could not join the cluster, and existing nodes were logging the following error at high frequency:
[ERR] memberlist: Failed push/pull merge: Node 'i-xxxxxxxxxxxxxxx' protocol version (2) is incompatible: [1, 0]
Oddly, each instance of the above error referenced a random node, seemingly indicating that a large number of nodes were faulty, which was not actually the case.
Using @kamaradclimber's patch, I found what turned out to be a single culprit, despite the numerous errors. After forcing the bad node out of the cluster and wiping its data directory, it rejoined fine. Other nodes were then able to join without issue as well, and the protocol version error messages ceased.
Another difference from the original post is that in my case the culprit was an agent, not a server.
Tagging this as a bug (which will likely end up fixed in memberlist or Serf). It looks like there's some case where we can poison the version checking algorithm with some zero-valued node entries in member list.
This is also happening on version 0.9.3, using same version on server and clients.
Same error while migrating from Consul 1.0.6 with patches to 1.1.0 with patches (but nothing related to serf protocol)
Would someone from hashicorp have any idea regarding the casue of this issue?
@kamaradclimber I just started to dive into the memberlist code yesterday afternoon and am going to continue some more today. Unfortunately I don't have much to report yet.
Am I correct in thinking this only happens during upgrades? If so how do you go about performing them? (just kill Consul and restart with a newer version?)
@mkeeler Yes, it seems it happens only when upgrading clients.
It probably means that auto-negotiation does not work well.
Basically we:
At this point all those already migrated servers had been upgraded
Yesterday, we upgraded some of the remaining clients still in 1.0.6, but during the upgrade, some of the agents could not join the cluster with this message.
Our Cluster contains ONLY 1.0.6 and 1.1.0 clients (and only 1.1.0 servers)
After several hours of investigation, it seems the only way is to restart sequentially all Consul Servers, then the new agents can join properly the cluster.
@pierresouchay Are you doing Gossip Encryption? It shouldn't matter, just trying to rule things out.
Yes.
Note that in that case, some of the nodes in error see each other's (but no server in the list)
Related issue: https://github.com/hashicorp/consul/issues/4342
Update: I experienced this issue again recently on the same Consul cluster as previously reported, which has been running Consul v1.0.2 for a while now. As with the last time this occurred for us, it was _not_ during a Consul upgrade. However, we've noticed that the problem correlates to operations wherein a large number of agents are restarted, meaning those agents have to leave the cluster and re-join it. I gather the protocol negotiation is a normal part of restarting a cluster, and that if a corrupt node somewhere in the cluster is reporting a bad protocol version, that negotiation cannot succeed. The result in our case is that agents cannot rejoin the cluster after being restarted.
Again, the only method I'm aware of for finding and removing the corrupt node from the cluster is to run a patched version of Consul with https://github.com/criteo-forks/consul/commit/4e58f83a4882ab51e0a2a82edf3029ca28b65f1e included, and use the output from its own failed negotiation attempt to determine the bad node. Without that patch, the logs are not clear on which node is the corrupt one. Perhaps a good place to start in determining the root cause of this bug may be to add more precise logging to the memberlist and/or Serf code that can more clearly identify the culprit?
One more note about this most recent occurrence: Judging by the cluster logs, the corrupt node went bad on a Friday afternoon (coincident to an operation that restarted a large number of agents), but the problem wasn't discovered until a subsequent restart-triggering operation was performed the following Monday. I noticed that it took a long time for the cluster to recover after the corrupt node was forcibly removed 鈥撀燼round 15 minutes, which is much longer than it took in previous occurrences. I don't know enough about Consul's internals to speculate as to what that could mean, but it seemed worth mentioning.
@ianwestcott Yes, we confirm this exact behavior
We often have this when restarting massively agents. Sometimes, a simple restart is enough, but sometimes, the only reliable way to fix it is to restart sequentially all Consul servers.
Here is more context in a new incident we had:
Jan 21 08:36:44 consul03-ty5 consul[444]: 2019/01/21 08:36:44 [ERR] consul.rpc: multiplex conn accept failed: keepalive timeout from=10.232.22.21:38828
Jan 21 08:36:51 consul03-ty5 consul[444]: 2019/01/21 08:36:51 [INFO] serf: EventMemberUpdate: mesos-slave068-ty5.central.criteo.prod
Jan 21 08:36:57 consul03-ty5 consul[444]: 2019/01/21 08:36:57 [INFO] serf: EventMemberUpdate: mesos-slave048-ty5.central.criteo.prod
Jan 21 08:36:57 consul03-ty5 consul[444]: 2019/01/21 08:36:57 [INFO] serf: EventMemberUpdate: mesos-slave026-ty5.central.criteo.prod
Jan 21 08:36:58 consul03-ty5 consul[444]: 2019/01/21 08:36:58 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave026-ty5.central.criteo.prod' protocol version (0) is incompatible: [1, 5] from=10.211.13
Jan 21 08:37:15 consul03-ty5 consul[444]: 2019/01/21 08:37:15 [ERR] memberlist: Push/Pull with d8-c4-97-b5-1d-e8.inventory.criteo.prod failed: Node 'web-cat039-ty5.ty5.ad.criteo.prod' protocol version (2) is i
Jan 21 08:37:20 consul03-ty5 consul[444]: 2019/01/21 08:37:20 [ERR] memberlist: Failed push/pull merge: Node 'mems19e08-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
Jan 21 08:37:24 consul03-ty5 consul[444]: 2019/01/21 08:37:24 [INFO] serf: EventMemberUpdate: mesos-slave110-ty5.central.criteo.prod
Jan 21 08:37:30 consul03-ty5 consul[444]: 2019/01/21 08:37:30 [INFO] serf: EventMemberUpdate: mesos-slave037-ty5.central.criteo.prod
Jan 21 08:37:58 consul03-ty5 consul[444]: 2019/01/21 08:37:58 [ERR] memberlist: Failed push/pull merge: Node 'web-rtb336-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
Jan 21 08:38:48 consul03-ty5 consul[444]: 2019/01/21 08:38:48 [ERR] memberlist: Failed push/pull merge: Node 'hostw842-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.232.14.26:421
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [INFO] serf: EventMemberFailed: d8-c4-97-71-a3-99.inventory.criteo.prod 10.232.64.28
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [INFO] serf: EventMemberJoin: d8-c4-97-71-a3-99.inventory.criteo.prod 10.232.64.28
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave049-ty5.central.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.13
We can see clearly that before
Jan 21 08:36:58 consul03-ty5 consul[444]: 2019/01/21 08:36:58 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave026-ty5.central.criteo.prod' protocol version (0) is incompatible: [1, 5] from=10.211.13
No problem was seen, but the line after do a:
Jan 21 08:37:20 consul03-ty5 consul[444]: 2019/01/21 08:37:20 [ERR] memberlist: Failed push/pull merge: Node 'mems19e08-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
and this message will keep repeating...
In order to fix, I had to restart all the servers sequentially.
@mkeeler What it means is that a single agent DID change the range of acceptable protocols, then it change the acceptable range of protocols on servers side. Quite a good start to find the root cause, what do you think? We are gonna try to provide a PR if we find the error that causes this corruption.
Now that the PR got merged into memberlist we will now need to revendor to pull in the changes.
@mkeeler PR in Consul: https://github.com/hashicorp/consul/pull/5313