Consul: 0.7.3->0.8.4 upgrade leads to protocol version (2) is incompatible: [1, 0]

Created on 3 Jul 2017 · 24Comments · Source: hashicorp/consul

`consul version` for both Client and Server

Client:

most use Consul v0.7.3-criteo1-criteo1 (f3d518bc+CHANGES) Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
some already updated version 0.8.4 (same as servers)
very few has 0.8.5 (experiment done by one team)

Server: Consul v0.8.4 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

The custom criteo version is based f3d518bc and only contains patches from #2474 and #2657.

`consul info` for both Client and Server

Client:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 2
    services = 3
build:
    prerelease = criteo1
    revision = 'f3d518b
    version = 0.7.3
consul:
    known_servers = 0
    server = false
runtime:
    arch = amd64
    cpu_count = 24
    goroutines = 30
    max_procs = 2
    os = windows
    version = go1.7.4
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 1
    services = 3
build:
    prerelease = 
    revision = f436077
    version = 0.8.4
consul:
    bootstrap = false
    known_datacenters = 9
    leader = false
    leader_addr = 10.71.4.141:8300
    server = true
raft:
    applied_index = 213985112
    commit_index = 213985112
    fsm_pending = 0
    last_contact = 12.873464ms
    last_log_index = 213985113
    last_log_term = 5511
    last_snapshot_index = 213979994
    last_snapshot_term = 5511
    latest_configuration = [{Suffrage:Voter ID:10.71.4.143:8300 Address:10.71.4.143:8300} {Suffrage:Voter ID:10.71.4.142:8300 Address:10.71.4.142:8300} {Suffrage:Voter ID:10.71.4.141:8300 Address:10.71.4.141:8300}]
    latest_configuration_index = 129059017
    num_peers = 2
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 5511
runtime:
    arch = amd64
    cpu_count = 24
    goroutines = 8348
    max_procs = 23
    os = linux
    version = go1.8.3
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1872
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 10
    member_time = 1095554
    members = 1872
    query_queue = 0
    query_time = 370
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 66250
    members = 27
    query_queue = 0
    query_time = 1

Operating system and Environment details

50% servers are linux (centos7 mostly), 50% are windows server 2012r2.

Description of the Issue (and unexpected/desired result)

consul agent already present in the serf cluster had upgraded without issues.
new consul agent added during the upgrade cannot join the cluster.
Here is the logs from an agent (on windows):

2017/07/03 14:35:46 [WARN] manager: No servers available
2017/07/03 14:35:46 [ERR] agent: failed to sync remote state: No known Consul servers
2017/07/03 14:35:50 [INFO] agent: (LAN) joining: [consul02-ty5.central.criteo.prod consul01-ty5.central.criteo.prod consul03-ty5.central.criteo.prod]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul02-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.141:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul01-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.142:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul03-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.143:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:

* Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [WARN] agent: Join failed: <nil>, retrying in 30s
2017/07/03 14:35:50 [WARN] manager: No servers available

Seems weird for many reasons:

"Failed to join xxx" messages mention consul servers ip address but unrelated nodes fqdn
all agents I've tried claim to support protocol 2-3 (I'm not absolutely sure it serf protocol version or raft since it's not mentioned anywhere and both protocols have similar version at the moment).
node should be able to join during an upgrade (that will last ~1week)

needs-investigation theminternal-cleanup typbug

Source

kamaradclimber

👍3

All 24 comments

Hi @kamaradclimber these errors are related to the Serf protocol:

* Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]

Consul 0.7 dropped support for protocol version 1, it that hasn't been used since Consul 0.3. What version of Consul is running on web-fbx019-ty5.ty5, hostw145-ty5.ty5, and couchs01e23-ty5?

slackpad on 3 Jul 2017

according to consul members, those nodes are running 0.7.3-criteo1:

couchs01e23-ty5.storage.criteo.prod             10.211.136.240:8301  alive   client  0.7.3criteo1  2         ty5
hostw145-ty5.ty5.ad.criteo.prod                 10.211.161.238:8301  alive   client  0.7.3criteo1  2         ty5
web-fbx019-ty5.ty5.ad.criteo.prod               10.211.165.91:8301   alive   client  0.7.3criteo1  2         ty5

here is a summary of versions in that consul cluster:

   1692 0.7.3criteo1
     70 0.8.4
    110 0.8.5

kamaradclimber on 3 Jul 2017

We finally solved the issue by restarting a consul server that looks weird.

This server has been detected because many agents in the serf cluster were not seeing it as part of the serf cluster.

For instance, some agents were in one case:

consul members|grep server
consul01-ty5.central.criteo.prod                10.71.4.142:8301     alive    server  0.8.4         2         ty5
consul02-ty5.central.criteo.prod                10.71.4.141:8301     alive    server  0.8.4         2         ty5

while others (including the faulty server, consul03-ty5.central.criteo.prod):

consul01-ty5.central.criteo.prod                10.71.4.142:8301     alive    server  0.8.4         2         ty5
consul02-ty5.central.criteo.prod                10.71.4.141:8301     alive    server  0.8.4         2         ty5
consul03-ty5.central.criteo.prod                10.71.4.143:8301     alive    server  0.8.4         2         ty5

kamaradclimber on 5 Jul 2017

Using code from https://github.com/criteo-forks/consul/commit/4e58f83a4882ab51e0a2a82edf3029ca28b65f1e during the incident, we could see:

Node 'mems12e06-ty5.storage.criteo.prod' put constraint on maxpmin to 1
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on minpmax to 5
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on maxdmin to 2
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on maxpmin to 1
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on minpmax to 5
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on maxdmin to 2
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on maxpmin to 1
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on minpmax to 5
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on maxdmin to 2
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
    2017/07/04 18:59:22 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:

* Failed to join 10.71.4.141: Node 'mems12e06-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'web-rtb032-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'web-fbx024-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
    2017/07/04 18:59:22 [WARN] agent: Join failed: <nil>, retrying in 30s

I think the weird part can be seen when minpmax is set to 0.

kamaradclimber on 5 Jul 2017

Anyway incident is closed on our side, I'd be happy to contribute to a discussion to detect such scenario better.

kamaradclimber on 5 Jul 2017

Thanks for the follow up note @kamaradclimber - not sure how things got into that state though, and I've never seen something like this before. It's odd that the server was related to the issue but the weird version in the log is associated with mesos-slave035-ty5.central.criteo.prod. Am I missing something about mesos-slave035-ty5.central.criteo.prod?

slackpad on 23 Aug 2017

We are facing a similar issue in our environment which we haven't been able to fix so far. At the moment we have 462 nodes running consul, where from a 462 universe, 375 are 0.9.x - 61 are 0.8.x and finally 10 0.7.x agents. Consul servers are running 0.9.x, respectively.

Some of the messages we have seen so far are (which look very similar each other and may have the same root cause):

Error1

2017/11/27 17:33:54 [ERR] memberlist: Push/Pull with i-asdasdasd failed: Node 'i-asdasdsda' protocol version (2) is incompatible: [1, 0]

Error2

 2017/11/27 17:33:42 [ERR] memberlist: Failed push/pull merge: Node 'i-asdasdsadd' protocol version (2) is incompatible: [1, 0] from=10.x.x.x:60109

we have been wondering how gets those values [1, 0], as far as we know the default protocol version spoken within the cluster is version 2, which is set in the source code as the minimum version as well.

Feel free to reach me if you guys need more information :)

kamusin on 27 Nov 2017

thanks @kamaradclimber for the patch provided we were able to identify a node that was causing the issue. pretty much the output we got was:

./consul agent -config-dir /etc/consul
==> Starting Consul agent...
==> Joining cluster...
Node 'i-000000000' put constraint on maxpmin to 1
Node 'i-000000000' put constraint on minpmax to 5
Node 'i-000000000' put constraint on maxdmin to 2
Node 'i-000000000' put constraint on mindmax to 5
Node 'i-555555555' put constraint on minpmax to 4
Node 'i-555555555' put constraint on mindmax to 4
End of iteration on remote nodes
==> 1 error(s) occurred:

* Failed to join 10.51.229.135: Node 'i-abcdfe' protocol version (0) is incompatible: [1, 4]

after forcing the removal of the node from the cluster and wiped out the data directory we were able to re-join the failing nodes.

kamusin on 3 Jan 2018

I work with @kamusin, we saw this again yesterday and used the patch above to debug. There was a node that allowed inbound traffic on 8301, but not outbound, so somehow it became part of the cluster at one point, but subsequently it couldn't initiate connections to anything. Eventually it seemed like the server thought it's maximum supported protocol version was 0, and nothing new could join the cluster.

We resolved it by stopping consul on the node and then doing a force-leave.

jarro2783 on 3 Jan 2018

I just experienced this same bug on a Consul cluster with 6 servers and 800+ agents, all running Consul v0.9.3. This was during normal operation of the cluster, not an upgrade as described in the original post. But as in earlier comments, new nodes could not join the cluster, and existing nodes were logging the following error at high frequency:

[ERR] memberlist: Failed push/pull merge: Node 'i-xxxxxxxxxxxxxxx' protocol version (2) is incompatible: [1, 0]

Oddly, each instance of the above error referenced a random node, seemingly indicating that a large number of nodes were faulty, which was not actually the case.

Using @kamaradclimber's patch, I found what turned out to be a single culprit, despite the numerous errors. After forcing the bad node out of the cluster and wiping its data directory, it rejoined fine. Other nodes were then able to join without issue as well, and the protocol version error messages ceased.

Another difference from the original post is that in my case the culprit was an agent, not a server.

ianwestcott on 4 Jan 2018

Tagging this as a bug (which will likely end up fixed in memberlist or Serf). It looks like there's some case where we can poison the version checking algorithm with some zero-valued node entries in member list.

slackpad on 4 Jan 2018

This is also happening on version 0.9.3, using same version on server and clients.

mauriziod on 28 Mar 2018

Same error while migrating from Consul 1.0.6 with patches to 1.1.0 with patches (but nothing related to serf protocol)

pierresouchay on 28 Jun 2018

Would someone from hashicorp have any idea regarding the casue of this issue?

kamaradclimber on 29 Jun 2018

@kamaradclimber I just started to dive into the memberlist code yesterday afternoon and am going to continue some more today. Unfortunately I don't have much to report yet.

Am I correct in thinking this only happens during upgrades? If so how do you go about performing them? (just kill Consul and restart with a newer version?)

mkeeler on 29 Jun 2018

@mkeeler Yes, it seems it happens only when upgrading clients.

It probably means that auto-negotiation does not work well.

Basically we:

upgraded Consul Servers to 1.1.0 a few weeks ago
upgraded some of the Consul Agents to 1.10 a few weeks ago (few hours after servers)

At this point all those already migrated servers had been upgraded

Yesterday, we upgraded some of the remaining clients still in 1.0.6, but during the upgrade, some of the agents could not join the cluster with this message.

Our Cluster contains ONLY 1.0.6 and 1.1.0 clients (and only 1.1.0 servers)

After several hours of investigation, it seems the only way is to restart sequentially all Consul Servers, then the new agents can join properly the cluster.

pierresouchay on 29 Jun 2018

@pierresouchay Are you doing Gossip Encryption? It shouldn't matter, just trying to rule things out.

mkeeler on 29 Jun 2018

Yes.

Note that in that case, some of the nodes in error see each other's (but no server in the list)

pierresouchay on 29 Jun 2018

pearkes on 26 Oct 2018

Update: I experienced this issue again recently on the same Consul cluster as previously reported, which has been running Consul v1.0.2 for a while now. As with the last time this occurred for us, it was _not_ during a Consul upgrade. However, we've noticed that the problem correlates to operations wherein a large number of agents are restarted, meaning those agents have to leave the cluster and re-join it. I gather the protocol negotiation is a normal part of restarting a cluster, and that if a corrupt node somewhere in the cluster is reporting a bad protocol version, that negotiation cannot succeed. The result in our case is that agents cannot rejoin the cluster after being restarted.

Again, the only method I'm aware of for finding and removing the corrupt node from the cluster is to run a patched version of Consul with https://github.com/criteo-forks/consul/commit/4e58f83a4882ab51e0a2a82edf3029ca28b65f1e included, and use the output from its own failed negotiation attempt to determine the bad node. Without that patch, the logs are not clear on which node is the corrupt one. Perhaps a good place to start in determining the root cause of this bug may be to add more precise logging to the memberlist and/or Serf code that can more clearly identify the culprit?

One more note about this most recent occurrence: Judging by the cluster logs, the corrupt node went bad on a Friday afternoon (coincident to an operation that restarted a large number of agents), but the problem wasn't discovered until a subsequent restart-triggering operation was performed the following Monday. I noticed that it took a long time for the cluster to recover after the corrupt node was forcibly removed – around 15 minutes, which is much longer than it took in previous occurrences. I don't know enough about Consul's internals to speculate as to what that could mean, but it seemed worth mentioning.

ianwestcott on 21 Nov 2018

@ianwestcott Yes, we confirm this exact behavior

We often have this when restarting massively agents. Sometimes, a simple restart is enough, but sometimes, the only reliable way to fix it is to restart sequentially all Consul servers.

pierresouchay on 21 Nov 2018

Here is more context in a new incident we had:

Jan 21 08:36:44 consul03-ty5 consul[444]: 2019/01/21 08:36:44 [ERR] consul.rpc: multiplex conn accept failed: keepalive timeout from=10.232.22.21:38828
Jan 21 08:36:51 consul03-ty5 consul[444]: 2019/01/21 08:36:51 [INFO] serf: EventMemberUpdate: mesos-slave068-ty5.central.criteo.prod
Jan 21 08:36:57 consul03-ty5 consul[444]: 2019/01/21 08:36:57 [INFO] serf: EventMemberUpdate: mesos-slave048-ty5.central.criteo.prod
Jan 21 08:36:57 consul03-ty5 consul[444]: 2019/01/21 08:36:57 [INFO] serf: EventMemberUpdate: mesos-slave026-ty5.central.criteo.prod
Jan 21 08:36:58 consul03-ty5 consul[444]: 2019/01/21 08:36:58 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave026-ty5.central.criteo.prod' protocol version (0) is incompatible: [1, 5] from=10.211.13
Jan 21 08:37:15 consul03-ty5 consul[444]: 2019/01/21 08:37:15 [ERR] memberlist: Push/Pull with d8-c4-97-b5-1d-e8.inventory.criteo.prod failed: Node 'web-cat039-ty5.ty5.ad.criteo.prod' protocol version (2) is i
Jan 21 08:37:20 consul03-ty5 consul[444]: 2019/01/21 08:37:20 [ERR] memberlist: Failed push/pull merge: Node 'mems19e08-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
Jan 21 08:37:24 consul03-ty5 consul[444]: 2019/01/21 08:37:24 [INFO] serf: EventMemberUpdate: mesos-slave110-ty5.central.criteo.prod
Jan 21 08:37:30 consul03-ty5 consul[444]: 2019/01/21 08:37:30 [INFO] serf: EventMemberUpdate: mesos-slave037-ty5.central.criteo.prod
Jan 21 08:37:58 consul03-ty5 consul[444]: 2019/01/21 08:37:58 [ERR] memberlist: Failed push/pull merge: Node 'web-rtb336-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
Jan 21 08:38:48 consul03-ty5 consul[444]: 2019/01/21 08:38:48 [ERR] memberlist: Failed push/pull merge: Node 'hostw842-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.232.14.26:421
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [INFO] serf: EventMemberFailed: d8-c4-97-71-a3-99.inventory.criteo.prod 10.232.64.28
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [INFO] serf: EventMemberJoin: d8-c4-97-71-a3-99.inventory.criteo.prod 10.232.64.28
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave049-ty5.central.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.13

We can see clearly that before

Jan 21 08:36:58 consul03-ty5 consul[444]: 2019/01/21 08:36:58 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave026-ty5.central.criteo.prod' protocol version (0) is incompatible: [1, 5] from=10.211.13

No problem was seen, but the line after do a:

Jan 21 08:37:20 consul03-ty5 consul[444]: 2019/01/21 08:37:20 [ERR] memberlist: Failed push/pull merge: Node 'mems19e08-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3

and this message will keep repeating...

In order to fix, I had to restart all the servers sequentially.

@mkeeler What it means is that a single agent DID change the range of acceptable protocols, then it change the acceptable range of protocols on servers side. Quite a good start to find the root cause, what do you think? We are gonna try to provide a PR if we find the error that causes this corruption.

pierresouchay on 22 Jan 2019

Now that the PR got merged into memberlist we will now need to revendor to pull in the changes.

mkeeler on 4 Feb 2019

🎉1

@mkeeler PR in Consul: https://github.com/hashicorp/consul/pull/5313

pierresouchay on 4 Feb 2019

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Change service tags via UI and/or CLI

sandstrom · 3Comments

Electing a 0.8.x leader during an upgrade can cause a panic in older servers

slackpad · 3Comments

HTTP API enhancement to get detailed /v1/catalog/services?

atomantic · 4Comments

Node marked as failed, but service checks are still green.

darron · 4Comments

ACL tokens are not always valid UUIDs

lmb · 4Comments

Consul: 0.7.3->0.8.4 upgrade leads to protocol version (2) is incompatible: [1, 0]

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

All 24 comments

Related issues

`consul version` for both Client and Server

`consul info` for both Client and Server