Consul: Command to leave WAN pool without leaving LAN pool

Created on 26 Sep 2019 · 5Comments · Source: hashicorp/consul

Feature Description

Currently the consul leave command will trigger a graceful leave and shutdown of the agent it is called on.

Consul should provide a command and HTTP API endpoint for servers to leave a WAN pool without:

Being removed from the raft config
Leaving the LAN pool
Shutting down

Use Case(s)

The main use-case would be to split up two WAN-joined datacenters.

~The only way to do this without downtime currently is to:~
~1. Block cross-DC server communication.~
~2. Have a server in each DC call consul force-leave <node-name>.<dc> on all the servers in the other DC (once the servers in the other DC are marked as failed).~

There is currently no workaround for this.

themoperator-usability typenhancement

Source

freddygv

👍3

Most helpful comment

This is a duplicate of #3307 but this has more context so might be a better issue. There was an attempt to do this in #3414 that has some more context and some issues that we ran into.

slackpad on 27 Sep 2019

❤5

All 5 comments

This is a duplicate of #3307 but this has more context so might be a better issue. There was an attempt to do this in #3414 that has some more context and some issues that we ran into.

slackpad on 27 Sep 2019

❤5

Relatedly, I don't think consul force-leave <node-name>.<dc> actually works as documented currently. When I test it with a federated cluster, even when all nodes are up and healthy I get:

$ consul members -wan
Node            Address          Status  Type    Build     Protocol  DC   Segment
node-24507.dc1  127.0.0.1:24509  alive   server  1.7.0dev  2         dc1  <all>
node-24512.dc1  127.0.0.1:24514  alive   server  1.7.0dev  2         dc1  <all>
node-24532.dc2  127.0.0.1:24534  alive   server  1.7.0dev  2         dc2  <all>
node-24537.dc2  127.0.0.1:24539  alive   server  1.7.0dev  2         dc2  <all>
node-24542.dc2  127.0.0.1:24544  alive   server  1.7.0dev  2         dc2  <all>
node-24562.dc3  127.0.0.1:24564  alive   server  1.7.0dev  2         dc3  <all>
node-24567.dc3  127.0.0.1:24569  alive   server  1.7.0dev  2         dc3  <all>
node-24572.dc3  127.0.0.1:24574  alive   server  1.7.0dev  2         dc3  <all>
node-8500.dc1   127.0.0.1:8302   alive   server  1.7.0dev  2         dc1  <all>

$ consul force-leave -prune node-24562.dc3
Error force leaving: Unexpected response code: 500 (agent: No node found with name 'node-24562.dc3')

banks on 14 May 2020

@banks there was a change made that broke that workaround. This was to fix an issue where force-leave called in the current DC without the DC suffix led to the force-left node not leaving the WAN pool.

The relevant code is here:
https://github.com/hashicorp/consul/blob/master/agent/consul/server.go#L1127

When you call <node-name>.dc2 from dc1 we naively append dc1 for the WAN pool removal, and the call is made with <node-name>.dc2.dc1.

I updated the issue to state there's no workaround currently available.

Edit:

Just noticed that you tried to force-leave a node that's alive. That's never been possible. Only failed nodes can be force-left because if they're still alive they will refute the messages about them being failed/leaving.

freddygv on 14 May 2020

There was another change made recently that is the reason the workaround now doesn't work: https://github.com/hashicorp/consul/commit/aed5cb76690aee5a77a15a1cf3992c517ab5fd17

In an attempt to improve the error message on force-leave (a great idea) we missed the case that it could be a WAN node so need to fix that.

banks on 15 May 2020

@banks @freddygv , it seems that at the moment there is no way to un-federate a cluster from another. Is the force-leave option going to be fixed soon?