Currently the consul leave command will trigger a graceful leave and shutdown of the agent it is called on.
Consul should provide a command and HTTP API endpoint for servers to leave a WAN pool without:
The main use-case would be to split up two WAN-joined datacenters.
~The only way to do this without downtime currently is to:~
~1. Block cross-DC server communication.~
~2. Have a server in each DC call consul force-leave <node-name>.<dc> on all the servers in the other DC (once the servers in the other DC are marked as failed).~
There is currently no workaround for this.
This is a duplicate of #3307 but this has more context so might be a better issue. There was an attempt to do this in #3414 that has some more context and some issues that we ran into.
Relatedly, I don't think consul force-leave <node-name>.<dc> actually works as documented currently. When I test it with a federated cluster, even when all nodes are up and healthy I get:
$ consul members -wan
Node Address Status Type Build Protocol DC Segment
node-24507.dc1 127.0.0.1:24509 alive server 1.7.0dev 2 dc1 <all>
node-24512.dc1 127.0.0.1:24514 alive server 1.7.0dev 2 dc1 <all>
node-24532.dc2 127.0.0.1:24534 alive server 1.7.0dev 2 dc2 <all>
node-24537.dc2 127.0.0.1:24539 alive server 1.7.0dev 2 dc2 <all>
node-24542.dc2 127.0.0.1:24544 alive server 1.7.0dev 2 dc2 <all>
node-24562.dc3 127.0.0.1:24564 alive server 1.7.0dev 2 dc3 <all>
node-24567.dc3 127.0.0.1:24569 alive server 1.7.0dev 2 dc3 <all>
node-24572.dc3 127.0.0.1:24574 alive server 1.7.0dev 2 dc3 <all>
node-8500.dc1 127.0.0.1:8302 alive server 1.7.0dev 2 dc1 <all>
$ consul force-leave -prune node-24562.dc3
Error force leaving: Unexpected response code: 500 (agent: No node found with name 'node-24562.dc3')
@banks there was a change made that broke that workaround. This was to fix an issue where force-leave called in the current DC without the DC suffix led to the force-left node not leaving the WAN pool.
The relevant code is here:
https://github.com/hashicorp/consul/blob/master/agent/consul/server.go#L1127
When you call <node-name>.dc2 from dc1 we naively append dc1 for the WAN pool removal, and the call is made with <node-name>.dc2.dc1.
I updated the issue to state there's no workaround currently available.
Edit:
Just noticed that you tried to force-leave a node that's alive. That's never been possible. Only failed nodes can be force-left because if they're still alive they will refute the messages about them being failed/leaving.
There was another change made recently that is the reason the workaround now doesn't work: https://github.com/hashicorp/consul/commit/aed5cb76690aee5a77a15a1cf3992c517ab5fd17
In an attempt to improve the error message on force-leave (a great idea) we missed the case that it could be a WAN node so need to fix that.
@banks @freddygv , it seems that at the moment there is no way to un-federate a cluster from another. Is the force-leave option going to be fixed soon?
Most helpful comment
This is a duplicate of #3307 but this has more context so might be a better issue. There was an attempt to do this in #3414 that has some more context and some issues that we ran into.