Consul: Consul server does not gracefully leave upon receiving a SIGINT signal

Created on 21 Dec 2017 · 10Comments · Source: hashicorp/consul

`consul version` for both Client and Server

Server: v1.0.2

`consul info` for Server

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease =
    revision = b55059f
    version = 1.0.2
consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = 192.168.50.6:8300
    server = true
raft:
    applied_index = 330
    commit_index = 330
    fsm_pending = 0
    last_contact = 526.503µs
    last_log_index = 331
    last_log_term = 6
    last_snapshot_index = 0
    last_snapshot_term = 0
    latest_configuration = [{Suffrage:Voter ID:672c0ffb-f796-07cf-b0a3-0d72b93b750b Address:192.168.50.6:8300} {Suffrage:Voter ID:200bf45e-2002-ed6d-2b55-8033a157ee0c Address:192.168.50.8:8300} {Suffrage:Voter ID:1dff25fe-f60c-1a4a-fa90-9b17200cfd21 Address:192.168.50.7:8300}]
    latest_configuration_index = 156
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Follower
    term = 6
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 89
    max_procs = 1
    os = linux
    version = go1.9.2
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 5
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 27
    members = 3
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 21
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

root@ubuntu-xenial:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Description of the Issue (and unexpected/desired result)

I'm setting up a new Consul server cluster with ACL enabled. The agents processes are managed through Systemd, with KillISignal set to SIGINT. I'm able to bootstrap the agents just fine and generate their acl_agent_tokens. However, I haven't been able to gracefully stop the agents through SIGINT signals. consul leave -token=<blah> works. But, systemctl stop consul does not. The agent is not gracefully stopped for some reason.

Reproduction steps

Setup consul server cluster of 3 machines with a configuration similar to:

config.hcl

bootstrap_expect        = 3
node_name               = "consul-server1"
server                  = true
ui                      = false
datacenter              = "us"
acl_datacenter          = "us"
domain                  = "hooklift"
bind_addr               = "0.0.0.0"
advertise_addr          = "192.168.50.7"
advertise_addr_wan      = ""
data_dir                = "/var/lib/hooklift/consul-server"
log_level               = "debug"
key_file                = "/etc/hooklift/consul-server/config/consul.key"
cert_file               = "/etc/hooklift/consul-server/config/consul.crt"
ca_file                 = "/etc/hooklift/consul-server/config/ca.crt"
verify_server_hostname  = true
verify_incoming         = true
verify_outgoing         = true
raft_protocol           = 3
disable_remote_exec     = true
disable_update_check    = true
acl_default_policy      = "deny"
acl_master_token        = "blah"
acl_replication_token   = ""
acl_down_policy         = "extend-cache"
encrypt                 = "blah2"
encrypt_verify_incoming = true
encrypt_verify_outgoing = true

enable_agent_tls_for_checks = true

retry_join          = [
  "192.168.50.6",
  "192.168.50.7",
  "192.168.50.8",
]

addresses {
  https = "0.0.0.0"
}

ports {
  https = 8080
  http  = 8500
}

telemetry {
  statsite_address = "127.0.0.1:8125"
}

performance {
  raft_multiplier = 5
}

acl_agent_token = "blah3"

Systemd unit: consul-server.service

[Unit]
Description=%p service
After=network-online.target
Requires=network-online.target

[Service]
Type=notify
Restart=on-failure

User=consul
Group=consul

ExecStart=/usr/local/bin/consul agent -config-dir=/etc/hooklift/consul-server/config
ExecReload=/bin/kill -HUP $MAINPID

# Make sure consul gracefully leaves the cluster.
KillSignal=SIGINT

SyslogIdentifier=%p

[Install]
WantedBy=multi-user.target

Run systemctl stop consul-server

Log Fragments

Dec 21 04:30:49 ubuntu-xenial systemd[1]: Stopping consul-server service...
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] Caught signal:  interrupt
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] Graceful shutdown disabled. Exiting
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Requesting shutdown
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] consul: shutting down server
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [WARN] serf: Shutdown without a Leave
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [WARN] serf: Shutdown without a Leave
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] manager: shutting down
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] raft: aborting pipeline replication to peer {Voter 200bf45e-2002-ed6d-2b55-8033a
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] raft: aborting pipeline replication to peer {Voter 672c0ffb-f796-07cf-b0a3-0d72b
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: consul server down
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: shutdown complete
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (tcp)
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (udp)
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Stopping HTTP server 127.0.0.1:8500 (tcp)
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Stopping HTTPS server [::]:8080 (tcp)
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Waiting for endpoints to shut down
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] agent: Endpoints down
Dec 21 04:30:49 ubuntu-xenial consul-server[9680]:     2017/12/21 04:30:49 [INFO] Exit code: 1
Dec 21 04:30:49 ubuntu-xenial systemd[1]: consul-server.service: Main process exited, code=exited, status=1/FAILURE
Dec 21 04:30:49 ubuntu-xenial systemd[1]: Stopped consul-server service.
Dec 21 04:30:49 ubuntu-xenial systemd[1]: consul-server.service: Unit entered failed state.
Dec 21 04:30:49 ubuntu-xenial systemd[1]: consul-server.service: Failed with result 'exit-code'.

typquestion

Source

c4milo

Most helpful comment

This one was a tough call. If you are really decommissioning a server then you want it to gracefully leave so it's no longer part of the quorum. If you are just stopping and starting it for some reason, then it's better not to have it leave so that there's not a spurious configuration change in the Raft log removing it and later adding it back, and so that the server can retain all of its state about the cluster and rejoin with minimal impact when it starts up again. Since we now have autopilot that will kick the dead server out in the first case where you remove it and then add a new one, we defaulted it to the safer mode for the second case and let the servers "fail" and then get cleaned up. Clients can always leave.

slackpad on 21 Dec 2017

👍3

All 10 comments

Hey @c4milo if you were on an older version of Consul before 0.7 this used to work differently. I think you'll need to set https://www.consul.io/docs/agent/options.html#skip_leave_on_interrupt to false in order to get the behavior you are expecting.

slackpad on 21 Dec 2017

What’s the recomended behavior in this case? I find the messaging from the documentation confusing. Here it says server agents should be allowed to gracefully leave the cluster to have a minimal impact on availability. However, the default value for skip_leave_on_interrupt is set to true for server agents by default. Shouldn’t it be false by default instead?

c4milo on 21 Dec 2017

slackpad on 21 Dec 2017

👍3

@slackpad thanks for getting back. It seems then that issuing a consul leave from a Terraform destroy provisioner will be the way to go as far as gracefully decommissioning a server. Since systemd or any other process supervisor will be handling the other use case.

c4milo on 21 Dec 2017

Along this same line, if graceful leave is turned off, consul-agent exits with an exit-code of 1. It seems counter-intuitive to exit with an error code when the user has asked it to shutdown and it shutdown just as it was supposed to (how it was configured). I don't want to set my alerts to ignore an exit-code of 1 since I don't know if there are cases where consul with exit with a 1 and there ACTUALLY was a fatal error.

kinghrothgar on 30 Apr 2018

While graceful leave is generally not what you want on a server if you are planning on bringing it back up again (eg, OS patching), one thing that you almost certainly want is graceful leader election if the node being taken down is the current leader.

From my testing, it looks like sending SIGINT to the leader node does not notify other servers, so leader election does not happen for a few seconds resulting in a minor service disruption. Having SIGINT not leave the cluster, but first gracefully triggering a leader election (with the shutting down server excluded from the election) would be the ideal situation. This would allow for graceful restarts of the leader.

nvx on 11 May 2018

I see what @nvx is saying and also what @slackpad is saying. What is the consensus? Is it better to just always do a graceful leave even if its just to immediately restart it a few seconds/minutes later? I share the same sentiment as @kinghrothgar that we can't just ignore exit code 1 since it isn't always going to mean that it exited because of a SIGINT. I'm leaning to just make it do the leave on exist since we don't do restarts that often and so the "spurious configuration" isn't a big deal.

jameshartig on 21 Jun 2018

Doing a full graceful leave I don't think is the right option because of the risk of bricking the cluster if all nodes leave at once. Ideally some sort of "if I'm the current leader, trigger a leader re-election then quit without gracefully leaving the cluster" would be the desired behaviour I think.

nvx on 25 Jun 2018

👍1

Is there a command to trigger a re-election gracefully? We can detect which server is the current leader and then run a command automatically before restarting.

jameshartig on 3 Jul 2018

I think the consensus here is running a command like systemctl stop consul should always perform a graceful leave. As an operator you're not always 100% confident of the time before it is started, and Consul will gracefully handle the re-join and leave appropriately as it is designed. So (if I properly understand the question in this thread) I would say you should always perform a graceful leave on stop. I'm going to close this for now, but please do comment and we can continue the discussion here or in a new issue as necessary.