Consul: Consul downtime when leader leaves

Created on 19 Feb 2018 · 7Comments · Source: hashicorp/consul

Description of the Issue (and unexpected/desired result)

I would expect that, if the consul leader issues consul leave, clients shouldn't receive RPC failures from consul failing to have elected a new leader. Instead, there seems to be around 15s where clients can receive failures until consul has elected a new leader. Is there a "more correct" way to have a consul leader leave the cluster?

Reproduction steps

Issue consul leave on the cluster leader.

`consul version` for both Client and Server

Client: v1.0.6
Server: v1.0.6

`consul info` for both Client and Server

Client:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 5
    services = 5
build:
    prerelease =
    revision = 9a494b5f
    version = 1.0.6
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 57
    max_procs = 2
    os = linux
    version = go1.9.3
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 169
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 6
    member_time = 1774
    members = 14
    query_queue = 0
    query_time = 1

Server:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 3
    services = 3
build:
    prerelease =
    revision = 9a494b5f
    version = 1.0.6
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 10.2.210.100:8300
    server = true
raft:
    applied_index = 19809408
    commit_index = 19809408
    fsm_pending = 0
    last_contact = 0
    last_log_index = 19809408
    last_log_term = 393
    last_snapshot_index = 19802623
    last_snapshot_term = 383
    latest_configuration = [{Suffrage:Voter ID:9859a83e-e11e-aff2-6165-fb151922c4dc Address:10.2.210.100:8300} {Suffrage:Voter ID:d4a2a349-43f0-d5e4-521a-55d9ebd2a265 Address:10.2.210.201:8300} {Suffrage:Voter ID:f776da6d-3164-2b26-ca70-9fc72ce6d655 Address:10.2.211.100:8300}]
    latest_configuration_index = 19808688
    num_peers = 2
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 393
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 191
    max_procs = 4
    os = linux
    version = go1.9.3
serf_lan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 169
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 2
    member_time = 1774
    members = 10
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = true
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Operating system and Environment details

Ubuntu 16.04 in AWS.

Log Fragments or Link to gist

Feb 19 01:50:29 ip-10-2-211-212 consul[3175]:     2018/02/19 01:50:29 [INFO] serf: EventMemberJoin: server-use1-0-10-2-210-100 10.2.210.100
Feb 19 01:50:29 ip-10-2-211-212 consul[3175]:     2018/02/19 01:50:29 [INFO] consul: adding server server-use1-0-10-2-210-100 (Addr: tcp/10.2.210.100:8300) (DC: us-east-1)
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] serf: EventMemberFailed: server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] consul: removing server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] serf: EventMemberLeave (forced): server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]:     2018/02/19 01:51:54 [INFO] consul: removing server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:55:16 ip-10-2-211-212 consul[3175]:     2018/02/19 01:55:16 [INFO] serf: EventMemberJoin: server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:55:16 ip-10-2-211-212 consul[3175]:     2018/02/19 01:55:16 [INFO] consul: adding server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:57:07 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:07 [INFO] serf: EventMemberLeave: server-use1-2-10-2-211-100 10.2.211.100
Feb 19 01:57:07 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:07 [INFO] consul: removing server server-use1-2-10-2-211-100 (Addr: tcp/10.2.211.100:8300) (DC: us-east-1)
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] consul: "Health.ChecksInState" RPC failed to server 10.2.211.100:8300: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] http: Request GET /v1/health/state/any?consistent=&index=19808594, error: rpc error making call: EOF from=127.0.0.1:48420
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.100:8300: rpc error making call: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] http: Request GET /v1/kv/fabio/config?consistent=&index=19808480, error: rpc error making call: rpc error making call: EOF from=127.0.0.1:47800
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.100:8300: rpc error making call: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:12 [ERR] http: Request GET /v1/kv/fabio/noroute.html?consistent=&index=19341163, error: rpc error making call: rpc error making call: EOF from=127.0.0.1:44304
Feb 19 01:57:20 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:20 [ERR] consul: "Health.ChecksInState" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:20 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:20 [ERR] http: Request GET /v1/health/state/any?consistent=&index=19808594, error: rpc error making call: No cluster leader from=127.0.0.1:49212
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] http: Request GET /v1/kv/fabio/config?consistent=&index=19808480, error: rpc error making call: No cluster leader from=127.0.0.1:49653
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:21 [ERR] http: Request GET /v1/kv/fabio/noroute.html?consistent=&index=19341163, error: rpc error making call: No cluster leader from=127.0.0.1:49652
Feb 19 01:57:23 ip-10-2-211-212 consul[3175]:     2018/02/19 01:57:23 [INFO] consul: New leader elected: server-use1-0-10-2-210-100

EDIT: I failed to note before that as of the time of the log snippet above, server-use1-2-10-2-211-100 was the cluster leader at the beginning of the log snippet. So, in the snippet, we see it exiting the consul cluster, then see the RPC errors, then consul elects server-use1-0-10-2-210-100 as the leader.

typquestion

Source

holtwilkins

Most helpful comment

@holtwilkins We made some changes in https://github.com/hashicorp/consul/issues/3514 to make RPC handling during leader elections more robust. Config options under performance can be used to mitigate 500s during server leaves and leader elections.

Looking at your logs, it appears that the default values for leave_drain_time and rpc_hold_timeout may not be enough for your environment. I suggest bumping them both up - the first one allows in-flight RPCs to the leader to finish when you issue consul leave and the second option allows other internal RPCs during leader election to retry.

preetapan on 19 Feb 2018

👍2

All 7 comments

preetapan on 19 Feb 2018

👍2

Thanks @preetapan , I had missed these options being added in the 1.0 notes then. I'm curious why you're suggesting increasing those two values as opposed to lowering the raft_multiplier down to 1? If I should increase those two timeouts, should rpc_hold_timeout always be x% longer than leave_drain_time or something?

Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections...

holtwilkins on 19 Feb 2018

Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections...

@holtwilkins yeah the defaults are intentionally tuned for minimal, low power setups as noted in https://www.consul.io/docs/guides/performance.html#minimum-server-requirements and the specific link @preetapan gave before.

The link above notes why the defaults are chosen as they are, it's expected and hopefully clearly documented that "production" installations are likely to need to tune those.

Setting [raft_multiplier] to a value of 1 will configure Raft to its highest-performance mode, equivalent to the default timing of Consul prior to 0.7, and is recommended for production Consul servers.

Not sure if Preetha assumed you had already followed the docs for a production setup but yeah seems you should try that first and then increase the timeout's if you still see issues in your environment.

Hope this helps :)

banks on 19 Feb 2018

Yup @banks I see the notes about "production" clusters now. I dropped raft_multiplier to 1, and it helped a lot, but I still got a 500 after a few leaves, so I've set leave_drain_time to 15s and rpc_hold_timeout to 20s, but I'm still not clear on the relationship between these two values and why their defaults are different from each other?

holtwilkins on 19 Feb 2018

@banks @preetapan can you please expand on @holtwilkins request? We are currently facing the same issue in our environment

guidoiaquinti on 29 Mar 2018

Hi @guidoiaquinti @holtwilkins sorry for the delay I missed the mention here.

but I'm still not clear on the relationship between these two values and why their defaults are different from each other?

The simplest difference is that leave_drain_time is how long the _server_ waits during a graceful leave (typical graceful shutdown timeout as with most HTTP/RPC servers):

https://github.com/hashicorp/consul/blob/3c96d64eaa7e1650f3f117a128306a2cfa51b4b4/agent/consul/server.go#L760-L766

While rpc_hold_timeout is how long the client (and any server that forwards an RPC to leader) will wait while RPC retries are made:

Client:

https://github.com/hashicorp/consul/blob/b797f465d3d9d44c4ad8c448865a2318207aafb3/agent/consul/client.go#L237-L275

And servers forwarding to leader:

https://github.com/hashicorp/consul/blob/06f980061e27ca1b21b00ca2691396398d4f9138/agent/consul/rpc.go#L242-L248

It's typical to set hold timeout a little higher to account for the time taken for initial and final round trips and scheduling delays etc.

Those retry/leave drain times need to be set somewhat higher than the slowest leader election might take in your environment to ensure you never see 500. Of course there is a tradeoff there if they get too long then eventually clients might give up on the response anyway rather than wait 20+ seconds for a success.

The raft_multiplier controls how quickly leader elections can happen. We recommend 1 for production but it's 5 by default per the docs because many people first spin up Consul on very small VMs with limited network and CPU performance and complained that they saw leader instability.

There is a tradeoff between leader stability (false positive failures caused by high CPU or network variability causing heartbeats to be delayed by a few hundred milliseconds) and how long it takes to detect a real leader failure. We expect production users to have well provisioned machines for their load so having raft_multiplier = 1 should not cause to much "flapping" of leader ship and allow quick leader failure detection.

I realise I've not given a super clear answer on exactly what to set and that's because this is all fairly subtle and variable stuff - depends on your hardware, workload, topology etc. I hope that helps build a mental model for how those parameters interact.

banks on 2 May 2018

@holtwilkins : Closing this due to in-activity and later versions of Consul have made improvements with how the cluster functions when a node gracefully leaves, but feel free to re-open this with more information should you come across this again!