I would expect that, if the consul leader issues consul leave, clients shouldn't receive RPC failures from consul failing to have elected a new leader. Instead, there seems to be around 15s where clients can receive failures until consul has elected a new leader. Is there a "more correct" way to have a consul leader leave the cluster?
Issue consul leave on the cluster leader.
consul version for both Client and ServerClient: v1.0.6
Server: v1.0.6
consul info for both Client and ServerClient:
agent:
check_monitors = 0
check_ttls = 0
checks = 5
services = 5
build:
prerelease =
revision = 9a494b5f
version = 1.0.6
consul:
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 2
goroutines = 57
max_procs = 2
os = linux
version = go1.9.3
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 169
failed = 0
health_score = 0
intent_queue = 0
left = 6
member_time = 1774
members = 14
query_queue = 0
query_time = 1
Server:
agent:
check_monitors = 0
check_ttls = 0
checks = 3
services = 3
build:
prerelease =
revision = 9a494b5f
version = 1.0.6
consul:
bootstrap = false
known_datacenters = 1
leader = true
leader_addr = 10.2.210.100:8300
server = true
raft:
applied_index = 19809408
commit_index = 19809408
fsm_pending = 0
last_contact = 0
last_log_index = 19809408
last_log_term = 393
last_snapshot_index = 19802623
last_snapshot_term = 383
latest_configuration = [{Suffrage:Voter ID:9859a83e-e11e-aff2-6165-fb151922c4dc Address:10.2.210.100:8300} {Suffrage:Voter ID:d4a2a349-43f0-d5e4-521a-55d9ebd2a265 Address:10.2.210.201:8300} {Suffrage:Voter ID:f776da6d-3164-2b26-ca70-9fc72ce6d655 Address:10.2.211.100:8300}]
latest_configuration_index = 19808688
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Leader
term = 393
runtime:
arch = amd64
cpu_count = 4
goroutines = 191
max_procs = 4
os = linux
version = go1.9.3
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 169
failed = 0
health_score = 0
intent_queue = 0
left = 2
member_time = 1774
members = 10
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
Ubuntu 16.04 in AWS.
Feb 19 01:50:29 ip-10-2-211-212 consul[3175]: 2018/02/19 01:50:29 [INFO] serf: EventMemberJoin: server-use1-0-10-2-210-100 10.2.210.100
Feb 19 01:50:29 ip-10-2-211-212 consul[3175]: 2018/02/19 01:50:29 [INFO] consul: adding server server-use1-0-10-2-210-100 (Addr: tcp/10.2.210.100:8300) (DC: us-east-1)
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]: 2018/02/19 01:51:54 [INFO] serf: EventMemberFailed: server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]: 2018/02/19 01:51:54 [INFO] consul: removing server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]: 2018/02/19 01:51:54 [INFO] serf: EventMemberLeave (forced): server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:51:54 ip-10-2-211-212 consul[3175]: 2018/02/19 01:51:54 [INFO] consul: removing server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:55:16 ip-10-2-211-212 consul[3175]: 2018/02/19 01:55:16 [INFO] serf: EventMemberJoin: server-use1-1-10-2-210-201 10.2.210.201
Feb 19 01:55:16 ip-10-2-211-212 consul[3175]: 2018/02/19 01:55:16 [INFO] consul: adding server server-use1-1-10-2-210-201 (Addr: tcp/10.2.210.201:8300) (DC: us-east-1)
Feb 19 01:57:07 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:07 [INFO] serf: EventMemberLeave: server-use1-2-10-2-211-100 10.2.211.100
Feb 19 01:57:07 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:07 [INFO] consul: removing server server-use1-2-10-2-211-100 (Addr: tcp/10.2.211.100:8300) (DC: us-east-1)
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:12 [ERR] consul: "Health.ChecksInState" RPC failed to server 10.2.211.100:8300: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:12 [ERR] http: Request GET /v1/health/state/any?consistent=&index=19808594, error: rpc error making call: EOF from=127.0.0.1:48420
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:12 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.100:8300: rpc error making call: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:12 [ERR] http: Request GET /v1/kv/fabio/config?consistent=&index=19808480, error: rpc error making call: rpc error making call: EOF from=127.0.0.1:47800
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:12 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.100:8300: rpc error making call: rpc error making call: EOF
Feb 19 01:57:12 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:12 [ERR] http: Request GET /v1/kv/fabio/noroute.html?consistent=&index=19341163, error: rpc error making call: rpc error making call: EOF from=127.0.0.1:44304
Feb 19 01:57:20 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:20 [ERR] consul: "Health.ChecksInState" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:20 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:20 [ERR] http: Request GET /v1/health/state/any?consistent=&index=19808594, error: rpc error making call: No cluster leader from=127.0.0.1:49212
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:21 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:21 [ERR] http: Request GET /v1/kv/fabio/config?consistent=&index=19808480, error: rpc error making call: No cluster leader from=127.0.0.1:49653
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:21 [ERR] consul: "KVS.Get" RPC failed to server 10.2.210.201:8300: rpc error making call: No cluster leader
Feb 19 01:57:21 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:21 [ERR] http: Request GET /v1/kv/fabio/noroute.html?consistent=&index=19341163, error: rpc error making call: No cluster leader from=127.0.0.1:49652
Feb 19 01:57:23 ip-10-2-211-212 consul[3175]: 2018/02/19 01:57:23 [INFO] consul: New leader elected: server-use1-0-10-2-210-100
EDIT: I failed to note before that as of the time of the log snippet above, server-use1-2-10-2-211-100 was the cluster leader at the beginning of the log snippet. So, in the snippet, we see it exiting the consul cluster, then see the RPC errors, then consul elects server-use1-0-10-2-210-100 as the leader.
@holtwilkins We made some changes in https://github.com/hashicorp/consul/issues/3514 to make RPC handling during leader elections more robust. Config options under performance can be used to mitigate 500s during server leaves and leader elections.
Looking at your logs, it appears that the default values for leave_drain_time and rpc_hold_timeout may not be enough for your environment. I suggest bumping them both up - the first one allows in-flight RPCs to the leader to finish when you issue consul leave and the second option allows other internal RPCs during leader election to retry.
Thanks @preetapan , I had missed these options being added in the 1.0 notes then. I'm curious why you're suggesting increasing those two values as opposed to lowering the raft_multiplier down to 1? If I should increase those two timeouts, should rpc_hold_timeout always be x% longer than leave_drain_time or something?
Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections...
Just in general, it seems like these defaults could be tuned a bit, given that I'm running the defaults on rather beefy consul servers and getting outages during leader elections...
@holtwilkins yeah the defaults are intentionally tuned for minimal, low power setups as noted in https://www.consul.io/docs/guides/performance.html#minimum-server-requirements and the specific link @preetapan gave before.
The link above notes why the defaults are chosen as they are, it's expected and hopefully clearly documented that "production" installations are likely to need to tune those.
Setting [
raft_multiplier] to a value of 1 will configure Raft to its highest-performance mode, equivalent to the default timing of Consul prior to 0.7, and is recommended for production Consul servers.
Not sure if Preetha assumed you had already followed the docs for a production setup but yeah seems you should try that first and then increase the timeout's if you still see issues in your environment.
Hope this helps :)
Yup @banks I see the notes about "production" clusters now. I dropped raft_multiplier to 1, and it helped a lot, but I still got a 500 after a few leaves, so I've set leave_drain_time to 15s and rpc_hold_timeout to 20s, but I'm still not clear on the relationship between these two values and why their defaults are different from each other?
@banks @preetapan can you please expand on @holtwilkins request? We are currently facing the same issue in our environment
Hi @guidoiaquinti @holtwilkins sorry for the delay I missed the mention here.
but I'm still not clear on the relationship between these two values and why their defaults are different from each other?
The simplest difference is that leave_drain_time is how long the _server_ waits during a graceful leave (typical graceful shutdown timeout as with most HTTP/RPC servers):
While rpc_hold_timeout is how long the client (and any server that forwards an RPC to leader) will wait while RPC retries are made:
Client:
And servers forwarding to leader:
It's typical to set hold timeout a little higher to account for the time taken for initial and final round trips and scheduling delays etc.
Those retry/leave drain times need to be set somewhat higher than the slowest leader election might take in your environment to ensure you never see 500. Of course there is a tradeoff there if they get too long then eventually clients might give up on the response anyway rather than wait 20+ seconds for a success.
The raft_multiplier controls how quickly leader elections can happen. We recommend 1 for production but it's 5 by default per the docs because many people first spin up Consul on very small VMs with limited network and CPU performance and complained that they saw leader instability.
There is a tradeoff between leader stability (false positive failures caused by high CPU or network variability causing heartbeats to be delayed by a few hundred milliseconds) and how long it takes to detect a real leader failure. We expect production users to have well provisioned machines for their load so having raft_multiplier = 1 should not cause to much "flapping" of leader ship and allow quick leader failure detection.
I realise I've not given a super clear answer on exactly what to set and that's because this is all fairly subtle and variable stuff - depends on your hardware, workload, topology etc. I hope that helps build a mental model for how those parameters interact.
@holtwilkins : Closing this due to in-activity and later versions of Consul have made improvements with how the cluster functions when a node gracefully leaves, but feel free to re-open this with more information should you come across this again!
Most helpful comment
@holtwilkins We made some changes in https://github.com/hashicorp/consul/issues/3514 to make RPC handling during leader elections more robust. Config options under performance can be used to mitigate 500s during server leaves and leader elections.
Looking at your logs, it appears that the default values for
leave_drain_timeandrpc_hold_timeoutmay not be enough for your environment. I suggest bumping them both up - the first one allows in-flight RPCs to the leader to finish when you issueconsul leaveand the second option allows other internal RPCs during leader election to retry.