vault HA problem in vault 0.7.3

Created on 18 Jul 2017 · 13Comments · Source: hashicorp/vault

vault 0.7.3
etcd 3.1.9
API v3

vault cannot change leader to other standby nodes

Error log
2017/07/18 03:28:39.935689 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found

Source

shsjshentao

All 13 comments

How to reproduce this issue?

xiang90 on 18 Jul 2017

Three nodes
Shutdown the active node.
The other two nodes vault leader is none.
Start the active node.
The three nodes still keep the leader none.

like this

[root@SGDLITVM0819 ~]# vault status -tls-skip-verify
Sealed: false
Key Shares: 1
Key Threshold: 1
Unseal Progress: 0
Unseal Nonce:
Version: 0.7.3
Cluster Name: vault-cluster-b7c6edd5
Cluster ID: 2c3ccf8c-f387-6bd4-8cb1-5b4f1551ad98

High-Availability Enabled: true
Mode: standby
Leader: \

shsjshentao on 18 Jul 2017

@shsjshentao

i tired, but it works. cannot really reproduce what you described.

can you provide me a step by step script? (how did you initialize vault? how do you shutdown the active one? what logging information should i expect to see?)

xiang90 on 18 Jul 2017

👍1

I usually happen this problem.
I reinstalled the three master just now.
I found command "vault step-down" also does not work.
The other two nodes cannot take over the leader and logs [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found
And I found etcd is low performance maybe my hdd.

2017/07/19 07:31:14.425524 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found
2017-07-19T07:31:15Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6458]: INFO Backend set to env
2017-07-19T07:31:15Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6458]: INFO Starting confd
2017-07-19T07:31:15Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6458]: INFO Backend nodes set to
2017-07-19T07:31:17Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6466]: INFO Backend set to env
2017-07-19T07:31:17Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6466]: INFO Starting confd
2017-07-19T07:31:17Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6466]: INFO Backend nodes set to
2017-07-19T07:31:19Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6474]: INFO Backend set to env
2017-07-19T07:31:19Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6474]: INFO Starting confd
2017-07-19T07:31:19Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6474]: INFO Backend nodes set to
2017-07-19T07:31:21Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6483]: INFO Backend set to env
2017-07-19T07:31:21Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6483]: INFO Starting confd
2017-07-19T07:31:21Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6483]: INFO Backend nodes set to
2017-07-19T07:31:23Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6491]: INFO Backend set to env
2017-07-19T07:31:23Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6491]: INFO Starting confd
2017-07-19T07:31:23Z SGDLITVM0819.hpeswlab.net /usr/bin/confd[6491]: INFO Backend nodes set to
2017/07/19 07:31:24.430583 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found

2017-07-19 07:21:09.939587 I | raft: 2046c458caf5773f [logterm: 7, index: 15633, vote: 0] ignored MsgVote from d33e6e92eef6099a [logterm: 7, index: 15633] at term 7: lease is not expired (remaining ticks: 10)
2017-07-19 07:21:09.948538 I | raft: 2046c458caf5773f [term: 7] received a MsgApp message with higher term from d33e6e92eef6099a [term: 14]
2017-07-19 07:21:09.948585 I | raft: 2046c458caf5773f became follower at term 14
2017-07-19 07:21:13.337983 W | etcdserver: timed out waiting for read index response
2017-07-19 07:21:18.488810 W | wal: sync duration of 20.093414577s, expected less than 1s
2017-07-19 07:22:15.515277 I | mvcc: store.index: compact 3106
2017-07-19 07:22:15.518657 I | mvcc: finished scheduled compaction at 3106 (took 1.72943ms)
2017-07-19 07:24:49.255967 W | etcdserver: apply entries took too long [20.245638093s for 1 entries]
2017-07-19 07:24:49.256026 W | etcdserver: avoid queries with large range/delete range!
2017-07-19 07:25:17.555745 W | etcdserver: timed out waiting for read index response
2017-07-19 07:25:24.556360 W | etcdserver: timed out waiting for read index response
2017-07-19 07:25:30.152730 W | wal: sync duration of 20.17741517s, expected less than 1s
2017-07-19 07:25:30.176633 W | etcdserver: ignored out-of-date read index response (want [119 63 93 89 151 89 82 208], got [119 63 93 89 151 89 82 216])
2017-07-19 07:25:31.834635 W | wal: sync duration of 1.09778233s, expected less than 1s
2017-07-19 07:26:15.480884 W | etcdserver: timed out waiting for read index response
2017-07-19 07:26:22.481279 W | etcdserver: timed out waiting for read index response
2017-07-19 07:26:26.944909 W | wal: sync duration of 20.207759118s, expected less than 1s
2017-07-19 07:26:26.955119 W | etcdserver: ignored out-of-date read index response (want [119 63 93 89 151 89 83 160], got [119 63 93 89 151 89 83 184])
2017-07-19 07:26:26.957652 W | etcdserver: ignored out-of-date read index response (want [119 63 93 89 151 89 83 175], got [119 63 93 89 151 89 83 184])
2017-07-19 07:27:08.779148 W | wal: sync duration of 1.043000802s, expected less than 1s
2017-07-19 07:27:15.528865 I | mvcc: store.index: compact 3615
2017-07-19 07:27:15.531434 I | mvcc: finished scheduled compaction at 3615 (took 1.434634ms)
2017-07-19 07:28:35.413686 W | etcdserver: timed out waiting for read index response
2017-07-19 07:28:42.414061 W | etcdserver: timed out waiting for read index response
2017-07-19 07:28:47.959655 W | wal: sync duration of 20.182056724s, expected less than 1s
2017-07-19 07:31:42.777420 W | wal: sync duration of 1.042830191s, expected less than 1s
2017-07-19 07:32:15.547140 I | mvcc: store.index: compact 4148
2017-07-19 07:32:15.550025 I | mvcc: finished scheduled compaction at 4148 (took 2.002339ms)

shsjshentao on 19 Jul 2017

2017-07-19 07:28:47.959655 W | wal: sync duration of 20.182056724s, expected less than 1s

your etcd cluster is not in a good shape. fsync took too long.

xiang90 on 25 Jul 2017

I believe it's because vault lost its etcd lease. Looking at the etcd client code (that vault uses), I don't see anything for handling that situation, e.g. code for acquiring a new lease.

Due to 64d412e, I believe if you want vault + etcd to be healthy, you must ensure vault and etcd can always talk to each other within 15 seconds (used to be 60 seconds). If that's not possible, due to fsync, or network problems, or for any other reason, you will need to restart vault to acquire a new etcd lease.

bhiggins on 18 Oct 2017

I looked into this a bit more, because we've seen the same problem, where a standby vault will eventually end up printing out "core: failed to acquire lock: error=etcdserver: requested lease not found" over and over after the leader goes down. I don't know what caused it to get into that state (network problems? etcd problems? otherwise?) and to some degree, I don't care. I just want things to recover eventually :-)

I don't know much about etcd, but it seems to me that EtcdLock's Lock() function needs to check if the session is still valid (i.e. is the channel returned by Done() still open?). If the session isn't valid anymore--because the lease expired, I assume--then a new one needs to be made, instead of continuing to use the old session.

If that sounds reasonable, I can provide a patch.

bhiggins on 24 Oct 2017

My colleague @willmo wonders if there may be a better way to address this.

The question is around the Lock interface, specifically what to do if an error is returned from Lock(). The code in core.go, specifically acquireLock and runStandby (which calls acquireLock), assumes that all errors returned from Lock() will be transient; in other words, an error won't mean that the lock is permanently unusable. However, for etcd3, this is currently not the case. The lock returned by LockWith() can become invalid if its associated session/lease expires. Without knowing more about Consul and Zookeeper, it's not immediately clear if a similar problem might exist with those backends as well.

That leads us to ask: what should change here? Should core.go change so it calls LockWith() again if Lock() returns an error? Should the etcd3 backend change (which was my proposal above) and the Lock documentation improve to explicitly state the contract? Even if the current contract remains, for the sake of robustness, should we do both?

What do you think, @jefferai and @armon?

bhiggins on 25 Oct 2017

i am going to take a look.

xiang90 on 25 Oct 2017

@xiang90, thanks for taking a look.

Just wanted to note how to reproduce this: