Vault Version: v0.9.5
Operating System/Architecture: Ubuntu 16.04.03
Vault Config File:
backend "consul" {
address = "127.0.0.1:8500"
path = "vault"
token = ""
}
listener "tcp" {
address = "1.1.1.1:8200"
tls_disable = 0
tls_cert_file = "/vault/server.crt"
tls_key_file = "/vault/server.key"
}
max_lease_ttl = "2880h"
Issue:
I'm running into an issue where Vaultโs shutdown is blocked in such a way that it stops serving secrets, but never gives up the lock (so the secondary node always stays inactive). This problem started after we upgraded to v0.9.5 from v0.7.3. It appears to be stuck in a loop trying to get the following keys from the storage backend:
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/cc5113c6c15b26c4c359407f35b49b0b33d8d6f7/?keys=&separator=%2F (1.026998ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/840bcfe07c0ce34bae584de5a5de6c1857c6d90a/?keys=&separator=%2F (1.029913ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/d85cf4528416f66409cc3b5d9433d58a39dc9cc4/?keys=&separator=%2F (956.545ยตs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/888b51a27895d526857553f8738162661328d74d/?keys=&separator=%2F (975.086ยตs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/dc067819860c1dd23f458b418fcd7aed0b1a697e/?keys=&separator=%2F (1.015359ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9544625267777e034883cbd8a576de268d5242b0/?keys=&separator=%2F (962.588ยตs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/61da133ade358ec06ae6e6f0152a57235574877f/?keys=&separator=%2F (1.139707ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/995823d5e90d6893865bcdaba8678b1fb0939b7a/?keys=&separator=%2F (1.003417ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9dc76f1234d11481dcaf614885293c92c463ffe6/?keys=&separator=%2F (995.442ยตs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f/?keys=&separator=%2F (1.047521ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/3dd19860320685b7b34e99ac8cf8db8b2c83081c/?keys=&separator=%2F (975.041ยตs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9f357234883dccdad0caeeb78265589c24d35a12/?keys=&separator=%2F (1.040412ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/421cfdbc380512d04780ff506960cfd1743a3759/?keys=&separator=%2F (974.111ยตs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be4f6170632f168a81c599276cf93c48abd11beb/?keys=&separator=%2F (974.154ยตs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/4d54cc468ba8b7089c51dee15dc5693209269c51/?keys=&separator=%2F (994.944ยตs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be85887a1bc57a393d903cf6cede8bfa8d083ffe/?keys=&separator=%2F (949.421ยตs) from=127.0.0.1:9954
Additionally, while Vault is active and serving secrets, we constantly see the same calls being made to the storage backend. Weโve traced all these back to a single parent token and the child tokens it has issued.
โโ 8561c9e01661bbcbcc2a038f1e7787246cadf40c
โโโ 61da133ade358ec06ae6e6f0152a57235574877f
โ โโโ 098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f
โ โโโ 3dd19860320685b7b34e99ac8cf8db8b2c83081c
โ โโโ 421cfdbc380512d04780ff506960cfd1743a3759
โ โโโ 4d54cc468ba8b7089c51dee15dc5693209269c51
โ โโโ 508c12f556d19093e9167d5dcacda376d10cc5d9
โ โโโ 560883d56509e907837d14d9c12f74921c7e6624
โ โโโ 6bacbce05311553b655a29141094fe352afa5427
โ โโโ 6c1ce5eec8d86db284c3aa8cf56c983614638f7c
โ โโโ 6e0249999410712a23d8c737a901624664f2fe94
โ โโโ 7b7b4283827160a365d6d9c8522103a5581b8736
โ โโโ 840bcfe07c0ce34bae584de5a5de6c1857c6d90a
โ โโโ 888b51a27895d526857553f8738162661328d74d
โ โโโ 9544625267777e034883cbd8a576de268d5242b0
โ โโโ 995823d5e90d6893865bcdaba8678b1fb0939b7a
โ โโโ 9dc76f1234d11481dcaf614885293c92c463ffe6
โ โโโ 9f357234883dccdad0caeeb78265589c24d35a12
โ โโโ ba619887b09ba9be87b8ab8d0a04172f6cda41c7
โ โโโ be4f6170632f168a81c599276cf93c48abd11beb
โ โโโ be85887a1bc57a393d903cf6cede8bfa8d083ffe
โ โโโ cc5113c6c15b26c4c359407f35b49b0b33d8d6f7
โ โโโ d85cf4528416f66409cc3b5d9433d58a39dc9cc4
โ โโโ dc067819860c1dd23f458b418fcd7aed0b1a697e
The first parent and child have leases, but no lease is found for any of the remaining children. These leases were created on v0.7.3.
{
"request_id": "d07e234f-a561-ce71-ff39-e768f20ee921",
"lease_id": "",
"renewable": false,
"lease_duration": 0,
"data": {
"expire_time": "2017-10-29T18:29:31.208498887Z",
"id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"issue_time": "2017-09-27T18:29:31.208493668Z",
"last_renewal": "2017-09-27T18:29:31.437838733Z",
"renewable": false,
"ttl": -11924360
},
"wrap_info": null,
"warnings": null,
"auth": null
}
{
"request_id": "6346938a-83c4-891f-f264-1f02d2463011",
"lease_id": "",
"renewable": false,
"lease_duration": 0,
"data": {
"expire_time": "2017-11-04T15:49:30.21318925Z",
"id": "auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f",
"issue_time": "2017-09-27T18:29:31.578314239Z",
"last_renewal": "2017-10-03T15:49:30.213189473Z",
"renewable": false,
"ttl": -11327396
},
"wrap_info": null,
"warnings": null,
"auth": null
}
Has anyone seen this before? Iโm trying to figure out how it got into the state in the first place. Not sure how to properly clean this up. Have already tried doing a tidy with no luck.
Thanks in advance for any assistance!
@SoMuchToGrok I am not sure what the problem could be offhand. So, please don't take my words as a possible solutions. They are just things you might want to try, given the situation.
Assuming you have the backup of the storage, can you try bringing up v0.7.3 again on it and revoke the problematic token? This might clear its lease and clear out the leases of the child tokens as well.
Another possibility is to try to perform an incremental upgrade to the versions between 0.7.3 and the latest, while keeping an eye on the upgrade guides (https://www.vaultproject.io/guides/upgrading/index.html). This might help to figure out which version is failing to handle a proper upgrade.
Thanks for the response @vishalnayak
I have a backup of the storage, but I'm hesitant to restore it because it typically introduces new problems (it tries to revoke secrets that "old" Vault already revoked, and the "new" Vault has no knowledge of the secrets that "old" Vault issued, leaving around secrets that will never get cleaned up). We're using a variety of different secret backends, so this becomes a fairly cumbersome issue to deal with. If there's an approach that's reasonably easy that can mitigate these factors, I'd be okay trying this.
I read through the Vault upgrade guide before upgrading, but nothing appeared to be particularly relevant. Given that the lease in question here had a negative TTL many months before we upgraded to v0.9.5, I'm not certain it's 100% (emphasis on 100%) related to the upgrade. With that said, the new version is clearly enforcing some behavior differently than before.
I have yet to try revoking/force-revoking the problematic token yet on v0.9.5. It's something I plan to try, but I'd like to understand what's going on here before I start deleting anything. This problem is in my staging environment, so I'm trying to be extra careful as I only have 1 degree of separation from production.
It'd be helpful if you can get a trace log; additionally when Vault is blocked if you can send a SIGQUIT it will stack trace which can help us figure out whether the token issue or something else is the cause of the hanging -- it might not be, and might be due to some changing of internal locking logic that happened in 0.9.5.
As for the tokens, we've fixed a number of issues over time related to tokens not being cleaned up properly when they have expired (which doesn't allow using them, just not cleaning them up). It's possible this is something that can be fixed by running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens
@jefferai Thanks for the assistance. I tried hitting auth/token/tidy but it didn't cleanup the tokens in question.
1) Here are the trace logs from the first start of Vault, to issuing a SIGQUIT (including the stack trace). Unfortunately I'm not seeing any of the tokens in question referenced in the logs, but hopefully this is helpful.
https://pastebin.com/raw/xDyGLFes
2) Here is the stacktrace we get when we issue a SIGQUIT after issuing a SIGINT.
https://gist.github.com/dmicanzerofox/238d7b557951c786af9b36d654dd288f
One initial comment from the earlier part of the logs: you have a lot of users / resources, across different backends (pki, aws, etc.) where the underlying data (user, etc.) appears to have been removed without Vault's action. Or, possibly, you reloaded Vault data from a backup state after it had already deleted these values. You'll need to use revoke-force (https://www.vaultproject.io/api/system/leases.html#revoke-force) to clean those up.
Took a look at the stacktrace. It's cut off, but based on the very last line of it, I actually think you're hitting a deadlock that is currently fixed in master and will be a part of 0.9.6, which should be coming this week.
Sorry about that - systemd most likely stopped capturing stdout/stderr too early.
Here is the full stracktrace:
https://pastebin.com/raw/watS2FX5
(SIGQUIT after SIGINT)
I think this should be fixed in the just-released 0.9.6...please let us know!
Thanks @jefferai
I'll get back to this in a few days and will update with the results!
We're still experiencing this issue on v0.9.6. For more context, I've never manually deleted anything from the storage backend, and I've never restored from a backup (I have in other environments, but not this one).
This issue appears to be identical - https://github.com/hashicorp/vault/issues/4179
Can you get another SIGQUIT stacktrace from 0.9.6? It would be good to see if the changes made from 0.9.5/0.9.6 have shifted the root cause.
Stacktrace from 0.9.6
Been looking at the stacktrace. One question: When shutdown is blocked, are you still seeing events in the log? e.g. are those GETs happening while you're waiting for Vault to shut down? If so, how long do you wait before deciding Vault is stuck? Anything interesting in Vault's logs during that time? How long has Vault been up?
I ask because I don't yet see evidence that Vault is blocked but I do see evidence that Vault is trying to remove large numbers of expired tokens. It appears like it might not be that Vault is blocked so much as so many things are happening at once due to reading in leases and trying to expire old ones that Vault simply hasn't gotten to servicing the request to shut down yet. I don't think Go's locks provide priority for write operations, so if a a few hundred or thousand requests are outstanding and each request is a network event it could take a while for the shutdown process to grab Vault's state lock successfully.
Another question -- you said it's the same issue as #4179, can you explain why you think so? Did you find that by removing that parent prefix it's instant, e.g. https://github.com/hashicorp/vault/issues/4179#issuecomment-375095112 ?
When the shutdown is blocked, I don't see anything particularly relevant in the Vault logs - but I do see the GETs from the consul monitor (letting the monitor cmd run for a few hours + some cut/uniq magic confirms that we see the same GETs over and over). I've waited at most ~4 days before killing it. Haven't come across anything interesting in the logs in general. Prior to running into this issue (pre-upgrade), Vault had been deployed for roughly 1 year.
I believe it's the same issue as https://github.com/hashicorp/vault/issues/4179 because I see the same loop active from the ~moment Vault turns on until I'm forced to SIGQUIT (it's not just during a shutdown/step-down). It also appears that the parent token relationship he describes is somewhat similar to what I'm seeing here. Haven't tried removing the parent prefix yet as I'd like to be 100% confident in the issue at hand before I start touching things (this is my staging environment and I'm trying my best to treat it as production). With that said, I would be comfortable removing it manually if you think that's the best course of action.
I'd be interested in knowing if https://github.com/hashicorp/vault/issues/4179 "blocks" a shutdown.
Good to know -- from above I didn't see it looping over the same IDs over and over but if you are seeing that then it does look similar. I'd prefer you not removing it manually yet as we're working on a patch and would be good to see if you can try that and have it work for you.
Thanks @jefferai, sounds good.
Any chance you're up for building from master and/or a branch and testing?
Yup, can easily build from whatever.
Still trying to figure this out (since in the other thread you indicated that the changes we put in based on that OP's report did not help you).
I have a couple of theories, but they're not certain yet...and it may be that both are true, or neither.
One is that in your logs there are a ton of leases that are not able to be revoked (AWS users that have since been deleted, postgres users that pg refuses to drop because objects depend on them). Those leases will be attempted to be revoked 6 times each (and then if it fails it'll stop for that session but still try again later). I don't know how many leases there are that these various tokens generated, but with each of those it might be trying to look up the associated token that was encoded in the leases. You could try using https://www.vaultproject.io/api/system/leases.html#revoke-force to clean those up (or for postgres change your revocation statements to force pg to drop the user, or so). That will tell Vault to give up trying to actually perform revocation, and allow it to continue cleaning up the leases locally.
The negative TTL is actually a potential indicator here. I forget how the logic has changed between 0.7.3 and now, but I believe it will try to revoke leases attached to a token first, before the token itself is revoked (which is different from that token being _usable_, which it should not be if the TTL is expired). So seeing such tokens might indeed mean that it's not cleaned up because of the attached leases.
The other possibility is that the hanging you're seeing trying to shut down is related to the expiration attempts, not the tokens. Basically, maybe there is a bad state if a shutdown is triggered while expirations are ongoing. One thing that would be very useful to understand here is if you see the same behavior on 0.9.4. In 0.9.5 we modified some of the sealing logic, and more in 0.9.6, and in general this was to _remove_ some potential races/deadlocks, but it's possible we introduced a new one at the same time. If you see the same behavior in 0.9.4 we'll know it's not related to these changes, which would be a useful data point.
Sounds good @jefferai, appreciate the additional information.
I'll give this a try on 0.9.4 and will report back. Unfortunately might not be able to get around to it for a few days, but will post here as soon as it's done.
Just an update - I will be getting around to this in a few days (ETA Wednesday/Thursday). It's been a ridiculous past few weeks for me :)
Just tested it; appreciate your patience. Unfortunately, still experiencing the same issue on v0.9.4.
Caveat with 0.9.4 - I had to compile the binary against go v1.10.1 because of the PKI/DNS SAN issue in go v1.10.
OK, so that's actually good in a way -- it means that the changes in shutdown logic are not at fault. So I think we're back at "Vault keeps trying to revoke things it can't actually revoke". Can you work on either fixing your revocation sql or using revoke-force as appropriate?
Sounds good - I will start by auditing my postgresql revoke statements.
Is there a recommended approach for "detecting" these issues? Or are logs the quickest/most accurate way to discover these revocation failures?
Logs are the best way at the moment. I think they are raised to error status though so should be easy to find.
Also, it is possible to get Vault logs in a json format, if that makes things easier in your infra.
Alright, so I've audited my revocation statements - they look correct. I did have an issue at one point with postgres/aws where users were deleted outside of Vault, but that was resolved months ago. From what I can tell, each of the leases in question (with a negative TTL) are a result of a 3rd party deleting the user.
I went ahead and ran a force-revoke on some of the leases, but it doesn't seem like Vault is actually deleting anything. I'm seeing these negative TTLs on PKI, postgres, and AWS. Same behavior on all of them when trying to force-revoke.
See output below:
# Lookup lease with negative TTL
vault write sys/leases/lookup lease_id=aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
Key Value
--- -----
expire_time 2017-10-26T11:58:09.810401171Z
id aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
issue_time 2017-09-24T11:58:09.810396474Z
last_renewal 2017-09-24T11:58:10.149875338Z
renewable false
ttl -15650481
# Try running a regular revoke
vault revoke -prefix aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).
Success! Revoked any leases with prefix: aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
# Lookup again, still exists (as expected)
vault write sys/leases/lookup lease_id=aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
Key Value
--- -----
expire_time 2017-10-26T11:58:09.810401171Z
id aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
issue_time 2017-09-24T11:58:09.810396474Z
last_renewal 2017-09-24T11:58:10.149875338Z
renewable false
ttl -15650505
# Run a force revoke on the same prefix
vault revoke -force=true -prefix aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).
Warning! Force-removing leases can cause Vault to become out of sync with
secret engines!
Success! Force revoked any leases with prefix: aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
# Lookup again, still exists (not expected)
vault write sys/leases/lookup lease_id=aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
Key Value
--- -----
expire_time 2017-10-26T11:58:09.810401171Z
id aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
issue_time 2017-09-24T11:58:09.810396474Z
last_renewal 2017-09-24T11:58:10.149875338Z
renewable false
ttl -15650524
The problem is that because it's a prefix command we ensure that it ends with /. I think we could add the ability to delete a specific value via these commands. Sadly not for 0.10.1 but I could have a branch for you to test pretty soon.
Sounds good Jeff - will be available to test whenever it's ready. Much appreciated.
Ready -- see https://github.com/hashicorp/vault/pull/4450 or the issue-4143 branch
I was able to successfully run the force-revoke(s) on the leases in question - I've validated that they no longer exist ๐
Unfortunately I'm still seeing the same sys/token/parent lookup loop. I don't mind taking the same course of action as https://github.com/hashicorp/vault/issues/4179 did (manually deleting entries in the consul backend), but unsure if there is a better approach.
Huh.
It seems you're pretty flexible with testing and data. I'd love to try to figure out what's happening here. We just got another report -- #4462 -- that claims that the problem started after upgrading to 0.10, but didn't say which version they came from. Did you start seeing that parent request loop after an upgrade? If so, from what version to what?
If you're up for it it would be good to see if we can figure out a particular version where this behavior is seen and maybe even a git bisect.
One other thing -- @calvn thought he found the issue with the parent request loop, but, it requires not only being > 0.10.0 (which if you're running the issue-4143 branch you are), but then running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens after upgrading. Can you give that a shot?
@calvn already had me pull down a specific branch a few weeks ago - I ran auth/token/tidy and the issue was still present. For posterity, I just re-ran it again (on https://github.com/hashicorp/vault/pull/4450) and confirmed that I'm still seeing the loop.
I first noticed this issue after upgrading from v0.7.3 to v0.9.5.
I'm up for testing different versions. I'd first need to audit the changelog to determine how far back I can actually go. I know I can't go all the way back to v0.7.3 because the identity mount didn't exist and I've already run into issues trying to restore to that version. I'd assume there could be other incompatibilities as well (with the underlying data store). My approach would be to audit the changes, determine the oldest version that I can reasonably run, and start there. Let me know if you have any feedback on that testing path.
With that said, I won't be able to run any comprehensive version tests until next Wednesday (May 2nd) at the earliest. My staging environment is currently "frozen" for [unrelated] testing.
We think is happening is that after 0.10.0 and a tidy, the revocation _loop_ is fixed, but that you just still have this huge token hierarchy that needs to be revoked. We found a place where we're not being super efficient at revocation time, which can halve the number of GET requests that are made (see #4465) but it will only make things better. If you can see if you're actually still seeing a real loop after the 0.10.0/tidy, that would be good to know.
I'm only seeing about ~20 unique entries in the loop, and the loop has been active for ~1 month since running the tidy on the branch @calvn had me try (back on March 27th). Based on the data available to me, it's pointing to it being a legitimate loop (unless it could legitimately take over a month - FWIW this Vault server has 4 cores).
I think I've got an idea of where the loop might be occuring, which is on the DFS logic in revokeTreeSalted. Constructing the tree is dependent on ts.view.List(ctx, path) to get the children which gets appended to the tree, but if if ts.revokeSalted(ctx, id) fails to remove the storage prefix entry (or any of the child entries from a parent, in this case it could be children storage prefix entries for 61da133ade358ec06ae6e6f0152a57235574877f), then when the traversal gets back to the parent, the List() call will once again append the children. So this will infinitely loop back every time it reaches back to the parent whose children storage entry prefixes were not completely removed.
This also likely explains why the issue started happening after a 07.3->0.9.5 upgrade since the revokeTreeSalted logic changed.
That seems very likely (good catch!) and brings up two questions: 1) why are the entries not being removed; 2) why are we re-listing once we're back at the parent? (Apologies if I'm way off base on the second question and we're not, my phone is freezing trying to load the file on GH.)
Hello, I've been working on this with @SoMuchToGrok . This function was one of the initial places we looked, because it showed up in the stack trace you. But not knowing anything about the internals it didn't yield much for us:
Mar 09 20:49:32 v-stag-1-113 vault[3224]: goroutine 28439 [select]:
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /goroot/src/time/sleep.go:170 +0x44
Mar 09 20:49:32 v-stag-1-113 vault[3224]: created by time.goFunc
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/expiration.go:995 +0x3f
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).updatePending.func1()
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/expiration.go:1027 +0x1d4
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).expireID(0xc42032c1e0, 0xc423af3f00, 0x3b)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/expiration.go:473 +0x121
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).Revoke(0xc42032c1e0, 0xc423af3f00, 0x3b, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/expiration.go:494 +0x367
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).revokeCommon(0xc42032c1e0, 0xc423af3f00, 0x3b, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/expiration.go:1048 +0x83
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).revokeEntry(0xc42032c1e0, 0xc428a36bd0, 0x3b, 0xc428a36bd0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/token_store.go:1197 +0x15e
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*TokenStore).RevokeTree(0xc420759180, 0x34b7960, 0xc42059c1c0, 0xc42767c0c0, 0x24, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/token_store.go:1214 +0x15e
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*TokenStore).revokeTreeSalted(0xc420759180, 0x34b7960, 0xc42059c1c0, 0xc42767c1b0, 0x28, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/barrier_view.go:54 +0x10c
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*BarrierView).List(0xc420476960, 0x34b7960, 0xc42059c1c0, 0xc42773db90, 0x30, 0x2346916, 0x1, 0xc42773db90, 0x30, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/barrier_aes_gcm.go:711 +0x210
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*AESGCMBarrier).List(0xc42004c780, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/physical/cache.go:159 +0x62
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/physical.(*Cache).List(0xc420074190, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vault/sealunwrapper.go:150 +0x62
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*sealUnwrapper).List(0xc4205d8300, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/physical/consul/consul.go:453 +0x28f
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/physical/consul.(*ConsulBackend).List(0xc4205a4c30, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/kv.go:143 +0x13a
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api.(*KV).Keys(0xc4201a6aa8, 0xc42c348d40, 0x40, 0x2346916, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/kv.go:165 +0x275
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api.(*KV).getInternal(0xc4201a6aa8, 0xc42c348d40, 0x40, 0xc428d049c0, 0x0, 0x429809, 0xc400000008, 0xc429664f80, 0xc42025d680)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/api.go:673 +0xb8
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api.(*Client).doRequest(0xc42026e480, 0xc42020c240, 0x2350a71, 0x9, 0xc428ebaba0, 0xc42020c240)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /goroot/src/net/http/client.go:602 +0x28d
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*Client).Do(0xc42058dbf0, 0xc420bb8300, 0x3547f40, 0x0, 0xc42957e518)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /goroot/src/net/http/client.go:173 +0xfd
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*Client).send(0xc42058dbf0, 0xc420bb8300, 0x0, 0x0, 0x0, 0xc42000dd98, 0x0, 0x1, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /goroot/src/net/http/client.go:249 +0x1a9
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.send(0xc420bb8300, 0x34a27e0, 0xc42032c2d0, 0x0, 0x0, 0x0, 0xc42000dd98, 0x7f58d6e7cd90, 0xc425033528, 0x1)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /goroot/src/net/http/transport.go:413 +0x992
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*Transport).RoundTrip(0xc42032c2d0, 0xc420bb8300, 0xc42032c2d0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]: /goroot/src/net/http/transport.go:1970 +0x608
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*persistConn).roundTrip(0xc42a0aca20, 0xc428d04b40, 0x0, 0x0, 0x0)
Can you once again paste the trace logs related to revocations (the pastenbin URLs either expired or were removed). Anything around "failed to revoke entry: {{err}}" should help to determine what's causing the revocation to fail before the actual secondary index gets removed.
Additionally, would you be able to test on a build based off https://github.com/hashicorp/vault/pull/4465? Using Jeff's changes and issuing revoke-force to remove invalid leases should get the children whose leases cannot be revoked out of the revocation error (unless there is an issue with the revocation of the child token itself, or some other unknown issue). Additionally, there were unnecessary List operations introduced in 0.10.0 that would be reduced with the changes in the branch.
@jefferai with regards to the loop, Brian brought up a good point in that an error in ts.revokeSalted(ctx, id) would cause the dfs traversal to exit immediately, so maybe the loop is not occurring in there. However, I was tracing the calls manually and I see that there might be loop happening on interactions between the token store and the expiration manager. There might be checks and breaks in the methods that actually prevents a loop from happening (such as the usage of entry.NumUses in revokeSalted), but I am not completely sure at this point:
m.revokeEntry -> ts.RevokeTree -> ts.revokeTreeSalted -> ts.revokeSalted -> m.RevokeByToken -> m.revokeCommon -> m.revokeEntry
From my testing it looks like revokTreeSalted can can cycle, like if a child token was some how the parent of one of the parents. Creating the token entries through the public method prohibited the cycle, but crafting them manual should illustrate it . (Also I don't think we actually have a cycle :p I just noticed it while doing this investigation)
```
+/*
+2018/03/12 12:15:51 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/6c1ce5eec8d86db284c3aa8cf56c983614638f7c/?keys=&separator=%2F
+(1.040345ms) from=127.0.0.1:39994
@jefferai thanks for pointing me here.
We updated 0.9.6 to 0.10.0.
I noticed the issue, cause Vault is running on Consul server Instance, and we were using t2 instance. so after more than the 24-hour loop, we ran out of CPU cycles and I noticed the behavior.
never restored backend from the backup.
But we had a problem updating to 0.9.0 and using database(MongoDB) mount - timeouts and really strange things.
I'm not sure if it will help at all but attached is an analysis of the tokens that show up in the step down loop.
@burdandrei did you come from an older version prior to 0.9.6? If so how far back?
We finally took the step and tried to revoke the leases connected to the looping tokens:
VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
# Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
``````
VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
```
Both commands froze until timeout was reached.
@dmicanzerofox unfortunately that means that now your state will be different than the last log you sent. We found something odd in it and are still trying to figure out what could have caused it.
@dmicanzerofox I noticed you edited the post where you uploaded bad.token.txt; can you re-post the old version of it? We found the oddity in that file that is no longer present in the newer one.
I have a copy.
bad.tokens.txt
@jefferai this cluster had every version from 0.5, or maybe even 0.4 :)
@calvn yes! sorry, there was a "bug" in the original, I was misattributing the Parent reference.
For the original txt file, I was checking to see if the token id was present in the parent entry. The following is an example from the first txt file.
{
"ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"SecondaryIndex": "",
"ParentIndex": "vault/sys/token/parent/8561c9e01661bbcbcc2a038f1e7787246cadf40c/61da133ade358ec06ae6e6f0152a57235574877f",
"Secret": {
"request_id": "4a1dd037-d0c1-08be-2c14-93e8b19604ce",
"lease_id": "",
"lease_duration": 0,
"renewable": false,
"data": {
"expire_time": "2017-10-29T18:29:31.208498887Z",
"id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"issue_time": "2017-09-27T18:29:31.208493668Z",
"last_renewal": "2017-09-27T18:29:31.437838733Z",
"renewable": false,
"ttl": -15557402
},
"warnings": null
}
}
After I posted i tried to fix the script so that the root nodes wouldn't have any value there, instead of showing a random one of their children :( sorry
{
"ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"SecondaryIndex": "",
"ParentIndex": "",
"Secret": {
"request_id": "28987c46-dd71-a4dd-85dc-3b2560665764",
"lease_id": "",
"lease_duration": 0,
"renewable": false,
"data": {
"expire_time": "2017-10-29T18:29:31.208498887Z",
"id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
"issue_time": "2017-09-27T18:29:31.208493668Z",
"last_renewal": "2017-09-27T18:29:31.437838733Z",
"renewable": false,
"ttl": -15796333
},
"warnings": null
}
}
Ahh, so this was a mistake from the script that built these objects, and not with Vault itself. This would eliminate the theory that we were having.
Yes! Sorry. This is top priority for us, and we can pretty much get you any data from the vault system datastore that you think might be helpful.
Thank you
@dmicanzerofox were you able to test with a branch based off #4465?
@burdandrei what version of Vault were you previously on?
@calvn
0.9.6
...
0.6.2
=)
@calvn - @dmicanzerofox and myself will be able to test that branch tomorrow (05/02). Will update with the results.
@calvn
We deployed #4465 and tried to revoke one of the expired leases that are looping, but was unable to do so. Something that is very interesting is that on our previous version the revoke commands hung until the timeout was reached. The commands below exited immediately:
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key Value
--- -----
expire_time 2017-11-04T15:49:30.21318925Z
id auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time 2017-09-27T18:29:31.578314239Z
last_renewal 2017-10-03T15:49:30.213189473Z
renewable false
ttl -15455338
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).
Success! Revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key Value
--- -----
expire_time 2017-11-04T15:49:30.21318925Z
id auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time 2017-09-27T18:29:31.578314239Z
last_renewal 2017-10-03T15:49:30.213189473Z
renewable false
ttl -15455363
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).
Warning! Force-removing leases can cause Vault to become out of sync with
secret engines!
Success! Force revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key Value
--- -----
expire_time 2017-11-04T15:49:30.21318925Z
id auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time 2017-09-27T18:29:31.578314239Z
last_renewal 2017-10-03T15:49:30.213189473Z
renewable false
ttl -15455420
After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on.
@dmicanzerofox @burdandrei can you do a build off #4512 and give that a try? We did some internal refactoring on revocation mechanics which should address the blocking calls and infinite looping issues.
@calvn sure will, thanks!
IT WORKED FOR US!!!!! Leases were revoked almost immediately and we were able to force a stepdown! @calvn++++++ @burdandrei !!!!!
Awesome, glad to hear!
@dmicanzerofox @burdandrei thanks for all of the patience you've shown (and @SoMuchToGrok of course!) with us around this -- it wasn't easy to figure out the cause, and it wasn't easy to fix it either. On the other hand, we think the new revocation mechanics we put in place are better in many ways so it should be a win all around.
Absolutely! Glad to help out. I appreciate that the team was so receptive and never gave up! @dmicanzerofox and myself are definitely grateful for all the effort to get this resolved :)
Most helpful comment
After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on.