Vault: Strange token with negative TTL blocking Vault shutdown

Created on 16 Mar 2018 · 66Comments · Source: hashicorp/vault

Vault Version: v0.9.5
Operating System/Architecture: Ubuntu 16.04.03

Vault Config File:

backend "consul" {
  address = "127.0.0.1:8500"
  path = "vault"
  token = ""
}

listener "tcp" {
  address = "1.1.1.1:8200"
  tls_disable = 0
  tls_cert_file = "/vault/server.crt"
  tls_key_file = "/vault/server.key"
}

max_lease_ttl = "2880h"

Issue:

I'm running into an issue where Vault’s shutdown is blocked in such a way that it stops serving secrets, but never gives up the lock (so the secondary node always stays inactive). This problem started after we upgraded to v0.9.5 from v0.7.3. It appears to be stuck in a loop trying to get the following keys from the storage backend:

2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/cc5113c6c15b26c4c359407f35b49b0b33d8d6f7/?keys=&separator=%2F (1.026998ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/840bcfe07c0ce34bae584de5a5de6c1857c6d90a/?keys=&separator=%2F (1.029913ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/d85cf4528416f66409cc3b5d9433d58a39dc9cc4/?keys=&separator=%2F (956.545µs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/888b51a27895d526857553f8738162661328d74d/?keys=&separator=%2F (975.086µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/dc067819860c1dd23f458b418fcd7aed0b1a697e/?keys=&separator=%2F (1.015359ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9544625267777e034883cbd8a576de268d5242b0/?keys=&separator=%2F (962.588µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/61da133ade358ec06ae6e6f0152a57235574877f/?keys=&separator=%2F (1.139707ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/995823d5e90d6893865bcdaba8678b1fb0939b7a/?keys=&separator=%2F (1.003417ms) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9dc76f1234d11481dcaf614885293c92c463ffe6/?keys=&separator=%2F (995.442µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f/?keys=&separator=%2F (1.047521ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/3dd19860320685b7b34e99ac8cf8db8b2c83081c/?keys=&separator=%2F (975.041µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/9f357234883dccdad0caeeb78265589c24d35a12/?keys=&separator=%2F (1.040412ms) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/421cfdbc380512d04780ff506960cfd1743a3759/?keys=&separator=%2F (974.111µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be4f6170632f168a81c599276cf93c48abd11beb/?keys=&separator=%2F (974.154µs) from=127.0.0.1:9954
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/4d54cc468ba8b7089c51dee15dc5693209269c51/?keys=&separator=%2F (994.944µs) from=127.0.0.1:4262
2018/03/16 18:16:16 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/be85887a1bc57a393d903cf6cede8bfa8d083ffe/?keys=&separator=%2F (949.421µs) from=127.0.0.1:9954

Additionally, while Vault is active and serving secrets, we constantly see the same calls being made to the storage backend. We’ve traced all these back to a single parent token and the child tokens it has issued.

── 8561c9e01661bbcbcc2a038f1e7787246cadf40c
  ├──  61da133ade358ec06ae6e6f0152a57235574877f
        │   ├── 098e2b0af50e1dd46cdb3ed4ba03a0d869c93f7f
        │   ├── 3dd19860320685b7b34e99ac8cf8db8b2c83081c
        │   ├── 421cfdbc380512d04780ff506960cfd1743a3759
        │   ├── 4d54cc468ba8b7089c51dee15dc5693209269c51
        │   ├── 508c12f556d19093e9167d5dcacda376d10cc5d9
        │   ├── 560883d56509e907837d14d9c12f74921c7e6624
        │   ├── 6bacbce05311553b655a29141094fe352afa5427
        │   ├── 6c1ce5eec8d86db284c3aa8cf56c983614638f7c
        │   ├── 6e0249999410712a23d8c737a901624664f2fe94
        │   ├── 7b7b4283827160a365d6d9c8522103a5581b8736
        │   ├── 840bcfe07c0ce34bae584de5a5de6c1857c6d90a
        │   ├── 888b51a27895d526857553f8738162661328d74d
        │   ├── 9544625267777e034883cbd8a576de268d5242b0
        │   ├── 995823d5e90d6893865bcdaba8678b1fb0939b7a
        │   ├── 9dc76f1234d11481dcaf614885293c92c463ffe6
        │   ├── 9f357234883dccdad0caeeb78265589c24d35a12
        │   ├── ba619887b09ba9be87b8ab8d0a04172f6cda41c7
        │   ├── be4f6170632f168a81c599276cf93c48abd11beb
        │   ├── be85887a1bc57a393d903cf6cede8bfa8d083ffe
        │   ├── cc5113c6c15b26c4c359407f35b49b0b33d8d6f7
        │   ├── d85cf4528416f66409cc3b5d9433d58a39dc9cc4
        │   └── dc067819860c1dd23f458b418fcd7aed0b1a697e

The first parent and child have leases, but no lease is found for any of the remaining children. These leases were created on v0.7.3.

{
  "request_id": "d07e234f-a561-ce71-ff39-e768f20ee921",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 0,
  "data": {
    "expire_time": "2017-10-29T18:29:31.208498887Z",
    "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
    "issue_time": "2017-09-27T18:29:31.208493668Z",
    "last_renewal": "2017-09-27T18:29:31.437838733Z",
    "renewable": false,
    "ttl": -11924360
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}

{
  "request_id": "6346938a-83c4-891f-f264-1f02d2463011",
  "lease_id": "",
  "renewable": false,
  "lease_duration": 0,
  "data": {
    "expire_time": "2017-11-04T15:49:30.21318925Z",
    "id": "auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f",
    "issue_time": "2017-09-27T18:29:31.578314239Z",
    "last_renewal": "2017-10-03T15:49:30.213189473Z",
    "renewable": false,
    "ttl": -11327396
  },
  "wrap_info": null,
  "warnings": null,
  "auth": null
}

Has anyone seen this before? I’m trying to figure out how it got into the state in the first place. Not sure how to properly clean this up. Have already tried doing a tidy with no luck.

Thanks in advance for any assistance!

Source

SoMuchToGrok

Most helpful comment

After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on.

calvn on 2 May 2018

❤2 🎉2 👍2

All 66 comments

@SoMuchToGrok I am not sure what the problem could be offhand. So, please don't take my words as a possible solutions. They are just things you might want to try, given the situation.

Assuming you have the backup of the storage, can you try bringing up v0.7.3 again on it and revoke the problematic token? This might clear its lease and clear out the leases of the child tokens as well.

Another possibility is to try to perform an incremental upgrade to the versions between 0.7.3 and the latest, while keeping an eye on the upgrade guides (https://www.vaultproject.io/guides/upgrading/index.html). This might help to figure out which version is failing to handle a proper upgrade.

vishalnayak on 16 Mar 2018

Thanks for the response @vishalnayak

I have a backup of the storage, but I'm hesitant to restore it because it typically introduces new problems (it tries to revoke secrets that "old" Vault already revoked, and the "new" Vault has no knowledge of the secrets that "old" Vault issued, leaving around secrets that will never get cleaned up). We're using a variety of different secret backends, so this becomes a fairly cumbersome issue to deal with. If there's an approach that's reasonably easy that can mitigate these factors, I'd be okay trying this.

I read through the Vault upgrade guide before upgrading, but nothing appeared to be particularly relevant. Given that the lease in question here had a negative TTL many months before we upgraded to v0.9.5, I'm not certain it's 100% (emphasis on 100%) related to the upgrade. With that said, the new version is clearly enforcing some behavior differently than before.

I have yet to try revoking/force-revoking the problematic token yet on v0.9.5. It's something I plan to try, but I'd like to understand what's going on here before I start deleting anything. This problem is in my staging environment, so I'm trying to be extra careful as I only have 1 degree of separation from production.

SoMuchToGrok on 16 Mar 2018

It'd be helpful if you can get a trace log; additionally when Vault is blocked if you can send a SIGQUIT it will stack trace which can help us figure out whether the token issue or something else is the cause of the hanging -- it might not be, and might be due to some changing of internal locking logic that happened in 0.9.5.

As for the tokens, we've fixed a number of issues over time related to tokens not being cleaned up properly when they have expired (which doesn't allow using them, just not cleaning them up). It's possible this is something that can be fixed by running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens

jefferai on 17 Mar 2018

@jefferai Thanks for the assistance. I tried hitting auth/token/tidy but it didn't cleanup the tokens in question.

1) Here are the trace logs from the first start of Vault, to issuing a SIGQUIT (including the stack trace). Unfortunately I'm not seeing any of the tokens in question referenced in the logs, but hopefully this is helpful.
https://pastebin.com/raw/xDyGLFes

2) Here is the stacktrace we get when we issue a SIGQUIT after issuing a SIGINT.
https://gist.github.com/dmicanzerofox/238d7b557951c786af9b36d654dd288f

SoMuchToGrok on 19 Mar 2018

One initial comment from the earlier part of the logs: you have a lot of users / resources, across different backends (pki, aws, etc.) where the underlying data (user, etc.) appears to have been removed without Vault's action. Or, possibly, you reloaded Vault data from a backup state after it had already deleted these values. You'll need to use revoke-force (https://www.vaultproject.io/api/system/leases.html#revoke-force) to clean those up.

jefferai on 19 Mar 2018

Took a look at the stacktrace. It's cut off, but based on the very last line of it, I actually think you're hitting a deadlock that is currently fixed in master and will be a part of 0.9.6, which should be coming this week.

jefferai on 19 Mar 2018

Sorry about that - systemd most likely stopped capturing stdout/stderr too early.

Here is the full stracktrace:
https://pastebin.com/raw/watS2FX5
(SIGQUIT after SIGINT)

SoMuchToGrok on 19 Mar 2018

I think this should be fixed in the just-released 0.9.6...please let us know!

jefferai on 21 Mar 2018

Thanks @jefferai

I'll get back to this in a few days and will update with the results!

SoMuchToGrok on 22 Mar 2018

We're still experiencing this issue on v0.9.6. For more context, I've never manually deleted anything from the storage backend, and I've never restored from a backup (I have in other environments, but not this one).

This issue appears to be identical - https://github.com/hashicorp/vault/issues/4179

SoMuchToGrok on 26 Mar 2018

Can you get another SIGQUIT stacktrace from 0.9.6? It would be good to see if the changes made from 0.9.5/0.9.6 have shifted the root cause.

jefferai on 26 Mar 2018

Stacktrace from 0.9.6

https://pastebin.com/raw/MLh5jQwj

SoMuchToGrok on 26 Mar 2018

Been looking at the stacktrace. One question: When shutdown is blocked, are you still seeing events in the log? e.g. are those GETs happening while you're waiting for Vault to shut down? If so, how long do you wait before deciding Vault is stuck? Anything interesting in Vault's logs during that time? How long has Vault been up?

I ask because I don't yet see evidence that Vault is blocked but I do see evidence that Vault is trying to remove large numbers of expired tokens. It appears like it might not be that Vault is blocked so much as so many things are happening at once due to reading in leases and trying to expire old ones that Vault simply hasn't gotten to servicing the request to shut down yet. I don't think Go's locks provide priority for write operations, so if a a few hundred or thousand requests are outstanding and each request is a network event it could take a while for the shutdown process to grab Vault's state lock successfully.

jefferai on 26 Mar 2018

Another question -- you said it's the same issue as #4179, can you explain why you think so? Did you find that by removing that parent prefix it's instant, e.g. https://github.com/hashicorp/vault/issues/4179#issuecomment-375095112 ?

jefferai on 26 Mar 2018

When the shutdown is blocked, I don't see anything particularly relevant in the Vault logs - but I do see the GETs from the consul monitor (letting the monitor cmd run for a few hours + some cut/uniq magic confirms that we see the same GETs over and over). I've waited at most ~4 days before killing it. Haven't come across anything interesting in the logs in general. Prior to running into this issue (pre-upgrade), Vault had been deployed for roughly 1 year.

I believe it's the same issue as https://github.com/hashicorp/vault/issues/4179 because I see the same loop active from the ~moment Vault turns on until I'm forced to SIGQUIT (it's not just during a shutdown/step-down). It also appears that the parent token relationship he describes is somewhat similar to what I'm seeing here. Haven't tried removing the parent prefix yet as I'd like to be 100% confident in the issue at hand before I start touching things (this is my staging environment and I'm trying my best to treat it as production). With that said, I would be comfortable removing it manually if you think that's the best course of action.

I'd be interested in knowing if https://github.com/hashicorp/vault/issues/4179 "blocks" a shutdown.

SoMuchToGrok on 26 Mar 2018

Good to know -- from above I didn't see it looping over the same IDs over and over but if you are seeing that then it does look similar. I'd prefer you not removing it manually yet as we're working on a patch and would be good to see if you can try that and have it work for you.

jefferai on 26 Mar 2018

Thanks @jefferai, sounds good.

SoMuchToGrok on 26 Mar 2018

Any chance you're up for building from master and/or a branch and testing?

jefferai on 26 Mar 2018

Yup, can easily build from whatever.

SoMuchToGrok on 26 Mar 2018

Still trying to figure this out (since in the other thread you indicated that the changes we put in based on that OP's report did not help you).

I have a couple of theories, but they're not certain yet...and it may be that both are true, or neither.

One is that in your logs there are a ton of leases that are not able to be revoked (AWS users that have since been deleted, postgres users that pg refuses to drop because objects depend on them). Those leases will be attempted to be revoked 6 times each (and then if it fails it'll stop for that session but still try again later). I don't know how many leases there are that these various tokens generated, but with each of those it might be trying to look up the associated token that was encoded in the leases. You could try using https://www.vaultproject.io/api/system/leases.html#revoke-force to clean those up (or for postgres change your revocation statements to force pg to drop the user, or so). That will tell Vault to give up trying to actually perform revocation, and allow it to continue cleaning up the leases locally.

The negative TTL is actually a potential indicator here. I forget how the logic has changed between 0.7.3 and now, but I believe it will try to revoke leases attached to a token first, before the token itself is revoked (which is different from that token being _usable_, which it should not be if the TTL is expired). So seeing such tokens might indeed mean that it's not cleaned up because of the attached leases.

The other possibility is that the hanging you're seeing trying to shut down is related to the expiration attempts, not the tokens. Basically, maybe there is a bad state if a shutdown is triggered while expirations are ongoing. One thing that would be very useful to understand here is if you see the same behavior on 0.9.4. In 0.9.5 we modified some of the sealing logic, and more in 0.9.6, and in general this was to _remove_ some potential races/deadlocks, but it's possible we introduced a new one at the same time. If you see the same behavior in 0.9.4 we'll know it's not related to these changes, which would be a useful data point.

jefferai on 28 Mar 2018

Sounds good @jefferai, appreciate the additional information.

I'll give this a try on 0.9.4 and will report back. Unfortunately might not be able to get around to it for a few days, but will post here as soon as it's done.

SoMuchToGrok on 29 Mar 2018

Just an update - I will be getting around to this in a few days (ETA Wednesday/Thursday). It's been a ridiculous past few weeks for me :)

SoMuchToGrok on 10 Apr 2018

Just tested it; appreciate your patience. Unfortunately, still experiencing the same issue on v0.9.4.

Caveat with 0.9.4 - I had to compile the binary against go v1.10.1 because of the PKI/DNS SAN issue in go v1.10.

SoMuchToGrok on 11 Apr 2018

OK, so that's actually good in a way -- it means that the changes in shutdown logic are not at fault. So I think we're back at "Vault keeps trying to revoke things it can't actually revoke". Can you work on either fixing your revocation sql or using revoke-force as appropriate?

jefferai on 11 Apr 2018

Sounds good - I will start by auditing my postgresql revoke statements.

Is there a recommended approach for "detecting" these issues? Or are logs the quickest/most accurate way to discover these revocation failures?

SoMuchToGrok on 11 Apr 2018

Logs are the best way at the moment. I think they are raised to error status though so should be easy to find.

Also, it is possible to get Vault logs in a json format, if that makes things easier in your infra.

jefferai on 11 Apr 2018

Alright, so I've audited my revocation statements - they look correct. I did have an issue at one point with postgres/aws where users were deleted outside of Vault, but that was resolved months ago. From what I can tell, each of the leases in question (with a negative TTL) are a result of a 3rd party deleting the user.

I went ahead and ran a force-revoke on some of the leases, but it doesn't seem like Vault is actually deleting anything. I'm seeing these negative TTLs on PKI, postgres, and AWS. Same behavior on all of them when trying to force-revoke.

See output below:

# Lookup lease with negative TTL
vault write sys/leases/lookup lease_id=aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
Key             Value
---             -----
expire_time     2017-10-26T11:58:09.810401171Z
id              aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
issue_time      2017-09-24T11:58:09.810396474Z
last_renewal    2017-09-24T11:58:10.149875338Z
renewable       false
ttl             -15650481

# Try running a regular revoke 
vault revoke -prefix aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).
Success! Revoked any leases with prefix: aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f

# Lookup again, still exists (as expected)
vault write sys/leases/lookup lease_id=aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
Key             Value
---             -----
expire_time     2017-10-26T11:58:09.810401171Z
id              aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
issue_time      2017-09-24T11:58:09.810396474Z
last_renewal    2017-09-24T11:58:10.149875338Z
renewable       false
ttl             -15650505

# Run a force revoke on the same prefix
vault revoke -force=true -prefix aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).
Warning! Force-removing leases can cause Vault to become out of sync with
secret engines!
Success! Force revoked any leases with prefix: aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f

# Lookup again, still exists (not expected)
vault write sys/leases/lookup lease_id=aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
Key             Value
---             -----
expire_time     2017-10-26T11:58:09.810401171Z
id              aws/creds/userprofile/6bae8cb3-20c3-d0c6-0545-1ecc5c5f092f
issue_time      2017-09-24T11:58:09.810396474Z
last_renewal    2017-09-24T11:58:10.149875338Z
renewable       false
ttl             -15650524

SoMuchToGrok on 25 Apr 2018

The problem is that because it's a prefix command we ensure that it ends with /. I think we could add the ability to delete a specific value via these commands. Sadly not for 0.10.1 but I could have a branch for you to test pretty soon.

jefferai on 25 Apr 2018

Sounds good Jeff - will be available to test whenever it's ready. Much appreciated.

SoMuchToGrok on 25 Apr 2018

Ready -- see https://github.com/hashicorp/vault/pull/4450 or the issue-4143 branch

jefferai on 25 Apr 2018

I was able to successfully run the force-revoke(s) on the leases in question - I've validated that they no longer exist 👍

Unfortunately I'm still seeing the same sys/token/parent lookup loop. I don't mind taking the same course of action as https://github.com/hashicorp/vault/issues/4179 did (manually deleting entries in the consul backend), but unsure if there is a better approach.

SoMuchToGrok on 26 Apr 2018

Huh.

It seems you're pretty flexible with testing and data. I'd love to try to figure out what's happening here. We just got another report -- #4462 -- that claims that the problem started after upgrading to 0.10, but didn't say which version they came from. Did you start seeing that parent request loop after an upgrade? If so, from what version to what?

If you're up for it it would be good to see if we can figure out a particular version where this behavior is seen and maybe even a git bisect.

jefferai on 26 Apr 2018

One other thing -- @calvn thought he found the issue with the parent request loop, but, it requires not only being > 0.10.0 (which if you're running the issue-4143 branch you are), but then running https://www.vaultproject.io/api/auth/token/index.html#tidy-tokens after upgrading. Can you give that a shot?

jefferai on 26 Apr 2018

@calvn already had me pull down a specific branch a few weeks ago - I ran auth/token/tidy and the issue was still present. For posterity, I just re-ran it again (on https://github.com/hashicorp/vault/pull/4450) and confirmed that I'm still seeing the loop.

I first noticed this issue after upgrading from v0.7.3 to v0.9.5.

I'm up for testing different versions. I'd first need to audit the changelog to determine how far back I can actually go. I know I can't go all the way back to v0.7.3 because the identity mount didn't exist and I've already run into issues trying to restore to that version. I'd assume there could be other incompatibilities as well (with the underlying data store). My approach would be to audit the changes, determine the oldest version that I can reasonably run, and start there. Let me know if you have any feedback on that testing path.

With that said, I won't be able to run any comprehensive version tests until next Wednesday (May 2nd) at the earliest. My staging environment is currently "frozen" for [unrelated] testing.

SoMuchToGrok on 26 Apr 2018

We think is happening is that after 0.10.0 and a tidy, the revocation _loop_ is fixed, but that you just still have this huge token hierarchy that needs to be revoked. We found a place where we're not being super efficient at revocation time, which can halve the number of GET requests that are made (see #4465) but it will only make things better. If you can see if you're actually still seeing a real loop after the 0.10.0/tidy, that would be good to know.

jefferai on 26 Apr 2018

I'm only seeing about ~20 unique entries in the loop, and the loop has been active for ~1 month since running the tidy on the branch @calvn had me try (back on March 27th). Based on the data available to me, it's pointing to it being a legitimate loop (unless it could legitimately take over a month - FWIW this Vault server has 4 cores).

SoMuchToGrok on 27 Apr 2018

I think I've got an idea of where the loop might be occuring, which is on the DFS logic in revokeTreeSalted. Constructing the tree is dependent on ts.view.List(ctx, path) to get the children which gets appended to the tree, but if if ts.revokeSalted(ctx, id) fails to remove the storage prefix entry (or any of the child entries from a parent, in this case it could be children storage prefix entries for 61da133ade358ec06ae6e6f0152a57235574877f), then when the traversal gets back to the parent, the List() call will once again append the children. So this will infinitely loop back every time it reaches back to the parent whose children storage entry prefixes were not completely removed.

This also likely explains why the issue started happening after a 07.3->0.9.5 upgrade since the revokeTreeSalted logic changed.

calvn on 27 Apr 2018

That seems very likely (good catch!) and brings up two questions: 1) why are the entries not being removed; 2) why are we re-listing once we're back at the parent? (Apologies if I'm way off base on the second question and we're not, my phone is freezing trying to load the file on GH.)

jefferai on 27 Apr 2018

Hello, I've been working on this with @SoMuchToGrok . This function was one of the initial places we looked, because it showed up in the stack trace you. But not knowing anything about the internals it didn't yield much for us:

Mar 09 20:49:32 v-stag-1-113 vault[3224]: goroutine 28439 [select]:
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /goroot/src/time/sleep.go:170 +0x44
Mar 09 20:49:32 v-stag-1-113 vault[3224]: created by time.goFunc
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/expiration.go:995 +0x3f
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).updatePending.func1()
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/expiration.go:1027 +0x1d4
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).expireID(0xc42032c1e0, 0xc423af3f00, 0x3b)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/expiration.go:473 +0x121
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).Revoke(0xc42032c1e0, 0xc423af3f00, 0x3b, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/expiration.go:494 +0x367
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).revokeCommon(0xc42032c1e0, 0xc423af3f00, 0x3b, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/expiration.go:1048 +0x83
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*ExpirationManager).revokeEntry(0xc42032c1e0, 0xc428a36bd0, 0x3b, 0xc428a36bd0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/token_store.go:1197 +0x15e
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*TokenStore).RevokeTree(0xc420759180, 0x34b7960, 0xc42059c1c0, 0xc42767c0c0, 0x24, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/token_store.go:1214 +0x15e
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*TokenStore).revokeTreeSalted(0xc420759180, 0x34b7960, 0xc42059c1c0, 0xc42767c1b0, 0x28, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/barrier_view.go:54 +0x10c
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*BarrierView).List(0xc420476960, 0x34b7960, 0xc42059c1c0, 0xc42773db90, 0x30, 0x2346916, 0x1, 0xc42773db90, 0x30, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/barrier_aes_gcm.go:711 +0x210
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*AESGCMBarrier).List(0xc42004c780, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/physical/cache.go:159 +0x62
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/physical.(*Cache).List(0xc420074190, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vault/sealunwrapper.go:150 +0x62
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vault.(*sealUnwrapper).List(0xc4205d8300, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/physical/consul/consul.go:453 +0x28f
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/physical/consul.(*ConsulBackend).List(0xc4205a4c30, 0x34b7960, 0xc42059c1c0, 0xc42c348d00, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/kv.go:143 +0x13a
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api.(*KV).Keys(0xc4201a6aa8, 0xc42c348d40, 0x40, 0x2346916, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/kv.go:165 +0x275
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api.(*KV).getInternal(0xc4201a6aa8, 0xc42c348d40, 0x40, 0xc428d049c0, 0x0, 0x429809, 0xc400000008, 0xc429664f80, 0xc42025d680)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /gopath/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/api.go:673 +0xb8
Mar 09 20:49:32 v-stag-1-113 vault[3224]: github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api.(*Client).doRequest(0xc42026e480, 0xc42020c240, 0x2350a71, 0x9, 0xc428ebaba0, 0xc42020c240)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /goroot/src/net/http/client.go:602 +0x28d
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*Client).Do(0xc42058dbf0, 0xc420bb8300, 0x3547f40, 0x0, 0xc42957e518)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /goroot/src/net/http/client.go:173 +0xfd
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*Client).send(0xc42058dbf0, 0xc420bb8300, 0x0, 0x0, 0x0, 0xc42000dd98, 0x0, 0x1, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /goroot/src/net/http/client.go:249 +0x1a9
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.send(0xc420bb8300, 0x34a27e0, 0xc42032c2d0, 0x0, 0x0, 0x0, 0xc42000dd98, 0x7f58d6e7cd90, 0xc425033528, 0x1)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /goroot/src/net/http/transport.go:413 +0x992
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*Transport).RoundTrip(0xc42032c2d0, 0xc420bb8300, 0xc42032c2d0, 0x0, 0x0)
Mar 09 20:49:32 v-stag-1-113 vault[3224]:         /goroot/src/net/http/transport.go:1970 +0x608
Mar 09 20:49:32 v-stag-1-113 vault[3224]: net/http.(*persistConn).roundTrip(0xc42a0aca20, 0xc428d04b40, 0x0, 0x0, 0x0)

dmicanzerofox on 27 Apr 2018

Can you once again paste the trace logs related to revocations (the pastenbin URLs either expired or were removed). Anything around "failed to revoke entry: {{err}}" should help to determine what's causing the revocation to fail before the actual secondary index gets removed.

calvn on 27 Apr 2018

Additionally, would you be able to test on a build based off https://github.com/hashicorp/vault/pull/4465? Using Jeff's changes and issuing revoke-force to remove invalid leases should get the children whose leases cannot be revoked out of the revocation error (unless there is an issue with the revocation of the child token itself, or some other unknown issue). Additionally, there were unnecessary List operations introduced in 0.10.0 that would be reduced with the changes in the branch.

calvn on 27 Apr 2018

@jefferai with regards to the loop, Brian brought up a good point in that an error in ts.revokeSalted(ctx, id) would cause the dfs traversal to exit immediately, so maybe the loop is not occurring in there. However, I was tracing the calls manually and I see that there might be loop happening on interactions between the token store and the expiration manager. There might be checks and breaks in the methods that actually prevents a loop from happening (such as the usage of entry.NumUses in revokeSalted), but I am not completely sure at this point:

m.revokeEntry -> ts.RevokeTree -> ts.revokeTreeSalted -> ts.revokeSalted -> m.RevokeByToken -> m.revokeCommon -> m.revokeEntry

calvn on 27 Apr 2018

From my testing it looks like revokTreeSalted can can cycle, like if a child token was some how the parent of one of the parents. Creating the token entries through the public method prohibited the cycle, but crafting them manual should illustrate it . (Also I don't think we actually have a cycle :p I just noticed it while doing this investigation)

```
+/*
+2018/03/12 12:15:51 [DEBUG] http: Request GET /v1/kv/vault/sys/token/parent/6c1ce5eec8d86db284c3aa8cf56c983614638f7c/?keys=&separator=%2F
+(1.040345ms) from=127.0.0.1:39994

*/
+func TestTokentStore_RevokeTree_CycleLogs(t *testing.T) {
_, ts, _, _ := TestCoreWithTokenStore(t)
+
root1 := &TokenEntry{
Path: "parent/1/",
ID: "1",
}
if err := ts.create(context.Background(), root1); err != nil {
t.Fatalf("err: %v", err)
}
+
root2 := &TokenEntry{
Path: "parent/2/",
ID: "2",
}
if err := ts.create(context.Background(), root2); err != nil {
t.Fatalf("err: %v", err)
}
+
// add the second Root as a child of the parent
entry := &TokenEntry{
Parent: root1.ID,
ID: "3",
}
if err := ts.create(context.Background(), entry); err != nil {
t.Fatalf("err: %v", err)
}
/*
ctx := context.Background()
+
entry.Policies = policyutil.SanitizePolicies(entry.Policies, policyutil.DoNotAddDefaultPolicy)
+
err := ts.createAccessor(ctx, entry)
if err != nil {
t.Fatalf("err: %v", err)
}
+
ts.storeCommon(ctx, entry, true)
*/
err := ts.RevokeTree(context.Background(), root2.ID)
if err != nil {
t.Fatalf("err: %v", err)
}
}

dmicanzerofox on 28 Apr 2018

@jefferai thanks for pointing me here.
We updated 0.9.6 to 0.10.0.
I noticed the issue, cause Vault is running on Consul server Instance, and we were using t2 instance. so after more than the 24-hour loop, we ran out of CPU cycles and I noticed the behavior.
never restored backend from the backup.
But we had a problem updating to 0.9.0 and using database(MongoDB) mount - timeouts and really strange things.

burdandrei on 29 Apr 2018

I'm not sure if it will help at all but attached is an analysis of the tokens that show up in the step down loop.

bad.tokens.txt

dmicanzerofox on 30 Apr 2018

@burdandrei did you come from an older version prior to 0.9.6? If so how far back?

jefferai on 30 Apr 2018

We finally took the step and tried to revoke the leases connected to the looping tokens:

VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f

# Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

``````
VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f

Error revoking leases with prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: Put https://x.x.x.x:8200/v1/sys/leases/revoke-prefix/auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

```

Both commands froze until timeout was reached.

dmicanzerofox on 30 Apr 2018

@dmicanzerofox unfortunately that means that now your state will be different than the last log you sent. We found something odd in it and are still trying to figure out what could have caused it.

jefferai on 30 Apr 2018

@dmicanzerofox I noticed you edited the post where you uploaded bad.token.txt; can you re-post the old version of it? We found the oddity in that file that is no longer present in the newer one.

calvn on 30 Apr 2018

I have a copy.
bad.tokens.txt

jefferai on 30 Apr 2018

@jefferai this cluster had every version from 0.5, or maybe even 0.4 :)

burdandrei on 30 Apr 2018

@calvn yes! sorry, there was a "bug" in the original, I was misattributing the Parent reference.

For the original txt file, I was checking to see if the token id was present in the parent entry. The following is an example from the first txt file.

{
  "ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "SecondaryIndex": "",
  "ParentIndex": "vault/sys/token/parent/8561c9e01661bbcbcc2a038f1e7787246cadf40c/61da133ade358ec06ae6e6f0152a57235574877f",
  "Secret": {
    "request_id": "4a1dd037-d0c1-08be-2c14-93e8b19604ce",
    "lease_id": "",
    "lease_duration": 0,
    "renewable": false,
    "data": {
      "expire_time": "2017-10-29T18:29:31.208498887Z",
      "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
      "issue_time": "2017-09-27T18:29:31.208493668Z",
      "last_renewal": "2017-09-27T18:29:31.437838733Z",
      "renewable": false,
      "ttl": -15557402
    },
    "warnings": null
  }
}

After I posted i tried to fix the script so that the root nodes wouldn't have any value there, instead of showing a random one of their children :( sorry

{
  "ID": "8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Primary": "vault/sys/token/id/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "Lease": "vault/sys/expire/id/auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
  "SecondaryIndex": "",
  "ParentIndex": "",
  "Secret": {
    "request_id": "28987c46-dd71-a4dd-85dc-3b2560665764",
    "lease_id": "",
    "lease_duration": 0,
    "renewable": false,
    "data": {
      "expire_time": "2017-10-29T18:29:31.208498887Z",
      "id": "auth/aws-ec2/login/8561c9e01661bbcbcc2a038f1e7787246cadf40c",
      "issue_time": "2017-09-27T18:29:31.208493668Z",
      "last_renewal": "2017-09-27T18:29:31.437838733Z",
      "renewable": false,
      "ttl": -15796333
    },
    "warnings": null
  }
}

dmicanzerofox on 30 Apr 2018

Ahh, so this was a mistake from the script that built these objects, and not with Vault itself. This would eliminate the theory that we were having.

calvn on 30 Apr 2018

Yes! Sorry. This is top priority for us, and we can pretty much get you any data from the vault system datastore that you think might be helpful.

Thank you

dmicanzerofox on 30 Apr 2018

@dmicanzerofox were you able to test with a branch based off #4465?

calvn on 30 Apr 2018

@burdandrei what version of Vault were you previously on?

calvn on 30 Apr 2018

@calvn
0.9.6
...
0.6.2
=)

burdandrei on 1 May 2018

@calvn - @dmicanzerofox and myself will be able to test that branch tomorrow (05/02). Will update with the results.

SoMuchToGrok on 1 May 2018

@calvn

We deployed #4465 and tried to revoke one of the expired leases that are looping, but was unable to do so. Something that is very interesting is that on our previous version the revoke commands hung until the timeout was reached. The commands below exited immediately:

root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455338
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).

Success! Revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455363
root@v-stag-4-194:/home/dmican# VAULT_CLIENT_TIMEOUT=600 vault revoke -force=true -prefix auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
WARNING! The "vault revoke" command is deprecated. Please use "vault lease
revoke" instead. This command will be removed in Vault 0.11 (or later).

Warning! Force-removing leases can cause Vault to become out of sync with
secret engines!
Success! Force revoked any leases with prefix: auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
root@v-stag-4-194:/home/dmican# vault write sys/leases/lookup lease_id=auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
Key             Value
---             -----
expire_time     2017-11-04T15:49:30.21318925Z
id              auth/token/create/61da133ade358ec06ae6e6f0152a57235574877f
issue_time      2017-09-27T18:29:31.578314239Z
last_renewal    2017-10-03T15:49:30.213189473Z
renewable       false
ttl             -15455420

dmicanzerofox on 2 May 2018

After some more investigation, we were able to identify and reproduce the issue. We are actively working on this and will let you know once we have a fix for you to test on.

calvn on 2 May 2018

❤2 🎉2 👍2

@dmicanzerofox @burdandrei can you do a build off #4512 and give that a try? We did some internal refactoring on revocation mechanics which should address the blocking calls and infinite looping issues.

calvn on 8 May 2018

@calvn sure will, thanks!

burdandrei on 9 May 2018

IT WORKED FOR US!!!!! Leases were revoked almost immediately and we were able to force a stepdown! @calvn++++++ @burdandrei !!!!!

dmicanzerofox on 9 May 2018

🎉1 👍1

Awesome, glad to hear!

calvn on 9 May 2018

@dmicanzerofox @burdandrei thanks for all of the patience you've shown (and @SoMuchToGrok of course!) with us around this -- it wasn't easy to figure out the cause, and it wasn't easy to fix it either. On the other hand, we think the new revocation mechanics we put in place are better in many ways so it should be a win all around.

jefferai on 9 May 2018

🎉1

Absolutely! Glad to help out. I appreciate that the team was so receptive and never gave up! @dmicanzerofox and myself are definitely grateful for all the effort to get this resolved :)

SoMuchToGrok on 9 May 2018

🎉1

Was this page helpful?

0 / 5 - 0 ratings