Vault: identitystore entity leak causes major performance issues

Created on 17 Apr 2020  路  11Comments  路  Source: hashicorp/vault

Describe the bug
The identitystore entities only grow and never shrink, causing long failover times (more than 10 minutes) due to entity loading and a storagepacker bottleneck during authentication. Additionally it causes high memory usage (increasing to 110GB when we list entitites).

We currently have over 5.7 million entities and growing:

vault read -format=json sys/internal/counters/entities 
{
  "request_id": "c332e801-8d1d-1d7a-3ab2-9540b9bf280a",
  "lease_id": "",
  "lease_duration": 0,
  "renewable": false,
  "data": {
    "counters": {
      "entities": {
        "total": 5708878
      }
    }
  },
  "warnings": null
}

This means each of the 256 storagepacker buckets that contains them each have around 22.3k entities, and each time a login occurs CreateOrFetchEntity() is called, it reads a storagepacker bucket, decompresses it, unmarshalls the protobuf, modifies it, marshals it, compresses it, and writes the bucket back. As a result we see Vault fail at about ~500 GCP auth requests per MINUTE.

This also caused a Consul storage migration to fail due to the large key sizes of the storagepacker buckets.

To Reproduce
Authenticate millions of times with unique identities.

Expected behavior
I expect entities to be maintained to be a reasonable size.

Environment:

  • Vault Server Version (retrieve with vault status):
$ vault status
Key                    Value
---                    -----
Version                1.4.0+prem
  • Vault CLI Version (retrieve with vault version):
$ vault version
Vault v1.1.1 (cgo)
  • Server Operating System/Architecture:
    Linux, AMD64

Vault server configuration file(s):

# Paste your Vault config here.
# Be sure to scrub any sensitive values

Additional context
We've been fighting these perf issues for over a year.

It seems the only place Vault deletes any dangling entities from storagepacker are entities and group entitites if the namespace is nil:
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L112
https://github.com/hashicorp/vault/blob/rel-1.4.0/vault/identity_store_util.go#L280

bug coridentity

All 11 comments

Can we safely delete these entities? Preferably by deleting the bucket keys directly.

cc @briankassouf

Is there any broader discussion happening on managing identities more sustainably?

For our use cases it would work nicely to have the detailed identity info (gcp instance id, etc) stored inside of a batch token, and have the batch tokens be truly stateless. I'm not sure if that would be compatible with the broader identities api though since we don't use that. Generally we try to use batch tokens wherever possible for better performance.

I think this got automatically closed because of the GCP auth plugin reference. Can it be reopened? I am unable to reopen it.

After cleaning up the stale entities our instance is now allocating about 0.5% of the memory as it did during the peak, and our failovers are now taking seconds instead of >10 mins. Identity store operations are also now fast again so we expect GCE auth to also be performant.

Thank you for your work in addressing the causes and improving metrics around this @briankassouf, @pcman312, @tyrannosaurus-becks, and all others involved.

For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist.

For future and ongoing remediations, maybe it makes sense to add some vacuum/tidy background process or API for purge stale entities? Please let me know your thoughts, I might be able to assist.

Yes, this is something we want to do as part of a larger effort, though I can't speak as to timeline.

I'm going to close this issue because I don't think there's anything left to be done here, and the tidy work will be tracked on our internal roadmap.

Wow, I think this might be the root cause of performance/memory issues we've been having with Vault with GCP Auth + GCS Storage backend (we have multiple batch compute workloads with spot instances hitting Vault with 1000s of new nodes a day).

Is there any more guidance on how to deal with large #'s of entities/performance implications of the alias configuration?
It's also not clear to me that upgrading to the versions of Vault with updated entity mapping logic would fix performance issues since it sounds like Vault will still be looking up all this identity information.

@duplo83 Do you have any tips for cleanup/how you identified stale entities to be deleted? Or did you just go ahead and delete most of them and let them get regenerated?

It almost certainly is from the situation that you described @sidewinder12s. If you are on a recent enough version of Vault, you can obtain entity count metrics. You can also configure the GCP auth backend now to not generate so many unique identities.

We had a script run through and delete the old identities. There are a few considerations though. I could share the script if it's okay with @briankassouf. We modified it from what Brian gave us.

  • You have to dump all of the entities first. You'll probably need to increase your request timeouts because it may take a few minutes. You will also need sufficient free memory (ours increased by 10s of gigabytes IIRC).
  • You can then walk through the entities and delete the ones that are older than a TTL (this requires an additional lookup)
  • You'll need to start slow because it deleting is intensive on the identity store. It will speed up as you work through them. It'll probably take days from start to finish if it is anything like what we had to do.
  • GCS is a relatively high latency backend so it may slow things down more.

I am now seeing huge latency spikes with GCS writes (due to apparent object writes being limited to once per second) for a few storage packer buckets. What's weird is that this is impacting Vault login/availability for a few users, but those users are using Batch type tokens that I thought didn't touch the storage backend.

Might you have any idea of what would be trying to write the storage packer entries multiple times a second days after I stopped deleting entities? I don't see a ton of information in commits, issues or discussion pages about how storage packer behaves.

Batch token creation still creates an identity in the identity store (which uses the storage packer buckets).
If GCS has an object write rate limitation of once per second, then it probably won't be compatible with your rate of login since there are only 256 storage packer buckets.

Yup, I think I was seeing the high rate of change for metadata keys that is called out in the Docs.
I was basically slamming 1 packer bucket during login of one of our batch compute services. I fairly rapidly upgraded from 1.3 to 1.5 and I think it just got lost in all the cleanup I had been doing.

Specifically GCS has a update limitation of once per second for a single object 1

Thanks for your help/responses.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

justintime picture justintime  路  55Comments

sochoa picture sochoa  路  39Comments

jweissig picture jweissig  路  44Comments

SoMuchToGrok picture SoMuchToGrok  路  66Comments

ekristen picture ekristen  路  34Comments