Vault: When vault have a lot of expired lease performing revoke for MySQL. Port 8200 don't have any response

Created on 18 Dec 2018  路  20Comments  路  Source: hashicorp/vault

Describe the bug
When vault have a lot of expired lease performing revoke for MySQL. Port 8200 don't have any response. Even the /v1/sys/health. After other standby vault pickup the active role. It try to revoke those lease again. And show a lot of "failed to revoke entry: resp: (*logical.Response)(nil) err: Error 1396: Operation DROP USER failed for 'v-kubernetes-dev-config-6juuHkPM'@'%'". Which MySQL user is already drop by pervious active vault. And current active vault may not able to response on port 8200. Then loop.

To Reproduce
Steps to reproduce the behavior:

  1. Create a lot of MySQL credentials (my case only have 15k MySQL credentials)
  2. Before those credentials expire. Shutdown all vault (We found this problem when using vault on dev environment. The vault run after few weeks. Shutdown the vault is the easiest way to reproduce problem)
  3. After all MySQL credentials expired. Turn on all vault again
  4. Around 1 to 2 mins later. Vault port 8200 no response

Expected behavior
It should able to revoke those credentials. And keep provide service (Not only create/revoke MySQL. Also login and normal secret operation).

Environment:

  • Vault Server Version (retrieve with vault status): Tested on 0.11.1 & 1.0.1
  • Vault CLI Version (retrieve with vault version): Only use with HTTP API
  • Server Operating System/Architecture: K8S 1.10 on debian 9.6

Vault server configuration file(s):

{"api_addr":"https://vault.services:8200","disable_clustering":true,"listener":{"tcp":{"address":"0.0.0.0:8200","tls_cert_file":"/vault/tls/server.crt","tls_key_file":"/vault/tls/server.key"}},"storage":{"mysql":{"address":"MYSQL","ha_enabled":"true","password":"PASSWORD","username":"USER"}},"telemetry":{"statsd_address":"localhost:9125"},"ui":true}

Additional context
The MySQL credential is create for another MySQL server. Not on vault's backend.

bug corlease secredatabase versio1.0.x waiting-for-response

Most helpful comment

I just had a similar error (athough with a postgresql backend).

The vault instance stayed locked for a whole weekend, and when trying to unseal it, it tried to revoke many leases, in parallel, and reached the maximum parallel db connections of the postgres instance.

I solved by specifying a low value for max_parallel (I set it as low as "2" without any issues so far), and everything just worked fine.

Hope this helps anybody running into this issue.

All 20 comments

Vault logs from when this is happening would be helpful. Generally revocation is not synchronous upon Vault loading, each lease is loaded and revocation happens async (but kicked off immediately) if the lease has expired.

Below is log for first active vault after step 3. There so many similar logs in middle (just different IDs). So I removed them.

==> Vault server configuration:

             Api Address: https://vault.services:8200
                     Cgo: disabled
              Listener 1: tcp (addr: "0.0.0.0:8200", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: debug
                   Mlock: supported: true, enabled: true
                 Storage: mysql (HA available)
                 Version: Vault v1.0.0
             Version Sha: c19cef14891751a23eaa9b41fd456d1f99e7e856

==> Vault server started! Log data will stream in below:

2018-12-19T04:33:06.141Z [DEBUG] storage.cache: creating LRU cache: size=0
2018-12-19 04:33:06.730016 I | [ERR] Error flushing to statsd! Err: write udp [::1]:32965->[::1]:9125: write: connection refused
2018-12-19T04:33:06.960Z [DEBUG] core: cannot unseal, not enough keys: keys=1 threshold=3 nonce=5b793d66-8346-8611-824d-c790d9d6cc71
2018-12-19T04:33:06.988Z [DEBUG] core: cannot unseal, not enough keys: keys=2 threshold=3 nonce=5b793d66-8346-8611-824d-c790d9d6cc71
2018-12-19T04:33:07.013Z [INFO]  core: vault is unsealed
2018-12-19T04:33:07.013Z [INFO]  core: entering standby mode
2018-12-19T04:33:07.048Z [INFO]  core: acquired lock, enabling active operation
2018-12-19T04:33:07.050Z [DEBUG] core: generating cluster private key
2018-12-19T04:33:07.082Z [DEBUG] core: generating local cluster certificate
2018-12-19T04:33:07.105Z [INFO]  core: post-unseal setup starting
2018-12-19T04:33:07.105Z [DEBUG] core: clearing forwarding clients
2018-12-19T04:33:07.105Z [DEBUG] core: done clearing forwarding clients
2018-12-19T04:33:07.105Z [INFO]  core: loaded wrapping token key
2018-12-19T04:33:07.105Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
2018-12-19T04:33:07.106Z [INFO]  core: successfully mounted backend: type=kv path=secret/
2018-12-19T04:33:07.106Z [INFO]  core: successfully mounted backend: type=system path=sys/
2018-12-19T04:33:07.106Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2018-12-19T04:33:07.106Z [INFO]  core: successfully mounted backend: type=database path=database/
2018-12-19T04:33:07.106Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018-12-19T04:33:07.109Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2018-12-19T04:33:07.109Z [INFO]  core: successfully enabled credential backend: type=gcp path=gcp/
2018-12-19T04:33:07.109Z [INFO]  core: successfully enabled credential backend: type=kubernetes path=kubernetes/
2018-12-19T04:33:07.109Z [INFO]  core: restoring leases
2018-12-19T04:33:07.109Z [DEBUG] expiration: collecting leases
2018-12-19T04:33:07.109Z [INFO]  rollback: starting rollback manager
2018-12-19T04:33:07.109Z [DEBUG] identity: loading entities
2018-12-19T04:33:07.110Z [DEBUG] identity: entities collected: num_existing=4
2018-12-19T04:33:07.110Z [DEBUG] identity: entities loading: progress=0
2018-12-19T04:33:07.113Z [INFO]  identity: entities restored
2018-12-19T04:33:07.113Z [DEBUG] identity: identity loading groups
2018-12-19T04:33:07.113Z [DEBUG] identity: groups collected: num_existing=0
2018-12-19T04:33:07.113Z [INFO]  identity: groups restored
2018-12-19T04:33:07.114Z [INFO]  core: clustering disabled, not starting listeners
2018-12-19T04:33:07.114Z [INFO]  core: post-unseal setup complete
2018-12-19T04:33:07.242Z [DEBUG] expiration: leases collected: num_existing=33346
2018-12-19T04:33:07.318Z [DEBUG] expiration: leases loading: progress=500
...
...
...
...
2018-12-19T04:33:40.296Z [DEBUG] expiration: leases loading: progress=33000
2018-12-19T04:33:40.312Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/1jU6jhvh2FY2XHt7hRvK6PYv
...
...
2018-12-19T04:34:37.846Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/2xhhqWSxOyyD2OTu0MiV9jJ8
2018-12-19T04:34:37.849Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2w2gPF12h9uSOoGjGFvE1PwP error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2018-12-19T04:34:37.896Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/2xuvXeo4noWuxZdLz0P0Gsdu
2018-12-19T04:34:37.899Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/30Pe1t5w4PWrW9Rbke5uND3O error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2018-12-19T04:34:37.899Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2xuMObYvCJPBTrN7E6DXQ2ME error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2018-12-19T04:34:37.899Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2zUphsnkNXVgHBI34nTmgrdk error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
...
...
2018-12-19T04:34:44.469Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/4D544LxbHBMJ32HTq9BGktLk
2018-12-19T04:34:44.472Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/3lXnWSFXH1YZBdVGsZPCiYJ1
2018-12-19T04:34:44.486Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/1Kj8yp1P2gyrBJTzTQKQyDGx error="failed to revoke entry: resp: (*logical.Response)(nil) err: Error 1396: Operation DROP USER failed for 'v-kubernetes-K8S_ROLE-1VbxgLpV'@'%'"
2018-12-19T04:34:44.525Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/5JJWV3gyaAFidl6pAydkvgiy
2018-12-19T04:34:44.536Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2lg6ll63dRhed717TVoBEyAr error="failed to revoke entry: resp: (*logical.Response)(nil) err: Error 1396: Operation DROP USER failed for 'v-kubernetes-K8S_ROLE-6XnGsAFk'@'%'"
2018-12-19T04:34:44.576Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/7i4b4fMQXfx3FhJI3qOlGCXh
2018-12-19T04:34:44.598Z [INFO]  expiration: revoked lease: lease_id=database/creds/DB_ROLE/7NOxSoJzvD5Bu300HN6jl4ch
2018-12-19T04:34:44.614Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/1chmA3UK24GD54Y86Yht0BIO error="failed to revoke entry: resp: (*logical.Response)(nil) err: Error 1396: Operation DROP USER failed for 'v-kubernetes-K8S_ROLE-1hcYJAyN'@'%'"
2018-12-19T04:34:44.638Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2pqVoZIYW2L6UMu8f8DJPn3B error="failed to revoke entry: resp: (*logical.Response)(nil) err: Error 1396: Operation DROP USER failed for 'v-kubernetes-K8S_ROLE-42HQ8ElL'@'%'"
2018-12-19T04:34:44.678Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2QcCbOSFFWTv7A5Grk8C0tv5 error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
...
...
2018-12-19T04:35:08.745Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/3BtM4Zccn9dnZoyHAVdVVYUW error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2018-12-19T04:35:08.745Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/2fuOzrJUGCRnSMg8q9lJbGwT error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2018-12-19T04:35:08.825Z [INFO]  core: pre-seal teardown starting
2018-12-19T04:35:08.825Z [DEBUG] core: clustering disabled, not stopping listeners
2018-12-19T04:35:08.825Z [DEBUG] expiration: stop triggered
2018-12-19T04:35:08.825Z [ERROR] expiration: shutting down, not attempting further revocation of lease: lease_id=database/creds/DB_ROLE/ZeVcMJAU1Eo469HyZiiNbBKO

2018-12-19T04:35:08.825Z [ERROR] expiration: shutting down, not attempting further revocation of lease: lease_id=database/creds/DB_ROLE/1UegkpXBN4lmiMAvJ0juxizC
2018-12-19T04:35:08.829Z [DEBUG] expiration: finished stopping
2018-12-19T04:35:08.829Z [INFO]  rollback: stopping rollback manager
2018-12-19T04:35:08.829Z [INFO]  core: pre-seal teardown complete
2018-12-19T04:35:08.833Z [ERROR] expiration: shutting down, not attempting further revocation of lease: lease_id=database/creds/DB_ROLE/gvE0FQG9KGdv9Bb9HspE01V0
2018-12-19T04:35:08.838Z [ERROR] expiration: shutting down, not attempting further revocation of lease: lease_id=database/creds/DB_ROLE/3Vir43GBO2febFfIYjULc11n
...
...
2018-12-19T04:35:13.457Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/DB_ROLE/O8AfzLrmWM8BOBb6fbFLMjk9 error="failed to revoke entry: resp: (*logical.Response)(nil) err: error verifying connection: context canceled"
...
...

I don't see anything in the logs indicating that Vault shouldn't be ready for connections. In fact this line:

2018-12-19T04:33:07.114Z [INFO] core: post-unseal setup complete

suggests that it is. When you say that port 8200 has no response, what do you mean? What kind of request are you sending? Does vault status work?

If I had to guess I'd guess that the failing expirations are causing client requests that need to read or write storage to time out because the mysql server is swamped. I think the first thing you need to do is figure out why all of those revocations are failing. Looks like maybe your revoke sql is incorrect, or the configured user doesn't have the right permissions.

The port 8200 no response to resuest. eg: vault login, get db cred and even health check endpoint (/v1/sys/health). Normally this health check endpoint should have response whatever it is active or standby and sealed.

I sure the revoke sql is correct and with enough permission. Because I already tried keep request new DB cred with TTL 1 mins. It can drop the DB user correctly.

So I ready no idea why it become no response. And try to revoke some already dropped DB users...

It looks like you're vault is in the process of shutting down, which is why all those revoke requests are failing and why core: pre-seal teardown starting is logged. Are there any logs indicating why you lost leadership? perhaps some sort of connectivity issue with the database?

I tested it again. Port 8200 no response (even /v1/sys/health) before log core: pre-seal teardown starting shown. In the same time I tail the vault backend mysql. No special log found. And able to perform query.

After another vault pickup active role. It didn't dead this time. After while it got a lot [ERROR] expiration: maximum revoke attempts reached: lease_id=database/creds/DB_ROLE/IDs log. And that functional DB still have 17K user didn't removed (When test I created 30K users). Also no any error log in functional DB. And the vault showing normal message. No any more revoke related log.


Remark:
Have 2 different DBs involved:

  1. Vault backend mysql
  2. Another mysql. which users is managed by vault. Let call it functional DB

I just had a similar error (athough with a postgresql backend).

The vault instance stayed locked for a whole weekend, and when trying to unseal it, it tried to revoke many leases, in parallel, and reached the maximum parallel db connections of the postgres instance.

I solved by specifying a low value for max_parallel (I set it as low as "2" without any issues so far), and everything just worked fine.

Hope this helps anybody running into this issue.

@toadjaune where did you set the max_parallel parameter? I have the same exact issue but with a MSSQL backend and changing that parameter does not seem to have any effect.

I am facing a similar issue and I updated the max_parallel from 32 to 4; will keep this under observation and update..

EDIT:

seems to work immediately..

@krish7919 Where did you set this? How does the rest of the config look like?

I use a postgres backend, documented here.

Thanks. I am using mssql. max_parallel had no effect.

Hi @alantang888!

Is this still a thing for you? Have you tried to set the max_parallel option to a lower value?

Cheers,
Michel

Hi @michelvocks , I don't know. I didn't work on that anymore.

@alantang888 Thanks for the quick reply. I will close this issue for now since we were unable to reproduce it. Don't hesitate to open a new issue if you face this issue again.

Hi, I am working in the same company @alantang888 used to work for and we still have this issue. We were using max_parallel 45, I tried changing to 10, but it didn't help.
The /v1/sys/health endpoint is responsive, but when I try to run a command like vault read database/creds/readonly-my-database-name-here
it just hangs there and times out after 1 minute. It works sometimes, but in the current state it's very easy to reproduce.

In the log I can see this:

2020-02-20T10:39:31.511Z [DEBUG] rollback: attempting rollback: path=database/
2020-02-20T10:39:31.511Z [DEBUG] rollback: attempting rollback: path=sys/
2020-02-20T10:39:31.511Z [DEBUG] rollback: attempting rollback: path=auth/gcp/
2020-02-20T10:39:31.511Z [DEBUG] rollback: attempting rollback: path=auth/token/
2020-02-20T10:40:06.843Z [INFO]  expiration: revoked lease: lease_id=auth/kubernetes/login/hd56edb15ea1cdd167a48c816edde597b53e605c57578cdfee1b01bb721545392
2020-02-20T10:40:19.747Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/8Kn9tApGs5WEj6tT396iKyGD error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2020-02-20T10:40:19.748Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/4iouazb76Eu1N4TsADkJ1UAu error="failed to revoke entry: resp: (*logical.Response)(nil) err: error verifying connection: context deadline exceeded"
2020-02-20T10:40:19.759Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/6pTapcMUCkVSHdSJ1XqS4w13 error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2020-02-20T10:40:19.774Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/1lnOfXElZE7nPrq6xmOprE4h error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2020-02-20T10:40:19.774Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/4QpSUInnSdUc4blMRq9SGufH error="failed to revoke entry: resp: (*logical.Response)(nil) err: context deadline exceeded"
2020-02-20T10:40:19.818Z [INFO]  expiration: revoked lease: lease_id=database/creds/redacted/srCf1RXnHMUthgO9gxV2dyV6
2020-02-20T10:40:19.819Z [INFO]  expiration: revoked lease: lease_id=database/creds/redacted/5ah6Ywxpg360Wk1P4vQVMvsa
2020-02-20T10:40:20.048Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/5LVQb9wCLKMBT8aMnGp5ZDOn error="failed to revoke entry: resp: (*logical.Response)(nil) err: error verifying connection: context deadline exceeded"
2020-02-20T10:40:20.048Z [ERROR] expiration: failed to revoke lease: lease_id=database/creds/redacted/7Hed5jyVGzhX6dCfwzSJmMR5 error="failed to revoke entry: resp: (*logical.Response)(nil) err: error verifying connection: context deadline exceeded"
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=auth/kubernetes/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=secret/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=auth/token/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=database/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=sys/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=cubbyhole/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=identity/
2020-02-20T10:40:31.511Z [DEBUG] rollback: attempting rollback: path=auth/gcp/
2020-02-20T10:40:37.717Z [INFO]  expiration: revoked lease: lease_id=auth/kubernetes/login/h50bc0be297caec72361d2a1c4d7b4b9296cf84703cc54eda4fa0b8b2744e917d
2020-02-20T10:41:02.084Z [INFO]  expiration: revoked lease: lease_id=auth/kubernetes/login/hc20c50630b0f975067c3d9338649124ecc52d70d131e7206a27d4ea202c32b90

We are still at Vault 1.0.1, BTW.

Hi @dataviruset I remember this may related to #6058. And I should added a cron job to clean those DB cred as temporary solution. If that is removed, you can found on https://gist.github.com/alantang888/2be13e178ad0e606de91caa821579649
I hope it still can help. And good luck.

Thanks @alantang888 for the update. I tried running your script for revoking old leases, but nothing was revoked because it seems we don't have any old/stale leases. I checked and we have around 130 active leases, maybe that's causing bad performance for some reason.
I enabled query logging on our MySQL backend to try to see what Vault is doing. Running a grep on "database", I can't see that many queries related to the database secret engine, though.

I just hit this again today with the new mas_parallel setting stated above. It had worked when we had ~10k leases before. Today, Vault was tracking ~67k leases and failed during a restart with an OOM. I will keep digging more, but the Vault HA story is far from done, it seems.

EDIT:
Seems similar to #6058 from above but with Postgres Secrets Engine.

I have opened https://github.com/hashicorp/vault/issues/9018 as this one is closed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Luzifer picture Luzifer  路  37Comments

sochoa picture sochoa  路  39Comments

wpg4665 picture wpg4665  路  39Comments

bitglue picture bitglue  路  37Comments

hashbrowncipher picture hashbrowncipher  路  65Comments