Vault: Lot of sockets left in CLOSE_WAIT state

Created on 29 Nov 2018 · 6Comments · Source: hashicorp/vault

Describe the bug
Observed Vault process left a lot of sockets in CLOSE_WAIT state. Socket in CLOSE_WAIT state are supposed to be released by the process owning the socket by issuing a close() system call. For any reason, vault is not releasing the sockets and reach the MAX OPEN FILE limit leading to a service failure.

To Reproduce
Steps to reproduce the behavior:

set vault address parameter in listen tcp stanza using a fqdn instead of ip address
start vault systemctl start vault
every vault api request will create a socket which will stay in CLOSE_WAIT state
once number of request to vault api reach the MAX OPEN FILE limit, vault process become unresponsive.

Expected behavior
Vault process should release any socket in CLOSE_WAIT state.

Environment:

Vault Server Version (retrieve with vault status): Version 0.11.3
Vault CLI Version (retrieve with vault version):
Vault v0.11.3 ('fb601237bfbe4bc16ff679f642248ee8a86e627b')
Server Operating System/Architecture:
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Linux foo.bar.foobar.io 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Infrastructure: AWS EC2
Storage Backend: AWS DynamoDB
Vault deployment: Cluster of 3 vault instances accessed via an AWS Classic LB.

Vault server configuration file(s):

cluster_name = "dc1"
max_lease_ttl = "768h"
default_lease_ttl = "768h"

listener "tcp" {
    address = "foo.bar.foobar.io:8200"
    tls_cert_file = "/etc/vault.d/tls/vault.crt"
    tls_key_file  = "/etc/vault.d/tls/vault.key"
    tls_min_version  = "tls12"
    tls_cipher_suites = "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA"
    tls_prefer_server_cipher_suites = "false"
  tls_disable = "false"
}

api_addr = "http://foo.bar.foobar.io:8200"

storage "dynamodb" {
  ha_enabled        = "true"
  region            = "eu-west-1"
  endpoint          = ""
  table             = "vault-data"
  session_token     = "Hashicorp Vault"
  max_parallel      = "128"
  read_capacity     = "5"
  write_capacity    = "5"
}

Additional context
[root@foo ~]# systemctl status vault -l

● vault.service - Vault secret management tool
   Loaded: loaded (/usr/lib/systemd/system/vault.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2018-11-20 15:37:34 UTC; 1 weeks 1 days ago
 Main PID: 25708 (vault)
   CGroup: /system.slice/vault.service
           └─25708 /usr/local/bin/vault server -config=/etc/vault.d/vault_main.hcl -log-level=info

Nov 29 15:15:26 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:26.175Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:27 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:27.175Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:28 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:28.175Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:29 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:29.176Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:30 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:30.176Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:31 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:31.176Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:32 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:32.177Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:33 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:33.177Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:34 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:34.177Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s
Nov 29 15:15:35 foo.bar.foobar.io vault[25708]: 2018-11-29T15:15:35.177Z [INFO ] http: Accept error: accept tcp 172.18.32.152:8200: accept4: too many open files; retrying in 1s

[root@foo ~]# netstat -antplu | grep 8200

tcp      129      0 172.18.32.152:8200      0.0.0.0:*               LISTEN      25708/vault         
tcp        0      0 172.18.32.152:8200      172.18.32.161:58322     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.161:58334     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.89:14386      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.161:58344     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.152:52634     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.33.40:38230      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.33.40:38218      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.161:58324     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.89:14366      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.89:14384      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.152:52638     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.33.40:38238      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.161:58356     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.33.40:38208      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.89:14398      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.33.40:38254      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.89:14408      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.161:58372     SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.33.40:38244      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.89:14368      SYN_RECV    -                   
tcp        0      0 172.18.32.152:8200      172.18.32.161:58352     SYN_RECV    -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37580      CLOSE_WAIT  -                   
tcp        0    246 172.18.32.152:52638     172.18.32.152:8200      FIN_WAIT1   -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57878     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37416      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13932      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57708     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37534      CLOSE_WAIT  -                   
tcp      148      0 172.18.32.152:8200      172.18.32.161:15181     ESTABLISHED -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13994      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37512      CLOSE_WAIT  -                   
tcp      148      0 172.18.32.152:8200      172.18.33.40:29859      ESTABLISHED -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37440      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37336      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13838      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13984      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37542      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37346      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37392      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37504      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37348      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:14004      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57844     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37434      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37386      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37530      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57750     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37540      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37478      CLOSE_WAIT  -                   
tcp      301      0 172.18.32.152:8200      172.18.33.40:37428      CLOSE_WAIT  -                   
tcp      148      0 172.18.32.152:8200      172.18.32.89:39293      ESTABLISHED -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57934     CLOSE_WAIT  -                   
tcp      246      0 172.18.32.152:8200      172.18.32.152:44684     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37522      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13964      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13850      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37394      CLOSE_WAIT  -                   
tcp        0    246 172.18.32.152:52634     172.18.32.152:8200      FIN_WAIT1   -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37458      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37374      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37474      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57716     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37588      CLOSE_WAIT  -                   
tcp      246      0 172.18.32.152:8200      172.18.32.152:44688     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37486      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57868     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37418      CLOSE_WAIT  -                   
tcp      246      0 172.18.32.152:8200      172.18.32.152:44690     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37480      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37578      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13802      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:14022      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13960      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13818      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37372      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37500      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57906     CLOSE_WAIT  -                   
tcp      246      0 172.18.32.152:8200      172.18.32.152:44668     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37490      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37568      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37370      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37576      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37412      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13922      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37524      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57886     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37538      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37586      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13828      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37574      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37596      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57728     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37496      CLOSE_WAIT  -                   
tcp      246      0 172.18.32.152:8200      172.18.32.152:44694     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37464      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37264      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37368      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37508      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37278      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37572      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37528      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57860     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37430      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37454      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37532      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37514      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37248      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57922     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37268      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37450      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37436      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37390      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37362      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37502      CLOSE_WAIT  -                   
tcp        0    246 172.18.32.152:52632     172.18.32.152:8200      FIN_WAIT1   -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37582      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37598      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37484      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.89:13902      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37452      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57830     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37462      CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.32.161:57810     CLOSE_WAIT  -                   
tcp      149      0 172.18.32.152:8200      172.18.33.40:37408      CLOSE_WAIT  -

bug storagdynamodb versio0.11.x

Source

csahli

Most helpful comment

Sorry, I mislead you, my mistake. The problem is not related to a vault parameter in the configuration file. I just described what I did to face the error.
I dive deeper and the problem looks to be related to DynamoDB backend storage.

As you can see at the vault configuration level we are using DynamoDB backend storage with the following DynamoDB stanza parameters:

   read_capacity    = "5"
   write_capacity    = "5"

On AWS DynamoDB table configuration side read/write capacity was set as follow:

Provisioned read capacity units: 5 (Auto Scaling Disabled)
Provisioned write capacity units: 5 (Auto Scaling Disabled)

Basically, vault read/write capacity was set to DynamoDB provisioned read/write capacity. But as we have 3 nodes configured, every node configured with a read/write capacity of 5 we actually exceed the DynamoDB table capacity.

And here is what was visible in the /var/log/messages system log file as vault is logging here as a systemd unit:

2018-11-26T16:54:55.871050+00:00 foo.bar.foobar.io vault: 2018-11-26T16:54:55.870Z [INFO ] core: security barrier not initialized
2018-11-28T17:05:41.567852+00:00 foo.bar.foobar.io vault: 2018-11-28T17:05:41.564Z [INFO ] core: vault is unsealed
2018-11-28T17:05:41.568139+00:00 foo.bar.foobar.io vault: 2018-11-28T17:05:41.566Z [INFO ] core: entering standby mode
2018-11-28T17:25:57.139296+00:00 foo.bar.foobar.io vault: 2018-11-28T17:25:57.136Z [ERROR] core: barrier init check failed: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:25:57.139578+00:00 foo.bar.foobar.io vault: status code: 400, request id: FO1EVFHQQ9TO2NHBVS083F8G03VV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:25:57.139794+00:00 foo.bar.foobar.io vault: 2018-11-28T17:25:57.137Z [ERROR] core: error checking health: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:25:57.140005+00:00 foo.bar.foobar.io vault: status code: 400, request id: FO1EVFHQQ9TO2NHBVS083F8G03VV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:26:06.524964+00:00 foo.bar.foobar.io vault: 2018-11-28T17:26:06.524Z [ERROR] core: barrier init check failed: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:26:06.525299+00:00 foo.bar.foobar.io vault: status code: 400, request id: TP3TO5DJA3DAUTCOGIGICP4C3RVV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:26:06.525530+00:00 foo.bar.foobar.io vault: 2018-11-28T17:26:06.524Z [ERROR] core: error checking health: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:26:06.525756+00:00 foo.bar.foobar.io vault: status code: 400, request id: TP3TO5DJA3DAUTCOGIGICP4C3RVV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:26:16.526527+00:00 foo.bar.foobar.io vault: 2018-11-28T17:26:16.526Z [ERROR] core: barrier init check failed: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:26:16.526795+00:00 foo.bar.foobar.io vault: status code: 400, request id: 9SRQ1V6P0U2MA4O796AH9Q4FOFVV4KQNSO5AEMVJF66Q9ASUAAJG"

So to solve the issue I just triple the DynamoDB table configuration read/write capacity to make it consistent with the number of vault nodes

Provisioned read capacity units : 15 (Auto Scaling Disabled)
Provisioned write capacity units: 15 (Auto Scaling Disabled)

Basically DynamoDB read/write capacity must be compliant with following formulas:

AWS DynamoDB table provisioning read capacity  = (Vault read capacity * number of vault node)
AWS DynamoDB table provisioning write capacity = (Vault write capacity * number of vault node)

This is the solution to solve the issue I face but the problem remains. Vault dynamoDB storage backend should handle the problem differently and should correctly release the resources in case of exception. As it looks like vault doesn't release the socket because it is not handling this throttling error correctly.

So I would propose to:

Enhance the dynamoDB backend storage documentation to take into account the number of nodes to set the read/write capacity

Update the dynamoDB backend storage code to better handle capacity limit reach problems so that if the vault node exceeds the read/write capacity after the retries, it releases the resources correctly.

csahli on 30 Nov 2018

👍3

All 6 comments

set vault address parameter in listen tcp stanza using a fqdn instead of ip address

Does this make a difference?

jefferai on 29 Nov 2018

Also, can you comment on the IP addresses on the other end? Are they random clients, do they correspond to other Vault servers, etc.?

jefferai on 29 Nov 2018

Can you take a look at Vault server logs/client logs and see if you're seeing errors? e.g. are you seeing any panics?

jefferai on 29 Nov 2018

As you can see at the vault configuration level we are using DynamoDB backend storage with the following DynamoDB stanza parameters:

   read_capacity    = "5"
   write_capacity    = "5"

On AWS DynamoDB table configuration side read/write capacity was set as follow:

Provisioned read capacity units: 5 (Auto Scaling Disabled)
Provisioned write capacity units: 5 (Auto Scaling Disabled)

And here is what was visible in the /var/log/messages system log file as vault is logging here as a systemd unit:

2018-11-26T16:54:55.871050+00:00 foo.bar.foobar.io vault: 2018-11-26T16:54:55.870Z [INFO ] core: security barrier not initialized
2018-11-28T17:05:41.567852+00:00 foo.bar.foobar.io vault: 2018-11-28T17:05:41.564Z [INFO ] core: vault is unsealed
2018-11-28T17:05:41.568139+00:00 foo.bar.foobar.io vault: 2018-11-28T17:05:41.566Z [INFO ] core: entering standby mode
2018-11-28T17:25:57.139296+00:00 foo.bar.foobar.io vault: 2018-11-28T17:25:57.136Z [ERROR] core: barrier init check failed: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:25:57.139578+00:00 foo.bar.foobar.io vault: status code: 400, request id: FO1EVFHQQ9TO2NHBVS083F8G03VV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:25:57.139794+00:00 foo.bar.foobar.io vault: 2018-11-28T17:25:57.137Z [ERROR] core: error checking health: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:25:57.140005+00:00 foo.bar.foobar.io vault: status code: 400, request id: FO1EVFHQQ9TO2NHBVS083F8G03VV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:26:06.524964+00:00 foo.bar.foobar.io vault: 2018-11-28T17:26:06.524Z [ERROR] core: barrier init check failed: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:26:06.525299+00:00 foo.bar.foobar.io vault: status code: 400, request id: TP3TO5DJA3DAUTCOGIGICP4C3RVV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:26:06.525530+00:00 foo.bar.foobar.io vault: 2018-11-28T17:26:06.524Z [ERROR] core: error checking health: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:26:06.525756+00:00 foo.bar.foobar.io vault: status code: 400, request id: TP3TO5DJA3DAUTCOGIGICP4C3RVV4KQNSO5AEMVJF66Q9ASUAAJG"
2018-11-28T17:26:16.526527+00:00 foo.bar.foobar.io vault: 2018-11-28T17:26:16.526Z [ERROR] core: barrier init check failed: error="failed to check for initialization: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
2018-11-28T17:26:16.526795+00:00 foo.bar.foobar.io vault: status code: 400, request id: 9SRQ1V6P0U2MA4O796AH9Q4FOFVV4KQNSO5AEMVJF66Q9ASUAAJG"

So to solve the issue I just triple the DynamoDB table configuration read/write capacity to make it consistent with the number of vault nodes

Provisioned read capacity units : 15 (Auto Scaling Disabled)
Provisioned write capacity units: 15 (Auto Scaling Disabled)

Basically DynamoDB read/write capacity must be compliant with following formulas:

AWS DynamoDB table provisioning read capacity  = (Vault read capacity * number of vault node)
AWS DynamoDB table provisioning write capacity = (Vault write capacity * number of vault node)

So I would propose to:

Enhance the dynamoDB backend storage documentation to take into account the number of nodes to set the read/write capacity

Update the dynamoDB backend storage code to better handle capacity limit reach problems so that if the vault node exceeds the read/write capacity after the retries, it releases the resources correctly.

csahli on 30 Nov 2018

👍3

The Dynamo code uses the aws sdk so this is _probably_ a bug at that layer, though unsure.

jefferai on 3 Dec 2018

I was not able to reproduce this and I believe that in any event it may be resolved using OS specific tuning (like below) if not being addressed in DynamoDB plugin portion of Vault. Other suspects here that differed to my own testing which was outside AWS are LB & OS (I did not use RHEL).

# /etc/sysctl.conf
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

sysctl -p ;

Is this issue still applicable?

aphorise on 14 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

OAuth2 token refreshing as a secret backend

andris9 · 3Comments

Passthrough auth in UI

maxsivanov · 3Comments

Feature request: Amazon CloudWatch logs audit backend

mfischer-zd · 3Comments

VAULT - List Tokens

narayan8291 · 3Comments

Docs recommend running vault agent as init container

tustvold · 3Comments