Vault: panic: interface conversion: interface {} is nil, not time.Time

Created on 22 Jul 2020 · 9Comments · Source: hashicorp/vault

Describe the bug
After upgrading from 1.4.3 to 1.5.0 Vault is crashing from a panic within the prometheus client. The server was running for about 16 hours before the panic and crash occured,

panic: interface conversion: interface {} is nil, not time.Time
goroutine 466266 [running]:
github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect.func3(0x2edbd20, 0xc001e975d0, 0x34b2f40, 0xc001b01800, 0x1)
#011/go/src/github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus/prometheus.go:96 +0x1d4
sync.(*Map).Range(0xc0009401b0, 0xc00211ce90)
#011/usr/local/go/src/sync/map.go:333 +0x1c5
github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect(0xc000940160, 0xc0016234a0)
#011/go/src/github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus/prometheus.go:94 +0x1d3
github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
#011/go/src/github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus/registry.go:443 +0x19d
created by github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather
#011/go/src/github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus/registry.go:535 +0xe36

To Reproduce
Steps to reproduce the behavior:

Run a Vault server on 1.5.0 for a while
Wait for panic and crash

Expected behavior
Random crashes and panics do not occur.

Environment:

Vault Server Version (retrieve with vault status): 1.5.0
Vault CLI Version (retrieve with vault version): 1.5.0
Server Operating System/Architecture: Ubuntu 16.04

Vault server configuration file(s):

ui                = true
log_level         = "warn"
log_format        = "standard"
default_lease_ttl = "768h"
max_lease_ttl     = "86400h"
pid_file          = "/var/run/vault/vault.pid"

api_addr     = "https://<redacted>:8200"
cluster_addr = "https://<redacted>:8201"

listener "tcp" {
  address                            = "<redacted>:8200"
  cluster_address                    = "<redacted>:8201"
  tls_disable                        = "false"
  tls_cert_file                      = "<redacted>"
  tls_key_file                       = "<redacted>"
  tls_client_ca_file                 = "<redacted>"
  tls_require_and_verify_client_cert = "true"
  tls_disable_client_certs           = "false"
}

storage "consul" {
  address          = "127.0.0.1:8501"
  path             = "vault/"
  check_timeout    = "10s"
  consistency_mode = "default"
  service          = "vault"
  service_tags     = "internal,vault"
  token            = "<redacted>"
  scheme           = "https"
  tls_ca_file      = "<redacted>"
  tls_cert_file    = "<redacted>"
  tls_key_file     = "<redacted>"
  tls_skip_verify  = "false"
}

telemetry {
  disable_hostname          = true
  prometheus_retention_time = "4h"
}

Additional context
When the server was restarted (having lost leadership lock) it logged this message many times:

[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""

bug cormetric

Source

alexdulin

👍1

Most helpful comment

I don't believe this is actually related to 1.5. I made a change to how we collect Prometheus metrics in 1.4 which introduced a race in this code, leading to the panic in rare conditions.

ncabatoff on 22 Jul 2020

👎2 👍2

All 9 comments

I don't believe this is actually related to 1.5. I made a change to how we collect Prometheus metrics in 1.4 which introduced a race in this code, leading to the panic in rare conditions.

ncabatoff on 22 Jul 2020

👎2 👍2

Nothing changed in our configuration of Vault, nor our environment, besides upgrading the version. And this is not an error we have ever seen before. What would explain the timing on that?

alexdulin on 22 Jul 2020

We have had this happen 3 times so far today. Never had it happen once on any prior version, including 1.4 through 1.4.3

alexdulin on 23 Jul 2020

The line in question is panicking due to a change in github.com/armon/go-metrics which was included in the 1.5.0 release.

https://github.com/armon/go-metrics/commit/38377b616bac08306ad0903d5fb436a5330d2973#diff-e0b40f84f51c7bbce23007045f0d409cR75

Before, pulling a value out of the map that was not there would have produced the zero value of time.Time, but after switching to a sync.Map, the zero value produced is the zero value for interface{}, which is nil rather than time.Time{}. Trying to convert nil to a time.Time panics.

There might be a new race, but what seems more likely for this error is that this code path was hit before without panicking. Doing math against the zero value for time.Time may have yielded incorrect behavior (it's 0001-01-01), but it would not have panicked, since there's no type conversion necessary. The current code would, and does.