Describe the bug
After upgrading from 1.4.3 to 1.5.0 Vault is crashing from a panic within the prometheus client. The server was running for about 16 hours before the panic and crash occured,
panic: interface conversion: interface {} is nil, not time.Time
goroutine 466266 [running]:
github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect.func3(0x2edbd20, 0xc001e975d0, 0x34b2f40, 0xc001b01800, 0x1)
#011/go/src/github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus/prometheus.go:96 +0x1d4
sync.(*Map).Range(0xc0009401b0, 0xc00211ce90)
#011/usr/local/go/src/sync/map.go:333 +0x1c5
github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect(0xc000940160, 0xc0016234a0)
#011/go/src/github.com/hashicorp/vault/vendor/github.com/armon/go-metrics/prometheus/prometheus.go:94 +0x1d3
github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
#011/go/src/github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus/registry.go:443 +0x19d
created by github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather
#011/go/src/github.com/hashicorp/vault/vendor/github.com/prometheus/client_golang/prometheus/registry.go:535 +0xe36
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Random crashes and panics do not occur.
Environment:
vault status): 1.5.0vault version): 1.5.0Vault server configuration file(s):
ui = true
log_level = "warn"
log_format = "standard"
default_lease_ttl = "768h"
max_lease_ttl = "86400h"
pid_file = "/var/run/vault/vault.pid"
api_addr = "https://<redacted>:8200"
cluster_addr = "https://<redacted>:8201"
listener "tcp" {
address = "<redacted>:8200"
cluster_address = "<redacted>:8201"
tls_disable = "false"
tls_cert_file = "<redacted>"
tls_key_file = "<redacted>"
tls_client_ca_file = "<redacted>"
tls_require_and_verify_client_cert = "true"
tls_disable_client_certs = "false"
}
storage "consul" {
address = "127.0.0.1:8501"
path = "vault/"
check_timeout = "10s"
consistency_mode = "default"
service = "vault"
service_tags = "internal,vault"
token = "<redacted>"
scheme = "https"
tls_ca_file = "<redacted>"
tls_cert_file = "<redacted>"
tls_key_file = "<redacted>"
tls_skip_verify = "false"
}
telemetry {
disable_hostname = true
prometheus_retention_time = "4h"
}
Additional context
When the server was restarted (having lost leadership lock) it logged this message many times:
[ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
I don't believe this is actually related to 1.5. I made a change to how we collect Prometheus metrics in 1.4 which introduced a race in this code, leading to the panic in rare conditions.
Nothing changed in our configuration of Vault, nor our environment, besides upgrading the version. And this is not an error we have ever seen before. What would explain the timing on that?
We have had this happen 3 times so far today. Never had it happen once on any prior version, including 1.4 through 1.4.3
The line in question is panicking due to a change in github.com/armon/go-metrics which was included in the 1.5.0 release.
https://github.com/armon/go-metrics/commit/38377b616bac08306ad0903d5fb436a5330d2973#diff-e0b40f84f51c7bbce23007045f0d409cR75
Before, pulling a value out of the map that was not there would have produced the zero value of time.Time, but after switching to a sync.Map, the zero value produced is the zero value for interface{}, which is nil rather than time.Time{}. Trying to convert nil to a time.Time panics.
There might be a new race, but what seems more likely for this error is that this code path was hit before without panicking. Doing math against the zero value for time.Time may have yielded incorrect behavior (it's 0001-01-01), but it would not have panicked, since there's no type conversion necessary. The current code would, and does.
Same here. Since upgrading to 1.5.0, Vault keeps crashing after a random amount of time. Sometimes runs 8h, sometimes only 15min.
My mistake, I thought that go-metrics change went into 1.4.
In my configuration, vault crash after a few minutes running. I saw the same issue here.
Our organization rolled back to 1.4.3 and the problem disappeared.
The fix for this has been merged and Vault deps have been updated. This should be released in 1.5.1, very soon.
Most helpful comment
I don't believe this is actually related to 1.5. I made a change to how we collect Prometheus metrics in 1.4 which introduced a race in this code, leading to the panic in rare conditions.