0.6.2
Linux
Vault: 0.7.3
At some point during our upgrade from nomad 0.5.6 to nomad 0.6.2 (correlation, !causation) we've got an alert about high cpu usage on our vault leader. we've since forced different leaders to step-down a several times, and high cpu load follows the leader even after a restart.
vault logs are identical to the ones we were seeing when cpu usage was 10x-20x lower.
my weak strace-fu and gdb-fu did not point to anything obvious.
next step: we'll try to run one vault server with log-level=trace, and force it to assume leader role using a series of step-downs.
Do you have any ideas about other things we can check that could have such effect on the vault leader?
unsure
silent
nothing vault related
log-level trace gave nothing.
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.679817 [TRACE] rollback: attempting rollback: path=auth/userpass/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.680732 [TRACE] rollback: attempting rollback: path=secret/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.680961 [TRACE] rollback: attempting rollback: path=sys/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.681263 [TRACE] rollback: attempting rollback: path=rabbitmq/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.681728 [TRACE] rollback: attempting rollback: path=ssh/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.681960 [TRACE] rollback: attempting rollback: path=cassandra/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.682315 [TRACE] rollback: attempting rollback: path=transit/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.683374 [TRACE] rollback: attempting rollback: path=consul/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.687285 [TRACE] rollback: attempting rollback: path=intermediate/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.687544 [TRACE] rollback: attempting rollback: path=cubbyhole/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.687842 [TRACE] rollback: attempting rollback: path=auth/token/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.678557 [TRACE] rollback: attempting rollback: path=auth/userpass/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679247 [TRACE] rollback: attempting rollback: path=secret/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679491 [TRACE] rollback: attempting rollback: path=sys/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679714 [TRACE] rollback: attempting rollback: path=rabbitmq/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679945 [TRACE] rollback: attempting rollback: path=ssh/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680161 [TRACE] rollback: attempting rollback: path=cassandra/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680359 [TRACE] rollback: attempting rollback: path=transit/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680623 [TRACE] rollback: attempting rollback: path=consul/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680818 [TRACE] rollback: attempting rollback: path=intermediate/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.681013 [TRACE] rollback: attempting rollback: path=cubbyhole/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.681276 [TRACE] rollback: attempting rollback: path=auth/token/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.677836 [TRACE] rollback: attempting rollback: path=auth/userpass/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.678542 [TRACE] rollback: attempting rollback: path=secret/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.678762 [TRACE] rollback: attempting rollback: path=sys/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.678990 [TRACE] rollback: attempting rollback: path=rabbitmq/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679187 [TRACE] rollback: attempting rollback: path=ssh/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679381 [TRACE] rollback: attempting rollback: path=cassandra/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679563 [TRACE] rollback: attempting rollback: path=transit/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679738 [TRACE] rollback: attempting rollback: path=consul/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.680021 [TRACE] rollback: attempting rollback: path=intermediate/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.680250 [TRACE] rollback: attempting rollback: path=cubbyhole/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.680431 [TRACE] rollback: attempting rollback: path=auth/token/
also, as a result of several step-downs while trying to force active on a node with TRACE logging , one of the old leaders is still hovering at 50-60% CPU even though is's not on active duty.
Old leader
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 53564 53000 34944 S 47.8 1.4 830:09.71 vault
New Loader
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 52252 51876 35136 S 62.3 1.4 4:57.88 vault
is this something we should discuss in vault repo instead?
@wuub I would move this to Vault. I just went through the change history of the files that touch Vault and nothing has really changed since 0.5.6. Do you use the template stanza with Vault?
I am going to close this but we can keep discussing.
I'll drop by Vault issues tomorrow.
And yes, we use template stanza a lot. With and without values from Vault.
Hey @wuub,
I was following the Vault issue. It would be great if you could share what you can now about the job causing this. Would like a 0.6.3 to include any fix that comes from this and would like to start investigating asap.
@dadgar I'll try to distill our jobspec to something that triggers this behavior later today.
@dadgar more investigation will have to wait untill tomorrow, but this is the obfuscated jobspec
job "jobname" {
datacenters = ["dc1"]
group "jobname" {
count = 1
task "jobname-app" {
env {
"ENVIRONMENT" = "prod"
}
vault {
policies = ["jobname"]
change_mode = "noop"
env = false
}
template{
data = <<EOH
HOSTS = [
{{ range service "prod.database" "any" }}"{{.Address}}",
{{ end }}
]
{{ with secret "secret/database/jobname" }}
{{.Data.role}} "{{.Data.password}}"
}
{{ end }}
EOH
destination = "${NOMAD_SECRETS_DIR}/config.py"
change_mode = "restart"
splay = "30s"
}
driver = "docker"
config {
image = "image"
dns_servers = ["172.17.0.1", "169.254.169.254"]
port_map {
"http" = 8000
}
}
service {
name = "jobname"
tags = ["prod", "v1.0.0"]
port = "http"
check {
name = "${NOMAD_ALLOC_NAME}-ping"
type = "http"
path = "/ping"
interval = "10s"
timeout = "2s"
}
check {
name = "${NOMAD_ALLOC_NAME}-healthz"
type = "http"
path = "/healthz"
interval = "15s"
timeout = "5s"
}
}
resources {
cpu = 200
memory = 256
network{
port "http" {
}
}
}
}
}
}
I don't know what might be causing constant queries about the secret yet, but one thing that this file has different is change_mode="noop" + mixing secret with service is also uncommon.
@wuub are those allocations restarting a lot because of the template re-rendering? Could we see the task events?
@dadgar does not look like it:
ID = b86034b9
Eval ID = 409192cf
Name = job.job[0]
Node ID = node
Job ID = job
Job Version = 3
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 08/31/17 18:11:44 CEST
Task "job" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
14/200 MHz 63 MiB/256 MiB 300 MiB 0 http: aaa.bbb.ccc.98:28795
Task Events:
Started At = 08/31/17 16:11:53 UTC
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
08/31/17 18:11:53 CEST Started Task started by client
08/31/17 18:11:45 CEST Driver Downloading image repo/subrepo/image:v0.0.17
08/31/17 18:11:44 CEST Task Setup Building Task Directory
08/31/17 18:11:44 CEST Received Task received by client
EDIT: we also have several different alerting rules that detect flapping services (by metric restarts, consul checks flapping etc. and none of them said anything about this job restarting, so I'm pretty confident it stays on once started)
@wuub Would it be possible to share the relevant audit logs and dump the telemetry by sending SIGUSR1 while Vault is under high load. These will help pin down what is happening.
We will try to track this down today.
I'm able to reliably reproduce this with empty nomad -dev/consul -dev/vault -dev setup. I'll distill it even further and share reproduction steps here in 1-2h (sorry, meetings :() .
@dadgar
reproduction repo is here: https://github.com/wuub/nomad-vault-repro
Removing secret from template stops the requests.
I set it up with tmux/tmuxp -> https://github.com/wuub/nomad-vault-repro/blob/master/tinycloud.yaml
Consul v0.9.2
Nomad v0.6.2
Vault v0.8.1 ('8d76a41854608c547a233f2e6292ae5355154695')
Vault audit logs are full of (a LOT of bytes/sec).
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"55d484ef-81b9-b28f-c888-abf647cd28c0","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"80b895ba-30c1-3a97-eee6-a5af45c4afa5","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"80b895ba-30c1-3a97-eee6-a5af45c4afa5","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"d286c240-22c3-c984-ca8b-80a85b4bc478","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"d286c240-22c3-c984-ca8b-80a85b4bc478","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"abf787ef-6547-b18b-5a87-f5136660bb25","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"abf787ef-6547-b18b-5a87-f5136660bb25","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"b5b78290-ebc3-c625-c668-5b4e5a4e0825","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}

Not sure if this is related but I'm having issues with 0.6.2 and the templates constantly renewing credentials from vault and causing a SIGHUP. Theres some logs and background here.
@wuub Thanks for reproduction steps! Reproduced it now
@wuub Just an update, this is a bug in the latest consul-template which we vendor in. Can reproduce with Nomad out of the picture.
@adamlc would you mind running your template using consul-template on 0.19.0 and 0.19.1 and see if the issue only appears with 0.19.1?
Hey all this is fixed and vendored in.
@wuub We are going to do an 0.6.3 RC1 on Monday. Would appreciate if you could test for this! Thanks and hope you have a good weekend!
@dadgar sure thing!
Do you think it'll address the other issue I had as well https://github.com/hashicorp/nomad/issues/3121 or should I patch it beforehand.
@dadgar I've just given 0.6.3-rc1 a spin and I'm still getting the issue with consul-template constantly renewing credentials and issuing lots of SIGHUPs.
@wuub are you still getting the same issue? I'm trying to determine if I've got a different issue or not!
@adamlc we're rolling out 0.6.3-rc1 right now. we'll see in an hour or so.
@dadgar @adamlc I can now confirm that 0.6.3-rc1 does in fact solve our original issue. vault servers are back to sub 0.5% CPU usage
Fair enough, clearly I've got some sort of other issue. I'll do some digging and open a ticket when I figure out the issue
@adamlc Want to post your template/job file? How sure are you that the data is not actually changing in Consul? Also could you share the DEBUG logs of the client.
Sorry I haven't had time to debug this yet. I'll give you a shout when I've got all the required information :)
@dadgar I've opened #3197 :)
Most helpful comment
@wuub We are going to do an 0.6.3 RC1 on Monday. Would appreciate if you could test for this! Thanks and hope you have a good weekend!