Nomad: [unsure, still investigating] significant rise in vault leader CPU usage during and after upgrade to nomad 0.6.2

Created on 30 Aug 2017  路  26Comments  路  Source: hashicorp/nomad

Nomad version

0.6.2

Operating system and Environment details

Linux
Vault: 0.7.3

Issue

At some point during our upgrade from nomad 0.5.6 to nomad 0.6.2 (correlation, !causation) we've got an alert about high cpu usage on our vault leader. we've since forced different leaders to step-down a several times, and high cpu load follows the leader even after a restart.

vault logs are identical to the ones we were seeing when cpu usage was 10x-20x lower.

my weak strace-fu and gdb-fu did not point to anything obvious.

next step: we'll try to run one vault server with log-level=trace, and force it to assume leader role using a series of step-downs.

Do you have any ideas about other things we can check that could have such effect on the vault leader?

Reproduction steps

unsure

Nomad Server logs (if appropriate)

silent

Nomad Client logs (if appropriate)

nothing vault related

Job file (if appropriate)

Most helpful comment

@wuub We are going to do an 0.6.3 RC1 on Monday. Would appreciate if you could test for this! Thanks and hope you have a good weekend!

All 26 comments

log-level trace gave nothing.

Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.679817 [TRACE] rollback: attempting rollback: path=auth/userpass/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.680732 [TRACE] rollback: attempting rollback: path=secret/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.680961 [TRACE] rollback: attempting rollback: path=sys/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.681263 [TRACE] rollback: attempting rollback: path=rabbitmq/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.681728 [TRACE] rollback: attempting rollback: path=ssh/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.681960 [TRACE] rollback: attempting rollback: path=cassandra/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.682315 [TRACE] rollback: attempting rollback: path=transit/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.683374 [TRACE] rollback: attempting rollback: path=consul/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.687285 [TRACE] rollback: attempting rollback: path=intermediate/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.687544 [TRACE] rollback: attempting rollback: path=cubbyhole/
Aug 30 10:25:50 vault-2 vault[2189]: 2017/08/30 10:25:50.687842 [TRACE] rollback: attempting rollback: path=auth/token/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.678557 [TRACE] rollback: attempting rollback: path=auth/userpass/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679247 [TRACE] rollback: attempting rollback: path=secret/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679491 [TRACE] rollback: attempting rollback: path=sys/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679714 [TRACE] rollback: attempting rollback: path=rabbitmq/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.679945 [TRACE] rollback: attempting rollback: path=ssh/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680161 [TRACE] rollback: attempting rollback: path=cassandra/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680359 [TRACE] rollback: attempting rollback: path=transit/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680623 [TRACE] rollback: attempting rollback: path=consul/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.680818 [TRACE] rollback: attempting rollback: path=intermediate/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.681013 [TRACE] rollback: attempting rollback: path=cubbyhole/
Aug 30 10:26:50 vault-2 vault[2189]: 2017/08/30 10:26:50.681276 [TRACE] rollback: attempting rollback: path=auth/token/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.677836 [TRACE] rollback: attempting rollback: path=auth/userpass/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.678542 [TRACE] rollback: attempting rollback: path=secret/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.678762 [TRACE] rollback: attempting rollback: path=sys/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.678990 [TRACE] rollback: attempting rollback: path=rabbitmq/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679187 [TRACE] rollback: attempting rollback: path=ssh/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679381 [TRACE] rollback: attempting rollback: path=cassandra/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679563 [TRACE] rollback: attempting rollback: path=transit/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.679738 [TRACE] rollback: attempting rollback: path=consul/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.680021 [TRACE] rollback: attempting rollback: path=intermediate/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.680250 [TRACE] rollback: attempting rollback: path=cubbyhole/
Aug 30 10:27:50 vault-2 vault[2189]: 2017/08/30 10:27:50.680431 [TRACE] rollback: attempting rollback: path=auth/token/

also, as a result of several step-downs while trying to force active on a node with TRACE logging , one of the old leaders is still hovering at 50-60% CPU even though is's not on active duty.

Old leader

PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                   
20   0   53564  53000  34944 S 47.8  1.4 830:09.71 vault 

New Loader

PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                                                   
20   0   52252  51876  35136 S 62.3  1.4   4:57.88 vault 

is this something we should discuss in vault repo instead?

@wuub I would move this to Vault. I just went through the change history of the files that touch Vault and nothing has really changed since 0.5.6. Do you use the template stanza with Vault?

I am going to close this but we can keep discussing.

I'll drop by Vault issues tomorrow.

And yes, we use template stanza a lot. With and without values from Vault.

Hey @wuub,

I was following the Vault issue. It would be great if you could share what you can now about the job causing this. Would like a 0.6.3 to include any fix that comes from this and would like to start investigating asap.

@dadgar I'll try to distill our jobspec to something that triggers this behavior later today.

@dadgar more investigation will have to wait untill tomorrow, but this is the obfuscated jobspec

job "jobname" {
    datacenters = ["dc1"]

    group "jobname" {
        count = 1
        task "jobname-app" {
            env {
                "ENVIRONMENT" = "prod"
            }

            vault {
                policies = ["jobname"]
                change_mode = "noop"
                env = false
            }
            template{
                data = <<EOH
HOSTS = [
  {{ range service "prod.database" "any" }}"{{.Address}}",
  {{ end }}
]
{{ with secret "secret/database/jobname" }}
    {{.Data.role}} "{{.Data.password}}"
}
{{ end }}
EOH
                destination = "${NOMAD_SECRETS_DIR}/config.py"
                change_mode = "restart"
                splay = "30s"
            }

            driver = "docker"
            config {
                image = "image"
                dns_servers = ["172.17.0.1", "169.254.169.254"]
                port_map {
                    "http" = 8000
                }
            }

          service {
                name = "jobname"
                tags = ["prod", "v1.0.0"]
                port = "http"
                check {
                    name = "${NOMAD_ALLOC_NAME}-ping"
                    type = "http"
                    path = "/ping"
                    interval = "10s"
                    timeout = "2s"
                }
                check {
                    name = "${NOMAD_ALLOC_NAME}-healthz"
                    type = "http"
                    path = "/healthz"
                    interval = "15s"
                    timeout = "5s"
                }
            }

            resources {
                cpu    = 200
                memory = 256
                network{
                    port "http" {
                    }
                }
            }
        }
    }
}

I don't know what might be causing constant queries about the secret yet, but one thing that this file has different is change_mode="noop" + mixing secret with service is also uncommon.

@wuub are those allocations restarting a lot because of the template re-rendering? Could we see the task events?

@dadgar does not look like it:

ID                  = b86034b9
Eval ID             = 409192cf
Name                = job.job[0]
Node ID             = node
Job ID              = job
Job Version         = 3
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 08/31/17 18:11:44 CEST

Task "job" is "running"
Task Resources
CPU         Memory          Disk     IOPS  Addresses
14/200 MHz  63 MiB/256 MiB  300 MiB  0     http: aaa.bbb.ccc.98:28795

Task Events:
Started At     = 08/31/17 16:11:53 UTC
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
08/31/17 18:11:53 CEST  Started     Task started by client
08/31/17 18:11:45 CEST  Driver      Downloading image repo/subrepo/image:v0.0.17
08/31/17 18:11:44 CEST  Task Setup  Building Task Directory
08/31/17 18:11:44 CEST  Received    Task received by client

EDIT: we also have several different alerting rules that detect flapping services (by metric restarts, consul checks flapping etc. and none of them said anything about this job restarting, so I'm pretty confident it stays on once started)

@wuub Would it be possible to share the relevant audit logs and dump the telemetry by sending SIGUSR1 while Vault is under high load. These will help pin down what is happening.

We will try to track this down today.

I'm able to reliably reproduce this with empty nomad -dev/consul -dev/vault -dev setup. I'll distill it even further and share reproduction steps here in 1-2h (sorry, meetings :() .

@dadgar
reproduction repo is here: https://github.com/wuub/nomad-vault-repro
Removing secret from template stops the requests.

I set it up with tmux/tmuxp -> https://github.com/wuub/nomad-vault-repro/blob/master/tinycloud.yaml

Consul v0.9.2
Nomad v0.6.2
Vault v0.8.1 ('8d76a41854608c547a233f2e6292ae5355154695')

Vault audit logs are full of (a LOT of bytes/sec).

{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"55d484ef-81b9-b28f-c888-abf647cd28c0","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"80b895ba-30c1-3a97-eee6-a5af45c4afa5","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"80b895ba-30c1-3a97-eee6-a5af45c4afa5","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"d286c240-22c3-c984-ca8b-80a85b4bc478","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"d286c240-22c3-c984-ca8b-80a85b4bc478","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"abf787ef-6547-b18b-5a87-f5136660bb25","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"response","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"abf787ef-6547-b18b-5a87-f5136660bb25","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"response":{"secret":{"lease_id":""},"data":{"password":"hmac-sha256:a9e4c7f90ce83f08c27708401218b29becfacd5c25c7baf7418e9810b972f896","role":"hmac-sha256:d794edb7048d2df98e6dfbe2ab0b143eb8e266260b1d4797405d589a02077bb4","ttl":"hmac-sha256:0ed6474b1267da43b04ddce4129d7edcaaaf91e9a99a6e5a6335f6561ecc5fc2"}},"error":""}
{"time":"2017-09-01T12:01:43Z","type":"request","auth":{"client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","display_name":"token-42d8a614-d3db-095b-070c-981f88e74a96-job-app","policies":["default","job"],"metadata":{"AllocationID":"42d8a614-d3db-095b-070c-981f88e74a96","NodeID":"8731eca6-4660-ac0d-2816-cf3716ee7009","Task":"job-app"}},"request":{"id":"b5b78290-ebc3-c625-c668-5b4e5a4e0825","operation":"read","client_token":"hmac-sha256:f42267188586284e1e85d91f93de82cace186c572609061cd91477f00559ded7","client_token_accessor":"hmac-sha256:ebad279ca6180c8d190d1e2852b034b3e1ffc6529787c146f134ea6510be69c6","path":"secret/db/prod/job","data":null,"remote_address":"127.0.0.1","wrap_ttl":0,"headers":{}},"error":""}

selection_570

Not sure if this is related but I'm having issues with 0.6.2 and the templates constantly renewing credentials from vault and causing a SIGHUP. Theres some logs and background here.

@wuub Thanks for reproduction steps! Reproduced it now

@wuub Just an update, this is a bug in the latest consul-template which we vendor in. Can reproduce with Nomad out of the picture.

@adamlc would you mind running your template using consul-template on 0.19.0 and 0.19.1 and see if the issue only appears with 0.19.1?

Hey all this is fixed and vendored in.

@wuub We are going to do an 0.6.3 RC1 on Monday. Would appreciate if you could test for this! Thanks and hope you have a good weekend!

@dadgar sure thing!

Do you think it'll address the other issue I had as well https://github.com/hashicorp/nomad/issues/3121 or should I patch it beforehand.

@dadgar I've just given 0.6.3-rc1 a spin and I'm still getting the issue with consul-template constantly renewing credentials and issuing lots of SIGHUPs.

@wuub are you still getting the same issue? I'm trying to determine if I've got a different issue or not!

@adamlc we're rolling out 0.6.3-rc1 right now. we'll see in an hour or so.

@dadgar @adamlc I can now confirm that 0.6.3-rc1 does in fact solve our original issue. vault servers are back to sub 0.5% CPU usage

Fair enough, clearly I've got some sort of other issue. I'll do some digging and open a ticket when I figure out the issue

@adamlc Want to post your template/job file? How sure are you that the data is not actually changing in Consul? Also could you share the DEBUG logs of the client.

Sorry I haven't had time to debug this yet. I'll give you a shout when I've got all the required information :)

@dadgar I've opened #3197 :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hynek picture hynek  路  3Comments

funkytaco picture funkytaco  路  3Comments

DanielDent picture DanielDent  路  3Comments

hamann picture hamann  路  3Comments

byronwolfman picture byronwolfman  路  3Comments