Nomad: Panic in prometheus metric collection affecting client/servers

Created on 24 Aug 2020 · 5Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.12.1-local+1

We patch and compile our own nomad source, but the changes are only to to ensure an existing docker driver option is set with an permitted value, and this should have no bearing on the the issue.

Operating system and Environment details

CentOS 7.7.1908 x86_64

Issue

Nomad server/clients have panic'd within prometheus metric collection handling code with error panic: interface conversion: interface {} is nil, not time.Time. See traces in logs below.

Issue has been observed three times in the last week on two different hosts within a single cluster, one running server mode only, and one running client mode only. This cluster was only upgraded to 0.12.1 in the last two weeks. I have other clusters running the same code, and for longer without issue. Metrics are being collected from all nomad instances.

Reproduction steps

Unknown

Job file (if appropriate)

N/A

Nomad Client logs (if appropriate)

Aug 22 16:31:37 exampleclient containerd[14069]: time="2020-08-22T16:31:37.855368243+01:00" level=info msg="shim reaped" id=20d053b9f002cf513bbd54495d8417e321898c1981b06f6fdc00b1719fb12531 Aug 22 16:31:37 exampleclient dockerd[31651]: time="2020-08-22T16:31:37.863859185+01:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Aug 22 16:31:37 exampleclient kernel: [1219465.168647] docker0: port 21(veth93f9883) entered disabled state Aug 22 16:31:37 exampleclient kernel: [1219465.185372] docker0: port 21(veth93f9883) entered disabled state Aug 22 16:31:37 exampleclient kernel: [1219465.185738] device veth93f9883 left promiscuous mode Aug 22 16:31:37 exampleclient kernel: [1219465.185750] docker0: port 21(veth93f9883) entered disabled state Aug 22 16:31:37 exampleclient dockerd[31651]: time="2020-08-22T16:31:37.941194246+01:00" level=warning msg="20d053b9f002cf513bbd54495d8417e321898c1981b06f6fdc00b1719fb12531 cleanup: failed to unmount IPC: umount /data/docker/containers/20d053b9f002cf513bbd54495d8417e321898c1981b06f6fdc00b1719fb12531/mounts/shm, flags: 0x2: no such file or directory" Aug 22 16:31:37 exampleclient nomad[31644]: 2020-08-22T16:31:37.999+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=9b7c3399-8d2d-fb53-f28f-59edc6421384 task=file-access-logging-task error="container stopped" Aug 22 16:31:37 exampleclient nomad[31644]: 2020-08-22T16:31:37.999+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exi ted: driver=docker path=/usr/bin/nomad pid=20771 Aug 22 16:31:37 exampleclient nomad[31644]: 2020-08-22T16:31:37.999+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker Aug 22 16:31:38 exampleclient nomad[31644]: 2020-08-22T16:31:38.016+0100 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:redacted references=0 Aug 22 16:31:38 exampleclient nomad[31644]: 2020-08-22T16:31:38.016+0100 [INFO] client.alloc_runner.task_runner: not restarting task: allo c_id=9b7c3399-8d2d-fb53-f28f-59edc6421384 task=file-access-logging-task reason="Policy allows no restarts" Aug 22 16:31:38 exampleclient nomad[31644]: panic: interface conversion: interface {} is nil, not time.Time Aug 22 16:31:38 exampleclient nomad[31644]: goroutine 9110640 [running]: Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect.func3(0x2687b80, 0xc00768b610, 0x2c8fb00, 0xc000aa5ce0, 0x1) Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/[email protected]/prometheus/prometheus.go:96 +0x1d4 Aug 22 16:31:38 exampleclient nomad[31644]: sync.(*Map).Range(0xc00062c680, 0xc000477e90) Aug 22 16:31:38 exampleclient nomad[31644]: sync/map.go:333 +0x1c5 Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect(0xc00062c630, 0xc0010c0600) Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/[email protected]/prometheus/prometheus.go:94 +0x1d3 Aug 22 16:31:38 exampleclient nomad[31644]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1() Aug 22 16:31:38 exampleclient nomad[31644]: github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x19d Aug 22 16:31:38 exampleclient nomad[31644]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather Aug 22 16:31:38 exampleclient nomad[31644]: github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x599 Aug 22 16:31:38 exampleclient systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT Aug 22 16:31:39 exampleclient systemd[1]: Unit nomad.service entered failed state. Aug 22 16:31:39 exampleclient systemd[1]: nomad.service failed.

This instance affected a client, is not correlated with the server logs below

Nomad Server logs (if appropriate)

Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.570+0100 [DEBUG] worker.batch_sched: reconciled current state with desired
state: eval_id=391d751b-4c8e-4bb1-cdd3-a4dfcf90717d job_id=prod-a2-3b8ad65dc0044b54a395bda4455f4590 namespace=default results="Total chan
ges: (place 1) (destructive 0) (inplace 0) (stop 0)
Aug 21 14:22:18 examplehost nomad[3573]: Desired Changes for "a2-group": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ign
ore 0) (canary 0)"
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.586+0100 [DEBUG] worker: submitted plan for evaluation: eval_id=391d751b-4c
8e-4bb1-cdd3-a4dfcf90717d
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.586+0100 [DEBUG] worker.batch_sched: setting eval status: eval_id=391d751b-
4c8e-4bb1-cdd3-a4dfcf90717d job_id=prod-a2-3b8ad65dc0044b54a395bda4455f4590 namespace=default status=complete
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.599+0100 [DEBUG] worker: updated evaluation: eval="<Eval "391d751b-4c8e-4bb
1-cdd3-a4dfcf90717d" JobID: "prod-a2-3b8ad65dc0044b54a395bda4455f4590" Namespace: "default">"
Aug 21 14:22:18 examplehost nomad[3573]: panic: interface conversion: interface {} is nil, not time.Time
Aug 21 14:22:18 examplehost nomad[3573]: goroutine 8976698 [running]:
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect.func2(0x2687b80, 0xc0053575e0,
 0x2d2a180, 0xc0039ad300, 0x1)
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/[email protected]/prometheus/prometheus.go:86 +0x1d4
Aug 21 14:22:18 examplehost nomad[3573]: sync.(*Map).Range(0xc00065a398, 0xc0011b2e90)
Aug 21 14:22:18 examplehost nomad[3573]: sync/map.go:333 +0x1c5
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect(0xc00065a370, 0xc004305380)
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/[email protected]/prometheus/prometheus.go:84 +0x154
Aug 21 14:22:18 examplehost nomad[3573]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Aug 21 14:22:18 examplehost nomad[3573]: github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x19d
Aug 21 14:22:18 examplehost nomad[3573]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Aug 21 14:22:18 examplehost nomad[3573]: github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe36
Aug 21 14:22:18 examplehost systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 21 14:22:18 examplehost systemd[1]: Unit nomad.service entered failed state.
Aug 21 14:22:18 examplehost systemd[1]: nomad.service failed.

This instance affected a server, is not correlated with the client logs above

stagwaiting-reply themcrash themdependencies typbug

Source

optiz0r

All 5 comments

Hi @optiz0r sorry to hear about this. We patched the go-metrics package to 0.3.4 in https://github.com/hashicorp/nomad/pull/8637 ~but that's not yet in a released version of Nomad~. Could you try applying this patch https://github.com/hashicorp/nomad/pull/8637/commits/84681d70327ec0a55f9ea496c337317e5165d675 to see if that corrects the issue in your build?

tgross on 24 Aug 2020

❤1

Actually I just checked and it looks like that's in 0.12.3 already (it's not showing up in the PR metadata).

tgross on 24 Aug 2020

Actually I just checked and it looks like that's in 0.12.3 already (it's not showing up in the PR metadata).

Ok, I'll try and build 0.12.3 this week and upgrade the relevant dev clusters, with a view to upgrading the affected prod cluster shortly after.

optiz0r on 24 Aug 2020

Just to update, I have three clusters upgraded to 0.12.3 already. The cluster that originally showed this error is scheduled to be upgraded in ~10 days time. I have not seen any recurrence of the issue since 22nd Aug though, so hard to know if the upgrade is having/will have any effect here. Not sure what specific action caused the three crashes in the first place, so no recreation steps to help validate.

optiz0r on 2 Sep 2020

No re occurrence so far, closing this as resolved with the upgrade to 0.12.3. thanks!

optiz0r on 23 Sep 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings