Nomad v0.12.1-local+1
We patch and compile our own nomad source, but the changes are only to to ensure an existing docker driver option is set with an permitted value, and this should have no bearing on the the issue.
CentOS 7.7.1908 x86_64
Nomad server/clients have panic'd within prometheus metric collection handling code with error panic: interface conversion: interface {} is nil, not time.Time. See traces in logs below.
Issue has been observed three times in the last week on two different hosts within a single cluster, one running server mode only, and one running client mode only. This cluster was only upgraded to 0.12.1 in the last two weeks. I have other clusters running the same code, and for longer without issue. Metrics are being collected from all nomad instances.
Unknown
N/A
Aug 22 16:31:37 exampleclient containerd[14069]: time="2020-08-22T16:31:37.855368243+01:00" level=info msg="shim reaped" id=20d053b9f002cf513bbd54495d8417e321898c1981b06f6fdc00b1719fb12531
Aug 22 16:31:37 exampleclient dockerd[31651]: time="2020-08-22T16:31:37.863859185+01:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 22 16:31:37 exampleclient kernel: [1219465.168647] docker0: port 21(veth93f9883) entered disabled state
Aug 22 16:31:37 exampleclient kernel: [1219465.185372] docker0: port 21(veth93f9883) entered disabled state
Aug 22 16:31:37 exampleclient kernel: [1219465.185738] device veth93f9883 left promiscuous mode
Aug 22 16:31:37 exampleclient kernel: [1219465.185750] docker0: port 21(veth93f9883) entered disabled state
Aug 22 16:31:37 exampleclient dockerd[31651]: time="2020-08-22T16:31:37.941194246+01:00" level=warning msg="20d053b9f002cf513bbd54495d8417e321898c1981b06f6fdc00b1719fb12531 cleanup: failed to unmount IPC: umount /data/docker/containers/20d053b9f002cf513bbd54495d8417e321898c1981b06f6fdc00b1719fb12531/mounts/shm, flags: 0x2: no such file or directory"
Aug 22 16:31:37 exampleclient nomad[31644]: 2020-08-22T16:31:37.999+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=9b7c3399-8d2d-fb53-f28f-59edc6421384 task=file-access-logging-task error="container stopped"
Aug 22 16:31:37 exampleclient nomad[31644]: 2020-08-22T16:31:37.999+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exi
ted: driver=docker path=/usr/bin/nomad pid=20771
Aug 22 16:31:37 exampleclient nomad[31644]: 2020-08-22T16:31:37.999+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
Aug 22 16:31:38 exampleclient nomad[31644]: 2020-08-22T16:31:38.016+0100 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:redacted references=0
Aug 22 16:31:38 exampleclient nomad[31644]: 2020-08-22T16:31:38.016+0100 [INFO] client.alloc_runner.task_runner: not restarting task: allo
c_id=9b7c3399-8d2d-fb53-f28f-59edc6421384 task=file-access-logging-task reason="Policy allows no restarts"
Aug 22 16:31:38 exampleclient nomad[31644]: panic: interface conversion: interface {} is nil, not time.Time
Aug 22 16:31:38 exampleclient nomad[31644]: goroutine 9110640 [running]:
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect.func3(0x2687b80, 0xc00768b610, 0x2c8fb00, 0xc000aa5ce0, 0x1)
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/[email protected]/prometheus/prometheus.go:96 +0x1d4
Aug 22 16:31:38 exampleclient nomad[31644]: sync.(*Map).Range(0xc00062c680, 0xc000477e90)
Aug 22 16:31:38 exampleclient nomad[31644]: sync/map.go:333 +0x1c5
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect(0xc00062c630, 0xc0010c0600)
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/armon/[email protected]/prometheus/prometheus.go:94 +0x1d3
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x19d
Aug 22 16:31:38 exampleclient nomad[31644]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Aug 22 16:31:38 exampleclient nomad[31644]: github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x599
Aug 22 16:31:38 exampleclient systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 22 16:31:39 exampleclient systemd[1]: Unit nomad.service entered failed state.
Aug 22 16:31:39 exampleclient systemd[1]: nomad.service failed.
This instance affected a client, is not correlated with the server logs below
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.570+0100 [DEBUG] worker.batch_sched: reconciled current state with desired
state: eval_id=391d751b-4c8e-4bb1-cdd3-a4dfcf90717d job_id=prod-a2-3b8ad65dc0044b54a395bda4455f4590 namespace=default results="Total chan
ges: (place 1) (destructive 0) (inplace 0) (stop 0)
Aug 21 14:22:18 examplehost nomad[3573]: Desired Changes for "a2-group": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ign
ore 0) (canary 0)"
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.586+0100 [DEBUG] worker: submitted plan for evaluation: eval_id=391d751b-4c
8e-4bb1-cdd3-a4dfcf90717d
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.586+0100 [DEBUG] worker.batch_sched: setting eval status: eval_id=391d751b-
4c8e-4bb1-cdd3-a4dfcf90717d job_id=prod-a2-3b8ad65dc0044b54a395bda4455f4590 namespace=default status=complete
Aug 21 14:22:18 examplehost nomad[3573]: 2020-08-21T14:22:18.599+0100 [DEBUG] worker: updated evaluation: eval="<Eval "391d751b-4c8e-4bb
1-cdd3-a4dfcf90717d" JobID: "prod-a2-3b8ad65dc0044b54a395bda4455f4590" Namespace: "default">"
Aug 21 14:22:18 examplehost nomad[3573]: panic: interface conversion: interface {} is nil, not time.Time
Aug 21 14:22:18 examplehost nomad[3573]: goroutine 8976698 [running]:
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect.func2(0x2687b80, 0xc0053575e0,
0x2d2a180, 0xc0039ad300, 0x1)
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/[email protected]/prometheus/prometheus.go:86 +0x1d4
Aug 21 14:22:18 examplehost nomad[3573]: sync.(*Map).Range(0xc00065a398, 0xc0011b2e90)
Aug 21 14:22:18 examplehost nomad[3573]: sync/map.go:333 +0x1c5
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/go-metrics/prometheus.(*PrometheusSink).Collect(0xc00065a370, 0xc004305380)
Aug 21 14:22:18 examplehost nomad[3573]: github.com/armon/[email protected]/prometheus/prometheus.go:84 +0x154
Aug 21 14:22:18 examplehost nomad[3573]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Aug 21 14:22:18 examplehost nomad[3573]: github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x19d
Aug 21 14:22:18 examplehost nomad[3573]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Aug 21 14:22:18 examplehost nomad[3573]: github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe36
Aug 21 14:22:18 examplehost systemd[1]: nomad.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 21 14:22:18 examplehost systemd[1]: Unit nomad.service entered failed state.
Aug 21 14:22:18 examplehost systemd[1]: nomad.service failed.
This instance affected a server, is not correlated with the client logs above
Hi @optiz0r sorry to hear about this. We patched the go-metrics package to 0.3.4 in https://github.com/hashicorp/nomad/pull/8637 ~but that's not yet in a released version of Nomad~. Could you try applying this patch https://github.com/hashicorp/nomad/pull/8637/commits/84681d70327ec0a55f9ea496c337317e5165d675 to see if that corrects the issue in your build?
Actually I just checked and it looks like that's in 0.12.3 already (it's not showing up in the PR metadata).
Actually I just checked and it looks like that's in 0.12.3 already (it's not showing up in the PR metadata).
Ok, I'll try and build 0.12.3 this week and upgrade the relevant dev clusters, with a view to upgrading the affected prod cluster shortly after.
Just to update, I have three clusters upgraded to 0.12.3 already. The cluster that originally showed this error is scheduled to be upgraded in ~10 days time. I have not seen any recurrence of the issue since 22nd Aug though, so hard to know if the upgrade is having/will have any effect here. Not sure what specific action caused the three crashes in the first place, so no recreation steps to help validate.
No re occurrence so far, closing this as resolved with the upgrade to 0.12.3. thanks!