Nomad: SIGSEGV on startup of nomad client since 0.5.3

Created on 31 Jan 2017  Â·  9Comments  Â·  Source: hashicorp/nomad

Nomad version

Output from nomad version

Nomad v0.5.3

64bit, tried both the LXC and non-LXC versions.

Operating system and Environment details

Host: Ubuntu Xenial, running in a LXC container.
Docker for Jobs:

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64
 Experimental: false

Issue

After updating to 0.5.3, nomad agent crashes on startup if started in client mode.

There seems to be some kind of correlation between the presence and absence of jobs on that node (I just upgraded without draining and ended up with a useless cluster). I’ve attached those two types of crashes.

Currently I’m unable to get one of my nodes back up. :( The others for some reason are working again.

Let me know if you need any more intel or if you know have any hints on how to resolve this…

Reproduction steps

Start it.

Nomad Server logs (if appropriate)

n/a, server works fine.

Nomad Client logs (if appropriate)

No jobs, clean docker

c-2001:~# /usr/local/bin/nomad agent -config /etc/nomad/client.hcl
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5ed497]

goroutine 70 [running]:
panic(0x10227c0, 0xc420012090)
    /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client/driver.(*CreatedResources).Copy(0x0, 0x1255710)
    /opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:108 +0x57
github.com/hashicorp/nomad/client.(*TaskRunner).SaveState(0xc42000f600, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:307 +0xc5
github.com/hashicorp/nomad/client.(*TaskRunner).setState(0xc42000f600, 0x119d299, 0x7, 0xc420189d40)
    /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:338 +0x3c
github.com/hashicorp/nomad/client.(*TaskRunner).createDriver.func1(0x11b3444, 0x17, 0xc4201dcde0, 0x2, 0x2)
    /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:380 +0x1f0
github.com/hashicorp/nomad/client/driver.(*DockerDriver).pullImage(0xc420454cd0, 0xc4201861c0, 0xc42039c640, 0xc42011d0b0, 0x1c, 0xc42011d0cd, 0x6, 0x2, 0x6)
    /opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:1007 +0x2bc
github.com/hashicorp/nomad/client/driver.(*DockerDriver).createImage(0xc420454cd0, 0xc4201861c0, 0xc42039c640, 0xc4203c6b60, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:972 +0x1af
github.com/hashicorp/nomad/client/driver.(*DockerDriver).Prestart(0xc420454cd0, 0xc4201dc080, 0xc420184820, 0x0, 0x0, 0xc4200ffe1c)
    /opt/gopath/src/github.com/hashicorp/nomad/client/driver/docker.go:383 +0xe2
github.com/hashicorp/nomad/client.(*TaskRunner).startTask(0xc42000f600, 0xc420142960, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1160 +0x27d
github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc42000f600)
    /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:902 +0x38a
github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc42000f600)
    /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:444 +0x6a1
created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
    /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x891
c-2001:~#

Jobs present, still running inside Docker

Jan 31 11:16:53 c-2001 nomad-client[901]:     Loaded configuration from /etc/nomad/client.hcl
Jan 31 11:16:53 c-2001 nomad-client[901]: ==> Starting Nomad agent...
Jan 31 11:16:57 c-2001 nomad-client[901]: ==> Nomad agent configuration:
Jan 31 11:16:57 c-2001 nomad-client[901]:                  Atlas: <disabled>
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Client: true
Jan 31 11:16:57 c-2001 nomad-client[901]:              Log Level: INFO
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Region: global (DC: scaleup)
Jan 31 11:16:57 c-2001 nomad-client[901]:                 Server: false
Jan 31 11:16:57 c-2001 nomad-client[901]:                Version: 0.5.3
Jan 31 11:16:57 c-2001 nomad-client[901]: ==> Nomad agent started! Log data will stream in below:
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.707555 [INFO] client: using state directory /vrmd/nomad/client
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.707632 [INFO] client: using alloc directory /vrmd/nomad/alloc
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.708129 [INFO] fingerprint.cgroups: cgroups are available
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:53.712605 [INFO] fingerprint.consul: consul agent is available
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.737729 [INFO] driver.docker: re-attaching to docker process: 5e46cc5404ba8634d31b586be4bb7b03381be29f67c7987cf900a559fe3d0071
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.738667 [ERR] client: failed to open handle to task 'httpbin' for alloc '9c30dfb4-d9b3-6adb-148d-eea7b53eee9d': Failed to find container 5e46cc5404ba8634d31b586be4bb7b03381be29f67c7987cf900a559fe3d0071
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740146 [INFO] driver.docker: re-attaching to docker process: 4407630c7f5a072387402dc64b323b0113ceac614ea8310e4c2ecf32ddcde64f
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740957 [ERR] client: failed to open handle to task 'hashi-ui' for alloc 'c6440e03-e17e-64f6-704b-c656598c474e': Failed to find container 4407630c7f5a072387402dc64b323b0113ceac614ea8310e4c2ecf32ddcde64f
Jan 31 11:16:57 c-2001 nomad-client[901]:     2017/01/31 11:16:57.740990 [INFO] client: Node ID "1353a057-dcfb-def7-3385-87c75884b01e"
Jan 31 11:16:57 c-2001 nomad-client[901]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 31 11:16:57 c-2001 nomad-client[901]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5eb974]
Jan 31 11:16:57 c-2001 nomad-client[901]: goroutine 28 [running]:
Jan 31 11:16:57 c-2001 nomad-client[901]: panic(0x10001e0, 0xc420010080)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/go/src/runtime/panic.go:500 +0x1a1
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client/driver.(*CreatedResources).Merge(0x0, 0xc4200241d0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:127 +0xe4
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).startTask(0xc42013d1e0, 0xc42040bda0, 0x0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:1164 +0x2e3
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc42013d1e0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:902 +0x38a
Jan 31 11:16:57 c-2001 nomad-client[901]: github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc42013d1e0)
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:444 +0x6a1
Jan 31 11:16:57 c-2001 nomad-client[901]: created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
Jan 31 11:16:57 c-2001 nomad-client[901]: #011/opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x891

themclient typbug

Most helpful comment

Repro'd in like 30s using Nomad 0.5.2 and 0.5.3 binaries with the example.nomad Redis job.

Very embarrassed I let this slip in. Fix coming.

All 9 comments

We have the same when upgraded nomad to 0.5.3, without node-drain:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5eb647]

goroutine 335 [running]:
panic(0x10001e0, 0xc420012060)
        /opt/go/src/runtime/panic.go:500 +0x1a1
github.com/hashicorp/nomad/client/driver.(*CreatedResources).Copy(0x0, 0x122e5b0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/driver/driver.go:108 +0x57
github.com/hashicorp/nomad/client.(*TaskRunner).SaveState(0xc42008b4a0, 0x0, 0x0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:307 +0xc5
github.com/hashicorp/nomad/client.(*AllocRunner).saveTaskRunnerState(0xc4200be1e0, 0xc42008b4a0, 0x1, 0x1)
        /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:249 +0x35
github.com/hashicorp/nomad/client.(*AllocRunner).SaveState(0xc4200be1e0, 0xc420450150, 0xc420830c10)
        /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:215 +0xe9
github.com/hashicorp/nomad/client.(*Client).saveState(0xc4202cd040, 0x1188f94, 0x13)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:604 +0x120
github.com/hashicorp/nomad/client.(*Client).runAllocs(0xc4202cd040, 0xc4202ecbc0)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:1507 +0x797
github.com/hashicorp/nomad/client.(*Client).run(0xc4202cd040)
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:969 +0x119
created by github.com/hashicorp/nomad/client.NewClient
        /opt/gopath/src/github.com/hashicorp/nomad/client/client.go:298 +0xc7a
    Loaded configuration from /etc/nomad/nomad.json

after we have done some cleanup (stop jobs that was placed on upgraded node, also we have made node-drain for upgraded node), nomad begins work as expected

Hey sorry this happened 👎 to recover can you delete the clients data_dir and bring it back up.

We will make sure 0.5.4 allows an in-place upgrade path for those who would like to wait!

Alex, may i make conclusion, that make nomad node-drain, then upgrade will have done safe?

Potentially not. The client has some state files in the data_dir that it tries to restore from. In 0.5.3 we introduced new fields in that state_file and the upgrade isn't being handled properly it seems.

So I suggest you nomad node-drain and then delete the data_dir and bring the client back up

Repro'd in like 30s using Nomad 0.5.2 and 0.5.3 binaries with the example.nomad Redis job.

Very embarrassed I let this slip in. Fix coming.

Hey @schmichael , any idea when 0.5.4 will be up at https://releases.hashicorp.com/nomad/ ?

EDIT: it's there now!

@schmichael we love you anyway!

@holtwilkins It would have been up sooner, but this was only the second time I've driven a release and was pretty slow at it. Thanks for your patience!

Was this page helpful?
0 / 5 - 0 ratings