Nomad: Nomad server crash after raft quorum on 0.9.3, 0.10.1

Created on 2 Dec 2019  路  5Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.10.2-rc1 (33d411d6a5bc72870ce504e75371d219e5c729bb)
Same issue on 0.9.3 and 0.10.1.

Operating system and Environment details

Ubuntu 16.04

Issue

Three-node nomad server cluster, has been working fine forever. Suddenly started crashing with SIGSEGV after completing Raft election. Single node works fine but as soon as another node joins the cluster all nodes crash with the below error messages. Probably bad alloc state handling?

Reproduction steps

No idea.

Nomad Server logs (if appropriate)

Dec 02 15:35:42 consul03 nomad[5075]: ==> Loaded configuration from /etc/nomad.d/base.hcl, /etc/nomad.d/server.hcl
Dec 02 15:35:42 consul03 nomad[5075]: ==> Starting Nomad agent...
Dec 02 15:35:42 consul03 nomad[5075]: agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/nomad/plugins
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=java type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]: nomad: raft: Restored from snapshot 9369-3188799-1575250626660
Dec 02 15:35:42 consul03 nomad[5075]: nomad: raft: Initial configuration (index=1890378): [{Suffrage:Voter ID:10.10.10.18:4647 Address:10.10.10.18:4647} {Suffrage:Voter ID:10.10.10.20:4647 Address:10.10.10.20:4647} {Suffrage:Voter ID:10.10.10.33:4647 Address:10.10.10.33:4647}]
Dec 02 15:35:42 consul03 nomad[5075]: nomad: serf: Ignoring previous leave in snapshot
Dec 02 15:35:42 consul03 nomad[5075]: nomad: serf: EventMemberJoin: consul03.global 10.10.10.33
Dec 02 15:35:42 consul03 nomad[5075]: nomad: starting scheduling worker(s): num_workers=2 schedulers="[batch system _core service]"
Dec 02 15:35:42 consul03 nomad[5075]: ==> Joining cluster...
Dec 02 15:35:42 consul03 nomad[5075]: nomad: raft: Node at 10.10.10.33:4647 [Follower] entering Follower state (Leader: "")
Dec 02 15:35:42 consul03 nomad[5075]: nomad: serf: Attempting re-join to previously known node: consul02.global: 10.10.10.20:4648
Dec 02 15:35:42 consul03 nomad[5075]: nomad: adding server: server="consul03.global (Addr: 10.10.10.33:4647) (DC: dc1)"
Dec 02 15:35:42 consul03 nomad[5075]: nomad: serf: Attempting re-join to previously known node: consul01.global: 10.10.10.18:4648
Dec 02 15:35:42 consul03 nomad[5075]: nomad: serf: Failed to re-join any previously known node
Dec 02 15:35:42 consul03 nomad[5075]: ==> Join completed. Synced with 1 initial agents
Dec 02 15:35:42 consul03 nomad[5075]: ==> Nomad agent configuration:
Dec 02 15:35:42 consul03 nomad[5075]:        Advertise Addrs: HTTP: 10.10.10.33:4646; RPC: 10.10.10.33:4647; Serf: 10.10.10.33:4648
Dec 02 15:35:42 consul03 nomad[5075]:             Bind Addrs: HTTP: 10.10.10.33:4646; RPC: 10.10.10.33:4647; Serf: 10.10.10.33:4648
Dec 02 15:35:42 consul03 nomad[5075]:                 Client: false
Dec 02 15:35:42 consul03 nomad[5075]:              Log Level: INFO
Dec 02 15:35:42 consul03 nomad[5075]:                 Region: global (DC: dc1)
Dec 02 15:35:42 consul03 nomad[5075]:                 Server: true
Dec 02 15:35:42 consul03 nomad[5075]:                Version: 0.10.1
Dec 02 15:35:42 consul03 nomad[5075]: ==> Nomad agent started! Log data will stream in below:
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.528Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/var/nomad/plugins
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.529Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.686Z [INFO ] nomad: raft: Restored from snapshot 9369-3188799-1575250626660
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.742Z [INFO ] nomad: raft: Initial configuration (index=1890378): [{Suffrage:Voter ID:10.10.10.18:4647 Address:10.10.10.18:4647} {Suffrage:Voter ID:10.10.10.20:4647 Address:10.10.10.20:4647} {Suffrage:Voter ID:10.10.10.33:4647 Address:10.10.10.33:4647}]
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.742Z [INFO ] nomad: serf: Ignoring previous leave in snapshot
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.743Z [INFO ] nomad: serf: EventMemberJoin: consul03.global 10.10.10.33
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.743Z [INFO ] nomad: starting scheduling worker(s): num_workers=2 schedulers="[batch system _core service]"
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.743Z [INFO ] nomad: raft: Node at 10.10.10.33:4647 [Follower] entering Follower state (Leader: "")
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.743Z [INFO ] nomad: serf: Attempting re-join to previously known node: consul02.global: 10.10.10.20:4648
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.744Z [INFO ] nomad: adding server: server="consul03.global (Addr: 10.10.10.33:4647) (DC: dc1)"
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.745Z [INFO ] nomad: serf: Attempting re-join to previously known node: consul01.global: 10.10.10.18:4648
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.746Z [WARN ] nomad: serf: Failed to re-join any previously known node
Dec 02 15:35:42 consul03 nomad[5075]:     2019-12-02T15:35:42.751Z [ERROR] nomad: error looking up Nomad servers in Consul: error="contacted 0 Nomad Servers: 2 error(s) occurred:
Dec 02 15:35:42 consul03 nomad[5075]: * Failed to join 10.10.10.20: dial tcp 10.10.10.20:4648: connect: connection refused
Dec 02 15:35:42 consul03 nomad[5075]: * Failed to join 10.10.10.18: dial tcp 10.10.10.18:4648: connect: connection refused"
Dec 02 15:35:42 consul03 nomad[5075]: nomad: error looking up Nomad servers in Consul: error="contacted 0 Nomad Servers: 2 error(s) occurred:

                                      * Failed to join 10.10.10.20: dial tcp 10.10.10.20:4648: connect: connection refused
                                      * Failed to join 10.10.10.18: dial tcp 10.10.10.18:4648: connect: connection refused"
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.655Z [INFO ] nomad: serf: EventMemberJoin: consul01.global 10.10.10.18
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.655Z [INFO ] nomad: adding server: server="consul01.global (Addr: 10.10.10.18:4647) (DC: dc1)"
Dec 02 15:35:43 consul03 nomad[5075]: nomad: serf: EventMemberJoin: consul01.global 10.10.10.18
Dec 02 15:35:43 consul03 nomad[5075]: nomad: adding server: server="consul01.global (Addr: 10.10.10.18:4647) (DC: dc1)"
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.661Z [INFO ] nomad: serf: EventMemberJoin: consul02.global 10.10.10.20
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.661Z [INFO ] nomad: adding server: server="consul02.global (Addr: 10.10.10.20:4647) (DC: dc1)"
Dec 02 15:35:43 consul03 nomad[5075]: nomad: serf: EventMemberJoin: consul02.global 10.10.10.20
Dec 02 15:35:43 consul03 nomad[5075]: nomad: adding server: server="consul02.global (Addr: 10.10.10.20:4647) (DC: dc1)"
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.811Z [WARN ] nomad: raft: Heartbeat timeout from "" reached, starting election
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.811Z [INFO ] nomad: raft: Node at 10.10.10.33:4647 [Candidate] entering Candidate state in term 10065
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: Heartbeat timeout from "" reached, starting election
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: Node at 10.10.10.33:4647 [Candidate] entering Candidate state in term 10065
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.822Z [INFO ] nomad: raft: Election won. Tally: 2
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.822Z [INFO ] nomad: raft: Node at 10.10.10.33:4647 [Leader] entering Leader state
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.822Z [INFO ] nomad: raft: Added peer 10.10.10.18:4647, starting replication
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.822Z [INFO ] nomad: raft: Added peer 10.10.10.20:4647, starting replication
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: Election won. Tally: 2
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.822Z [INFO ] nomad: cluster leadership acquired
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: Node at 10.10.10.33:4647 [Leader] entering Leader state
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: Added peer 10.10.10.18:4647, starting replication
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: Added peer 10.10.10.20:4647, starting replication
Dec 02 15:35:43 consul03 nomad[5075]: nomad: cluster leadership acquired
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.823Z [INFO ] nomad: raft: pipelining replication to peer {Voter 10.10.10.20:4647 10.10.10.20:4647}
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: pipelining replication to peer {Voter 10.10.10.20:4647 10.10.10.20:4647}
Dec 02 15:35:43 consul03 nomad[5075]:     2019-12-02T15:35:43.823Z [INFO ] nomad: raft: pipelining replication to peer {Voter 10.10.10.18:4647 10.10.10.18:4647}
Dec 02 15:35:43 consul03 nomad[5075]: nomad: raft: pipelining replication to peer {Voter 10.10.10.18:4647 10.10.10.18:4647}
Dec 02 15:35:45 consul03 nomad[5075]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 02 15:35:45 consul03 nomad[5075]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xd22e77]
Dec 02 15:35:45 consul03 nomad[5075]: goroutine 13 [running]:
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/nomad/structs.RemoveAllocs(0xc0014d2000, 0x10, 0x10, 0xc0017808a0, 0x1, 0x1, 0xc0014d2000, 0xe, 0x10)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/structs/funcs.go:61 +0x117
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*EvalContext).ProposedAllocs(0xc00210c370, 0xc00001e6f0, 0x24, 0xc002782bf0, 0x1756817, 0x245ea00, 0xc00049da40, 0xc0004bac80)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/context.go:128 +0x53e
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*RankedNode).ProposedAllocs(0xc00049da40, 0x2ab6920, 0xc00210c370, 0x7, 0x3fe8a76ac49d1242, 0xc001a0c638, 0x0, 0x1)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:44 +0x63
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc0025afc40, 0x40ed9a)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/rank.go:186 +0x84
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc0019b6b80, 0x2a5fec0)
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*SystemStack).Select(0xc00049d490, 0xc00167b260, 0x0, 0x24)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/stack.go:279 +0x1b4
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computePlacements(0xc001d76d20, 0xc0004ec000, 0x24, 0x28, 0xc000d598e8, 0x1c
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:286 +0xc46
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).computeJobAllocs(0xc001d76d20, 0xc001250960, 0xc00049d490)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:264 +0x100a
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).process(0xc001d76d20, 0x0, 0x2064c00, 0xc0025cfca0)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:130 +0x33f
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc002783910, 0xc002783920, 0xc, 0x0)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/util.go:271 +0x5d
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/scheduler.(*SystemScheduler).Process(0xc001d76d20, 0xc00062d980, 0x2ac4180, 0xc0006192c0)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/scheduler/system_sched.go:76 +0x2ab
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/nomad.(*nomadFSM).reconcileQueuedAllocations(0xc00042e770, 0x30b855, 0x0, 0x0)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:1398 +0x463
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyReconcileSummaries(0xc00042e770, 0xc00255efb1, 0x8, 0x8, 0x30b855, 0xc002783e72, 0
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:795 +0x7e
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc00042e770, 0xc001b6bec0, 0x4440640, 0xbf7169b465bed617)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/nomad/fsm.go:211 +0x6eb
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc00249d270)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:57 +0x150
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc0000b8000)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/fsm.go:120 +0x33b
Dec 02 15:35:45 consul03 nomad[5075]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc0000b8000, 0xc00087e130)
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:146 +0x53
Dec 02 15:35:45 consul03 nomad[5075]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft.(*raftState).goFunc
Dec 02 15:35:45 consul03 nomad[5075]:         /opt/gopath/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/raft/state.go:144 +0x66

themcore themcrash typbug

Most helpful comment

Seems to work fine! Would like to get a build of this with UI enabled to put into production.

Thank you for the report and testing! We're hoping to get the official 0.10.2 release out tomorrow (Dec 4th).

All 5 comments

Quick hotfix to at least allow the cluster to come back up:

diff --git a/nomad/structs/funcs.go b/nomad/structs/funcs.go
index f79cb6c06..acb27565c 100644
--- a/nomad/structs/funcs.go
+++ b/nomad/structs/funcs.go
@@ -50,6 +50,7 @@ func warningsFormatter(es []error) string {
 // RemoveAllocs is used to remove any allocs with the given IDs
 // from the list of allocations
 func RemoveAllocs(alloc []*Allocation, remove []*Allocation) []*Allocation {
+       fmt.Printf("DBGCRASH alloc %+v remove %+v\n", alloc, remove)
        // Convert remove into a set
        removeSet := make(map[string]struct{})
        for _, remove := range remove {
@@ -58,6 +59,12 @@ func RemoveAllocs(alloc []*Allocation, remove []*Allocation) []*Allocation {
        n := len(alloc)
        for i := 0; i < n; i++ {
+                if alloc[i] == nil {
+                       alloc[i], alloc[n-1] = alloc[n-1], nil
+                       i--
+                       n--
+                        continue
+                }
                if _, ok := removeSet[alloc[i].ID]; ok {
                        alloc[i], alloc[n-1] = alloc[n-1], nil
                        i--

Hi @johngilden ! Thanks for reporting this serious bug!

The fix is up in #6792 and you can download Linux or Windows amd64 binaries to test it.

Seems to work fine! Would like to get a build of this with UI enabled to put into production.

(Appreciate the quick turn-around on this issue btw)

Seems to work fine! Would like to get a build of this with UI enabled to put into production.

Thank you for the report and testing! We're hoping to get the official 0.10.2 release out tomorrow (Dec 4th).

Was this page helpful?
0 / 5 - 0 ratings