Nomad: Nomad server failed and fails to recover - panic on start

Created on 25 Apr 2018 · 12Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.8.0 (751b42aaa60f1bc679e5062e16db7d15f8ff9654)

Operating system and Environment details

Issue

We have a development (thank goodness) environment go down all of a sudden...not really sure what happened, but none of the servers are able to establish leadership. Attempting to restart nomad shows the following panic:

    2018/04/25 04:24:47.759383 [INFO] nomad: cluster leadership acquired
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa8 pc=0x10eaa49]

goroutine 146 [running]:
github.com/hashicorp/nomad/nomad/drainer.handleTaskGroup(0xc420b07d18, 0xc420b07600, 0xc42030c880, 0xc420b20700, 0x5, 0x8, 0x1, 0xc420b206c0, 0xc420b075f0, 0x4760e2)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:355 +0x779
github.com/hashicorp/nomad/nomad/drainer.handleJob(0xc420b07d18, 0xc4203a9380, 0xc4207d3dc0, 0x5, 0x8, 0x1, 0xc4203a9380, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:324 +0x575
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc420287e30)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:214 +0x4b2
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

Can't get a leader up, cant get things running, nomad outage recovery doesn't work either. Looks like theres bad state somewhere?

Reproduction steps

Not really sure how to reproduce this, we found this cluster this way.

themcore themcrash typbug

Source

djenriquez

Most helpful comment

ip was reused within hours

burdandrei on 25 Apr 2018

👍2

All 12 comments

Just upgraded the binary to 0.8.1 and tried again:

Nomad Version

Nomad v0.8.1 (46aa11ba851f114f5faa0876839b45fbca0c47f0)

    2018/04/25 04:29:59.500372 [INFO] nomad: cluster leadership acquired
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa8 pc=0x10eab99]

goroutine 146 [running]:
github.com/hashicorp/nomad/nomad/drainer.handleTaskGroup(0xc420e2dd18, 0xc420e2d600, 0xc42030a380, 0xc420d684e0, 0x3, 0x4, 0x1, 0xc420b16240, 0xc420e2d5f0, 0x4760e2)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:355 +0x779
github.com/hashicorp/nomad/nomad/drainer.handleJob(0xc420e2dd18, 0xc4203a1380, 0xc420d683c0, 0x3, 0x4, 0x1, 0xc4203a1380, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:324 +0x575
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc4202685b0)
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:214 +0x4b2
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
    /opt/gopath/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:88 +0x1e0

djenriquez on 25 Apr 2018

I was able to get past this by building a binary that checks if node is nil. Once the server was up, all other servers were able to join in with the 0.8.0 release version just fine. I turned off the dev version, changed it back to 0.8.0 and now everything is back happy on 0.8.0.

See PR #4208

djenriquez on 25 Apr 2018

Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well.

Were you starting from a fresh cluster state?

chelseakomlo on 25 Apr 2018

We were not. This cluster was active and had tens of jobs running over 100 allocations. We did have an autoscaling agent here (replicator) which is responsible for scaling jobs and nodes in and out. I believe draining an allocation and terminating a node consecutively caused the bad state, though that’s a pretty normal operation that has worked hundreds of times over the last few months.

DJ Enriquez

On Apr 25, 2018, at 6:15 AM, Chelsea Komlo notifications@github.com wrote:

Thanks for the PR, and the detailed description. Would you mind also including your agent and job configuration? We would like to reproduce on our end as well.

Were you starting from a fresh cluster state?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

djenriquez on 25 Apr 2018

@chelseakomlo Looks like that deployment is in that bad state again. I need to fix it by doing what I did yesterday with the dev binary, but should have logs to help out if needed.

djenriquez on 25 Apr 2018

This is where it all came crashing down, not sure if these logs are useful though...

Apr 25 05:42:32 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: EOF
Apr 25 05:42:32 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: failed to get conn: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:33 ip-10-2-9-105 nomad[13448]: raft: Rejecting vote request from 10.2.11.3:4647 since we have a leader: 10.2.19.112:4647
Apr 25 05:42:34 ip-10-2-9-105 nomad[13448]: worker: failed to dequeue evaluation: rpc error: failed to get conn: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-9-105 nomad[13448]: raft: Heartbeat timeout from "10.2.19.112:4647" reached, starting election

This server then retried elections unsuccessfully until I corrected it a few minutes ago.

One server reported:

Apr 25 05:42:32 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: EOF
Apr 25 05:42:32 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: stream closed
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: worker: failed to dequeue evaluation: rpc error: failed to get conn: rpc error: lead thread didn't get connection
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Heartbeat timeout from "10.2.19.112:4647" reached, starting election
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Candidate] entering Candidate state in term 1393
Apr 25 05:42:33 ip-10-2-11-3 nomad[13504]: raft: Failed to make RequestVote RPC to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Duplicate RequestVote for same term: 1393
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Election timeout reached, restarting election
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Candidate] entering Candidate state in term 1394
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to make RequestVote RPC to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Election won. Tally: 2
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Node at 10.2.11.3:4647 [Leader] entering Leader state
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Added peer 10.2.9.105:4647, starting replication
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Added peer 10.2.19.112:4647, starting replication
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: nomad: cluster leadership acquired
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to AppendEntries to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: pipelining replication to peer {Voter 10.2.9.105:4647 10.2.9.105:4647}
Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: raft: Failed to AppendEntries to {Voter 10.2.19.112:4647 10.2.19.112:4647}: dial tcp 10.2.19.112:4647: getsockopt: connection refused
Apr 25 05:42:35 ip-10-2-11-3 init: nomad main process (13504) terminated with status 2

The third and final server reported:

Apr 25 05:40:52 ip-10-2-19-112 nomad[19679]: nomad.heartbeat: node 'b39baaa0-897f-5c82-9342-fb4317fed09e' TTL expired
Apr 25 05:42:32 ip-10-2-19-112 init: nomad main process (19679) terminated with status 2

I believe 10.2.11.3 was the leader as it had the most recent log stating.

Apr 25 05:42:35 ip-10-2-11-3 nomad[13504]: nomad: cluster leadership acquired

Apr 25 05:07:43 ip-10-2-19-112 nomad[19679]: nomad: cluster leadership acquired

and

Apr 25 04:19:05 ip-10-2-9-105 nomad[12831]: nomad: cluster leadership acquired

EDIT:
10.2.19.112 was the leader, it was the last server to report leadership acquisition before the crash. 10.2.11.3's last leadership acquition was actually:

Apr 25 05:02:55 ip-10-2-11-3 nomad[13271]: nomad: cluster leadership acquired

Only after 10.2.19.112 crashed did 10.2.11.3 acquire leadership at Apr 25 05:42:35.

djenriquez on 25 Apr 2018

just received this in production cluster =(
better ideas than running binary compiled from @dadgar PR?

burdandrei on 25 Apr 2018

We're working on getting a release out with this fix in it soon (on the order of days).

qkate on 25 Apr 2018

One thing to add. one of the nomad clients for some reason was added to raft peer list:

Node                           ID               Address          State     Voter  RaftProtocol
(unknown)                      10.1.3.92:4647   10.1.3.92:4647   follower  true   unknown
i-0b06d30a175f35dd2.us-east-1  10.1.7.103:4647  10.1.7.103:4647  follower  true   2
i-017fcda8499a8da74.us-east-1  10.1.1.65:4647   10.1.1.65:4647   leader    true   2
i-026ffbbc26159973d.us-east-1  10.1.9.90:4647   10.1.9.90:4647   follower  true   2
i-0a84f22df5e58d875.us-east-1  10.1.9.143:4647  10.1.9.143:4647  follower  true   2
i-0e1487563f2f252a7.us-east-1  10.1.3.58:4647   10.1.3.58:4647   follower  true   2

burdandrei on 25 Apr 2018

UPD:
looks like old servers IP was reused because of autoscaling, and it made nomad a bit crazy

burdandrei on 25 Apr 2018

How old we talking @burdandrei? Terminated just hours/days ago? Is this something the garbage collector usually catches but the IP was reused too soon?

djenriquez on 25 Apr 2018

ip was reused within hours

burdandrei on 25 Apr 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings