Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
Nomad server panics with fatal error: concurrent map read and map write
It happened on 2 out of the 3 Nomad servers in the cluster
Restarting the Nomad servers causes them to panic the same way after a short time
Clients seem unaffected
go test -race -run 'TestEval*' ./nomad/ seem to output tons of data race issues in general
Was draining a bunch of nodes, not sure if its related or not
https://gist.github.com/jippi/7b41d871cdbda6e24d202e09f0925438
diff --git a/lib/delay_heap.go b/lib/delay_heap.go
index 155583198..540593e76 100644
--- a/lib/delay_heap.go
+++ b/lib/delay_heap.go
@@ -3,11 +3,14 @@ package lib
import (
"container/heap"
"fmt"
+ "sync"
"time"
"github.com/hashicorp/nomad/nomad/structs"
)
+var delayHeapMutex = sync.RWMutex{}
+
// DelayHeap wraps a heap and gives operations other than Push/Pop.
// The inner heap is sorted by the time in the WaitUntil field of delayHeapNode
type DelayHeap struct {
@@ -91,13 +94,21 @@ func (p *DelayHeap) Push(dataNode HeapNode, next time.Time) error {
ID: dataNode.ID(),
Namespace: dataNode.Namespace(),
}
+
+ delayHeapMutex.RLock()
if _, ok := p.index[tuple]; ok {
+ delayHeapMutex.RUnlock()
return fmt.Errorf("node %q (%s) already exists", dataNode.ID(), dataNode.Namespace())
}
+ delayHeapMutex.RUnlock()
delayHeapNode := &delayHeapNode{dataNode, next, 0}
p.index[tuple] = delayHeapNode
+
+ delayHeapMutex.Lock()
heap.Push(&p.heap, delayHeapNode)
+ delayHeapMutex.Unlock()
+
return nil
}
@@ -120,6 +131,8 @@ func (p *DelayHeap) Peek() *delayHeapNode {
return nil
}
+ delayHeapMutex.RLock()
+ defer delayHeapMutex.RUnlock()
return p.heap[0]
}
@@ -128,7 +141,9 @@ func (p *DelayHeap) Contains(heapNode HeapNode) bool {
ID: heapNode.ID(),
Namespace: heapNode.Namespace(),
}
+ delayHeapMutex.RLock()
_, ok := p.index[tuple]
+ delayHeapMutex.RUnlock()
return ok
}
@@ -137,14 +152,17 @@ func (p *DelayHeap) Update(heapNode HeapNode, waitUntil time.Time) error {
ID: heapNode.ID(),
Namespace: heapNode.Namespace(),
}
+
+ delayHeapMutex.RLock()
if existingHeapNode, ok := p.index[tuple]; ok {
+ delayHeapMutex.RUnlock()
// Need to update the job as well because its spec can change.
existingHeapNode.Node = heapNode
existingHeapNode.WaitUntil = waitUntil
heap.Fix(&p.heap, existingHeapNode.index)
return nil
}
-
+ delayHeapMutex.Unlock()
return fmt.Errorf("heap doesn't contain object with ID %q (%s)", heapNode.ID(), heapNode.Namespace())
}
@@ -153,6 +171,10 @@ func (p *DelayHeap) Remove(heapNode HeapNode) error {
ID: heapNode.ID(),
Namespace: heapNode.Namespace(),
}
+
+ delayHeapMutex.Lock()
+ defer delayHeapMutex.Unlock()
+
if node, ok := p.index[tuple]; ok {
heap.Remove(&p.heap, node.index)
delete(p.index, tuple)
We have the same issues, but we still have issues described here https://github.com/hashicorp/nomad/issues/4477, and simple restarts hung nomad servers, after they stuck
Here our logs
nomad_race.logs.txt
But patch that you are provided doesn't looks legal, DelayHeap imho must be threadsafe neutral
and more correct patch must change more higtlevel structures and will be looks like so
diff --git a/nomad/eval_broker.go b/nomad/eval_broker.go
index 54e32e7a8..59fde90dc 100644
--- a/nomad/eval_broker.go
+++ b/nomad/eval_broker.go
@@ -763,8 +763,8 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
return
case <-timerChannel:
// remove from the heap since we can enqueue it now
- b.delayHeap.Remove(&evalWrapper{eval})
b.l.Lock()
+ b.delayHeap.Remove(&evalWrapper{eval})
b.stats.TotalWaiting -= 1
b.enqueueLocked(eval, eval.Type)
b.l.Unlock()
@@ -777,11 +777,14 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
// nextDelayedEval returns the next delayed eval to launch and when it should be enqueued.
// This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time.
func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) {
+ b.l.RLock()
// If there is nothing wait for an update.
if b.delayHeap.Length() == 0 {
+ b.l.RUnlock()
return nil, time.Time{}
}
nextEval := b.delayHeap.Peek()
+ b.l.RUnlock()
if nextEval == nil {
return nil, time.Time{}
It will by very helpful if guys from hashicorp clarify this moment
my patch made the cluster non-crashy, so it did what i wanted it to :)
@preetapan can you or someone on the nomad team please triage this?
Hi,
We are also facing the same problem. Just as @jippi described, Nomad servers are throwing:
fatal error: concurrent map read and map write
and it seems that's related with draining bunch of nodes (that's the only thing we've done).
Nomad version:
Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)
Operating system and Environment details
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
Sorry for the late response, been traveling this week.
This is on our radar and I'll have a PR up for it by early next week.
@preetapan thanks, for info. Any suggestion for some hotfix ? Server restart or sth.
We are not able to interact with our production cluster at the moment and it's crucial for us to solve this problem somehow.
@pkrolikowski yes restarting the server should work. Draining several nodes at once increases the chances of the race condition occurring again so I would suggest avoiding that till I merge a fix to master. Sorry for the bug,
Most helpful comment
@preetapan can you or someone on the nomad team please triage this?