Nomad: Nomad Server: fatal error: concurrent map read and map write

Created on 22 Aug 2018  路  7Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty

Issue

Nomad server panics with fatal error: concurrent map read and map write

It happened on 2 out of the 3 Nomad servers in the cluster

Restarting the Nomad servers causes them to panic the same way after a short time

Clients seem unaffected

go test -race -run 'TestEval*' ./nomad/ seem to output tons of data race issues in general

Reproduction steps

Was draining a bunch of nodes, not sure if its related or not

Nomad Server logs (if appropriate)

https://gist.github.com/jippi/7b41d871cdbda6e24d202e09f0925438

Hotfix

diff --git a/lib/delay_heap.go b/lib/delay_heap.go
index 155583198..540593e76 100644
--- a/lib/delay_heap.go
+++ b/lib/delay_heap.go
@@ -3,11 +3,14 @@ package lib
 import (
    "container/heap"
    "fmt"
+   "sync"
    "time"

    "github.com/hashicorp/nomad/nomad/structs"
 )

+var delayHeapMutex = sync.RWMutex{}
+
 // DelayHeap wraps a heap and gives operations other than Push/Pop.
 // The inner heap is sorted by the time in the WaitUntil field of delayHeapNode
 type DelayHeap struct {
@@ -91,13 +94,21 @@ func (p *DelayHeap) Push(dataNode HeapNode, next time.Time) error {
        ID:        dataNode.ID(),
        Namespace: dataNode.Namespace(),
    }
+
+   delayHeapMutex.RLock()
    if _, ok := p.index[tuple]; ok {
+       delayHeapMutex.RUnlock()
        return fmt.Errorf("node %q (%s) already exists", dataNode.ID(), dataNode.Namespace())
    }
+   delayHeapMutex.RUnlock()

    delayHeapNode := &delayHeapNode{dataNode, next, 0}
    p.index[tuple] = delayHeapNode
+
+   delayHeapMutex.Lock()
    heap.Push(&p.heap, delayHeapNode)
+   delayHeapMutex.Unlock()
+
    return nil
 }

@@ -120,6 +131,8 @@ func (p *DelayHeap) Peek() *delayHeapNode {
        return nil
    }

+   delayHeapMutex.RLock()
+   defer delayHeapMutex.RUnlock()
    return p.heap[0]
 }

@@ -128,7 +141,9 @@ func (p *DelayHeap) Contains(heapNode HeapNode) bool {
        ID:        heapNode.ID(),
        Namespace: heapNode.Namespace(),
    }
+   delayHeapMutex.RLock()
    _, ok := p.index[tuple]
+   delayHeapMutex.RUnlock()
    return ok
 }

@@ -137,14 +152,17 @@ func (p *DelayHeap) Update(heapNode HeapNode, waitUntil time.Time) error {
        ID:        heapNode.ID(),
        Namespace: heapNode.Namespace(),
    }
+
+   delayHeapMutex.RLock()
    if existingHeapNode, ok := p.index[tuple]; ok {
+       delayHeapMutex.RUnlock()
        // Need to update the job as well because its spec can change.
        existingHeapNode.Node = heapNode
        existingHeapNode.WaitUntil = waitUntil
        heap.Fix(&p.heap, existingHeapNode.index)
        return nil
    }
-
+   delayHeapMutex.Unlock()
    return fmt.Errorf("heap doesn't contain object with ID %q (%s)", heapNode.ID(), heapNode.Namespace())
 }

@@ -153,6 +171,10 @@ func (p *DelayHeap) Remove(heapNode HeapNode) error {
        ID:        heapNode.ID(),
        Namespace: heapNode.Namespace(),
    }
+
+   delayHeapMutex.Lock()
+   defer delayHeapMutex.Unlock()
+
    if node, ok := p.index[tuple]; ok {
        heap.Remove(&p.heap, node.index)
        delete(p.index, tuple)

themcrash typbug

Most helpful comment

@preetapan can you or someone on the nomad team please triage this?

All 7 comments

We have the same issues, but we still have issues described here https://github.com/hashicorp/nomad/issues/4477, and simple restarts hung nomad servers, after they stuck

Here our logs
nomad_race.logs.txt

But patch that you are provided doesn't looks legal, DelayHeap imho must be threadsafe neutral
and more correct patch must change more higtlevel structures and will be looks like so

diff --git a/nomad/eval_broker.go b/nomad/eval_broker.go
index 54e32e7a8..59fde90dc 100644
--- a/nomad/eval_broker.go
+++ b/nomad/eval_broker.go
@@ -763,8 +763,8 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
            return
        case <-timerChannel:
            // remove from the heap since we can enqueue it now
-           b.delayHeap.Remove(&evalWrapper{eval})
            b.l.Lock()
+           b.delayHeap.Remove(&evalWrapper{eval})
            b.stats.TotalWaiting -= 1
            b.enqueueLocked(eval, eval.Type)
            b.l.Unlock()
@@ -777,11 +777,14 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context) {
 // nextDelayedEval returns the next delayed eval to launch and when it should be enqueued.
 // This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time.
 func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) {
+   b.l.RLock()
    // If there is nothing wait for an update.
    if b.delayHeap.Length() == 0 {
+       b.l.RUnlock()
        return nil, time.Time{}
    }
    nextEval := b.delayHeap.Peek()
+   b.l.RUnlock()

    if nextEval == nil {
        return nil, time.Time{} 

It will by very helpful if guys from hashicorp clarify this moment

my patch made the cluster non-crashy, so it did what i wanted it to :)

@preetapan can you or someone on the nomad team please triage this?

Hi,
We are also facing the same problem. Just as @jippi described, Nomad servers are throwing:
fatal error: concurrent map read and map write
and it seems that's related with draining bunch of nodes (that's the only thing we've done).

Nomad version:
Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial

Sorry for the late response, been traveling this week.

This is on our radar and I'll have a PR up for it by early next week.

@preetapan thanks, for info. Any suggestion for some hotfix ? Server restart or sth.
We are not able to interact with our production cluster at the moment and it's crucial for us to solve this problem somehow.

@pkrolikowski yes restarting the server should work. Draining several nodes at once increases the chances of the race condition occurring again so I would suggest avoiding that till I merge a fix to master. Sorry for the bug,

Was this page helpful?
0 / 5 - 0 ratings