Nomad: Nomad agent does not notify the task, about the change of dynamic secrets after restart

Created on 27 Apr 2018 · 13Comments · Source: hashicorp/nomad

Nomad version

Nomad 0.8.1
Vault 0.10.0

Operating system and Environment details

Ubuntu 16.04.4 LTS

Issue

After restarting the nomad agent (nomad restart) on which the task is placed, the nomad agent receives new dynamic secrets from the Vault, but does not say anything to the task.

Reproduction steps

Place a job using dynamic secrets in a nomad template
Find the node on which the job was placed
Reload the nomad agent on this node

Nomad agent will immediately receive a new dynamic secret, but the task will continue to work with the outdated secret. Nomad agent will not send a signal or restart

Nomad Client logs (if appropriate)

Apr 27 12:25:32 ip-172-25-5-101 nomad.sh[27775]: ==> Nomad agent started! Log data will stream in below:
Apr 27 12:25:32 ip-172-25-5-101 nomad.sh[27775]:     2018/04/27 12:25:32 [WARN] vault.read(secrets/aws/creds/service): the calling token display name/IAM policy name were truncated to fit into IAM username length limits

Job file (if appropriate)

template {
        data = <<EOH
        AWS_access_key_id="{{with secret "secrets/aws/creds/service"}}{{.Data.access_key}}{{end}}"
        AWS_secret_access_key="{{with secret "secrets/aws/creds/service"}}{{.Data.secret_key}}{{end}}"
        EOH
        destination = "secrets/awssecrets.env"
    env         = true
        change_mode   = "restart"
}

stagaccepted themvault typbug

Source

sadpirit

👍9

Most helpful comment

Hi @schmichael @tgross, any chance this will be picked up in a coming release? This makes using dynamic vault credentials in nomad very difficult and unstable. We're running a few hacky workarounds for the moment, but we can't do that sustainably. Thanks!

filip-vt on 13 May 2020

👍20

All 13 comments

@schmichael We have the same issue with dynamic secrets when nomad agent restarts due upgrade process. Of course we can do node drain but it is not very convenient, also this doesn't cover cases when nomad crash due for example internal bug and restarted by systemd, upstart etc

tantra35 on 14 Sep 2018

We made patch to solve this:

 client/alloc_runner.go    | 10 +++++-
 client/consul_template.go | 80 +++++++++++++++++++++++++++++++++++++++++++----
 client/task_runner.go     | 19 ++++++-----
 3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/client/alloc_runner.go b/client/alloc_runner.go
index aeb284a3b..8ec141df4 100644
--- a/client/alloc_runner.go
+++ b/client/alloc_runner.go
@@ -29,6 +29,12 @@ var (
    allocRunnerStateAllocDirKey  = []byte("alloc-dir")
 )

+type TemplateManagerLogger struct{
+   logger *log.Logger
+   allocID string
+   taskName string
+}
+
 // AllocStateUpdater is used to update the status of an allocation
 type AllocStateUpdater func(alloc *structs.Allocation)

@@ -351,8 +357,10 @@ func (r *AllocRunner) RestoreState() error {
            r.logger.Printf("[ERR] client: failed to restore state for alloc %s task %q: %v", r.allocID, name, err)
            mErr.Errors = append(mErr.Errors, err)
        } else if !r.alloc.TerminalStatus() {
+           ctx := context.WithValue(context.Background(), "logger", &TemplateManagerLogger{r.logger, r.allocID, name})
+
            // Only start if the alloc isn't in a terminal status.
-           go tr.Run()
+           go tr.RunWithContext(ctx)

            if upgrading {
                if err := tr.SaveState(); err != nil {
diff --git a/client/consul_template.go b/client/consul_template.go
index 3e4869706..0a53c622a 100644
--- a/client/consul_template.go
+++ b/client/consul_template.go
@@ -8,6 +8,7 @@ import (
    "sort"
    "strconv"
    "strings"
+   "context"
    "sync"
    "time"

@@ -131,7 +132,7 @@ func (c *TaskTemplateManagerConfig) Validate() error {
    return nil
 }

-func NewTaskTemplateManager(config *TaskTemplateManagerConfig) (*TaskTemplateManager, error) {
+func NewTaskTemplateManager(ctx context.Context, config *TaskTemplateManagerConfig) (*TaskTemplateManager, error) {
    // Check pre-conditions
    if err := config.Validate(); err != nil {
        return nil, err
@@ -168,7 +169,7 @@ func NewTaskTemplateManager(config *TaskTemplateManagerConfig) (*TaskTemplateMan
    tm.runner = runner
    tm.lookup = lookup

-   go tm.run()
+   go tm.run(ctx)
    return tm, nil
 }

@@ -191,7 +192,7 @@ func (tm *TaskTemplateManager) Stop() {
 }

 // run is the long lived loop that handles errors and templates being rendered
-func (tm *TaskTemplateManager) run() {
+func (tm *TaskTemplateManager) run(ctx context.Context) {
    // Runner is nil if there is no templates
    if tm.runner == nil {
        // Unblock the start if there is nothing to do
@@ -203,7 +204,18 @@ func (tm *TaskTemplateManager) run() {
    go tm.runner.Start()

    // Block till all the templates have been rendered
-   tm.handleFirstRender()
+   events := tm.handleFirstRender()
+   if ctx != nil {
+       if events != nil {
+           if v := ctx.Value("logger"); v != nil {
+               if logger, ok := v.(*TemplateManagerLogger); ok {
+                   logger.logger.Printf("[INFO] Calling signals for alloc %s task %s: after restore due templates rerender", logger.allocID, logger.taskName)
+               }
+           }
+
+           tm.CallSignals(events)
+       }
+   }

    // Detect if there was a shutdown.
    select {
@@ -233,9 +245,10 @@ func (tm *TaskTemplateManager) run() {
 }

 // handleFirstRender blocks till all templates have been rendered
-func (tm *TaskTemplateManager) handleFirstRender() {
+func (tm *TaskTemplateManager) handleFirstRender() map[string]*manager.RenderEvent {
    // missingDependencies is the set of missing dependencies.
    var missingDependencies map[string]struct{}
+   var l_retevents map[string]*manager.RenderEvent

    // eventTimer is used to trigger the firing of an event showing the missing
    // dependencies.
@@ -253,7 +266,7 @@ WAIT:
    for {
        select {
        case <-tm.shutdownCh:
-           return
+           return nil
        case err, ok := <-tm.runner.ErrCh:
            if !ok {
                continue
@@ -276,6 +289,8 @@ WAIT:
                }
            }

+           l_retevents = events
+
            break WAIT
        case <-tm.runner.RenderEventCh():
            events := tm.runner.RenderEvents()
@@ -340,6 +355,59 @@ WAIT:
            tm.config.Hooks.EmitEvent(consulTemplateSourceName, fmt.Sprintf("Missing: %s", missingStr))
        }
    }
+
+   return l_retevents
+}
+
+func (tm *TaskTemplateManager) CallSignals(events map[string]*manager.RenderEvent) {
+   for id, event := range events {
+       if !event.DidRender {
+           continue
+       }
+
+       // Lookup the template and determine what to do
+       tmpls, ok := tm.lookup[id] 
+       if !ok {
+           tm.config.Hooks.Kill(consulTemplateSourceName, fmt.Sprintf("template runner returned unknown template id %q", id), true)
+           break
+       }
+
+       signals := make(map[string]struct{})
+       restart := false
+
+       for _, tmpl := range tmpls {
+           switch tmpl.ChangeMode {
+           case structs.TemplateChangeModeSignal:
+               signals[tmpl.ChangeSignal] = struct{}{}
+           case structs.TemplateChangeModeRestart:
+               restart = true
+           case structs.TemplateChangeModeNoop:
+               continue
+           }
+       }
+
+       if restart {
+           const failure = false
+           tm.config.Hooks.Restart(consulTemplateSourceName, "template with change_mode restart re-rendered", failure)
+       } else if len(signals) != 0 {
+           var mErr multierror.Error
+           for signal := range signals {
+               err := tm.config.Hooks.Signal(consulTemplateSourceName, "template re-rendered", tm.signals[signal])
+               if err != nil {
+                   multierror.Append(&mErr, err)
+               }
+           }
+
+           if err := mErr.ErrorOrNil(); err != nil {
+               flat := make([]os.Signal, 0, len(signals))
+               for signal := range signals {
+                   flat = append(flat, tm.signals[signal])
+               }
+
+               tm.config.Hooks.Kill(consulTemplateSourceName, fmt.Sprintf("Sending signals %v failed: %v", flat, err), true)
+           }
+       }
+   }
 }

 // handleTemplateRerenders is used to handle template render events after they
diff --git a/client/task_runner.go b/client/task_runner.go
index 4affba3e6..03a1aa2b8 100644
--- a/client/task_runner.go
+++ b/client/task_runner.go
@@ -13,6 +13,7 @@ import (
    "strings"
    "sync"
    "time"
+   "context"

    metrics "github.com/armon/go-metrics"
    "github.com/boltdb/bolt"
@@ -575,7 +576,7 @@ func (r *TaskRunner) createDriver() (driver.Driver, error) {
 }

 // Run is a long running routine used to manage the task
-func (r *TaskRunner) Run() {
+func (r *TaskRunner) RunWithContext(ctx context.Context) {
    defer close(r.waitCh)
    r.logger.Printf("[DEBUG] client: starting task context for '%s' (alloc '%s')",
        r.task.Name, r.alloc.ID)
@@ -622,7 +623,7 @@ func (r *TaskRunner) Run() {
    }

    // Start the run loop
-   r.run()
+   r.run(ctx)

    // Do any cleanup necessary
    r.postrun()
@@ -630,6 +631,10 @@ func (r *TaskRunner) Run() {
    return
 }

+func (r *TaskRunner) Run() {
+   r.RunWithContext(nil)
+}
+
 // validateTask validates the fields of the task and returns an error if the
 // task is invalid.
 func (r *TaskRunner) validateTask() error {
@@ -911,7 +916,7 @@ func (r *TaskRunner) updatedTokenHandler() {

        // Create a new templateManager
        var err error
-       r.templateManager, err = NewTaskTemplateManager(&TaskTemplateManagerConfig{
+       r.templateManager, err = NewTaskTemplateManager(nil, &TaskTemplateManagerConfig{
            Hooks:                r,
            Templates:            r.task.Templates,
            ClientConfig:         r.config,
@@ -936,7 +941,7 @@ func (r *TaskRunner) updatedTokenHandler() {
 // prestart handles life-cycle tasks that occur before the task has started.
 // Since it's run asynchronously with the main Run() loop the alloc & task are
 // passed in to avoid racing with updates.
-func (r *TaskRunner) prestart(alloc *structs.Allocation, task *structs.Task, resultCh chan bool) {
+func (r *TaskRunner) prestart(ctx context.Context, alloc *structs.Allocation, task *structs.Task, resultCh chan bool) {
    if task.Vault != nil {
        // Wait for the token
        r.logger.Printf("[DEBUG] client: waiting for Vault token for task %v in alloc %q", task.Name, alloc.ID)
@@ -1027,7 +1032,7 @@ func (r *TaskRunner) prestart(alloc *structs.Allocation, task *structs.Task, res
        // Build the template manager
        if r.templateManager == nil {
            var err error
-           r.templateManager, err = NewTaskTemplateManager(&TaskTemplateManagerConfig{
+           r.templateManager, err = NewTaskTemplateManager(ctx, &TaskTemplateManagerConfig{
                Hooks:                r,
                Templates:            r.task.Templates,
                ClientConfig:         r.config,
@@ -1083,7 +1088,7 @@ func (r *TaskRunner) postrun() {

 // run is the main run loop that handles starting the application, destroying
 // it, restarts and signals.
-func (r *TaskRunner) run() {
+func (r *TaskRunner) run(ctx context.Context) {
    // Predeclare things so we can jump to the RESTART
    var stopCollection chan struct{}
    var handleWaitCh chan *dstructs.WaitResult
@@ -1101,7 +1106,7 @@ func (r *TaskRunner) run() {
    for {
        // Do the prestart activities
        prestartResultCh := make(chan bool, 1)
-       go r.prestart(r.alloc, r.task, prestartResultCh)
+       go r.prestart(ctx, r.alloc, r.task, prestartResultCh)

    WAIT:
        for {

tantra35 on 19 Sep 2018

@schmichael please take in account that this issue still present in nomad 0.9.3

And when we restart nomad agent(on ubuntu we use sytemcctl restart nomad), template is reredered

2019-06-14T19:54:04.886+0300 [INFO ] client.consul: discovered following servers: servers=192.168.142.102:4647,192.168.142.101:4647,192.168.142.103:4647
2019-06-14T19:54:04.902+0300 [INFO ] client: node registration complete
2019/06/14 19:54:04.975911 [INFO] (runner) creating new runner (dry: false, once: false)
2019/06/14 19:54:04.976172 [INFO] (runner) creating watcher
2019/06/14 19:54:04.976323 [INFO] (runner) starting
2019/06/14 19:54:05.208804 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/b8aff9c0-b257-d633-1674-ec3020049ca6/vault_debug_task/secrets/consul_token.env"

but allocation doesn't restarted

tantra35 on 14 Jun 2019

We got bitten by this same issue (described in #6638 ). Our application kept using old database username/password after restarting Nomad client agent, and failed to connect to database when Vault expired the old database user. This is really a serious problem for Nomad jobs working with Vault. Especially now when this wonderful 0.10 version coming out, I guess a lot of people will manually restart their Nomad agent recently...
Please consider fixing this as soon as possible. Thanks!

liujingchen on 7 Nov 2019

👍9

We encountered the same issue. It took us a while (several nomad versions) to figure out what was going on, since our max TTL was set pretty high and we could never correlate it to a nomad agent restart. All tests are positive.

Nomad v0.10.2
Vault v1.3.1

filip-vt on 9 Mar 2020

👍8

filip-vt on 13 May 2020

👍20

Same here. My client ran into this and is now asking questions around the reliability of the stack and if we should not look for alternatives.

kaysimi on 13 May 2020

Still the case for version 0.11.0. I would also love to see this bug fixed.

frederikbosch on 23 Jun 2020

@filip-vt What is the workaround you are referring to?

frederikbosch on 23 Jun 2020

I my case, I updated a key (v1 kv) in Vault, but the secrets files are not updating, and the job is therefore also not restarted. Even when I restart the task manually from the UI, it still does not renew the secret.

      template {
        data = "{{with secret \"secret/kv/certificate/domain\"}}{{.Data.privkey}}{{end}}"
        destination = "secrets/cert.key"
        change_mode = "signal"
        change_signal = "SIGHUP"
      }

      template {
        data = "{{with secret \"secret/kv/certificate/domain\"}}{{.Data.fullchain}}{{end}}"
        destination = "secrets/cert.crt"
        change_mode = "signal"
        change_signal = "SIGHUP"
      }

frederikbosch on 23 Jun 2020

I'm seeing the same problem with consul kv values. I followed the example that schmichael gave in https://github.com/hashicorp/nomad/issues/5191 but my task is not getting restarted on consul kv value change. After changing the kv value a nomad plan doesn't show a difference.

@schmichael can you please verify that your example still works? I'm using nomad v0.11.2.

I verified that the template file itself is updated, but a task restart is not initiated. change_mode = restart. This seems to be a pretty egregious bug as it's directly counter to the docs.