Nomad: Script checks fails to update TTL

Created on 11 Dec 2019 · 5Comments · Source: hashicorp/nomad

Nomad version

v0.10.2

Issue

Script checks fail to update in Consul.

For what I could investigate. This is due to service name interpolation not being done in the script check hook.

As service name is not interpolated in that hook, the checkID generated by the hash function is different to the one registered in consul.

Reproduction steps

1) create a service with a script check (use interpolation in the service name)
2) run job

Nomad Client logs

Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.325032Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.704519Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.287374Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.721120Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.307906Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.655170Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}

I made the following change to make it work in our environment.

diff --git a/client/allocrunner/taskrunner/script_check_hook.go b/client/allocrunner/taskrunner/script_check_hook.go
index b40e92301..4916ef76e 100644
--- a/client/allocrunner/taskrunner/script_check_hook.go
+++ b/client/allocrunner/taskrunner/script_check_hook.go
@@ -175,12 +175,15 @@ func (h *scriptCheckHook) Stop(ctx context.Context, req *interfaces.TaskStopRequ
 func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
        scriptChecks := make(map[string]*scriptCheck)
        for _, service := range h.task.Services {
+               copyService := service.Copy()
+               copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+               copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
                for _, check := range service.Checks {
                        if check.Type != structs.ServiceCheckScript {
                                continue
                        }
                        serviceID := agentconsul.MakeAllocServiceID(
-                               h.alloc.ID, h.task.Name, service)
+                               h.alloc.ID, h.task.Name, copyService)
                        sc := newScriptCheck(&scriptCheckConfig{
                                allocID:    h.alloc.ID,
                                taskName:   h.task.Name,
@@ -205,6 +208,9 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
        // watches Consul for status changes.
        tg := h.alloc.Job.LookupTaskGroup(h.alloc.TaskGroup)
        for _, service := range tg.Services {
+               copyService := service.Copy()
+               copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+               copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
                for _, check := range service.Checks {
                        if check.Type != structs.ServiceCheckScript {
                                continue
@@ -214,7 +220,7 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
                        }
                        groupTaskName := "group-" + tg.Name
                        serviceID := agentconsul.MakeAllocServiceID(
-                               h.alloc.ID, groupTaskName, service)
+                               h.alloc.ID, h.task.Name, copyService)
                        sc := newScriptCheck(&scriptCheckConfig{
                                allocID:    h.alloc.ID,
                                taskName:   groupTaskName,

themdiscovery typbug

Source

jorgemarey

👀1 👍1

All 5 comments

Hi @jorgemarey and thanks for reporting this! This may be related to what's going on in #6637 but we'll look into it.

tgross on 11 Dec 2019

Hi @tgross. I don't know if this is related. That issue occurs when performing the validation of the job file and this happens when the agent (client) is running the allocation and trying to update the TTL on consul.

jorgemarey on 12 Dec 2019

👍1

Hey @jorgemarey, just wanted to let you know I've started on the fix for this. Your patch has the right idea, but we need to move where we're doing the taskEnv interpolation to account for job updates. Once I've got that (and tests!) I'll ping you on the pull request as a heads up.

tgross on 7 Jan 2020

👍1

I've opened https://github.com/hashicorp/nomad/pull/6916 with the fix.

tgross on 8 Jan 2020

🎉1

I'm still having the same issue as the original poster, running Nomad v0.10.5