Nomad: Script checks fails to update TTL

Created on 11 Dec 2019  路  5Comments  路  Source: hashicorp/nomad

Nomad version

v0.10.2

Issue

Script checks fail to update in Consul.

For what I could investigate. This is due to service name interpolation not being done in the script check hook.

As service name is not interpolated in that hook, the checkID generated by the hash function is different to the one registered in consul.

Reproduction steps

1) create a service with a script check (use interpolation in the service name)
2) run job

Nomad Client logs

Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.325032Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.704519Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.287374Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.721120Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.307906Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.655170Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}

I made the following change to make it work in our environment.

diff --git a/client/allocrunner/taskrunner/script_check_hook.go b/client/allocrunner/taskrunner/script_check_hook.go
index b40e92301..4916ef76e 100644
--- a/client/allocrunner/taskrunner/script_check_hook.go
+++ b/client/allocrunner/taskrunner/script_check_hook.go
@@ -175,12 +175,15 @@ func (h *scriptCheckHook) Stop(ctx context.Context, req *interfaces.TaskStopRequ
 func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
        scriptChecks := make(map[string]*scriptCheck)
        for _, service := range h.task.Services {
+               copyService := service.Copy()
+               copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+               copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
                for _, check := range service.Checks {
                        if check.Type != structs.ServiceCheckScript {
                                continue
                        }
                        serviceID := agentconsul.MakeAllocServiceID(
-                               h.alloc.ID, h.task.Name, service)
+                               h.alloc.ID, h.task.Name, copyService)
                        sc := newScriptCheck(&scriptCheckConfig{
                                allocID:    h.alloc.ID,
                                taskName:   h.task.Name,
@@ -205,6 +208,9 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
        // watches Consul for status changes.
        tg := h.alloc.Job.LookupTaskGroup(h.alloc.TaskGroup)
        for _, service := range tg.Services {
+               copyService := service.Copy()
+               copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+               copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
                for _, check := range service.Checks {
                        if check.Type != structs.ServiceCheckScript {
                                continue
@@ -214,7 +220,7 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
                        }
                        groupTaskName := "group-" + tg.Name
                        serviceID := agentconsul.MakeAllocServiceID(
-                               h.alloc.ID, groupTaskName, service)
+                               h.alloc.ID, h.task.Name, copyService)
                        sc := newScriptCheck(&scriptCheckConfig{
                                allocID:    h.alloc.ID,
                                taskName:   groupTaskName,
themdiscovery typbug

All 5 comments

Hi @jorgemarey and thanks for reporting this! This may be related to what's going on in #6637 but we'll look into it.

Hi @tgross. I don't know if this is related. That issue occurs when performing the validation of the job file and this happens when the agent (client) is running the allocation and trying to update the TTL on consul.

Hey @jorgemarey, just wanted to let you know I've started on the fix for this. Your patch has the right idea, but we need to move where we're doing the taskEnv interpolation to account for job updates. Once I've got that (and tests!) I'll ping you on the pull request as a heads up.

I'm still having the same issue as the original poster, running Nomad v0.10.5

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Gerrrr picture Gerrrr  路  3Comments

hynek picture hynek  路  3Comments

byronwolfman picture byronwolfman  路  3Comments

ashald picture ashald  路  3Comments

stongo picture stongo  路  3Comments