v0.10.2
Script checks fail to update in Consul.
For what I could investigate. This is due to service name interpolation not being done in the script check hook.
As service name is not interpolated in that hook, the checkID generated by the hash function is different to the one registered in consul.
1) create a service with a script check (use interpolation in the service name)
2) run job
Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.325032Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:25 w-315cfc94-0005 nomad: {"@level":"warn","@message":"updating check failed","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:25.704519Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.287374Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:40 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:40.721120Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.307906Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-fcdfa8b4920e8e49f8361e7a82b9d772f5c90bfa\")","task":"my-task"}
Dec 11 07:46:55 w-315cfc94-0005 nomad: {"@level":"debug","@message":"updating check still failing","@module":"client.alloc_runner.task_runner.task_hook.script_checks","@timestamp":"2019-12-11T07:46:55.655170Z","alloc_id":"e3f32377-2307-9b88-2052-ddb8efa4013c","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-f9d3b979a85ba3c79f9e54bed95fbc6ba49a5827\")","task":"my-task"}
I made the following change to make it work in our environment.
diff --git a/client/allocrunner/taskrunner/script_check_hook.go b/client/allocrunner/taskrunner/script_check_hook.go
index b40e92301..4916ef76e 100644
--- a/client/allocrunner/taskrunner/script_check_hook.go
+++ b/client/allocrunner/taskrunner/script_check_hook.go
@@ -175,12 +175,15 @@ func (h *scriptCheckHook) Stop(ctx context.Context, req *interfaces.TaskStopRequ
func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
scriptChecks := make(map[string]*scriptCheck)
for _, service := range h.task.Services {
+ copyService := service.Copy()
+ copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+ copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
for _, check := range service.Checks {
if check.Type != structs.ServiceCheckScript {
continue
}
serviceID := agentconsul.MakeAllocServiceID(
- h.alloc.ID, h.task.Name, service)
+ h.alloc.ID, h.task.Name, copyService)
sc := newScriptCheck(&scriptCheckConfig{
allocID: h.alloc.ID,
taskName: h.task.Name,
@@ -205,6 +208,9 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
// watches Consul for status changes.
tg := h.alloc.Job.LookupTaskGroup(h.alloc.TaskGroup)
for _, service := range tg.Services {
+ copyService := service.Copy()
+ copyService.Name = h.taskEnv.ReplaceEnv(copyService.Name)
+ copyService.PortLabel = h.taskEnv.ReplaceEnv(service.PortLabel)
for _, check := range service.Checks {
if check.Type != structs.ServiceCheckScript {
continue
@@ -214,7 +220,7 @@ func (h *scriptCheckHook) newScriptChecks() map[string]*scriptCheck {
}
groupTaskName := "group-" + tg.Name
serviceID := agentconsul.MakeAllocServiceID(
- h.alloc.ID, groupTaskName, service)
+ h.alloc.ID, h.task.Name, copyService)
sc := newScriptCheck(&scriptCheckConfig{
allocID: h.alloc.ID,
taskName: groupTaskName,
Hi @jorgemarey and thanks for reporting this! This may be related to what's going on in #6637 but we'll look into it.
Hi @tgross. I don't know if this is related. That issue occurs when performing the validation of the job file and this happens when the agent (client) is running the allocation and trying to update the TTL on consul.
Hey @jorgemarey, just wanted to let you know I've started on the fix for this. Your patch has the right idea, but we need to move where we're doing the taskEnv interpolation to account for job updates. Once I've got that (and tests!) I'll ping you on the pull request as a heads up.
I've opened https://github.com/hashicorp/nomad/pull/6916 with the fix.
I'm still having the same issue as the original poster, running Nomad v0.10.5