0.9.1
1.5.0
Sometimes nomad outputs the following errors when deploying a container:
{"@level":"error","@message":"update hook failed","@module":"client.alloc_runner.task_runner","@timestamp":"2019-05-29T14:55:58.076399Z","alloc_id":"b5b284fb-8980-d29e-65f7-9371f3b9d15f","error":"driver doesn't support script checks","name":"consul_services","task":"my-task"}
{"@level":"error","@message":"update hook failed","@module":"client.alloc_runner.task_runner","@timestamp":"2019-05-29T14:59:43.017706Z","alloc_id":"9efb7a79-55d8-2d72-9506-96b0bf7a7050","error":"unable to get address for service \"my-service\": cannot use address_mode=\"driver\": no driver network exists","name":"consul_services","task":"my-task"}
In the end it works and gets registered correctly.
This happened several times with different jobs.
I can't reproduce them all the time. It seems like a concurrency problem.
The job is a simple docker driver job with a service and a check. I saw in some issue that for address_mode driver it should be in a custom network_mode (which is the case).
This job file is the one that was used when the second error appeared (in the case of the script check error was another one).
"Tasks": [
{
"Name": "my-task",
"Driver": "docker",
"User": "",
"Config": {
"network_mode": "my-network",
"port_map": [
{
"api": 3000
}
],
"force_pull": true,
"image": "********",
},
"Env": null,
"Services": [
{
"Name": "${NOMAD_JOB_NAME}-${NOMAD_ALLOC_INDEX}",
"PortLabel": "api",
"AddressMode": "driver",
"Tags": null,
"CanaryTags": null,
"Checks": [
{
"Name": "${NOMAD_JOB_NAME}-${NOMAD_ALLOC_INDEX}-check-tcp-container",
"Type": "tcp",
"Command": "",
"Args": null,
"Path": "",
"Protocol": "",
"PortLabel": "",
"AddressMode": "",
"Interval": 10000000000,
"Timeout": 2000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Method": "",
"Header": null,
"CheckRestart": {
"Limit": 3,
"Grace": 90000000000,
"IgnoreWarnings": false
},
"GRPCService": "",
"GRPCUseTLS": false
}
]
},
{
"Name": "${NOMAD_JOB_NAME}",
"PortLabel": "api",
"AddressMode": "driver",
"Tags": null,
"CanaryTags": null,
"Checks": [
{
"Name": "${NOMAD_TASK_NAME}-${NOMAD_ALLOC_INDEX}-check-tcp-service",
"Type": "tcp",
"Command": "",
"Args": null,
"Path": "",
"Protocol": "",
"PortLabel": "",
"AddressMode": "",
"Interval": 10000000000,
"Timeout": 2000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Method": "",
"Header": null,
"CheckRestart": null,
"GRPCService": "",
"GRPCUseTLS": false
}
]
},
{
"Name": "${NOMAD_JOB_NAME}-external",
"PortLabel": "api",
"AddressMode": "host",
"Tags": [],
"CanaryTags": null,
"Checks": [
{
"Name": "${NOMAD_TASK_NAME}-${NOMAD_ALLOC_INDEX}-external",
"Type": "tcp",
"Command": "",
"Args": null,
"Path": "",
"Protocol": "",
"PortLabel": "",
"AddressMode": "",
"Interval": 10000000000,
"Timeout": 2000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Method": "",
"Header": null,
"CheckRestart": null,
"GRPCService": "",
"GRPCUseTLS": false
}
]
}
],
"Templates": [
{
"SourcePath": "",
"DestPath": "secrets/file.env",
"EmbeddedTmpl": " ........ ",
"ChangeMode": "restart",
"ChangeSignal": "",
"Splay": 5000000000,
"Perms": "0644",
"LeftDelim": "{{",
"RightDelim": "}}",
"Envvars": true,
"VaultGrace": 15000000000
}
],
"Constraints": null,
"Affinities": null,
"Resources": {
"CPU": 300,
"MemoryMB": 300,
"DiskMB": 0,
"IOPS": 0,
"Networks": [
{
"Device": "",
"CIDR": "",
"IP": "",
"MBits": 10,
"ReservedPorts": null,
"DynamicPorts": [
{
"Label": "api",
"Value": 0
}
]
}
],
"Devices": null
},
"DispatchPayload": null,
"Meta": null,
"KillTimeout": 5000000000,
"LogConfig": {
"MaxFiles": 10,
"MaxFileSizeMB": 10
},
"Artifacts": null,
"Leader": false,
"ShutdownDelay": 0,
"KillSignal": ""
}
],
In the case of the script check failure. The script is rendered via a template.
May 29 14:59:43 w-a3d6cd97-0002 nomad: {"@level":"error","@message":"update hook failed","@module":"client.alloc_runner.task_runner","@timestamp":"2019-05-29T14:59:43.017706Z","alloc_id":"9efb7a79-55d8-2d72-9506-96b0bf7a7050","error":"unable to get address for service \"my-service\": cannot use address_mode=\"driver\": no driver network exists","name":"consul_services","task":"my-task"}
May 29 14:59:43 w-a3d6cd97-0002 nomad: 2019/05/29 14:59:43.017810 [INFO] (runner) creating new runner (dry: false, once: false)
May 29 14:59:43 w-a3d6cd97-0002 nomad: 2019/05/29 14:59:43.017969 [INFO] (runner) creating watcher
May 29 14:59:43 w-a3d6cd97-0002 nomad: 2019/05/29 14:59:43.018091 [INFO] (runner) starting
May 29 14:59:43 w-a3d6cd97-0002 nomad: 2019/05/29 14:59:43.039351 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9efb7a79-55d8-2d72-9506-96b0bf7a7050/my-task/secrets/file.env"
May 29 14:59:44 w-a3d6cd97-0002 consul: 2019/05/29 14:59:44 [INFO] agent: Synced check "89eac5f576aa976e8b883d4ee1f7f0bbd5cc6bdc"
May 29 14:59:44 w-a3d6cd97-0002 nomad: {"@level":"info","@message":"created container","@module":"client.driver_mgr.docker","@timestamp":"2019-05-29T14:59:44.616571Z","container_id":"e613b13dccd874babb3950f8b1119445fde313eb908007233b7cc37c4e0b407e","driver":"docker"}
If you want more info I could try a get some more.
Thanks.
Could this be related to #5770 ? Both are related to the update hook for the service registration.
We are seeing this problem as well.
This does not appear to be related to #5770. (5770 has been fixed in 0.9.3, but we're still seeing this problem with Nomad 0.9.3.)
I might also be having the same or a similar problem with 0.9.3.
In my case, this issue occurs with docker container with a script health check:
consul_services: driver doesn't support script checks
I see it in the allocation status via cli:
Recent Events:
Time Type Description
2019-06-26T16:28:03Z Started Task started by client
2019-06-26T16:28:03Z Task hook failed consul_services: driver doesn't support script checks
2019-06-26T16:28:03Z Task Setup Building Task Directory
2019-06-26T16:28:03Z Received Task received by client
The health check actually shows up as passing in consul:
[
{
"Node": "localhost.localdomain",
"CheckID": "_nomad-check-604bd5e908d5da2bfbf3013306fc5892d0e10865",
"Name": "api-php-fpm",
"Status": "passing",
"Notes": "",
"Output": "",
"ServiceID": "_nomad-task-2bcc1828-d438-f387-2b7b-90a8f52da268-api-php-fpm",
"ServiceName": "api-php-fpm",
"ServiceTags": [],
"Definition": {},
"CreateIndex": 8229,
"ModifyIndex": 8230
}
]
We're seeing the same issue with our custom task driver. It returns a proper *drivers.DriverNetwork, but we still get that failure.
In the end, the service is registered with the correct IP. The failure can be confusing and lead to look for problems in the wrong places.
Unfortunately, the supplied IP:port not shown in the nomad alloc status, and consul does not have a CLI command to show more detailed information about a service (like its IP), so we need to make an API call, which is cumbersome. I think that's a separate issue though. Discussed a bit in this post: https://groups.google.com/forum/#!topic/nomad-tool/5sR8MTGZFrM
Hi.
This issue occurs for me basically every time i start a job.
client.alloc_runner.task_runner: update hook failed: alloc_id=091c3564-da61-3ad9-d618-d2de767f5554 task=service name=consul_services error="unable to get address for service "my-service": cannot use address_mode="driver": no driver network exists"
But regardless of this being logged as error Nomad starts the task and registers it in consul with proper IP and port.
My configuration looks like this:
Nomad starts a docker task with address_mode: driver
Docker has only default networks and starts a container with "bridge" network, so I get docker0 interface with IP assigned from /etc/docker/daemon.conf (bip setting) and vethxxxx bridged to it.
Then I have overlay network on flannel, with flannel.0 device with own IP and all inter-node communication is routed via kernel and flannel. Docker has no idea that there is an overlay network.
The docs say that if a task cannot find a driver network it will fail. Well, it always starts for me.
This issue was very sporadic on 0.8 and earlier, with 0.9 I think happens every time (tested on 0.9.0, 0.9.3, 0.9.4).
Regards,
Marcin.
Same issue here as @Garagoth using IPv6. (on 0.9.4)
Registration also works, but getting the error.
Going to do some debugs to see if I can find out more.
@Garagoth do you have also vault in use in that job?
Found the culprit.
vault Prestart runs a Update on all runnerhooks before docker starts so there's no task.DriverNetwork yet
Don't run tr.triggerUpdateHooks in updatedVaultToken if the hook is run for the first time?
Can make a PR if this solution looks good.
When using vault in the job the vault hook Prestart runs before docker and tr.triggerUpdateHooks gets triggered in updatedVaultToken()
Which hits a channel
https://github.com/hashicorp/nomad/blob/6c4863c5f61b0cdfb609927f6f67e17d90454197/client/allocrunner/taskrunner/task_runner.go#L1191
Which runs updateHooks()
https://github.com/hashicorp/nomad/blob/6c4863c5f61b0cdfb609927f6f67e17d90454197/client/allocrunner/taskrunner/task_runner.go#L649
Which runs Update on all the hooks in tr.runnerHooks
https://github.com/hashicorp/nomad/blob/6c4863c5f61b0cdfb609927f6f67e17d90454197/client/allocrunner/taskrunner/task_runner_hooks.go#L422-L427
Which hits Update in the consul_services hook
Which triggers UpdateTask
Runs serviceRegs
https://github.com/hashicorp/nomad/blob/6c4863c5f61b0cdfb609927f6f67e17d90454197/command/agent/consul/client.go#L952
Which finally runs getAddress with an empty task.DriverNetwork and returns the ERROR
https://github.com/hashicorp/nomad/blob/6c4863c5f61b0cdfb609927f6f67e17d90454197//command/agent/consul/client.go#L709-L713
Yes, all jobs with vault have this error.
Jobs without Vault seem to be fine. I did not notice this earlier.
M.
Made PR #6066 which fixes the issue for me
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
I'm still experiencing this issue using v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82). I have no tried the #6066 PR though.
I'm experiencing this also. Perhaps Nomad should have a way to assign an address directly itself?
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
Since this popped back up my inbox: still seeing it using 0.10.2
I think I've just hit the same issue. I upgraded from 0.9.3 to 0.10.3 and some of my jobs are failing. I had raised a separate issue - #7177 - because I didn't initially spot this one.
Most helpful comment
Made PR #6066 which fixes the issue for me