Nomad v0.6.2
CoreOS
When a nomad job with a task that uses the docker driver has both an environment section env { "KEY" = "value" } and an environment template, the environment section is used to populate the docker container environment but the environment template is ignored. but sometimes it works too.
Run the nomad job, sometimes you don't get the environment variables from template
The job file is similar to this:
job "test" {
datacenters = ["staging-sqsc"]
type = "service"
constraint {
attribute = "${meta.sqsc.environment}"
value = "staging"
}
constraint {
attribute = "${meta.sqsc.project}"
value = "sqsc"
}
update {
stagger = "10s"
max_parallel = 1
}
group "test" {
count = 1
task "test" {
driver = "docker"
config {
image = "alpine"
args = [ "/bin/sh", "-c", "date; env; echo; cat" ]
}
resources {
cpu = 100
memory = 512
network {
mbits = 10
}
}
env {
"ENV_FROM_ENV" = "ok"
}
template {
destination = "environment"
env = true
data = <<TEMPLATE
ENV_FROM_TEMPLATE={{ key "sample-consul-key" }}
TEMPLATE
}
}
}
}
(You need to the the consul key sample-consul-key and have a nomad node that matches the constraints set)
Thanks for reporting this issue. Do you have more information about when the environment template is and isn't ignored? If you have logs that are relevant to this issue, feel free to add them here.
@mildred Could you please share DEBUG level logs of the client where this occurs.
I have the following nomad client logs. This occurs even though the environment section of the job file is empty and all variables are set using the environment template instead.
nomad logs
2017/11/06 18:31:04.988262 [INFO] client: task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
2017/11/06 18:31:04.988317 [INFO] client: Restarting task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" in 16.362625637s
2017/11/06 18:31:04.990046 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2017/11/06 18:31:05.138308 [DEBUG] client: updated allocations at index 4602 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:05.138530 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:07.154924 [DEBUG] http: Request /v1/agent/servers (45.587碌s)
2017/11/06 18:31:07.156368 [DEBUG] http: Request /v1/status/peers (170.845碌s)
2017/11/06 18:31:08.768912 [ERR] http: Request /v1/node/172857b9-3b5f-bd0a-b09d-545ac3ceb6dc?index=6&wait=1m, error: node not found
2017/11/06 18:31:08.769380 [DEBUG] http: Request /v1/node/172857b9-3b5f-bd0a-b09d-545ac3ceb6dc?index=6&wait=1m (1m2.208570212s)
2017/11/06 18:31:09.107824 [INFO] client: task "register" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" completed successfully
2017/11/06 18:31:09.107876 [INFO] client: Restarting task "register" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" in 5.69028103s
2017/11/06 18:31:09.107836 [INFO] client: task "forward" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" failed: Wait returned exit code 255, signal 0, and error
2017/11/06 18:31:09.107968 [INFO] client: Restarting task "forward" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" in 5.69028103s
2017/11/06 18:31:09.110549 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2017/11/06 18:31:09.112986 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2017/11/06 18:31:09.337888 [DEBUG] client: updated allocations at index 4603 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:09.337955 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:11.742120 [INFO] client: task "forward" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" failed: Wait returned exit code 255, signal 0, and error
2017/11/06 18:31:11.742181 [INFO] client: Restarting task "forward" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" in 5.346224508s
2017/11/06 18:31:11.742517 [INFO] client: task "register" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" completed successfully
2017/11/06 18:31:11.742567 [INFO] client: Restarting task "register" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" in 5.346224508s
2017/11/06 18:31:11.745454 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2017/11/06 18:31:11.747612 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2017/11/06 18:31:11.938454 [DEBUG] client: updated allocations at index 4604 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:11.938588 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:12 [DEBUG] memberlist: TCP connection from=127.0.0.1:36034
2017/11/06 18:31:14.819411 [DEBUG] driver.raw_exec: started process with pid: 7231
2017/11/06 18:31:14.830486 [DEBUG] driver.raw_exec: started process with pid: 7275
2017/11/06 18:31:14.937910 [DEBUG] client: updated allocations at index 4606 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:14.937994 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:15.137896 [DEBUG] client: updated allocations at index 4607 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:15.137968 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:15.551060 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4588&wait=1m (1m0.939825811s)
2017/11/06 18:31:16.552631 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4588&wait=1m (673.628碌s)
2017/11/06 18:31:16.943638 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002172154-squarescale-machine-hello/allocations?index=4588&wait=1m (1m1.648431515s)
2017/11/06 18:31:17.114875 [DEBUG] driver.raw_exec: started process with pid: 7451
2017/11/06 18:31:17.120173 [DEBUG] driver.raw_exec: started process with pid: 7455
2017/11/06 18:31:17.156304 [DEBUG] http: Request /v1/agent/servers (34.34碌s)
2017/11/06 18:31:17.157408 [DEBUG] http: Request /v1/status/peers (121.718碌s)
2017/11/06 18:31:17.338751 [DEBUG] client: updated allocations at index 4608 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:17.338891 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:17.554116 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4607&wait=1m (566.876碌s)
2017/11/06 18:31:17.945171 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002172154-squarescale-machine-hello/allocations?index=4588&wait=1m (601.115碌s)
2017/11/06 18:31:21.382134 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin253912072
2017/11/06 18:31:21.382179 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin253912072]}
2017/11/06 18:31:21.382192 [DEBUG] driver.docker: using 536870912 bytes memory for deploy
2017/11/06 18:31:21.382200 [DEBUG] driver.docker: using 100 cpu shares for deploy
2017/11/06 18:31:21.382237 [DEBUG] driver.docker: binding directories []string{"/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/alloc:/alloc", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/deploy/local:/local", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/deploy/secrets:/secrets", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/squarescale-deploy/terraform:/terraform"} for deploy
2017/11/06 18:31:21.382261 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge
2017/11/06 18:31:21.382288 [DEBUG] driver.docker: setting container name to: deploy-bc4e33c0-aed2-bd19-1798-e06216f5876a
2017/11/06 18:31:21.978340 [INFO] driver.docker: created container 8ce48f04f15f9c4c8b9015d6f0fc757f05ac1e21229d1422f0e54e6882004c6a
2017/11/06 18:31:22.502623 [INFO] driver.docker: started container 8ce48f04f15f9c4c8b9015d6f0fc757f05ac1e21229d1422f0e54e6882004c6a
2017/11/06 18:31:22.543870 [WARN] client: error fetching stats of task deploy: stats collection hasn't started yet
2017/11/06 18:31:22 [DEBUG] memberlist: TCP connection from=127.0.0.1:36054
2017/11/06 18:31:22.737883 [DEBUG] client: updated allocations at index 4609 (total 14) (pulled 0) (filtered 14)
2017/11/06 18:31:22.737986 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
2017/11/06 18:31:23.090949 [INFO] client: task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
The task named deploy with alloc bc4e33c0 is the nomad job that has missing environment variable. At the moment it keeps restarting because it misses the AWS_REGION environment variable and the docker container exits with status 1.
nomad inspect deploy shows:
json job (nomad inspect)
{
"Job": {
"AllAtOnce": false,
"Constraints": [
{
"LTarget": "${meta.sqsc.environment}",
"Operand": "=",
"RTarget": "dev"
},
{
"LTarget": "${meta.sqsc.project}",
"Operand": "=",
"RTarget": "sqsc"
}
],
"CreateIndex": 42,
"Datacenters": [
"dev-sqsc"
],
"ID": "deploy",
"JobModifyIndex": 42,
"Meta": null,
"ModifyIndex": 92,
"Name": "deploy",
"ParameterizedJob": null,
"ParentID": "",
"Payload": null,
"Periodic": null,
"Priority": 50,
"Region": "global",
"Stable": true,
"Status": "running",
"StatusDescription": "",
"Stop": false,
"SubmitTime": 1509728475312274239,
"TaskGroups": [
{
"Constraints": null,
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"Meta": null,
"Name": "deploy",
"RestartPolicy": {
"Attempts": 2,
"Delay": 15000000000,
"Interval": 60000000000,
"Mode": "delay"
},
"Tasks": [
{
"Artifacts": null,
"Config": {
"force_pull": false,
"image": "095348363195.dkr.ecr.eu-west-1.amazonaws.com/squarescale-deploy:dev-latest",
"volumes": [
"/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/squarescale-deploy/terraform:/terraform"
],
"dns_servers": [
"172.17.0.1"
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"KillTimeout": 5000000000,
"Leader": false,
"LogConfig": {
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "deploy",
"Resources": {
"CPU": 100,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"Networks": [
{
"CIDR": "",
"Device": "",
"DynamicPorts": null,
"IP": "",
"MBits": 10,
"ReservedPorts": null
}
]
},
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeSignal": "",
"DestPath": "environment",
"EmbeddedTmpl": "\n\nSQSC_ENVIRONMENT=\"staging\"\nSQSC_TFSTATE_BUCKET=\"sqsc-tfstate-staging\"\n\n\nNOMAD_VERSION=\"{{ key \"squarescale/dev/nomad_version\" }}\"\nAWS_REGION=\"{{ key \"squarescale/dev/aws/region\" }}\"\nAWS_ACCESS_KEY_ID=\"{{ key \"squarescale/dev/aws/access_key_id\" }}\"\nAWS_SECRET_ACCESS_KEY=\"{{ key \"squarescale/dev/aws/secret_access_key\" }}\"\nSQSC_TERRAFORM_SQS_INPUT_URL=\"{{ key \"squarescale/dev/deploy_sqs\" }}\"\n",
"Envvars": true,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 300000000000
},
{
"ChangeMode": "restart",
"ChangeSignal": "",
"DestPath": "restart",
"EmbeddedTmpl": " {{keyOrDefault (printf \"restart/%s\" (env \"NOMAD_JOB_NAME\")) \"\"}}\n {{keyOrDefault (printf \"restart/%s/%s\" (env \"NOMAD_JOB_NAME\") (env \"NOMAD_TASK_NAME\")) \"\"}}\n {{keyOrDefault (printf \"restart/%s/%s/%s\" (env \"NOMAD_JOB_NAME\") (env \"NOMAD_TASK_NAME\") (env \"NOMAD_ALLOC_INDEX\")) \"\"}}\n {{keyOrDefault (printf \"restart/%s\" (env \"NOMAD_ALLOC_NAME\")) \"\"}}\n",
"Envvars": false,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 300000000000
}
],
"User": "",
"Vault": null
}
],
"Update": {
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"Stagger": 10000000000
}
}
],
"Type": "service",
"Update": {
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "",
"HealthyDeadline": 0,
"MaxParallel": 1,
"MinHealthyTime": 0,
"Stagger": 10000000000
},
"VaultToken": "",
"Version": 0
}
}
consul keys are available:
$ bin/consul-0.8.3 kv get squarescale/dev/aws/region
eu-west-1
$ bin/consul-0.8.3 kv get squarescale/dev/nomad_version
0.6.2
$ bin/consul-0.8.3 kv get squarescale/dev/aws/access_key_id
xxxxxxxxxxxxxxxxxxxxx
$ bin/consul-0.8.3 kv get squarescale/dev/aws/secret_access_key
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ bin/consul-0.8.3 kv get squarescale/dev/deploy_sqs
https://sqs.eu-west-1.amazonaws.com/xxxxxxxxxxxxxxxxxx/terraform-deploy-shanti
environment template file is being written to disk correctly:
$ nomad fs bc4e33c0 deploy/environment
SQSC_ENVIRONMENT="staging"
SQSC_TFSTATE_BUCKET="sqsc-tfstate-staging"
NOMAD_VERSION="0.6.2"
AWS_REGION="eu-west-1"
AWS_ACCESS_KEY_ID="xxxxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
SQSC_TERRAFORM_SQS_INPUT_URL="https://sqs.eu-west-1.amazonaws.com/xxxxxx/terraform-deploy-shanti"
deploy/executor.out shows only repeated lines like this one:
2017/11/06 18:45:03.820637 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin198853221
I just ran in the same problem on 0.7.
I got a job with env stanza and 2 templates one for thinks from consul and one for secrets.
I can see secrets in the file, but not in the environment variables
@dadgar I got the reproduce steps:
the problem is if there are 2 templates that have env = true.
After one of them gets re-rendered, nomad restarts the tasks with envs from only this template.
Here is the job definition:
job "envtest" {
datacenters = ["staging-us-east-1"]
region = "staging-us-east-1"
type = "service"
update {
stagger = "10s"
max_parallel = 1
}
group "envtest" {
count = 1
task "envtest" {
driver = "exec"
constraint {
attribute = "${attr.kernel.name}"
value = "linux"
}
config {
command = "sleep"
args = [
"600"
]
}
template {
data = <<EOH
# Lines starting with a # are ignored
# Empty lines are also ignored
VLAD="FROG"
EOH
destination = "local/0.env"
env = true
}
template {
data = <<EOH
test1="{{key "test1"}}"
EOH
destination = "local/1.env"
env = true
}
template {
data = <<EOH
test2="{{key "test2"}}"
EOH
destination = "local/2.env"
env = true
}
resources {
cpu = 20
memory = 10
network {
mbits = 10
}
}
}
}
}
Now when you start the job you receive both test1 and test2 environment variables.
If you update test1 key in consul you'll receive only test1 env var, update test2 and you'll receive only it.
if I understand this right this happening because of this: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L397
Question is what is the right solution, not to give more than one template that is injecting environment, or build the logic of merging them?
See the JSON Nomad job in https://github.com/hashicorp/nomad/issues/3498#issuecomment-342228986 it has two templates but only one of them is an environment template. Still it doesn't alays work.
About the two environment templates issue, I don't see a problem to have nomad restrict to a single environment template. But if it is not difficult, implementing a merge mechanism might be a better solution (no edge case to check when accepting jobs)
this is even more fun, and awful!
@mildred I was not able to ever reproduce with either of the job files you gave. I was able to reproduce with the description @burdandrei gave.
I have attached a Linux AMD64 build that has the fix. I would love if you could test and tell me if I have fixed your issue too.
Thanks for super fast fix @dadgar !
I noticed the problem occurred mostly when I changed a consul key in restart/deploy that triggers a change in the restart template (which is not an environment template). Then the job restarts with missing environment variables. Stopping the job completely (nomad stop) and restarting it (nomad run) will rerun the job correctly in all cases I noticed.
I'll check this version and report back if I still have issues.
@dadgar it seems to work well. The cases where this bug happened don't seem to trigger the bug any more now. Thank you.
@burdandrei @mildred Thank you for testing and for the super fast feedback!