Nomad: environment template is not always working

Created on 3 Nov 2017  路  13Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.6.2

Operating system and Environment details

CoreOS

Issue

When a nomad job with a task that uses the docker driver has both an environment section env { "KEY" = "value" } and an environment template, the environment section is used to populate the docker container environment but the environment template is ignored. but sometimes it works too.

Reproduction steps

Run the nomad job, sometimes you don't get the environment variables from template

Job file (if appropriate)

The job file is similar to this:

job "test" {
  datacenters = ["staging-sqsc"]

  type = "service"

  constraint {
    attribute = "${meta.sqsc.environment}"
    value     = "staging"
  }

  constraint {
    attribute = "${meta.sqsc.project}"
    value     = "sqsc"
  }

  update {
    stagger = "10s"
    max_parallel = 1
  }

  group "test" {

    count = 1

    task "test" {
      driver = "docker"

      config {
        image = "alpine"
        args = [ "/bin/sh", "-c", "date; env; echo; cat" ]
      }

      resources {
        cpu = 100
        memory = 512
        network {
          mbits = 10
        }
      }

      env {
        "ENV_FROM_ENV" = "ok"
      }

      template {
        destination = "environment"
        env = true
        data = <<TEMPLATE
ENV_FROM_TEMPLATE={{ key "sample-consul-key" }}
TEMPLATE
      }

    }
  }
}

(You need to the the consul key sample-consul-key and have a nomad node that matches the constraints set)

themtemplate typbug

All 13 comments

Thanks for reporting this issue. Do you have more information about when the environment template is and isn't ignored? If you have logs that are relevant to this issue, feel free to add them here.

@mildred Could you please share DEBUG level logs of the client where this occurs.

I have the following nomad client logs. This occurs even though the environment section of the job file is empty and all variables are set using the environment template instead.

nomad logs

    2017/11/06 18:31:04.988262 [INFO] client: task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
    2017/11/06 18:31:04.988317 [INFO] client: Restarting task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" in 16.362625637s
    2017/11/06 18:31:04.990046 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:05.138308 [DEBUG] client: updated allocations at index 4602 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:05.138530 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:07.154924 [DEBUG] http: Request /v1/agent/servers (45.587碌s)
    2017/11/06 18:31:07.156368 [DEBUG] http: Request /v1/status/peers (170.845碌s)
    2017/11/06 18:31:08.768912 [ERR] http: Request /v1/node/172857b9-3b5f-bd0a-b09d-545ac3ceb6dc?index=6&wait=1m, error: node not found
    2017/11/06 18:31:08.769380 [DEBUG] http: Request /v1/node/172857b9-3b5f-bd0a-b09d-545ac3ceb6dc?index=6&wait=1m (1m2.208570212s)
    2017/11/06 18:31:09.107824 [INFO] client: task "register" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" completed successfully
    2017/11/06 18:31:09.107876 [INFO] client: Restarting task "register" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" in 5.69028103s
    2017/11/06 18:31:09.107836 [INFO] client: task "forward" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" failed: Wait returned exit code 255, signal 0, and error 
    2017/11/06 18:31:09.107968 [INFO] client: Restarting task "forward" for alloc "6decdf71-8e82-438b-4365-c85ec05ab800" in 5.69028103s
    2017/11/06 18:31:09.110549 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:09.112986 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:09.337888 [DEBUG] client: updated allocations at index 4603 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:09.337955 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:11.742120 [INFO] client: task "forward" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" failed: Wait returned exit code 255, signal 0, and error 
    2017/11/06 18:31:11.742181 [INFO] client: Restarting task "forward" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" in 5.346224508s
    2017/11/06 18:31:11.742517 [INFO] client: task "register" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" completed successfully
    2017/11/06 18:31:11.742567 [INFO] client: Restarting task "register" for alloc "597a9be9-af7e-b8ea-eb14-41b66d98ec92" in 5.346224508s
    2017/11/06 18:31:11.745454 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:11.747612 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2017/11/06 18:31:11.938454 [DEBUG] client: updated allocations at index 4604 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:11.938588 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:12 [DEBUG] memberlist: TCP connection from=127.0.0.1:36034
    2017/11/06 18:31:14.819411 [DEBUG] driver.raw_exec: started process with pid: 7231
    2017/11/06 18:31:14.830486 [DEBUG] driver.raw_exec: started process with pid: 7275
    2017/11/06 18:31:14.937910 [DEBUG] client: updated allocations at index 4606 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:14.937994 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:15.137896 [DEBUG] client: updated allocations at index 4607 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:15.137968 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:15.551060 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4588&wait=1m (1m0.939825811s)
    2017/11/06 18:31:16.552631 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4588&wait=1m (673.628碌s)
    2017/11/06 18:31:16.943638 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002172154-squarescale-machine-hello/allocations?index=4588&wait=1m (1m1.648431515s)
    2017/11/06 18:31:17.114875 [DEBUG] driver.raw_exec: started process with pid: 7451
    2017/11/06 18:31:17.120173 [DEBUG] driver.raw_exec: started process with pid: 7455
    2017/11/06 18:31:17.156304 [DEBUG] http: Request /v1/agent/servers (34.34碌s)
    2017/11/06 18:31:17.157408 [DEBUG] http: Request /v1/status/peers (121.718碌s)
    2017/11/06 18:31:17.338751 [DEBUG] client: updated allocations at index 4608 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:17.338891 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:17.554116 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002180522-squarescale-machine-sqsc-demo/allocations?index=4607&wait=1m (566.876碌s)
    2017/11/06 18:31:17.945171 [DEBUG] http: Request /v1/job/svc-qa-noenv-20171002172154-squarescale-machine-hello/allocations?index=4588&wait=1m (601.115碌s)
    2017/11/06 18:31:21.382134 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin253912072
    2017/11/06 18:31:21.382179 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin253912072]}
    2017/11/06 18:31:21.382192 [DEBUG] driver.docker: using 536870912 bytes memory for deploy
    2017/11/06 18:31:21.382200 [DEBUG] driver.docker: using 100 cpu shares for deploy
    2017/11/06 18:31:21.382237 [DEBUG] driver.docker: binding directories []string{"/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/alloc:/alloc", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/deploy/local:/local", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/nomad-data/alloc/bc4e33c0-aed2-bd19-1798-e06216f5876a/deploy/secrets:/secrets", "/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/squarescale-deploy/terraform:/terraform"} for deploy
    2017/11/06 18:31:21.382261 [DEBUG] driver.docker: networking mode not specified; defaulting to bridge
    2017/11/06 18:31:21.382288 [DEBUG] driver.docker: setting container name to: deploy-bc4e33c0-aed2-bd19-1798-e06216f5876a
    2017/11/06 18:31:21.978340 [INFO] driver.docker: created container 8ce48f04f15f9c4c8b9015d6f0fc757f05ac1e21229d1422f0e54e6882004c6a
    2017/11/06 18:31:22.502623 [INFO] driver.docker: started container 8ce48f04f15f9c4c8b9015d6f0fc757f05ac1e21229d1422f0e54e6882004c6a
    2017/11/06 18:31:22.543870 [WARN] client: error fetching stats of task deploy: stats collection hasn't started yet
    2017/11/06 18:31:22 [DEBUG] memberlist: TCP connection from=127.0.0.1:36054
    2017/11/06 18:31:22.737883 [DEBUG] client: updated allocations at index 4609 (total 14) (pulled 0) (filtered 14)
    2017/11/06 18:31:22.737986 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2017/11/06 18:31:23.090949 [INFO] client: task "deploy" for alloc "bc4e33c0-aed2-bd19-1798-e06216f5876a" failed: Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1


 

The task named deploy with alloc bc4e33c0 is the nomad job that has missing environment variable. At the moment it keeps restarting because it misses the AWS_REGION environment variable and the docker container exits with status 1.

nomad inspect deploy shows:

json job (nomad inspect)
{
"Job": {
"AllAtOnce": false,
"Constraints": [
{
"LTarget": "${meta.sqsc.environment}",
"Operand": "=",
"RTarget": "dev"
},
{
"LTarget": "${meta.sqsc.project}",
"Operand": "=",
"RTarget": "sqsc"
}
],
"CreateIndex": 42,
"Datacenters": [
"dev-sqsc"
],
"ID": "deploy",
"JobModifyIndex": 42,
"Meta": null,
"ModifyIndex": 92,
"Name": "deploy",
"ParameterizedJob": null,
"ParentID": "",
"Payload": null,
"Periodic": null,
"Priority": 50,
"Region": "global",
"Stable": true,
"Status": "running",
"StatusDescription": "",
"Stop": false,
"SubmitTime": 1509728475312274239,
"TaskGroups": [
{
"Constraints": null,
"Count": 1,
"EphemeralDisk": {
"Migrate": false,
"SizeMB": 300,
"Sticky": false
},
"Meta": null,
"Name": "deploy",
"RestartPolicy": {
"Attempts": 2,
"Delay": 15000000000,
"Interval": 60000000000,
"Mode": "delay"
},
"Tasks": [
{
"Artifacts": null,
"Config": {
"force_pull": false,
"image": "095348363195.dkr.ecr.eu-west-1.amazonaws.com/squarescale-deploy:dev-latest",
"volumes": [
"/home/mildred/Work/SquareScale/squarescale-infrastructure/dev/squarescale-deploy/terraform:/terraform"
],
"dns_servers": [
"172.17.0.1"
]
},
"Constraints": null,
"DispatchPayload": null,
"Driver": "docker",
"Env": null,
"KillTimeout": 5000000000,
"Leader": false,
"LogConfig": {
"MaxFileSizeMB": 10,
"MaxFiles": 10
},
"Meta": null,
"Name": "deploy",
"Resources": {
"CPU": 100,
"DiskMB": 0,
"IOPS": 0,
"MemoryMB": 512,
"Networks": [
{
"CIDR": "",
"Device": "",
"DynamicPorts": null,
"IP": "",
"MBits": 10,
"ReservedPorts": null
}
]
},
"Services": null,
"ShutdownDelay": 0,
"Templates": [
{
"ChangeMode": "restart",
"ChangeSignal": "",
"DestPath": "environment",
"EmbeddedTmpl": "\n\nSQSC_ENVIRONMENT=\"staging\"\nSQSC_TFSTATE_BUCKET=\"sqsc-tfstate-staging\"\n\n\nNOMAD_VERSION=\"{{ key \"squarescale/dev/nomad_version\" }}\"\nAWS_REGION=\"{{ key \"squarescale/dev/aws/region\" }}\"\nAWS_ACCESS_KEY_ID=\"{{ key \"squarescale/dev/aws/access_key_id\" }}\"\nAWS_SECRET_ACCESS_KEY=\"{{ key \"squarescale/dev/aws/secret_access_key\" }}\"\nSQSC_TERRAFORM_SQS_INPUT_URL=\"{{ key \"squarescale/dev/deploy_sqs\" }}\"\n",
"Envvars": true,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 300000000000
},
{
"ChangeMode": "restart",
"ChangeSignal": "",
"DestPath": "restart",
"EmbeddedTmpl": " {{keyOrDefault (printf \"restart/%s\" (env \"NOMAD_JOB_NAME\")) \"\"}}\n {{keyOrDefault (printf \"restart/%s/%s\" (env \"NOMAD_JOB_NAME\") (env \"NOMAD_TASK_NAME\")) \"\"}}\n {{keyOrDefault (printf \"restart/%s/%s/%s\" (env \"NOMAD_JOB_NAME\") (env \"NOMAD_TASK_NAME\") (env \"NOMAD_ALLOC_INDEX\")) \"\"}}\n {{keyOrDefault (printf \"restart/%s\" (env \"NOMAD_ALLOC_NAME\")) \"\"}}\n",
"Envvars": false,
"LeftDelim": "{{",
"Perms": "0644",
"RightDelim": "}}",
"SourcePath": "",
"Splay": 5000000000,
"VaultGrace": 300000000000
}
],
"User": "",
"Vault": null
}
],
"Update": {
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "checks",
"HealthyDeadline": 300000000000,
"MaxParallel": 1,
"MinHealthyTime": 10000000000,
"Stagger": 10000000000
}
}
],
"Type": "service",
"Update": {
"AutoRevert": false,
"Canary": 0,
"HealthCheck": "",
"HealthyDeadline": 0,
"MaxParallel": 1,
"MinHealthyTime": 0,
"Stagger": 10000000000
},
"VaultToken": "",
"Version": 0
}
}

 

consul keys are available:

$ bin/consul-0.8.3 kv get squarescale/dev/aws/region
eu-west-1
$ bin/consul-0.8.3 kv get squarescale/dev/nomad_version
0.6.2
$ bin/consul-0.8.3 kv get squarescale/dev/aws/access_key_id
xxxxxxxxxxxxxxxxxxxxx
$ bin/consul-0.8.3 kv get squarescale/dev/aws/secret_access_key
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ bin/consul-0.8.3 kv get squarescale/dev/deploy_sqs
https://sqs.eu-west-1.amazonaws.com/xxxxxxxxxxxxxxxxxx/terraform-deploy-shanti

environment template file is being written to disk correctly:

$ nomad fs bc4e33c0 deploy/environment


SQSC_ENVIRONMENT="staging"
SQSC_TFSTATE_BUCKET="sqsc-tfstate-staging"


NOMAD_VERSION="0.6.2"
AWS_REGION="eu-west-1"
AWS_ACCESS_KEY_ID="xxxxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
SQSC_TERRAFORM_SQS_INPUT_URL="https://sqs.eu-west-1.amazonaws.com/xxxxxx/terraform-deploy-shanti"

deploy/executor.out shows only repeated lines like this one:

2017/11/06 18:45:03.820637 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin198853221

I just ran in the same problem on 0.7.
I got a job with env stanza and 2 templates one for thinks from consul and one for secrets.
I can see secrets in the file, but not in the environment variables

@dadgar I got the reproduce steps:
the problem is if there are 2 templates that have env = true.
After one of them gets re-rendered, nomad restarts the tasks with envs from only this template.

Here is the job definition:

job "envtest" {
  datacenters = ["staging-us-east-1"]
  region      = "staging-us-east-1"
  type        = "service"

  update {
    stagger      = "10s"
    max_parallel = 1
  }

  group "envtest" {
    count = 1

    task "envtest" {
      driver = "exec"

      constraint {
        attribute = "${attr.kernel.name}"
        value     = "linux"
      }


      config {
        command = "sleep"
        args    = [
          "600"
        ]
      }

      template {
  data = <<EOH
# Lines starting with a # are ignored

# Empty lines are also ignored
VLAD="FROG"
EOH

  destination = "local/0.env"
  env         = true
}

      template {
  data = <<EOH
test1="{{key "test1"}}"
EOH

  destination = "local/1.env"
  env         = true
}

      template {
  data = <<EOH
test2="{{key "test2"}}"
EOH

  destination = "local/2.env"
  env         = true
}

      resources {
        cpu    = 20
        memory = 10

        network {
          mbits = 10
        }
      }
    }
  }
}

Now when you start the job you receive both test1 and test2 environment variables.
If you update test1 key in consul you'll receive only test1 env var, update test2 and you'll receive only it.

if I understand this right this happening because of this: https://github.com/hashicorp/nomad/blob/master/client/consul_template.go#L397

Question is what is the right solution, not to give more than one template that is injecting environment, or build the logic of merging them?

See the JSON Nomad job in https://github.com/hashicorp/nomad/issues/3498#issuecomment-342228986 it has two templates but only one of them is an environment template. Still it doesn't alays work.

About the two environment templates issue, I don't see a problem to have nomad restrict to a single environment template. But if it is not difficult, implementing a merge mechanism might be a better solution (no edge case to check when accepting jobs)

this is even more fun, and awful!

@mildred I was not able to ever reproduce with either of the job files you gave. I was able to reproduce with the description @burdandrei gave.

I have attached a Linux AMD64 build that has the fix. I would love if you could test and tell me if I have fixed your issue too.

nomad.zip

Thanks for super fast fix @dadgar !

I noticed the problem occurred mostly when I changed a consul key in restart/deploy that triggers a change in the restart template (which is not an environment template). Then the job restarts with missing environment variables. Stopping the job completely (nomad stop) and restarting it (nomad run) will rerun the job correctly in all cases I noticed.

I'll check this version and report back if I still have issues.

@dadgar it seems to work well. The cases where this bug happened don't seem to trigger the bug any more now. Thank you.

@burdandrei @mildred Thank you for testing and for the super fast feedback!

Was this page helpful?
0 / 5 - 0 ratings