Nomad: elixer applications are not able to access files in the job task directory.

Created on 20 Apr 2018  路  6Comments  路  Source: hashicorp/nomad

Nomad versions

Nomad v0.8.1 (46aa11ba851f114f5faa0876839b45fbca0c47f0)
(also 0.8.0)
(worked fine in Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e))

Operating system and Environment details

debian 8

Issue

After upgrading to nomad 0.8.0 (and then 0.8.1 as a test) my elixer applications are no longer able to read files that are placed in the allocation directory during the "downloading artifacts" phase of the deployment (i get a "file not found" type of error).

These applications worked just fine under version 0.7.1.

The actual error I get is:

** (Conform.Schema.SchemaError) Schema at /local/org_service/releases/0.0.1/org_service.schema.exs doesn't exist!
    (conform) lib/conform/schema.ex:134: Conform.Schema.load!/1
    (conform) lib/conform.ex:95: Conform.process/1
    (elixir) lib/kernel/cli.ex:105: anonymous fn/3 in Kernel.CLI.exec_fun/2

but the file does exist:

-rw-r--r-- 1 root root 9193 Apr 20 13:15 /local/org_service/releases/0.0.1/org_service.schema.exs

To resolve this I can manually run through and "touch" all the files that the application will need to read. once I do that, the application is able to read the files and start up.

(note that I have also opened a support ticket for this issue - sorry about the duplicate)

Nomad Server logs (if appropriate)

There are no server error logs

Nomad Client logs (if appropriate)

there are no client error logs

Job file (if appropriate)

```job "org-service-stag" {
datacenters = ["awse"]
type = "service"
constraint {
attribute = "${meta.role}"
value = "api-cluster"
}
constraint {
attribute = "${meta.env}"
value = "stag"
}

# set our update policy

update {
max_parallel = 2
health_check = "checks"
min_healthy_time = "30s"
healthy_deadline = "3m"
auto_revert = false
#canary = 1
#stagger = "30s"
}

reschedule {
delay = "30s"
delay_function = "exponential"
max_delay = "5m"
unlimited = true
}

group "app" {
# set our restart policy
restart {
interval = "1m"
attempts = 2
delay = "15s"
mode = "fail"
}
count = 2

# needed for increased log file size
ephemeral_disk {
  size    = "2600"
}

task "org-service" {
  leader = true
  # grab our files
  artifact {
    source = "https://<url>/org-service-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.tar.gz"
    destination = "local/org_service"
  }
  artifact {
    # for development
    source = "https://<url>/org-service-config-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.conf.tmpl"
  }
  # turn it into the correct config
  template {
    source = "local/org-service-config-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.conf.tmpl"
    # the underscore in org_service below is intentional since thats the technical name of the application
    destination = "local/org_service/releases/0.0.1/org_service.conf"
    change_mode = "restart"
    splay = "10m"
    vault_grace = "15m"
    perms = "664"
  }
        artifact {
    # for development
    source = "https://<url>/org-service-vm-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.args.tmpl"
  }
  # turn it into the correct config
  template {
    source = "local/org-service-vm-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.args.tmpl"
    destination = "local/org_service/vm.args"
    change_mode = "restart"
    splay = "10m"
    vault_grace = "15m"
    perms = "664"
  }
  # set our environment variables
  env {
    CHEF_ENV = "${meta.env}"
    APP_NAME = "org-service"
    LOCAL_HOSTNAME = "${node.unique.name}"
    # we need this so it doesn't try to write into the application
    RELEASE_MUTABLE_DIR="/local/run_dir"
    PORT = "${NOMAD_PORT_app}"
    ERL_CRASH_DUMP="/alloc/logs/erl_crash.dump"
    ERL_EPMD_PORT = "${NOMAD_PORT_epmd}"
    # when a new deploy runs, we havent yet set the deploy_version to the new value so we need to specify the GIT_HASH that we are
    # using for the job and templates
    GIT_HASH = "b1fdcac2c596935f49c29aba2e630b97f5f6e28f"
  }
  # grant access to secrets
  vault {
    policies = [ "app-stag-org-service" ]
    change_mode = "noop"
  }
  # run our app
  driver = "exec"
  config {
    command = "local/org_service/bin/org_service"
    args = [ "foreground" ]
  }
  resources {
    cpu    = 2000
    memory = 3000
    network {
      port "app" {}
      port "admin" {}
      port "epmd" {
        static = "11001"
      }
    }
  }

  logs {
    max_files     = 5
    max_file_size = 500
  }

  # add in service discovery
  service {
    name = "org-service"
    # for now we use both <context>__<data> and <data> formats
    tags = [
      "${node.unique.name}", "host__${node.unique.name}",
      "b1fdcac2c596935f49c29aba2e630b97f5f6e28f", "version__b1fdcac2c596935f49c29aba2e630b97f5f6e28f",
      "${meta.env}", "env__${meta.env}",
      "${meta.env}-api-cluster-prefix-/v1/organizations",
      "${meta.env}-api-cluster-prefix-/swagger-orgs.json",
      "consuldogConfig:org-service-http_check.yaml.tmpl:http_check",
      "consuldogConfig:org-service-process.yaml.tmpl:process"
    ]

    port = "app"

    check {
      name = "app"
      path     = "/v1/organizations/monitor/ping"
      initial_status = "critical"
      type     = "http"
      protocol = "http"
      port     = "app"
      interval = "10s"
      timeout  = "2s"
    }
  }

  # add in service discovery so we can find the admin port from consul
  service {
    name = "org-service-admin"
    tags = [
      "${node.unique.name}",
      "host__${node.unique.name}",
      "b1fdcac2c596935f49c29aba2e630b97f5f6e28f", "version__b1fdcac2c596935f49c29aba2e630b97f5f6e28f",
      "${meta.env}", "env__${meta.env}",
      "consuldogConfig:org-service-admin.yaml.tmpl:admin"
    ]

    port = "admin"

    check {
      name = "admin"
      initial_status = "critical"
      type     = "tcp"
      port     = "app"
      interval = "10s"
      timeout  = "2s"
    }
  }
}

task "log-shipper" {
  # grab our config file template
  artifact {
    # for development
    source = "https://<url>/org-service-remote-syslog2-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.yml.tmpl"
  }
  # turn it into the correct config
  template {
    source = "local/org-service-remote-syslog2-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.yml.tmpl"
    destination = "local/remote-syslog2.yml"
    change_mode = "noop"
    perms = "664"
  }
  # set our environment variables
  env {
    CHEF_ENV = "${meta.env}"
    APP_NAME = "org-service"
    LOCAL_HOSTNAME = "${node.unique.name}"
    LOG_TASK_NAME = "org-service"
    # when a new deploy runs, we havent yet set the deploy_version to the new value so we need to specify the GIT_HASH that we are
    # using for the job and templates
    GIT_HASH = "b1fdcac2c596935f49c29aba2e630b97f5f6e28f"
  }
  # grant access to secrets
  driver = "exec"
  config {
    command = "/usr/local/bin/remote_syslog"
    args = [ "-c", "/local/remote-syslog2.yml", "-D" ]
  }
  resources {
    cpu    = 100
    memory = 100
  }
}

}
}
```

stagneeds-investigation typbug

Most helpful comment

@dansteen Thanks for the reproducer. We were able to track it down to an issue when the archive did not have access time set. This has been fixed and will be pulled into Nomad 0.8.2 which will be releasing shortly!

All 6 comments

Thanks for reporting this.

https://github.com/hashicorp/nomad/pull/4129/files looks like it could be related to this regression.

As a workaround, would you be able to make the file being downloaded be world readable, the changes we made in #4129 should preserve original permissions.

Hi @preetapan

Thanks for the reply! Unfortunately, I don't think its a permissions issue as all the files and directories are all world read all the way up the chain:

-rw-r--r-- 1 root root 8635 Apr 16 12:32 local/org_service/releases/0.0.1/org_service.schema.exs
drwxr-xr-x 5 root root 4096 Apr 16 17:04 local/org_service/releases/0.0.1/
drwxr-xr-x 3 root root 4096 Apr 16 12:32 local/org_service/releases
drwxr-xr-x 6 root root 4096 Apr 16 12:32 local/org_service
drwxrwxrwx 4 nobody nogroup 4096 Apr 16 12:32 local

Also, when I do a "touch" on the file (no permissions change at all) it becomes accessible to the service.

Thanks!

@dansteen Could you share the same permission breakdown when you running it under 0.7.1? Also could you, as a test, run a batch version of that job and have the command just be cat and the arg be that file (on 0.8.1). I wonder if any application can read the file.

Here is a super-simple test application that will demonstrate this issue:
hello_world.tar.gz

If you can't use the binary, or you want to test in other box types, here is the code that generates that application bundle:
hello_world_repo.tar.gz

To build it, you will need the following packages (on debian):
apt-get install elixir erlang-dev erlang-parsetools erlang-eunit erlang-xmerl

Then run the following commands:

mix deps.get
MIX_ENV=prod mix do compile, release --env=prod

The binary bundle will then be located in _build/prod/rel/hello_world/releases/0.1.0/hello_world.tar.gz

And here is an associated job file that will demonstrate this issue:

job "test" {
  datacenters = ["awse"]
  type        = "service"

  constraint {
    attribute = "${meta.role}"
    value     = "api-cluster"
  }

  constraint {
    attribute = "${meta.env}"
    value     = "load"
  }

  # set our update policy
  update {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "30s"
    healthy_deadline = "3m"
    auto_revert      = false

    #canary           = 1
    #stagger          = "30s"
  }

  reschedule {
    delay          = "30s"
    delay_function = "exponential"
    max_delay      = "5m"
    unlimited      = true
  }

  group "test" {
    # set our restart policy
    restart {
      interval = "1m"
      attempts = 2
      delay    = "15s"
      mode     = "fail"
    }

    count = 1

    task "test" {
      leader = true

      # grab our files
      artifact {
        source      = "https://<url to archive>/hello_world.tar.gz"
        destination = "local/hello_world"
      }

      # set our environment variables
      env {
        CHEF_ENV       = "${meta.env}"
        APP_NAME       = "org-service"
        LOCAL_HOSTNAME = "${node.unique.name}"

        # we need this so it doesn't try to write into the application
        RELEASE_MUTABLE_DIR = "/local/run_dir"
        PORT                = "${NOMAD_PORT_app}"
        ERL_CRASH_DUMP      = "/alloc/logs/erl_crash.dump"
        ERL_EPMD_PORT       = "${NOMAD_PORT_epmd}"
      }

      # run our app
      driver = "exec"

      config {
        command = "local/hello_world/bin/hello_world"
        args    = ["foreground"]
      }

      resources {
        cpu    = 200
        memory = 300

        network {
          port "app"{}
          port "admin"{}

          port "epmd" {
            static = "11001"
          }
        }
      }
    }
  }
}

The errors show up in stderr:

018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:45 std_error 
** (Conform.Schema.SchemaError) Schema at /local/hello_world/releases/0.1.0/hello_world.schema.exs doesn't exist!
    (conform) lib/conform/schema.ex:134: Conform.Schema.load!/1
    (conform) lib/conform.ex:95: Conform.process/1
    (elixir) lib/kernel/cli.ex:105: anonymous fn/3 in Kernel.CLI.exec_fun/2

A successful run will generate the following log line:

==> Generated sys.config in /tmp/hello_world_sample/_build/prod/rel/hello_world/var

Some interesting tests

  1. If you run cp /local/hello_world/releases/0.1.0/hello_world.schema.exs /<any place at all> the next time the job restarts it will be able to find the file. (notice that we don't change anything about the file at all - it's just a forced read via a copy).

  2. This error happens when you run the command inside a folder that nomad creates as part of an allocation. However, you don't have to be running the command from within nomad to get this error. If you just cd into the allocation direcory, and run the command manually (no chroot or anything) you get the same error. As long as the command is being run from within the allocation directory it has issues.

Thanks!

Hi @dadgar !

Here is the permissions breakdown under 0.7.1:

-rw-r--r-- 1 root root 9193 Apr 23 10:49 local/org_service/releases/0.0.1/org_service.schema.exs
drwxr-xr-x 5 root root 4096 Apr 23 10:49 local/org_service/releases/0.0.1/
drwxr-xr-x 3 root root 4096 Apr 23 10:49 local/org_service/releases/
drwxr-xr-x 6 root root 4096 Apr 23 10:49 local/org_service
drwxrwxrwx 4 nobody nogroup 4096 Apr 23 10:49 local

Here is the outcome of running a cat as the command:

@moduledoc """
A schema is a keyword list which represents how to map, transform, and validate
...
<lots of stuff>

So it seems that the file is there and can be generally read.

Thanks!

@dansteen Thanks for the reproducer. We were able to track it down to an issue when the archive did not have access time set. This has been fixed and will be pulled into Nomad 0.8.2 which will be releasing shortly!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

funkytaco picture funkytaco  路  3Comments

byronwolfman picture byronwolfman  路  3Comments

hynek picture hynek  路  3Comments

Gerrrr picture Gerrrr  路  3Comments

Smuerdt picture Smuerdt  路  3Comments