Nomad v0.8.1 (46aa11ba851f114f5faa0876839b45fbca0c47f0)
(also 0.8.0)
(worked fine in Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e))
debian 8
After upgrading to nomad 0.8.0 (and then 0.8.1 as a test) my elixer applications are no longer able to read files that are placed in the allocation directory during the "downloading artifacts" phase of the deployment (i get a "file not found" type of error).
These applications worked just fine under version 0.7.1.
The actual error I get is:
** (Conform.Schema.SchemaError) Schema at /local/org_service/releases/0.0.1/org_service.schema.exs doesn't exist!
(conform) lib/conform/schema.ex:134: Conform.Schema.load!/1
(conform) lib/conform.ex:95: Conform.process/1
(elixir) lib/kernel/cli.ex:105: anonymous fn/3 in Kernel.CLI.exec_fun/2
but the file does exist:
-rw-r--r-- 1 root root 9193 Apr 20 13:15 /local/org_service/releases/0.0.1/org_service.schema.exs
To resolve this I can manually run through and "touch" all the files that the application will need to read. once I do that, the application is able to read the files and start up.
(note that I have also opened a support ticket for this issue - sorry about the duplicate)
There are no server error logs
there are no client error logs
```job "org-service-stag" {
datacenters = ["awse"]
type = "service"
constraint {
attribute = "${meta.role}"
value = "api-cluster"
}
constraint {
attribute = "${meta.env}"
value = "stag"
}
# set our update policy
update {
max_parallel = 2
health_check = "checks"
min_healthy_time = "30s"
healthy_deadline = "3m"
auto_revert = false
#canary = 1
#stagger = "30s"
}
reschedule {
delay = "30s"
delay_function = "exponential"
max_delay = "5m"
unlimited = true
}
group "app" {
# set our restart policy
restart {
interval = "1m"
attempts = 2
delay = "15s"
mode = "fail"
}
count = 2
# needed for increased log file size
ephemeral_disk {
size = "2600"
}
task "org-service" {
leader = true
# grab our files
artifact {
source = "https://<url>/org-service-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.tar.gz"
destination = "local/org_service"
}
artifact {
# for development
source = "https://<url>/org-service-config-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.conf.tmpl"
}
# turn it into the correct config
template {
source = "local/org-service-config-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.conf.tmpl"
# the underscore in org_service below is intentional since thats the technical name of the application
destination = "local/org_service/releases/0.0.1/org_service.conf"
change_mode = "restart"
splay = "10m"
vault_grace = "15m"
perms = "664"
}
artifact {
# for development
source = "https://<url>/org-service-vm-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.args.tmpl"
}
# turn it into the correct config
template {
source = "local/org-service-vm-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.args.tmpl"
destination = "local/org_service/vm.args"
change_mode = "restart"
splay = "10m"
vault_grace = "15m"
perms = "664"
}
# set our environment variables
env {
CHEF_ENV = "${meta.env}"
APP_NAME = "org-service"
LOCAL_HOSTNAME = "${node.unique.name}"
# we need this so it doesn't try to write into the application
RELEASE_MUTABLE_DIR="/local/run_dir"
PORT = "${NOMAD_PORT_app}"
ERL_CRASH_DUMP="/alloc/logs/erl_crash.dump"
ERL_EPMD_PORT = "${NOMAD_PORT_epmd}"
# when a new deploy runs, we havent yet set the deploy_version to the new value so we need to specify the GIT_HASH that we are
# using for the job and templates
GIT_HASH = "b1fdcac2c596935f49c29aba2e630b97f5f6e28f"
}
# grant access to secrets
vault {
policies = [ "app-stag-org-service" ]
change_mode = "noop"
}
# run our app
driver = "exec"
config {
command = "local/org_service/bin/org_service"
args = [ "foreground" ]
}
resources {
cpu = 2000
memory = 3000
network {
port "app" {}
port "admin" {}
port "epmd" {
static = "11001"
}
}
}
logs {
max_files = 5
max_file_size = 500
}
# add in service discovery
service {
name = "org-service"
# for now we use both <context>__<data> and <data> formats
tags = [
"${node.unique.name}", "host__${node.unique.name}",
"b1fdcac2c596935f49c29aba2e630b97f5f6e28f", "version__b1fdcac2c596935f49c29aba2e630b97f5f6e28f",
"${meta.env}", "env__${meta.env}",
"${meta.env}-api-cluster-prefix-/v1/organizations",
"${meta.env}-api-cluster-prefix-/swagger-orgs.json",
"consuldogConfig:org-service-http_check.yaml.tmpl:http_check",
"consuldogConfig:org-service-process.yaml.tmpl:process"
]
port = "app"
check {
name = "app"
path = "/v1/organizations/monitor/ping"
initial_status = "critical"
type = "http"
protocol = "http"
port = "app"
interval = "10s"
timeout = "2s"
}
}
# add in service discovery so we can find the admin port from consul
service {
name = "org-service-admin"
tags = [
"${node.unique.name}",
"host__${node.unique.name}",
"b1fdcac2c596935f49c29aba2e630b97f5f6e28f", "version__b1fdcac2c596935f49c29aba2e630b97f5f6e28f",
"${meta.env}", "env__${meta.env}",
"consuldogConfig:org-service-admin.yaml.tmpl:admin"
]
port = "admin"
check {
name = "admin"
initial_status = "critical"
type = "tcp"
port = "app"
interval = "10s"
timeout = "2s"
}
}
}
task "log-shipper" {
# grab our config file template
artifact {
# for development
source = "https://<url>/org-service-remote-syslog2-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.yml.tmpl"
}
# turn it into the correct config
template {
source = "local/org-service-remote-syslog2-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.yml.tmpl"
destination = "local/remote-syslog2.yml"
change_mode = "noop"
perms = "664"
}
# set our environment variables
env {
CHEF_ENV = "${meta.env}"
APP_NAME = "org-service"
LOCAL_HOSTNAME = "${node.unique.name}"
LOG_TASK_NAME = "org-service"
# when a new deploy runs, we havent yet set the deploy_version to the new value so we need to specify the GIT_HASH that we are
# using for the job and templates
GIT_HASH = "b1fdcac2c596935f49c29aba2e630b97f5f6e28f"
}
# grant access to secrets
driver = "exec"
config {
command = "/usr/local/bin/remote_syslog"
args = [ "-c", "/local/remote-syslog2.yml", "-D" ]
}
resources {
cpu = 100
memory = 100
}
}
}
}
```
Thanks for reporting this.
https://github.com/hashicorp/nomad/pull/4129/files looks like it could be related to this regression.
As a workaround, would you be able to make the file being downloaded be world readable, the changes we made in #4129 should preserve original permissions.
Hi @preetapan
Thanks for the reply! Unfortunately, I don't think its a permissions issue as all the files and directories are all world read all the way up the chain:
-rw-r--r-- 1 root root 8635 Apr 16 12:32 local/org_service/releases/0.0.1/org_service.schema.exs
drwxr-xr-x 5 root root 4096 Apr 16 17:04 local/org_service/releases/0.0.1/
drwxr-xr-x 3 root root 4096 Apr 16 12:32 local/org_service/releases
drwxr-xr-x 6 root root 4096 Apr 16 12:32 local/org_service
drwxrwxrwx 4 nobody nogroup 4096 Apr 16 12:32 local
Also, when I do a "touch" on the file (no permissions change at all) it becomes accessible to the service.
Thanks!
@dansteen Could you share the same permission breakdown when you running it under 0.7.1? Also could you, as a test, run a batch version of that job and have the command just be cat and the arg be that file (on 0.8.1). I wonder if any application can read the file.
Here is a super-simple test application that will demonstrate this issue:
hello_world.tar.gz
If you can't use the binary, or you want to test in other box types, here is the code that generates that application bundle:
hello_world_repo.tar.gz
To build it, you will need the following packages (on debian):
apt-get install elixir erlang-dev erlang-parsetools erlang-eunit erlang-xmerl
Then run the following commands:
mix deps.get
MIX_ENV=prod mix do compile, release --env=prod
The binary bundle will then be located in _build/prod/rel/hello_world/releases/0.1.0/hello_world.tar.gz
And here is an associated job file that will demonstrate this issue:
job "test" {
datacenters = ["awse"]
type = "service"
constraint {
attribute = "${meta.role}"
value = "api-cluster"
}
constraint {
attribute = "${meta.env}"
value = "load"
}
# set our update policy
update {
max_parallel = 1
health_check = "checks"
min_healthy_time = "30s"
healthy_deadline = "3m"
auto_revert = false
#canary = 1
#stagger = "30s"
}
reschedule {
delay = "30s"
delay_function = "exponential"
max_delay = "5m"
unlimited = true
}
group "test" {
# set our restart policy
restart {
interval = "1m"
attempts = 2
delay = "15s"
mode = "fail"
}
count = 1
task "test" {
leader = true
# grab our files
artifact {
source = "https://<url to archive>/hello_world.tar.gz"
destination = "local/hello_world"
}
# set our environment variables
env {
CHEF_ENV = "${meta.env}"
APP_NAME = "org-service"
LOCAL_HOSTNAME = "${node.unique.name}"
# we need this so it doesn't try to write into the application
RELEASE_MUTABLE_DIR = "/local/run_dir"
PORT = "${NOMAD_PORT_app}"
ERL_CRASH_DUMP = "/alloc/logs/erl_crash.dump"
ERL_EPMD_PORT = "${NOMAD_PORT_epmd}"
}
# run our app
driver = "exec"
config {
command = "local/hello_world/bin/hello_world"
args = ["foreground"]
}
resources {
cpu = 200
memory = 300
network {
port "app"{}
port "admin"{}
port "epmd" {
static = "11001"
}
}
}
}
}
}
The errors show up in stderr:
018-04-23 12:53:44 std_error 2018-04-23 12:53:44 std_error 2018-04-23 12:53:44 std_error 2018-04-23 12:53:44 std_error 2018-04-23 12:53:44 std_error 2018-04-23 12:53:45 std_error
** (Conform.Schema.SchemaError) Schema at /local/hello_world/releases/0.1.0/hello_world.schema.exs doesn't exist!
(conform) lib/conform/schema.ex:134: Conform.Schema.load!/1
(conform) lib/conform.ex:95: Conform.process/1
(elixir) lib/kernel/cli.ex:105: anonymous fn/3 in Kernel.CLI.exec_fun/2
A successful run will generate the following log line:
==> Generated sys.config in /tmp/hello_world_sample/_build/prod/rel/hello_world/var
If you run cp /local/hello_world/releases/0.1.0/hello_world.schema.exs /<any place at all> the next time the job restarts it will be able to find the file. (notice that we don't change anything about the file at all - it's just a forced read via a copy).
This error happens when you run the command inside a folder that nomad creates as part of an allocation. However, you don't have to be running the command from within nomad to get this error. If you just cd into the allocation direcory, and run the command manually (no chroot or anything) you get the same error. As long as the command is being run from within the allocation directory it has issues.
Thanks!
Hi @dadgar !
Here is the permissions breakdown under 0.7.1:
-rw-r--r-- 1 root root 9193 Apr 23 10:49 local/org_service/releases/0.0.1/org_service.schema.exs
drwxr-xr-x 5 root root 4096 Apr 23 10:49 local/org_service/releases/0.0.1/
drwxr-xr-x 3 root root 4096 Apr 23 10:49 local/org_service/releases/
drwxr-xr-x 6 root root 4096 Apr 23 10:49 local/org_service
drwxrwxrwx 4 nobody nogroup 4096 Apr 23 10:49 local
Here is the outcome of running a cat as the command:
@moduledoc """
A schema is a keyword list which represents how to map, transform, and validate
...
<lots of stuff>
So it seems that the file is there and can be generally read.
Thanks!
@dansteen Thanks for the reproducer. We were able to track it down to an issue when the archive did not have access time set. This has been fixed and will be pulled into Nomad 0.8.2 which will be releasing shortly!
Most helpful comment
@dansteen Thanks for the reproducer. We were able to track it down to an issue when the archive did not have access time set. This has been fixed and will be pulled into Nomad 0.8.2 which will be releasing shortly!