I've been having a weird issue with an app. It runs fine when running it with Docker's CLI but fails when I schedule it with Nomad. I created a simpler test case, taking specific details of the app out and pushed an image to the public Docker registry at c4milo/nomad-test:1.0.0.
docker pull c4milo/nomad-test:1.0.0
docker run --net=host -ti image_id uwsgi --env ENV=dev --die-on-term --master --http 9090 --workers 2 --threads 2 --need-app --callable app --thunder-lock --chdir /app --file app.py
job "test-job" {
datacenters = ["dc1"]
distinct_hosts = true
type = "service"
priority = 50
constraint {
attribute = "$attr.kernel.name"
value = "linux"
}
# Configure the job to do rolling updates
update {
# Stagger updates every 10 seconds
stagger = "10s"
# Update a single task at a time
max_parallel = 1
}
group "instances" {
count = 1
restart {
interval = "1m"
attempts = 2
delay = "15s"
on_success = true
mode = "delay"
}
# Define a task to run
task "test" {
# Use Docker to run the task.
driver = "docker"
config {
image = "c4milo/nomad-test:1.0.0"
server_address = "registry.docker.com:443"
network_mode = "host"
command = "uwsgi"
args = [
"--env ENV=dev",
"--die-on-term",
"--master",
"--http ${NOMAD_PORT_http}",
"--workers 1", "--threads 1",
"--need-app", "--callable app",
"--chdir /app",
"--file app.py"
]
}
resources {
cpu = 500 # 500 Mhz
memory = 512 # 256MB
network {
mbits = 10
port "http" {}
}
}
service {
port = "http"
check {
name = "alive"
type = "http"
path = "/"
interval = "10s"
timeout = "2s"
}
}
}
}
}
FROM centos:6
RUN yum install -y \
epel-release \
wget \
rpm \
unzip \
centos-release-SCL \
ca-certificates \
gsl \
blas-devel \
lapack-devel \
libxslt-devel
RUN yum install -y \
python-pip \
python-devel
RUN pip install --upgrade pip Flask
RUN pip install uwsgi
WORKDIR /app
COPY . /app
RUN python -m compileall /app
CMD ["python /app/app.py"]
$ nomad alloc-status c12205ab-f6c8-1113-7878-eae0f532cfe5
ID = c12205ab-f6c8-1113-7878-eae0f532cfe5
EvalID = 7ec6468e-aed9-34a0-41ed-254d234afc7e
Name = test-job.instances[0]
NodeID = 2beee4ee-1339-646d-fb85-537327e998f9
JobID = test-job
ClientStatus = running
NodesEvaluated = 3
NodesFiltered = 0
NodesExhausted = 0
AllocationTime = 59.62碌s
CoalescedFailures = 0
==> Task "test" is "pending"
Recent Events:
Time Type Description
16:26:35 12/28/15 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
16:26:35 12/28/15 Started <none>
16:26:20 12/28/15 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
16:26:20 12/28/15 Started <none>
16:25:50 12/28/15 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
16:25:50 12/28/15 Started <none>
16:25:35 12/28/15 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
16:25:35 12/28/15 Started <none>
16:25:20 12/28/15 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
16:25:20 12/28/15 Started <none>
==> Status
Allocation "c12205ab-f6c8-1113-7878-eae0f532cfe5" status "running" (0/3 nodes filtered)
* Score "62a553b3-b829-0897-c8d4-d14056b9366c.binpack" = 8.147441
* Score "f0fd7eaf-69d8-934e-d0ea-922585c92c97.binpack" = 1.841511
* Score "2beee4ee-1339-646d-fb85-537327e998f9.binpack" = 14.610137
Here is the repo with all the files: https://github.com/c4milo/nomad-test
@c4milo Were you able to see the logs from the container?
@diptanu no, it seems to be terminating right away. Docker daemon logs don't show anything abnormal either. My next step triaging the issue was to put some additional logging in the Docker task driver but I've spent a good deal of time already ruling out other possibilities, and I would like to use a second pair of eyes now :/
@c4milo Usually I triaged issues like this in the past by comparing the json the docker daemon receives while running via the cli and the cluster manager. Can you see the difference?
If you have the container ids then sharing the json returned by docker inspect should tell us how the containers are getting setup in both cases.
I'm not sure I'm following, how am I supposed to run docker inspect on a container that immediately terminates upon scheduling with Nomad?
@c4milo this might be redundant information but as long as the container is being created it is possible to inspect it. Just look it up with docker ps -a and run an inspect on the appropriate container. The fact that it terminates directly does not prevent you from inspecting the json configuration.
It's no redundant at all. Thanks a lot for the additional clarification @iverberk
Here it goes, the one with "ExitCode": 1, is when run through Nomad:
diff --git a/nomad.json b/cli.json
index 4f8e4a6..7c3744d 100644
--- a/nomad.json
+++ b/cli.json
@@ -1,19 +1,27 @@
[
{
- "Id": "15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60",
- "Created": "2015-12-30T20:52:55.922662765Z",
+ "Id": "c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555",
+ "Created": "2015-12-30T20:59:28.190230218Z",
"Path": "uwsgi",
"Args": [
- "--env ENV=dev",
+ "--env",
+ "ENV=dev",
"--die-on-term",
"--master",
- "--http 23933",
- "--workers 1",
- "--threads 1",
+ "--http",
+ "9090",
+ "--workers",
+ "2",
+ "--threads",
+ "2",
"--need-app",
- "--callable app",
- "--chdir /app",
- "--file app.py"
+ "--callable",
+ "app",
+ "--thunder-lock",
+ "--chdir",
+ "/app",
+ "--file",
+ "app.py"
],
"State": {
"Status": "exited",
@@ -23,17 +31,17 @@
"OOMKilled": false,
"Dead": false,
"Pid": 0,
- "ExitCode": 1,
+ "ExitCode": 0,
"Error": "",
- "StartedAt": "2015-12-30T20:52:55.977034702Z",
- "FinishedAt": "2015-12-30T20:52:55.981442507Z"
+ "StartedAt": "2015-12-30T20:59:28.266952875Z",
+ "FinishedAt": "2015-12-30T20:59:31.319496838Z"
},
"Image": "2ada554a624b0469e95fc98577271d83754c59ead37ea52888d70e11b4b03c02",
- "ResolvConfPath": "/var/lib/docker/containers/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/resolv.conf",
- "HostnamePath": "/var/lib/docker/containers/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/hostname",
- "HostsPath": "/var/lib/docker/containers/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/hosts",
- "LogPath": "/var/lib/docker/containers/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60-json.log",
- "Name": "/test-3001caa8-7c81-a0c1-4d04-af3e5496be41",
+ "ResolvConfPath": "/var/lib/docker/containers/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/resolv.conf",
+ "HostnamePath": "/var/lib/docker/containers/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/hostname",
+ "HostsPath": "/var/lib/docker/containers/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/hosts",
+ "LogPath": "/var/lib/docker/containers/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555-json.log",
+ "Name": "/trusting_almeida",
"RestartCount": 0,
"Driver": "overlay",
"ExecDriver": "native-0.2",
@@ -42,47 +50,31 @@
"AppArmorProfile": "",
"ExecIDs": null,
"HostConfig": {
- "Binds": [
- "/var/lib/nomad/alloc/3001caa8-7c81-a0c1-4d04-af3e5496be41/alloc:/alloc:rw,z",
- "/var/lib/nomad/alloc/3001caa8-7c81-a0c1-4d04-af3e5496be41/test:/local:rw,Z"
- ],
+ "Binds": null,
"ContainerIDFile": "",
- "LxcConf": null,
- "Memory": 536870912,
+ "LxcConf": [],
+ "Memory": 0,
"MemoryReservation": 0,
- "MemorySwap": -1,
+ "MemorySwap": 0,
"KernelMemory": 0,
- "CpuShares": 500,
+ "CpuShares": 0,
"CpuPeriod": 0,
"CpusetCpus": "",
"CpusetMems": "",
"CpuQuota": 0,
"BlkioWeight": 0,
"OomKillDisable": false,
- "MemorySwappiness": null,
+ "MemorySwappiness": -1,
"Privileged": false,
- "PortBindings": {
- "23933/tcp": [
- {
- "HostIp": "10.213.42.22",
- "HostPort": "23933"
- }
- ],
- "23933/udp": [
- {
- "HostIp": "10.213.42.22",
- "HostPort": "23933"
- }
- ]
- },
+ "PortBindings": {},
"Links": null,
"PublishAllPorts": false,
- "Dns": null,
- "DnsOptions": null,
- "DnsSearch": null,
+ "Dns": [],
+ "DnsOptions": [],
+ "DnsSearch": [],
"ExtraHosts": null,
"VolumesFrom": null,
- "Devices": null,
+ "Devices": [],
"NetworkMode": "host",
"IpcMode": "",
"PidMode": "",
@@ -91,7 +83,7 @@
"CapDrop": null,
"GroupAdd": null,
"RestartPolicy": {
- "Name": "",
+ "Name": "no",
"MaximumRetryCount": 0
},
"SecurityOpt": null,
@@ -112,62 +104,47 @@
"Name": "overlay",
"Data": {
"LowerDir": "/var/lib/docker/overlay/2ada554a624b0469e95fc98577271d83754c59ead37ea52888d70e11b4b03c02/root",
- "MergedDir": "/var/lib/docker/overlay/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/merged",
- "UpperDir": "/var/lib/docker/overlay/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/upper",
- "WorkDir": "/var/lib/docker/overlay/15cf649405971447c25ca85f8dbbd054f473bc8a17d0fe35178ff49ef14acc60/work"
+ "MergedDir": "/var/lib/docker/overlay/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/merged",
+ "UpperDir": "/var/lib/docker/overlay/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/upper",
+ "WorkDir": "/var/lib/docker/overlay/c345768419bfb041acc78d5b95760070b6282d33391d7c280ae6295c8c810555/work"
}
},
- "Mounts": [
- {
- "Source": "/var/lib/nomad/alloc/3001caa8-7c81-a0c1-4d04-af3e5496be41/alloc",
- "Destination": "/alloc",
- "Mode": "rw,z",
- "RW": true
- },
- {
- "Source": "/var/lib/nomad/alloc/3001caa8-7c81-a0c1-4d04-af3e5496be41/test",
- "Destination": "/local",
- "Mode": "rw,Z",
- "RW": true
- }
- ],
+ "Mounts": [],
"Config": {
"Hostname": "per-nomad-worker07",
"Domainname": "",
"User": "",
- "AttachStdin": false,
- "AttachStdout": false,
- "AttachStderr": false,
- "ExposedPorts": {
- "23933/tcp": {},
- "23933/udp": {}
- },
- "Tty": false,
- "OpenStdin": false,
- "StdinOnce": false,
+ "AttachStdin": true,
+ "AttachStdout": true,
+ "AttachStderr": true,
+ "Tty": true,
+ "OpenStdin": true,
+ "StdinOnce": true,
"Env": [
- "NOMAD_CPU_LIMIT=500",
- "NOMAD_IP=10.213.42.22",
- "NOMAD_PORT_http=23933",
- "NOMAD_ALLOC_DIR=/alloc",
- "NOMAD_TASK_DIR=/local",
- "NOMAD_MEMORY_LIMIT=512",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"uwsgi",
- "--env ENV=dev",
+ "--env",
+ "ENV=dev",
"--die-on-term",
"--master",
- "--http 23933",
- "--workers 1",
- "--threads 1",
+ "--http",
+ "9090",
+ "--workers",
+ "2",
+ "--threads",
+ "2",
"--need-app",
- "--callable app",
- "--chdir /app",
- "--file app.py"
+ "--callable",
+ "app",
+ "--thunder-lock",
+ "--chdir",
+ "/app",
+ "--file",
+ "app.py"
],
- "Image": "c4milo/nomad-test:1.0.0",
+ "Image": "2ada554a624b",
"Volumes": null,
"WorkingDir": "/app",
"Entrypoint": null,
@@ -175,7 +152,8 @@
"Labels": {
"License": "GPLv2",
"Vendor": "CentOS"
- }
+ },
+ "StopSignal": "SIGTERM"
},
"NetworkSettings": {
"Bridge": "",
So, this change in my Nomad job definition seemed to made it work:
diff --git a/app.nomad b/app.nomad
index bd35830..a21e5e3 100644
--- a/app.nomad
+++ b/app.nomad
@@ -1,5 +1,5 @@
job "test-job" {
- datacenters = ["dc1"]
+ datacenters = ["iad2", "dc1"]
distinct_hosts = true
type = "service"
priority = 50
@@ -39,14 +39,14 @@ job "test-job" {
network_mode = "host"
command = "uwsgi"
args = [
- "--env ENV=dev",
+ "--env", "ENV=dev",
"--die-on-term",
"--master",
- "--http ${NOMAD_PORT_http}",
- "--workers 1", "--threads 1",
- "--need-app", "--callable app",
- "--chdir /app",
- "--file app.py"
+ "--http", "${NOMAD_PORT_http}",
+ "--workers", " ", "--threads", "1",
+ "--need-app", "--callable", "app",
+ "--chdir", "/app",
+ "--file", "app.py"
]
}
Since this was just an issue of splitting the command line arguments I am going to close it. If something else comes up we can open it.
Sure, however I still think it could be more user friendly and prone to this kind of mistakes.
Stumbled upon the same issue where container exits and cleaned up immediately. @diptanu gave the tip on checking the task's stdout and stderr logs which worked very well in diagnosing the issue.
Most helpful comment
Sure, however I still think it could be more user friendly and prone to this kind of mistakes.