I spent most of my morning trying to build a test nomad + consul cluster using vagrant.
I am finding that as it stands, consul integration is very difficult to run on top of nomad. I am sure the issues outlined here will spawn child issues that are more specific, but I think having a general issue will help provide discussion improve the user experience before we break it down to specific tasks.
Here is a quick background of my investigations to narrow down the scope:
f21global/consul on docker hub).f21global/consul).localhost:8500.This is currently the nomad task config I am using:
# Define a job called my-service
job "consul" {
# Job should run in the US region
region = "global"
# Spread tasks between us-west-1 and us-east-1
datacenters = ["dc1"]
# run this job globally
type = "system"
# Rolling updates should be sequential
update {
stagger = "30s"
max_parallel = 1
}
constraint{
distinct_hosts = true
}
group "consul-server" {
# Create a web front end using a docker image
task "consul-server" {
driver = "docker"
config {
image = "f21global/consul"
network_mode = "host"
args = ["agent", "-server", "-bootstrap-expect", "1", "-data-dir", "/tmp/consul"]
}
resources {
cpu = 500
memory = 64
network {
# Request for a static port
port "consul_8300" {
static = 8300
}
port "consul_8301" {
static = 8301
}
port "consul_8302" {
static = 8302
}
port "consul_8400" {
static = 8400
}
port "consul_8500" {
static = 8500
}
port "consul_8600" {
static = 8600
}
}
}
}
}
group "consul-agent" {
# Create a web front end using a docker image
task "consul-agent" {
driver = "docker"
config {
image = "f21global/consul"
network_mode = "host"
args = ["agent", "-data-dir", "/tmp/consul", "-node=agent-twi"]
}
resources {
cpu = 500
memory = 64
network {
# Request for a static port
port "consul_8300" {
static = 8300
}
port "consul_8301" {
static = 8301
}
port "consul_8302" {
static = 8302
}
port "consul_8400" {
static = 8400
}
port "consul_8500" {
static = 8500
}
port "consul_8600" {
static = 8600
}
}
}
}
}
}
Problems I ran into:
distinct_hosts causes nomad to panic: #489.start-join and start-join-wan.consul.nomad file with the ip address and then send it to nomad as an update.do not run on nodes that already have the consul-server task running.@F21 The Nomad client continues to keep trying to the Consul agent. The moment it connects to the agent it syncs all the service definitions of Tasks running on that node.
Also, I am wondering why does Consul servers need to be run as system jobs? It makes perfect sense to run the consul agents as system jobs. We shouldn't probably need to use the distinct host constraint in case the job uses the system scheduler.
@diptanu I agree that the Consul servers probably won't need to use the system scheduler.
Having said that, my initial rationale was to use distinct_host as a way to prevent the consul agent from being scheduled onto clients where the consul server was running.
However, even after setting the consul server count to 1, I am still getting scheduling errors:
# Define a job called my-service
job "consul" {
# Job should run in the US region
region = "global"
# Spread tasks between us-west-1 and us-east-1
datacenters = ["dc1"]
# run this job globally
type = "system"
# Rolling updates should be sequential
update {
stagger = "30s"
max_parallel = 1
}
constraint{
distinct_hosts = "true"
}
group "consul-server" {
count = 1
# Create a web front end using a docker image
task "consul-server" {
driver = "docker"
config {
image = "f21global/consul"
network_mode = "host"
args = ["agent", "-server", "-bootstrap-expect", "1", "-data-dir", "/tmp/consul"]
}
resources {
cpu = 500
memory = 64
network {
# Request for a static port
port "consul_8300" {
static = 8300
}
port "consul_8301" {
static = 8301
}
port "consul_8302" {
static = 8302
}
port "consul_8400" {
static = 8400
}
port "consul_8500" {
static = 8500
}
port "consul_8600" {
static = 8600
}
}
}
}
}
group "consul-agent" {
# Create a web front end using a docker image
task "consul-agent" {
driver = "docker"
config {
image = "f21global/consul"
network_mode = "host"
args = ["agent", "-data-dir", "/tmp/consul", "-node=agent-twi"]
}
resources {
cpu = 500
memory = 64
network {
# Request for a static port
port "consul_8300" {
static = 8300
}
port "consul_8301" {
static = 8301
}
port "consul_8302" {
static = 8302
}
port "consul_8400" {
static = 8400
}
port "consul_8500" {
static = 8500
}
port "consul_8600" {
static = 8600
}
}
}
}
}
}
$ sudo nomad run -address=http://192.168.33.10:4646 consul.nomad
==> Monitoring evaluation "d659de31-4f88-95bc-8fe0-32098d1ce3f6"
Evaluation triggered by job "consul"
Scheduling error for group "consul-agent" (failed to find a node for placement)
Allocation "22b47eea-b799-e56b-3622-af7aa9c97a78" status "failed" (0/1 nodes filtered)
* Resources exhausted on 1 nodes
* Dimension "network: reserved port collision" exhausted on 1 nodes
Allocation "2447ee21-feb4-37eb-9ed5-001d00846d05" created: node "69278cbc-37c4-cf17-2de1-586c3589cfa9", group "consul-server"
Allocation "714b1913-acd5-2189-5ca3-cc150910cacb" created: node "c0e42790-44b2-a729-5a8f-1742fb503999", group "consul-server"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "d659de31-4f88-95bc-8fe0-32098d1ce3f6" finished with status "complete"
$ sudo nomad status -address=http://192.168.33.10:4646 consul
ID = consul
Name = consul
Type = system
Priority = 50
Datacenters = dc1
Status = <none>
==> Evaluations
ID Priority TriggeredBy Status
d659de31-4f88-95bc-8fe0-32098d1ce3f6 50 job-register complete
==> Allocations
ID EvalID NodeID
TaskGroup Desired Status
22b47eea-b799-e56b-3622-af7aa9c97a78 d659de31-4f88-95bc-8fe0-32098d1ce3f6 <none>
consul-agent failed failed
2447ee21-feb4-37eb-9ed5-001d00846d05 d659de31-4f88-95bc-8fe0-32098d1ce3f6 69278cbc-37c4-cf17-2de1-586c3589cfa9 consul-server run dead
714b1913-acd5-2189-5ca3-cc150910cacb d659de31-4f88-95bc-8fe0-32098d1ce3f6 c0e42790-44b2-a729-5a8f-1742fb503999 consul-server run dead
$ sudo nomad status -address=http://192.168.33.10:4646 consul
ID = consul
Name = consul
Type = system
Priority = 50
Datacenters = dc1
Status = <none>
==> Evaluations
ID Priority TriggeredBy Status
d659de31-4f88-95bc-8fe0-32098d1ce3f6 50 job-register complete
==> Allocations
ID EvalID NodeID
TaskGroup Desired Status
22b47eea-b799-e56b-3622-af7aa9c97a78 d659de31-4f88-95bc-8fe0-32098d1ce3f6 <none>
consul-agent failed failed
2447ee21-feb4-37eb-9ed5-001d00846d05 d659de31-4f88-95bc-8fe0-32098d1ce3f6 69278cbc-37c4-cf17-2de1-586c3589cfa9 consul-server run dead
714b1913-acd5-2189-5ca3-cc150910cacb d659de31-4f88-95bc-8fe0-32098d1ce3f6 c0e42790-44b2-a729-5a8f-1742fb503999 consul-server run dead
@F21 It looks like the agent and server are getting scheduled on the same machine, that's why you're getting the port collision. distinct_hosts at the job level just means that all the task groups are going to be running on distinct machines. But a system job still would run on every single machine and that's the consul agent is getting scheduled with the consul server. We might need a way to exclude system jobs from running on machines with certain label.
I'm glad I'm not the only one running into this problem. There really should be a recommended way in the docs to run consul because having that service running is crucial for a production ready cluster.
I tried splitting up consul-server and consul-agent into different services (instead of system) to let them run independently but that doesn't appear to be the right solution (or I missed something in the config)
The nomad server running consul-server keeps restarting the service
2015/11/25 13:01:13 [ERR] client: failed to complete task 'consul-server' for alloc '85aa78be-e03b-33e0-6bc5-e6424b8eabdb': Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
2015/11/25 13:01:13 [INFO] client: Restarting Task: consul-server
2015/11/25 13:01:29 [INFO] driver.docker: a container with the name consul-server-85aa78be-e03b-33e0-6bc5-e6424b8eabdb already exists; will attempt to purge and re-create
2015/11/25 13:01:32 [INFO] driver.docker: purged container consul-server-85aa78be-e03b-33e0-6bc5-e6424b8eabdb
2015/11/25 13:01:32 [INFO] driver.docker: created container dc3978f31f16a1ab2994679f33466ca2e21feaea4fe7f9975d2c3a3aa61c61ad
2015/11/25 13:01:32 [INFO] driver.docker: started container dc3978f31f16a1ab2994679f33466ca2e21feaea4fe7f9975d2c3a3aa61c61ad
2015/11/25 13:01:32 [ERR] client: failed to complete task 'consul-server' for alloc '85aa78be-e03b-33e0-6bc5-e6424b8eabdb': Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
2015/11/25 13:01:32 [INFO] client: Restarting Task: consul-server
And the server running consul-agent restarts a couple of times and then gets stuck
2015/11/25 13:05:14 [INFO] driver.docker: purged container consul-agent-519193f5-6744-f3e7-3fd2-cb4a06ec4cb6
2015/11/25 13:05:14 [INFO] driver.docker: created container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a
2015/11/25 13:05:14 [ERR] driver.docker: failed to start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: API error (500): Cannot start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: [8] System error: write /sys/fs/cgroup/devices/system.slice/docker-57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a.scope/cgroup.procs: no such device
2015/11/25 13:05:14 [ERR] client: failed to start task 'consul-agent' for alloc '519193f5-6744-f3e7-3fd2-cb4a06ec4cb6': Failed to start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: API error (500): Cannot start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: [8] System error: write /sys/fs/cgroup/devices/system.slice/docker-57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a.scope/cgroup.procs: no such device
2015/11/25 13:05:46 [ERR] http: Request /v1/allocation/consul-client, error: rpc error: alloc lookup failed: index error: UUID must be 36 characters
I'm still digging into it but just wanted to echo the need for a recommended way to run consul on nomad.
FWIW, I have avoided these issues by making the consul network/service the thing that is setup and completed first, and which nomad then uses. CM manages consul and nomad on all nodes, and init for each node works out the details of forming/joining a cluster with consensus and a leader. No chicken/egg issues here.
Has any process been made to get consul running on nomad? While it's possible to run consul by itself along side nomad, it poses the following problems (this is assuming we only have 1 datacenter and want to have 3 consul servers with the rest of the servers running the consul client):
@F21 You can definitely run Consul servers with Nomad. What is not possible today is to run _both_ the Consul servers(using service scheduler) and clients(using system scheduler) via Nomad. The reason is that the system scheduler currently schedules all the system jobs in all the machines in a Nomad cluster, there is no way for the system scheduler to currently exclude running the Consul clients on the machines where Nomad is running the Consul servers. And on the same machine, the client and server can't run simultaenously because of port collisions.
But if you just want to run the Consul servers on Nomad it's definitely possible and as you said, you could use Atlas's auto-join functionality to have the Consul servers find each other when they are dynamically scheduled by Nomad.
@diptanu What are the nomad team's plans to fully bring consul scheduling to nomad?
In terms of the service scheduler colliding with the system scheduler, maybe a key called a service_key or something more suitable can be added to the task definitions. We can then set the service_key for both the system task and the service task to something like consul. Then, nomad will give the service task priority when scheduling. If a node running a service task goes offline, nomad can evict the system task and replace it with a service task so that the count is maintained.
Or maybe, the metadata feature can be improved so that it's exposed on a node level. For example, a task could add a piece of metadata setting service-type=consul-server and the system scheduler can be constrained using constraints to not schedule on nodes where the metadata exists. However, a priority system would still be required to so that the service tasks have precedence.
I've recently built a bunch of docker images to run HDFS, having this feature would also be very useful. For example, I want to run the namenodes on distinct nodes. I also want to use the system scheduler to schedule datanodes on all nodes in the cluster except for nodes where the namenodes are running.
Another possible way to deal with consul discovery without using Atlas would be to exploit the all_at_once feature when scheduling the tasks. Since the scheduler knows which nodes the tasks will run on atomically, it could also pass an array (maybe as json) of the ip addresses of the nodes it's scheduled on and pass that as an environment variable to the task. I think this should deal with the issue of consul servers discovering each other. However, We would also need some way to pass that information to the system task so that the consul clients can connect to the servers.
Has any progress been made to run both consul agents and servers on nomad?
@F21, from what I can tell, it's not impossible, it just takes work. With that said, I have had a _lot_ of success with consul as the primary core service that runs outside nomad, and I would recommend considering this route too.
@ketzacoatl That's what I am currently doing with a virtualized test cluster. However, if you have say 3 nodes running nomad servers and consul servers, how are you recovering if 1 of those nodes goes down or experiences a hardware failure?
I have my nomad servers and consul leaders running together on one auto-scaling group, 3 to 5 servers. If one node goes down, AWS follows the auto-scaling group setup and creates a replacement for the node(s) that are not present.
Ah, that makes sense. I am not using AWS but will probably be running on a set of dedicated servers and a public cloud provider without auto-scaling, so a machine going down will need manual intervention.
@ketzacoatl How are your ASG instances joining the cluster when started? That is, how do you "know" the other servers to join to?
@memelet, for the consul leaders themselves, I use some AWS hackery - Terraform creates the ASG and puts it in the smallest subnet possible (limiting the IP range). We have 2 AZ for failover on the ASG, so there are two subnets, and the list of "possible IPs" is computed (eg, those that the leaders might _actually_ have, but we don't know, because it's ASG), and that list is used to create a DNS entry for "all leader IPs". Note that the list of possible IPs is _huge_ (~25) compared to the leader nodes (3 - 5). Consul agents can then be pointed at that DNS record for the leaders, and configured with retry_join, so they will _eventually_ find one of the "right" leader IPs, and get connected to the network. The time it takes the agents to find the leaders is dependent on the retry_interval. The first goal of the leaders is to get consul up, then nomad. Nomad relies on consul in my setup, and I think it's goofy to make Consul rely on Nomad (at least in my setup, and that is because I use consul as part of "distributed Configuration Management" for the cluster, and it forms the foundation for the whole shebang - you gotta pick either the egg, chicken or farmer in your story..), so Nomad leader/servers comes online, and then publish their "service" in the Consul catalog. The service check is simple, and so long as the service is running, the server is listed in consul, and that lets the nomad servers find each other for their quorum. Even with the DNS hackery, this has worked _very_ reliably, albeit the agent nodes _can be_ a little slow to find the leaders and join. I plan on addressing this with some code on lambda that updates the DNS record as the nodes in the leader ASG come and go.
Why not Consul running with nomad by default? every node running nomad can be running consul as well.
Please make them into one box . so we can easier to use nomad to manager cluster as well.
Running the two on the same hosts is trivial. The documentation for each app is clear in how to configure and run the software. There is a learning curve to understanding all the details you need to master in order to be effective.
@ketzacoatl wow, thank you for describing your AWS hackery! Didn't think that we could simply brute force to find a leader in some subnet :+1:
You could also use lambda to update a DNS record when nodes in the ASG change - see https://objectpartners.com/2015/07/07/aws-tricks-updating-route53-dns-for-autoscalinggroup-using-lambda/ for an example.
Hey I am going to close this since we recommend running Consul outside of Nomad.
Most helpful comment
Hey I am going to close this since we recommend running Consul outside of Nomad.