Nomad: Improve consul integration user experience.

Created on 23 Nov 2015 · 21Comments · Source: hashicorp/nomad

I spent most of my morning trying to build a test nomad + consul cluster using vagrant.

I am finding that as it stands, consul integration is very difficult to run on top of nomad. I am sure the issues outlined here will spawn child issues that are more specific, but I think having a general issue will help provide discussion improve the user experience before we break it down to specific tasks.

Here is a quick background of my investigations to narrow down the scope:

Nomad 0.2.0 is used.
Consul 0.6.0 RC2 is used (using my docker image f21global/consul on docker hub).
Currently only using the docker task driver.
I plan am running consul as a system task as recommended in the documentation.
- Consul runs as a docker container (f21global/consul).
- Consul runs with host networking. This is so that every nomad agent should be able to easily access consul via localhost:8500.
- In addition to the above, docker containers can access the consul agent or server running on the host though the gateway address within the docker container.
I want to have consul running on all nomad client nodes. up to 3 of the nodes in each datacenter should be a consul server. The rest should be agents.

This is currently the nomad task config I am using:

# Define a job called my-service
job "consul" {
    # Job should run in the US region
    region = "global"

    # Spread tasks between us-west-1 and us-east-1
    datacenters = ["dc1"]

    # run this job globally
    type = "system"

    # Rolling updates should be sequential
    update {
        stagger = "30s"
        max_parallel = 1
    }

    constraint{
        distinct_hosts = true
    }

    group "consul-server" {
        # Create a web front end using a docker image
        task "consul-server" {
            driver = "docker"
            config {
                image = "f21global/consul"
                network_mode = "host"
                args = ["agent", "-server", "-bootstrap-expect", "1", "-data-dir", "/tmp/consul"]
            }
            resources {
                cpu = 500
                memory = 64
                network {
                    # Request for a static port
                    port "consul_8300" {
                        static = 8300
                    }

                    port "consul_8301" {
                        static = 8301
                    }

                    port "consul_8302" {
                        static = 8302
                    }

                    port "consul_8400" {
                        static = 8400
                    }

                    port "consul_8500" {
                        static = 8500
                    }

                    port "consul_8600" {
                        static = 8600
                    }
                }
            }
        }
    }

    group "consul-agent" {
        # Create a web front end using a docker image
        task "consul-agent" {
            driver = "docker"
            config {
                image = "f21global/consul"
                network_mode = "host"
                args = ["agent", "-data-dir", "/tmp/consul", "-node=agent-twi"]
            }
            resources {
                cpu = 500
                memory = 64
                network {
                    # Request for a static port
                    port "consul_8300" {
                        static = 8300
                    }

                    port "consul_8301" {
                        static = 8301
                    }

                    port "consul_8302" {
                        static = 8302
                    }

                    port "consul_8400" {
                        static = 8400
                    }

                    port "consul_8500" {
                        static = 8500
                    }

                    port "consul_8600" {
                        static = 8600
                    }
                }
            }
        }
    }
}

Problems I ran into:

distinct_hosts causes nomad to panic: #489.
When launching the nomad clients, it tries to look for a consul agent, but fails to find one, because the consul agent needs to be run as a nomad task. It is unclear whether the nomad client will keep retrying to access the consul agent when it is launched by the nomad cluster.
Because the consul servers are launched by nomad, it is impossible automate the consul agents so that they can automatically join using start-join and start-join-wan.
This requires a lot of manual intervention, because we would need to start the servers, check where the servers are, update the consul.nomad file with the ip address and then send it to nomad as an update.
It doesn't seem to be possible to break the task files up into 2 (1 for the servers and 1 for the agents), because there doesn't seem to be a way to assign constraints base on task names. For example, in the agents task file, I would like to have a constraint that says: do not run on nodes that already have the consul-server task running.

themcore themdiscovery

Source

F21

👍1

Most helpful comment

Hey I am going to close this since we recommend running Consul outside of Nomad.

dadgar on 14 Feb 2017

👍2

All 21 comments

@F21 The Nomad client continues to keep trying to the Consul agent. The moment it connects to the agent it syncs all the service definitions of Tasks running on that node.

Also, I am wondering why does Consul servers need to be run as system jobs? It makes perfect sense to run the consul agents as system jobs. We shouldn't probably need to use the distinct host constraint in case the job uses the system scheduler.

diptanu on 23 Nov 2015

@diptanu I agree that the Consul servers probably won't need to use the system scheduler.

Having said that, my initial rationale was to use distinct_host as a way to prevent the consul agent from being scheduled onto clients where the consul server was running.

However, even after setting the consul server count to 1, I am still getting scheduling errors:

# Define a job called my-service
job "consul" {
    # Job should run in the US region
    region = "global"

    # Spread tasks between us-west-1 and us-east-1
    datacenters = ["dc1"]

    # run this job globally
    type = "system"

    # Rolling updates should be sequential
    update {
        stagger = "30s"
        max_parallel = 1
    }

    constraint{
        distinct_hosts = "true"
    }

    group "consul-server" {
        count = 1

        # Create a web front end using a docker image
        task "consul-server" {
            driver = "docker"
            config {
                image = "f21global/consul"
                network_mode = "host"
                args = ["agent", "-server", "-bootstrap-expect", "1", "-data-dir", "/tmp/consul"]
            }
            resources {
                cpu = 500
                memory = 64
                network {
                    # Request for a static port
                    port "consul_8300" {
                        static = 8300
                    }

                    port "consul_8301" {
                        static = 8301
                    }

                    port "consul_8302" {
                        static = 8302
                    }

                    port "consul_8400" {
                        static = 8400
                    }

                    port "consul_8500" {
                        static = 8500
                    }

                    port "consul_8600" {
                        static = 8600
                    }
                }
            }
        }
    }

    group "consul-agent" {
        # Create a web front end using a docker image
        task "consul-agent" {
            driver = "docker"
            config {
                image = "f21global/consul"
                network_mode = "host"
                args = ["agent", "-data-dir", "/tmp/consul", "-node=agent-twi"]
            }
            resources {
                cpu = 500
                memory = 64
                network {
                    # Request for a static port
                    port "consul_8300" {
                        static = 8300
                    }

                    port "consul_8301" {
                        static = 8301
                    }

                    port "consul_8302" {
                        static = 8302
                    }

                    port "consul_8400" {
                        static = 8400
                    }

                    port "consul_8500" {
                        static = 8500
                    }

                    port "consul_8600" {
                        static = 8600
                    }
                }
            }
        }
    }
}

$ sudo nomad run -address=http://192.168.33.10:4646 consul.nomad
==> Monitoring evaluation "d659de31-4f88-95bc-8fe0-32098d1ce3f6"
    Evaluation triggered by job "consul"
    Scheduling error for group "consul-agent" (failed to find a node for placement)
    Allocation "22b47eea-b799-e56b-3622-af7aa9c97a78" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: reserved port collision" exhausted on 1 nodes
    Allocation "2447ee21-feb4-37eb-9ed5-001d00846d05" created: node "69278cbc-37c4-cf17-2de1-586c3589cfa9", group "consul-server"
    Allocation "714b1913-acd5-2189-5ca3-cc150910cacb" created: node "c0e42790-44b2-a729-5a8f-1742fb503999", group "consul-server"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "d659de31-4f88-95bc-8fe0-32098d1ce3f6" finished with status "complete"
$ sudo nomad status -address=http://192.168.33.10:4646 consul
ID          = consul
Name        = consul
Type        = system
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
d659de31-4f88-95bc-8fe0-32098d1ce3f6  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID
   TaskGroup      Desired  Status
22b47eea-b799-e56b-3622-af7aa9c97a78  d659de31-4f88-95bc-8fe0-32098d1ce3f6  <none>
   consul-agent   failed   failed
2447ee21-feb4-37eb-9ed5-001d00846d05  d659de31-4f88-95bc-8fe0-32098d1ce3f6  69278cbc-37c4-cf17-2de1-586c3589cfa9  consul-server  run      dead
714b1913-acd5-2189-5ca3-cc150910cacb  d659de31-4f88-95bc-8fe0-32098d1ce3f6  c0e42790-44b2-a729-5a8f-1742fb503999  consul-server  run      dead
$ sudo nomad status -address=http://192.168.33.10:4646 consul
ID          = consul
Name        = consul
Type        = system
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
d659de31-4f88-95bc-8fe0-32098d1ce3f6  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID
   TaskGroup      Desired  Status
22b47eea-b799-e56b-3622-af7aa9c97a78  d659de31-4f88-95bc-8fe0-32098d1ce3f6  <none>
   consul-agent   failed   failed
2447ee21-feb4-37eb-9ed5-001d00846d05  d659de31-4f88-95bc-8fe0-32098d1ce3f6  69278cbc-37c4-cf17-2de1-586c3589cfa9  consul-server  run      dead
714b1913-acd5-2189-5ca3-cc150910cacb  d659de31-4f88-95bc-8fe0-32098d1ce3f6  c0e42790-44b2-a729-5a8f-1742fb503999  consul-server  run      dead

F21 on 23 Nov 2015

@F21 It looks like the agent and server are getting scheduled on the same machine, that's why you're getting the port collision. distinct_hosts at the job level just means that all the task groups are going to be running on distinct machines. But a system job still would run on every single machine and that's the consul agent is getting scheduled with the consul server. We might need a way to exclude system jobs from running on machines with certain label.

diptanu on 24 Nov 2015

I'm glad I'm not the only one running into this problem. There really should be a recommended way in the docs to run consul because having that service running is crucial for a production ready cluster.

rothgar on 25 Nov 2015

👍1

I tried splitting up consul-server and consul-agent into different services (instead of system) to let them run independently but that doesn't appear to be the right solution (or I missed something in the config)

The nomad server running consul-server keeps restarting the service

2015/11/25 13:01:13 [ERR] client: failed to complete task 'consul-server' for alloc '85aa78be-e03b-33e0-6bc5-e6424b8eabdb': Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
    2015/11/25 13:01:13 [INFO] client: Restarting Task: consul-server
2015/11/25 13:01:29 [INFO] driver.docker: a container with the name consul-server-85aa78be-e03b-33e0-6bc5-e6424b8eabdb already exists; will attempt to purge and re-create
2015/11/25 13:01:32 [INFO] driver.docker: purged container consul-server-85aa78be-e03b-33e0-6bc5-e6424b8eabdb
    2015/11/25 13:01:32 [INFO] driver.docker: created container dc3978f31f16a1ab2994679f33466ca2e21feaea4fe7f9975d2c3a3aa61c61ad
    2015/11/25 13:01:32 [INFO] driver.docker: started container dc3978f31f16a1ab2994679f33466ca2e21feaea4fe7f9975d2c3a3aa61c61ad
    2015/11/25 13:01:32 [ERR] client: failed to complete task 'consul-server' for alloc '85aa78be-e03b-33e0-6bc5-e6424b8eabdb': Wait returned exit code 1, signal 0, and error Docker container exited with non-zero exit code: 1
    2015/11/25 13:01:32 [INFO] client: Restarting Task: consul-server

And the server running consul-agent restarts a couple of times and then gets stuck

2015/11/25 13:05:14 [INFO] driver.docker: purged container consul-agent-519193f5-6744-f3e7-3fd2-cb4a06ec4cb6
    2015/11/25 13:05:14 [INFO] driver.docker: created container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a
    2015/11/25 13:05:14 [ERR] driver.docker: failed to start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: API error (500): Cannot start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: [8] System error: write /sys/fs/cgroup/devices/system.slice/docker-57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a.scope/cgroup.procs: no such device
    2015/11/25 13:05:14 [ERR] client: failed to start task 'consul-agent' for alloc '519193f5-6744-f3e7-3fd2-cb4a06ec4cb6': Failed to start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: API error (500): Cannot start container 57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a: [8] System error: write /sys/fs/cgroup/devices/system.slice/docker-57347baa7465137dc6aea463ae939539043852a0058e9e642f2daf360128a77a.scope/cgroup.procs: no such device
    2015/11/25 13:05:46 [ERR] http: Request /v1/allocation/consul-client, error: rpc error: alloc lookup failed: index error: UUID must be 36 characters

I'm still digging into it but just wanted to echo the need for a recommended way to run consul on nomad.

rothgar on 25 Nov 2015

FWIW, I have avoided these issues by making the consul network/service the thing that is setup and completed first, and which nomad then uses. CM manages consul and nomad on all nodes, and init for each node works out the details of forming/joining a cluster with consensus and a leader. No chicken/egg issues here.

ketzacoatl on 25 Dec 2015

Has any process been made to get consul running on nomad? While it's possible to run consul by itself along side nomad, it poses the following problems (this is assuming we only have 1 datacenter and want to have 3 consul servers with the rest of the servers running the consul client):

If a physical node where a consul server is running on dies, we need to to make sure a new node is provisioned with the consul server and not the consul agent. Perhaps some configuration management tool can be used to do this, but it seems rather inefficient to have to poll all the servers in the DC to work out how many consul servers are running.
Furthermore, this use case seems to be something that's quite suitable for nomad to manage. For example, if a node running the consul server dies, nomad can just kill a client on another node and replace it with a server.
Bootstrapping consul on nomad might be a problem as we do need to know the ip addresses of a few nodes to allow consul to find each other, however I think this can be alleviated by using Atlas for discovery, manually including the ip addresses of a few nodes that are guaranteed to exist (and have gossip discover everyone) or using mdns (when and if it lands in consul).

F21 on 30 Dec 2015

@F21 You can definitely run Consul servers with Nomad. What is not possible today is to run _both_ the Consul servers(using service scheduler) and clients(using system scheduler) via Nomad. The reason is that the system scheduler currently schedules all the system jobs in all the machines in a Nomad cluster, there is no way for the system scheduler to currently exclude running the Consul clients on the machines where Nomad is running the Consul servers. And on the same machine, the client and server can't run simultaenously because of port collisions.

But if you just want to run the Consul servers on Nomad it's definitely possible and as you said, you could use Atlas's auto-join functionality to have the Consul servers find each other when they are dynamically scheduled by Nomad.

diptanu on 31 Dec 2015

@diptanu What are the nomad team's plans to fully bring consul scheduling to nomad?

In terms of the service scheduler colliding with the system scheduler, maybe a key called a service_key or something more suitable can be added to the task definitions. We can then set the service_key for both the system task and the service task to something like consul. Then, nomad will give the service task priority when scheduling. If a node running a service task goes offline, nomad can evict the system task and replace it with a service task so that the count is maintained.

Or maybe, the metadata feature can be improved so that it's exposed on a node level. For example, a task could add a piece of metadata setting service-type=consul-server and the system scheduler can be constrained using constraints to not schedule on nodes where the metadata exists. However, a priority system would still be required to so that the service tasks have precedence.

I've recently built a bunch of docker images to run HDFS, having this feature would also be very useful. For example, I want to run the namenodes on distinct nodes. I also want to use the system scheduler to schedule datanodes on all nodes in the cluster except for nodes where the namenodes are running.

Another possible way to deal with consul discovery without using Atlas would be to exploit the all_at_once feature when scheduling the tasks. Since the scheduler knows which nodes the tasks will run on atomically, it could also pass an array (maybe as json) of the ip addresses of the nodes it's scheduled on and pass that as an environment variable to the task. I think this should deal with the issue of consul servers discovering each other. However, We would also need some way to pass that information to the system task so that the consul clients can connect to the servers.

F21 on 31 Dec 2015

Has any progress been made to run both consul agents and servers on nomad?

F21 on 3 Feb 2016

@F21, from what I can tell, it's not impossible, it just takes work. With that said, I have had a _lot_ of success with consul as the primary core service that runs outside nomad, and I would recommend considering this route too.

ketzacoatl on 3 Feb 2016

@ketzacoatl That's what I am currently doing with a virtualized test cluster. However, if you have say 3 nodes running nomad servers and consul servers, how are you recovering if 1 of those nodes goes down or experiences a hardware failure?

F21 on 3 Feb 2016

I have my nomad servers and consul leaders running together on one auto-scaling group, 3 to 5 servers. If one node goes down, AWS follows the auto-scaling group setup and creates a replacement for the node(s) that are not present.

ketzacoatl on 3 Feb 2016

Ah, that makes sense. I am not using AWS but will probably be running on a set of dedicated servers and a public cloud provider without auto-scaling, so a machine going down will need manual intervention.

F21 on 3 Feb 2016

@ketzacoatl How are your ASG instances joining the cluster when started? That is, how do you "know" the other servers to join to?

memelet on 3 Feb 2016

@memelet, for the consul leaders themselves, I use some AWS hackery - Terraform creates the ASG and puts it in the smallest subnet possible (limiting the IP range). We have 2 AZ for failover on the ASG, so there are two subnets, and the list of "possible IPs" is computed (eg, those that the leaders might _actually_ have, but we don't know, because it's ASG), and that list is used to create a DNS entry for "all leader IPs". Note that the list of possible IPs is _huge_ (~25) compared to the leader nodes (3 - 5). Consul agents can then be pointed at that DNS record for the leaders, and configured with retry_join, so they will _eventually_ find one of the "right" leader IPs, and get connected to the network. The time it takes the agents to find the leaders is dependent on the retry_interval. The first goal of the leaders is to get consul up, then nomad. Nomad relies on consul in my setup, and I think it's goofy to make Consul rely on Nomad (at least in my setup, and that is because I use consul as part of "distributed Configuration Management" for the cluster, and it forms the foundation for the whole shebang - you gotta pick either the egg, chicken or farmer in your story..), so Nomad leader/servers comes online, and then publish their "service" in the Consul catalog. The service check is simple, and so long as the service is running, the server is listed in consul, and that lets the nomad servers find each other for their quorum. Even with the DNS hackery, this has worked _very_ reliably, albeit the agent nodes _can be_ a little slow to find the leaders and join. I plan on addressing this with some code on lambda that updates the DNS record as the nodes in the leader ASG come and go.

ketzacoatl on 3 Feb 2016

Why not Consul running with nomad by default? every node running nomad can be running consul as well.
Please make them into one box . so we can easier to use nomad to manager cluster as well.

29e7e280-0d1c-4bba-98fe-f7cd3ca7500a on 5 May 2016

Running the two on the same hosts is trivial. The documentation for each app is clear in how to configure and run the software. There is a learning curve to understanding all the details you need to master in order to be effective.

ketzacoatl on 5 May 2016

@ketzacoatl wow, thank you for describing your AWS hackery! Didn't think that we could simply brute force to find a leader in some subnet :+1: