Nomad: Every Nomad executor queries all consul services and health checks every 5 seconds.

Created on 16 Nov 2016  路  5Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.4.2

Operating system and Environment details

$ uname -a
Linux ~ 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

The cluster is running on AWS EC2-VPC.

Issue

Every 4~5 seconds, the nomad executor process queries Consul's services endpoint and health checks endpoint with no filtering; fetching the full list of all services and health checks known by Consul.

This creates a scaling problem where n executors frequently fetch information about n services.

Here's an example TCP dump session, watching requests from a unique client port:

$ sudo tcpdump -i lo -AAn port 8500 | egrep -A7 '\.22065 > .* \[P'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
22:25:41.465102 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 1209640838:1209640955, ack 3740529286, win 2431, options [nop,nop,TS val 414530103 ecr 414528937], length 117
..............E...U.@[email protected]!4H.........        .(;.....
..:7..5.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:41.481674 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 117:232, ack 104037, win 2431, options [nop,nop,TS val 414530107 ecr 414530104], length 115
..............E...U3@.@..~..g...g.V1!4H.........        .(9.....
..:;..:8GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:46.286527 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 232:349, ack 105112, win 2431, options [nop,nop,TS val 414531308 ecr 414530107], length 117
..............E...U5@[email protected]!4H..n......        .(;.....
..>...:;GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:46.295171 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 349:464, ack 209148, win 2431, options [nop,nop,TS val 414531310 ecr 414531308], length 115
..............E...U:@[email protected]!4H.....#...        .(9.....
..>...>.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:51.038255 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 464:581, ack 210223, win 2431, options [nop,nop,TS val 414532496 ecr 414531310], length 117
..............E...U<@[email protected]!4H..V..'...        .(;.....
..C...>.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


--
22:25:51.052174 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 581:696, ack 314259, win 2431, options [nop,nop,TS val 414532500 ecr 414532496], length 115
..............E...UA@[email protected]!4H.........        .(9.....
..C...C.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


^C7695 packets captured
15406 packets received by filter
0 packets dropped by kernel

The client port maps back to an executor:

$ sudo lsof -i -n -P | grep :22065
nomad    12083   root   14u  IPv4 4942373      0t0  TCP 172.16.103.191:22065->172.16.103.191:8500 (ESTABLISHED)
consul   16230   root   44u  IPv6 4940746      0t0  TCP 172.16.103.191:8500->172.16.103.191:22065 (ESTABLISHED)

$ ps aux | grep 12083 | grep -v grep
root     12083  0.3  0.0 559344 17888 ?        S<sl 21:34   0:15 /usr/local/bin/nomad executor /opt/nomad/mnt/9fb55daf-6d92-3dc9-6cda-d128db32673d/event_handler/event_handler-executor.out

I'm running a job with Count: 800 on 4 nomad client machines, and the services response is pretty big:

$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/services
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/services HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:30:50 GMT
< Transfer-Encoding: chunked
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 103906

The health checks endpoint is very small, because I've disabled health checks on this big job while running the test. (Our initial assumption was that health checks were responsible for the CPU load and lo interface pressure.)

$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/checks
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/checks HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:47:58 GMT
< Content-Length: 966
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 966

Observations:

  • The Accept-Encoding header is ignored (no compression on the response).
  • 800 Count / 4 clients = 200 jobs on localhost:

    $ nomad node-status -self | grep running | wc -l
    202
    
  • 200 executors / 5 seconds = 40 requests per second

  • 40 requests per second * 100 KB = 4 MB/s
  • 4 MB/s * 8 bits = 32 Mbit/s over localhost

Librato metrics agrees:

screen shot 2016-11-15 at 2 35 32 pm

  • The 800 count jobs themselves are idle (e.g. a bash script: while true ; do sleep 60 ; done)
  • Nomad executors and Consul consume a steady 25% CPU (on each core):

    Tasks: 442; 2 running
    Load average: 1.97 1.95 2.69
    Uptime: 19 days, 04:58:27
    

Reproduction steps

  • Run any job with a high count, e.g. 200 or more executors per Nomad client. (Try with an without TCP/HTTP health checks.)
  • Observe CPU utilization from Nomad executors and Consul.
  • It appears all the CPU is being consumed by retransmitting cacheable data at a pretty high frequency.
  • Adding health checks to the job spec makes the problem worse, due to the periodic fetches asking for all health checks in addition to all services. In our tests, the lo traffic increases from ~36 Mbit/s to ~54 Mbit/s when health checks are included.

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

Job file (if appropriate)

{
    "Job": {
        "Region": "global",
        "ID": "foobar",
        "Name": "foobar",
        "Type": "service",
        "Priority": 50,
        "AllAtOnce": false,
        "Datacenters": [
            "my-dc"
        ],
        "Constraints": null,
        "TaskGroups": [
            {
                "Name": "event_handler_1",
                "Count": 800,
                "Constraints": null,
                "Tasks": [
                    {
                        "Name": "event_handler",
                        "Driver": "exec",
                        "User": "",
                        "Config": {
                            "args": [
                                "-c",
                                "while true ; do sleep 60 ; done"
                            ],
                            "command": "/bin/bash"
                        },
                        "Constraints": null,
                        "Env": null,
                        "Services": [
                            {
                                "Id": "",
                                "Name": "foobar",
                                "Tags": [],
                                "PortLabel": "http",
                                "Checks": null
                            }
                        ],
                        "Resources": {
                            "CPU": 30,
                            "MemoryMB": 100,
                            "DiskMB": 100,
                            "IOPS": 0,
                            "Networks": [
                                {
                                    "Public": false,
                                    "CIDR": "",
                                    "ReservedPorts": null,
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10
                                }
                            ]
                        },
                        "Meta": null,
                        "KillTimeout": 5000000000,
                        "LogConfig": {
                            "MaxFiles": 5,
                            "MaxFileSizeMB": 10
                        },
                        "Artifacts": null
                    }
                ],
                "RestartPolicy": {
                    "Interval": 60000000000,
                    "Attempts": 5,
                    "Delay": 5000000000,
                    "Mode": "delay"
                },
                "Meta": null
            }
        ],
        "Update": {
            "Stagger": 0,
            "MaxParallel": 0
        },
        "Periodic": null,
        "Meta": null
    }
}
themclient themperformance typbug

Most helpful comment

@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now).

Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway).

I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature.

(If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.)

All 5 comments

Definitely needs improving. _Might_ be related to #1968 - still investigating that one.

@schmichael Have you made progress on this? Pinging because we have Nomad clients in production which have 167 Mbps over lo, with only 0.98 Mbps over eth0. In other words, this bug is doing 160x more work than the tasks we're running in Nomad. The issue here is CPU, since lo isn't IO-bound.

@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now).

Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway).

I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature.

(If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.)

Awesome!!! This sounds great, actually. I'm really looking forward to it. The description you provided matches my observations. Thanks for the update!

Fixed in master by #2467 -- will be in 0.6.0 release. Actually demo'd this particular fix internally at HashiCorp! For 250 tasks with checks running a single node, loopback traffic went from 60 Mbps in 0.5.6 to 60 Kbps.

CPU usage was significantly lower, but still not as low as we'd like. Some quick profiling revealed task stats collection as the next most likely CPU hog on idle systems.

If you have the time/ability/desire to test the Consul fix I've attached a binary here:

https://github.com/hashicorp/nomad/issues/2474#issuecomment-295502874

Was this page helpful?
0 / 5 - 0 ratings