Nomad v0.4.2
$ uname -a
Linux ~ 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"
The cluster is running on AWS EC2-VPC.
Every 4~5 seconds, the nomad executor process queries Consul's services endpoint and health checks endpoint with no filtering; fetching the full list of all services and health checks known by Consul.
This creates a scaling problem where n executors frequently fetch information about n services.
Here's an example TCP dump session, watching requests from a unique client port:
$ sudo tcpdump -i lo -AAn port 8500 | egrep -A7 '\.22065 > .* \[P'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
22:25:41.465102 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 1209640838:1209640955, ack 3740529286, win 2431, options [nop,nop,TS val 414530103 ecr 414528937], length 117
..............E...U.@[email protected]!4H......... .(;.....
..:7..5.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
--
22:25:41.481674 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 117:232, ack 104037, win 2431, options [nop,nop,TS val 414530107 ecr 414530104], length 115
..............E...U3@.@..~..g...g.V1!4H......... .(9.....
..:;..:8GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
--
22:25:46.286527 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 232:349, ack 105112, win 2431, options [nop,nop,TS val 414531308 ecr 414530107], length 117
..............E...U5@[email protected]!4H..n...... .(;.....
..>...:;GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
--
22:25:46.295171 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 349:464, ack 209148, win 2431, options [nop,nop,TS val 414531310 ecr 414531308], length 115
..............E...U:@[email protected]!4H.....#... .(9.....
..>...>.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
--
22:25:51.038255 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 464:581, ack 210223, win 2431, options [nop,nop,TS val 414532496 ecr 414531310], length 117
..............E...U<@[email protected]!4H..V..'... .(;.....
..C...>.GET /v1/agent/services HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
--
22:25:51.052174 IP 172.16.103.191.22065 > 172.16.103.191.8500: Flags [P.], seq 581:696, ack 314259, win 2431, options [nop,nop,TS val 414532500 ecr 414532496], length 115
..............E...UA@[email protected]!4H......... .(9.....
..C...C.GET /v1/agent/checks HTTP/1.1
Host: 172.16.103.191:8500
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
^C7695 packets captured
15406 packets received by filter
0 packets dropped by kernel
The client port maps back to an executor:
$ sudo lsof -i -n -P | grep :22065
nomad 12083 root 14u IPv4 4942373 0t0 TCP 172.16.103.191:22065->172.16.103.191:8500 (ESTABLISHED)
consul 16230 root 44u IPv6 4940746 0t0 TCP 172.16.103.191:8500->172.16.103.191:22065 (ESTABLISHED)
$ ps aux | grep 12083 | grep -v grep
root 12083 0.3 0.0 559344 17888 ? S<sl 21:34 0:15 /usr/local/bin/nomad executor /opt/nomad/mnt/9fb55daf-6d92-3dc9-6cda-d128db32673d/event_handler/event_handler-executor.out
I'm running a job with Count: 800 on 4 nomad client machines, and the services response is pretty big:
$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/services
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/services HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:30:50 GMT
< Transfer-Encoding: chunked
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 103906
The health checks endpoint is very small, because I've disabled health checks on this big job while running the test. (Our initial assumption was that health checks were responsible for the CPU load and lo interface pressure.)
$ curl -v -s -w 'Total content length: %{size_download}\n' -o /dev/null -H 'Accept-Encoding: gzip' http://localhost:8500/v1/agent/checks
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8500 (#0)
> GET /v1/agent/checks HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8500
> Accept: */*
> Accept-Encoding: gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Tue, 15 Nov 2016 22:47:58 GMT
< Content-Length: 966
<
{ [data not shown]
* Connection #0 to host localhost left intact
Total content length: 966
Observations:
Accept-Encoding header is ignored (no compression on the response).800 Count / 4 clients = 200 jobs on localhost:
$ nomad node-status -self | grep running | wc -l
202
200 executors / 5 seconds = 40 requests per second
40 requests per second * 100 KB = 4 MB/s4 MB/s * 8 bits = 32 Mbit/s over localhostLibrato metrics agrees:

while true ; do sleep 60 ; done)Nomad executors and Consul consume a steady 25% CPU (on each core):
Tasks: 442; 2 running
Load average: 1.97 1.95 2.69
Uptime: 19 days, 04:58:27
lo traffic increases from ~36 Mbit/s to ~54 Mbit/s when health checks are included.N/A
N/A
{
"Job": {
"Region": "global",
"ID": "foobar",
"Name": "foobar",
"Type": "service",
"Priority": 50,
"AllAtOnce": false,
"Datacenters": [
"my-dc"
],
"Constraints": null,
"TaskGroups": [
{
"Name": "event_handler_1",
"Count": 800,
"Constraints": null,
"Tasks": [
{
"Name": "event_handler",
"Driver": "exec",
"User": "",
"Config": {
"args": [
"-c",
"while true ; do sleep 60 ; done"
],
"command": "/bin/bash"
},
"Constraints": null,
"Env": null,
"Services": [
{
"Id": "",
"Name": "foobar",
"Tags": [],
"PortLabel": "http",
"Checks": null
}
],
"Resources": {
"CPU": 30,
"MemoryMB": 100,
"DiskMB": 100,
"IOPS": 0,
"Networks": [
{
"Public": false,
"CIDR": "",
"ReservedPorts": null,
"DynamicPorts": [
{
"Label": "http",
"Value": 0
}
],
"IP": "",
"MBits": 10
}
]
},
"Meta": null,
"KillTimeout": 5000000000,
"LogConfig": {
"MaxFiles": 5,
"MaxFileSizeMB": 10
},
"Artifacts": null
}
],
"RestartPolicy": {
"Interval": 60000000000,
"Attempts": 5,
"Delay": 5000000000,
"Mode": "delay"
},
"Meta": null
}
],
"Update": {
"Stagger": 0,
"MaxParallel": 0
},
"Periodic": null,
"Meta": null
}
}
Definitely needs improving. _Might_ be related to #1968 - still investigating that one.
@schmichael Have you made progress on this? Pinging because we have Nomad clients in production which have 167 Mbps over lo, with only 0.98 Mbps over eth0. In other words, this bug is doing 160x more work than the tasks we're running in Nomad. The issue here is CPU, since lo isn't IO-bound.
@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now).
Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway).
I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature.
(If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.)
Awesome!!! This sounds great, actually. I'm really looking forward to it. The description you provided matches my observations. Thanks for the update!
Fixed in master by #2467 -- will be in 0.6.0 release. Actually demo'd this particular fix internally at HashiCorp! For 250 tasks with checks running a single node, loopback traffic went from 60 Mbps in 0.5.6 to 60 Kbps.
CPU usage was significantly lower, but still not as low as we'd like. Some quick profiling revealed task stats collection as the next most likely CPU hog on idle systems.
If you have the time/ability/desire to test the Consul fix I've attached a binary here:
https://github.com/hashicorp/nomad/issues/2474#issuecomment-295502874
Most helpful comment
@parasyte The good news is: yes! Much progress has been made! The bad news is that it won't land in time for 0.5.5 (which should be out any day now).
Definitely going to land in the next release though. Unless you use a lot of script checks it should dramatically reduce Consul chatter. Since script checks have to heartbeat to Consul there's no getting around that traffic, but at least you can tune how often those run yourself (and the traffic should always incur far less CPU than exec'ing a script anyway).
I'll try to post binaries here when the PR is up if anyone wants to test. If I get merged soon enough after 0.5.5 you may just want to run it since there's really no workaround for the current Consul implementations excessively chatty nature.
(If you care about the implementation details: the existing implementation runs a periodic per-task reconcile loop against Consul. It's far more defensive in approach than is needed, and so I'm moving to a more straightforward talk-to-Consul-only-on-state-change approach.)