Consul: Node health flapping - EC2

Created on 1 Sep 2015 · 107Comments · Source: hashicorp/consul

We have a five node Consul cluster handling roughly 30 nodes across 4 different AWS accounts in a shared VPC across different availability zones. For the most part, everything works great. However, quite frequently, a random node will flap from healthy to critical. The flapping happens on completely random nodes and no consistency whatsoever.

Every time a node "flaps" it causes our consul-template, which populates our NGINX reverse-proxy config to reload. This is causes things like our Apache benchmark tests to fail.

We are looking to use Consul for production, but this issue has caused a lot of people to worry about consistency.

We also have all required TCP/UDP ports open through all the nodes, as well.

We believe the issue is just a latency problem with the polling of serf. Is there a way to modify the serf health-check interval to adjust to geographical latency?

Heres the log from one of the Consul servers:

    2015/09/01 17:46:13 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:46:13 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
    2015/09/01 17:46:13 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:46:15 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:46:15 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:46:16 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:46:26 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:46:32 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:05 [INFO] serf: EventMemberFailed: ip-10-170-138-228 10.170.138.228
    2015/09/01 17:47:19 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
    2015/09/01 17:47:19 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:47:23 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:47:44 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
    2015/09/01 17:47:44 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:49 [INFO] memberlist: Marking ip-10-170-76-170 as failed, suspect timeout reached
    2015/09/01 17:47:49 [INFO] serf: EventMemberFailed: ip-10-170-76-170 10.170.76.170
    2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-76-170 10.170.76.170
    2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-138-228 10.170.138.228
    2015/09/01 17:48:00 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
    2015/09/01 17:48:00 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:48:02 [INFO] serf: EventMemberFailed: ip-10-185-23-211 10.185.23.211
    2015/09/01 17:48:16 [INFO] serf: EventMemberJoin: ip-10-185-23-211 10.185.23.211
    2015/09/01 17:48:32 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
    2015/09/01 17:48:32 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:48:33 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:48:45 [INFO] serf: EventMemberFailed: ip-10-185-23-210 10.185.23.210
    2015/09/01 17:48:46 [INFO] serf: EventMemberJoin: ip-10-185-23-210 10.185.23.210
    2015/09/01 17:48:55 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:00 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:00 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:20 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:32 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:32 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:38 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:38 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:40 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:51 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:51 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:52 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:56 [INFO] serf: EventMemberFailed: ip-10-185-15-217 10.185.15.217
    2015/09/01 17:49:56 [INFO] serf: EventMemberJoin: ip-10-185-15-217 10.185.15.217
    2015/09/01 17:50:04 [INFO] memberlist: Marking ip-10-190-13-188 as failed, suspect timeout reached
    2015/09/01 17:50:04 [INFO] serf: EventMemberFailed: ip-10-190-13-188 10.190.13.188
    2015/09/01 17:50:05 [INFO] serf: EventMemberJoin: ip-10-190-13-188 10.190.13.188
    2015/09/01 17:50:20 [INFO] serf: EventMemberFailed: ip-10-185-77-94 10.185.77.94
    2015/09/01 17:50:24 [INFO] memberlist: Marking ip-10-185-65-7 as failed, suspect timeout reached
    2015/09/01 17:50:24 [INFO] serf: EventMemberFailed: ip-10-185-65-7 10.185.65.7
    2015/09/01 17:50:31 [INFO] serf: EventMemberJoin: ip-10-185-77-94 10.185.77.94
    2015/09/01 17:50:47 [INFO] serf: EventMemberJoin: ip-10-185-65-7 10.185.65.7
    2015/09/01 17:51:01 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:02 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:09 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:51:09 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:51:43 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:51:45 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
    2015/09/01 17:51:45 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:45 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:52:22 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:52:22 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:52:30 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4

Source

djenriquez

All 107 comments

Those parameters are not configuration-tunable but it would be simple to patch and build Consul with a different set of values. Do you know the approximate round-trip times between your different AZs? Also, is there anything interesting in the logs on the node side during one of these flapping events?

slackpad on 5 Sep 2015

@slackpad Sorry for the late response! Pinging between the three AZs in us-west-2 is roughly 1.3ms.

Nothing interesting is happening in the logs or being reported in Sysdig. Our nodes are pretty under-utilized right now, but the highest traffic belongs to Consul and Consul-template with an average rate of about 7 KiB/s, spiking to 20KiB/s about every 30 seconds. Our nodes are reporting an average rate of 7KiB/s, so yea pretty much all Consul traffic.

djenriquez on 9 Sep 2015

We got this problem too. At first it works fine with 8 nodes. But now we have 12 nodes and will increase more and we face with node health flapping all the times. This is happen randomly with random node and it happens every few seconds. The error would be below repeat with random node

2015/09/11 13:31:59 [INFO] memberlist: Suspect pre-service-03 has failed, no acks received
2015/09/11 13:31:57 [WARN] memberlist: Refuting a suspect message (from: prod-service-07)

FYI We run Consul in Docker container. Host run on EC2 with VPC on the same subnet and ping really working fine with below 2ms without any loss.

winggundamth on 11 Sep 2015

We have this problem too, but we're running with a slightly larger cluster than those in this thread. We're running with approximately 160 nodes in our production cluster and I'm testing this problem in a cluster with a little over 60 healthy nodes. Both of them have this flapping issue.

I enabled metrics in our testing environment and have been analyzing them to try and find some kind of pattern. I first thought maybe the probes were regularly taking 400ms and that some just happened to take slightly longer at 500ms and was failing. The mean time for probing nodes is 5-25ms with just an occasional outlier. The standard deviation has a max of 100ms from the metrics I've taken the last day.

The curious part is that while the times are very low for probing a node, the sum indicates that at 1-2 probes fail a minute. The sum hops between a little above 500ms and 1000ms which seems to indicate 1-2 probes failed.

I tried checking the serf queues to see if they were backed up. The metric that serf reports for intents, events, and queries seem to consistently be zero. I have no idea if these queues are the same as the memberlist queue though. I also don't know if that has anything to do with acks. A cursory look at the code seems to indicate that these queues shouldn't affect acks: https://github.com/hashicorp/memberlist/blob/master/net.go#L234

At this point, I'm confused about why this is happening. I know EC2's network isn't the best, but it failing so often doesn't seem to happen anywhere else that I'm aware of. I have already checked all security groups and we're operating inside of a VPC. I see traffic traveling over TCP and UDP so I know that it's not a configuration issue at this point.

jsternberg on 16 Sep 2015

A correction to the above. There appeared to be a few nodes that had UDP blocked in our testing environment. I've confirmed we no longer have those and am gathering metrics again.

I think we still have this problem though as our production environment doesn't have those ports blocked and we still regularly get nodes failing.

jsternberg on 16 Sep 2015

so @jsternberg, just to clarify, you're still having the flapping problem, but have solved the bad metrics issue you were having?

djenriquez on 16 Sep 2015

@djenriquez yep. I'm now attempting to get more data to try and find some kind of root cause. I'll be running with it in this state and will likely be able to confirm after monitoring the metrics for a couple of days about what is happening. I already see that one probe failed, but no dead node happened.

Since this is such a common issue, it may be worth adding some additional logging for when probes fail or a test command for environment validation.

jsternberg on 16 Sep 2015

Since this is such a common issue, it may be worth adding some additional logging for when probes fail or a test command for environment validation.

This is definitely a good idea and we've been talking about this internally as well. Given the randomized nature of the pings, and the fact that it will try indirect pings via other nodes Serf/Consul can pave over a lot of different types of failures and configuration issues in ways that can be confusing. In Consul 0.6 we have added a TCP fallback ping which helps keep the cluster stable while providing some log messages about a possible misconfiguration ("node X was reachable via TCP but not UDP").

I forgot to ask, @djenriquez are you running Consul inside Docker containers as well?

slackpad on 16 Sep 2015

@slackpad Yes sir.

djenriquez on 17 Sep 2015

This is the problem on Docker. Until it fixed please see my workaround here

https://github.com/docker/docker/issues/8795#issuecomment-139553386

winggundamth on 17 Sep 2015

@winggundamth I do not believe that this is the same issue. I am actually very familiar with the conntrack fix and have experienced it with Consul before in the past. The UDP issue fixed by conntrack is much more consistent in failures than the flapping problem that we are having here.

The flapping issue that we are seeing is downtime for roughly 30-90 seconds for probably every few hours; the nodes are up 90-95% of the time. But when you start increasing the amount of nodes, your cluster will see failures more often because the chances of a single node being in that 5-10% failure time increases.

djenriquez on 17 Sep 2015

@djenriquez So does it means that work around can not fix the problem for you?

winggundamth on 17 Sep 2015

@winggundamth correct, this does not fix the problem for us.

djenriquez on 17 Sep 2015

I retract my previous comments from this thread. It appears our core problem was something mentioned in a Google Groups mailing list message about this.

After resolving network errors in our testing environment, I looked at our staging environment. Our staging environment was repeatedly failing. Luckily, the logs messages mentioned a node that I know has been having trouble due to having too much IO load. I'll have more data within the next couple of days, but I think this will probably fix our issue. I'll report back if I'm wrong and there is still an issue, but otherwise assume that we're fine and have no issues.

The testing environment has been working perfectly with no node flapping. The PR I referenced above helped in figuring out which nodes were having problems and were failing their UDP ping checks. I also made another fix to the metrics that I'll open an issue for that caused "alive" metrics to get reported at invalid times.

@djenriquez I'm not sure if my issue is the same as yours, but I would suggest looking at the metrics and see if you can make any heads or tails of them. It may point you to the problem.

jsternberg on 17 Sep 2015

Thanks for the update @jsternberg - could you link to that Google Groups thread here?

slackpad on 17 Sep 2015

It was in response to the reason why this issue was created to begin with and how I found this issue number.

https://groups.google.com/d/msg/consul-tool/zyh8Kbifv6M/c1WWpknQ8H8J

One of the first responses so I'm a little embarrassed that was our underlying issue. It is certainly difficult with the gossip protocol to find who is causing the failure as it turns out I was always looking at the wrong nodes.

jsternberg on 17 Sep 2015

Awesome, glad to see its working better for you @jsternberg. Unfortunately in our case, its not a single node but a completely random node that will fail for a short period of time, including nodes in the same VPC as the consul servers. We have all traffic set up to pass through the required consul ports across all servers.

At first I had thought there was a VPN issue between our AZ, but that wouldn't answer why the nodes in the same AZ and VPC as the consul servers would also flap periodically.

I haven't spent much time analyzing the data because the issue is minor, just more of an annoyance. I'll go ahead and start looking deeper at this issue.

djenriquez on 17 Sep 2015

@djenriquez to clarify what happened to us, it _was_ a random node that would fail. That's the reason why it was so hard to find was because the server that was failing would not be the one that actually failed.

Loaded node A sends out ping to node B
Node B responds to ping with an ack
Node A doesn't respond to the ack before 500 ms timeout
Node A _thinks_ the ack failed, even though it succeeded
Node A tries fallback methods, somehow they fail too
Node A sends out suspect message about Node B to the cluster
Node C receives suspect message about Node B
Node C hits suspect timeout and declares Node B dead

A single instance of this happening isn't too bad, but if it happens with _every_ ping you get a bunch of random suspect messages being sent to the cluster. Even if it's only sent to a fraction (5%) of them, you get 3 suspect messages a minute. Eventually, the suspect timeout gets hit before the node can refute the message and you end up with dead nodes.

Unfortunately, I don't have enough evidence that this is exactly what happened, but removing the loaded node from our cluster seems to be making our cluster healthier. There could also be other reasons why the probe has failed.

jsternberg on 17 Sep 2015

Ah, @jsternberg seems logical. The other problem with my issue however, is that none of our nodes are reaching over 50% CPU utilization, with the average utilization ~15% for our entire infrastructure, with the heaviest traffic in our nodes belonging to Consul at a whopping 40KiB/s average (sarcasm :stuck_out_tongue: ). These are all new nodes that we're looking to utilize soon.

djenriquez on 18 Sep 2015

My old cluster had 5 leaders that were t2.micro instances with around 50 agents connected and using it as DNS for auto discovery with TTL's for services and nodes of 5 seconds along with allow_stale turned on. All of these were running in docker with net=host. I have been seeing between 2-5 leader elections a day.

I relaunched all of the leaders on new nodes and put them on m3.mediums yesterday. This new cluster just had its first random leader election. It was running for about 8 hours before the first event occurred.

Some stats on this m3.medium cluster
CPU: 20% max - 15% average
Network I/O - 1-2MB/s average

Node: 10.81

consul_1 |     2015/09/18 22:15:14 [WARN] raft: Failed to contact 10.0.20.177:8300 in 500.153009ms
consul_1 |     2015/09/18 22:15:15 [WARN] raft: Failed to contact 10.0.20.177:8300 in 946.791541ms
consul_1 |     2015/09/18 22:15:15 [WARN] raft: Failed to contact 10.0.20.177:8300 in 1.480900366s
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [ERR] raft: peer 10.0.20.177:8300 has newer term, stopping replication
consul_1 |     2015/09/18 22:15:20 [INFO] raft: Node at 10.0.10.81:8300 [Follower] entering Follower state
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.20.177:8300
consul_1 |     2015/09/18 22:15:20 [INFO] consul: cluster leadership lost
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.10.64:8300
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.10.250:8300
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.20.87:8300
consul_1 |     2015/09/18 22:15:21 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since our last index is greater (23800, 23798)
consul_1 |     2015/09/18 22:15:22 [WARN] raft: Heartbeat timeout reached, starting election
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Node at 10.0.10.81:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Election won. Tally: 3
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Node at 10.0.10.81:8300 [Leader] entering Leader state
consul_1 |     2015/09/18 22:15:22 [INFO] consul: cluster leadership acquired
consul_1 |     2015/09/18 22:15:22 [INFO] consul: New leader elected: ip-10-0-10-81
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.10.250:8300
consul_1 |     2015/09/18 22:15:22 [WARN] raft: AppendEntries to 10.0.20.177:8300 rejected, sending older logs (next: 23799)
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.10.64:8300
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.20.87:8300
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.20.177:8300

Node: 20.71

consul_1 |     2015/09/18 22:15:15 [WARN] raft: Heartbeat timeout reached, starting election
consul_1 |     2015/09/18 22:15:15 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:17 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:17 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:18 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:18 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:20 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:21 [ERR] raft-net: Failed to flush response: write tcp 10.0.10.81:55933: connection reset by peer
consul_1 |     2015/09/18 22:15:21 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:21 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] agent: failed to sync remote state: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Node at 10.0.20.177:8300 [Follower] entering Follower state
consul_1 |     2015/09/18 22:15:22 [WARN] raft: Failed to get previous log: 23800 log not found (last: 23798)
consul_1 |     2015/09/18 22:15:22 [INFO] consul: New leader elected: ip-10-0-10-81
consul_1 |     2015/09/18 22:15:26 [INFO] serf: attempting reconnect to ip-10-0-30-87 10.0.30.87:8301

sstarcher on 19 Sep 2015

Any status updates or additional info we can provide to get this issue up and moving?

The flapping is starting to affect some of our services since consul-templating is removing the routes to our NGINX because the nodes are considered unhealthy. Eventually, when they flap back, all is fine, but this service interruption is very problematic for our nodes hosting important services.

djenriquez on 26 Sep 2015

Updates?

djenriquez on 30 Sep 2015

Hi @djenriquez - haven't forgotten about this but working through a backlog.

slackpad on 1 Oct 2015

hey @djenriquez I'm not sure if you follow the list serve (https://groups.google.com/forum/#!forum/consul-tool) but there are a number of threads on the topic.

I saw some flapping issues but they are now massively reduced through a few tweaks (if this helps):

moved to M4.large instances for the consul servers; the network performance, for my case, was much improved, even for a consul cluster that spans AZs.
allowed for dns, service, and node stale caches. This helps keep things up and running during leader-flapping
I was routing all dns through consul, but I now use dnsmasq to only put *.consul. requests through consul.
I use --net host for the consul docker containers. Before I added that flag to docker, I was running into issues with docker dropping UDP packets, which I noticed quite a bit when I was running all dns requests through consul. using --net host has helped quite a bit (though of course you lose all the iptable benefits when you do that)

I'm now seeing 1-2 leader elections a day on a small cluster (3 consul servers, another 6-12 agents, and ~ 16 services)

feel free to ping me if you want to dive into details.

skippy on 1 Oct 2015

Thanks @slackpad.

Hi @skippy, yes this issue actually was born in the Google group. As far as the flapping nodes, I'm having a hard time justifying allocation an m4.large, especially 5 of them for Consul. Sysdig shows that Consul uses an average of 40 KiB/s network traffic, with some spikes to 70 KiB/s, but thats it. This is in a cluster of ~30 nodes /w ~40-50 services between the nodes.

Now, maybe larger/stronger instances can increase stability, but why? Smaller instances, even t2.micros should be more than capable of handling that kind of network traffic.

How does one set up service/node stale caches? Is this a feature in Consul that I enable? This is a setting I have not yet fiddled with, maybe it can help. I'm not using Consul DNS at all.

I had someone else just yesterday recommend --net=host. I've transitioned the server containers to use that option and am slowly transitioning the agents to use that as well. Its too soon to report whether its helping, I feel that it has, somewhat, but I have noticed some nodes with --net=host still flapping.

Once I transition all agents to use that option, i'll report back. Thank you!

djenriquez on 2 Oct 2015

On Oct 1, 2015, at 10:53 PM, DJ Enriquez [email protected] wrote:

Now, maybe larger/stronger instances can increase stability, but why? Smaller instances, even t2.micros should be more than capable of handling that kind of network traffic.

I run several clusters on t2.small instances. We have very few leader elections.

blalor on 2 Oct 2015

@blalor, are you using t2.small within the same AZ or spanning AZs?

@djenriquez great point about moving to m4. The challenge with AWS is there are lots of other variables; how many azs are being spanned, which region are you in, etc. And then lots of guessing (are the t2 and m4 families on better internal network hardware?)

@djenriquez here is part of a sample json config that may help:

{
  "dns_config": {
    "allow_stale": true,
    "node_ttl": "2s",
    "max_stale": "5s",
    "service_ttl": {
      "*": "5s",
      "api": "10s"
    }
  }
}

you'll see some of these settings on https://www.consul.io/docs/agent/options.html#node_ttl

skippy on 2 Oct 2015

Awesome @skippy, I'll take a look at those configs.

As far as AWS regions, we have one datacenter per region (us-west-2, us-east-1, eu-west-1), but having the flapping issues only in the west. Granted, our us-east-1 and eu-west-1 datacenters only house the 5 agents with 5 services vs us-west-2 with 30 agents, and many many more services.

In each region, we've split up our agents equally to establish HA as much as possible.

djenriquez on 2 Oct 2015

Update in our environment: Last Friday, we moved away from our "One Consul cluster for everything" strategy and separated our production environment from our dev/test/staging. During this effort, we also ran all of our dockerized consul containers with --net=host. Ever since then, then flapping has stopped.

I'm not sure which of the two (splitting up the cluster or --net=host), or if both contributed to resolving the flapping issue, but we ultimately split that environment up 60/40, with 60% being dev/test/staging and 40% being prod.

When it was flapping, we did have some consul agents running --net=host before splitting up the environment, but not all agents were. During this time, some of these agents with --net=host were still flapping. Maybe every agent needs to run --net=host before the positive effects can take place?

This then raises a concern on the type of load Consul can take before the flapping begins. We had 20 instances, including the 5 consul servers, when the flapping was occurring and now have 2 environments of 15 and 10 with no flapping. Does consul have a load issue when the number of agents are increased??

djenriquez on 5 Oct 2015

Just an update: since separating our cluster into smaller clusters, we have not seen nodes flapping.

We've yet to have any cluster reach 30+ nodes, but still worried that the flapping issue will begin once we get to that point. Has any progress been made in discovering the cause of flapping nodes in large (30+ nodes) aws clusters?

djenriquez on 26 Oct 2015

Try setting the MTU to 1500.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html

I just did a test:
eu-west-1 (ireland)
3 t2.micro server, across all 3AZ's
200 t2.micro clients, distributed across AZ's
Running it 45 minutes with no problems.

Leader is running 50-70% cpu so might run into problems with cpu credits.
However with such large cluster larger server instances are acceptable.

M4 might work because it has enhanced networking.

Also don't forget your network ACL's beside your SG's.
(Don't know if the MTU really solves the problem, I forgot a network ACL and got flappin, this is how I got to this issue. However we had problems with AWS SES and setting the mtu to 1500 resolved it so tried that first before discovering the ACL... Need to do some testing with defaul 9001 mtu but it's sunday and I only wanted to know if consul can handle normal cluster size with small nodes so will come later.) Have seen leadership reelection though..

sander-su on 1 Nov 2015

@sander-su I had previously tested m3.medium servers and they still had the same issue, just to a lesser extent.

sstarcher on 1 Nov 2015

How did you set GOMAXPROCS?
Running with t2.medium and GOMAXPROCS=2 seems to be pretty stable (no leader re-elect, no flapping) (at least past few hours)

sander-su on 2 Nov 2015

You can refer to my original post above. https://github.com/hashicorp/consul/issues/1212#issuecomment-141678943

sstarcher on 2 Nov 2015

Hi @sstarcher, i'm not sure why I missed your original post, but I did. Sounds like a solid idea. We are currently not have any flapping at the moment because we fragmented our infrastructure such that the maximum number of nodes in a cluster we have right now is 15 nodes.

When we get back to the amount where flapping happens (~20-25+), I'll try this out.

If anyone else is experiencing flapping and would like to give this a shot for the sake of resolving this issue, please do so.

djenriquez on 2 Nov 2015

I have a 3 node, multi AZ VPC cluster running on t2.small's. Logs during flapping look like this:

2015/11/16 13:15:49 [WARN] raft: Failed to contact 10.0.1.5:8300 in 500.228729ms
    2015/11/16 13:15:50 [WARN] raft: Failed to contact 10.0.1.5:8300 in 943.215024ms
    2015/11/16 13:15:50 [WARN] raft: Failed to contact 10.0.1.5:8300 in 1.399755136s
    2015/11/16 13:15:54 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:15:54 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:15:55 [ERR] raft: peer 10.0.1.5:8300 has newer term, stopping replication
    2015/11/16 13:15:55 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:15:55 [INFO] consul: cluster leadership lost
    2015/11/16 13:15:55 [INFO] raft: aborting pipeline replication to peer 10.0.0.6:8300
    2015/11/16 13:15:55 [ERR] http: Request /v1/catalog/nodes, error: No cluster leader
    2015/11/16 13:15:55 [ERR] http: Request /v1/catalog/services, error: No cluster leader
    2015/11/16 13:15:56 [INFO] consul: New leader elected: consul-3
    2015/11/16 13:16:01 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/16 13:16:01 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:03 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:16:03 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:05 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:16:05 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:06 [ERR] http: Request /v1/catalog/service/monitoring?dc=ec2&index=366986&wait=30000ms, error: No cluster leader
    2015/11/16 13:16:06 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:16:06 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:06 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:16:06 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.1.5:8300
    2015/11/16 13:16:06 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.1.5:8300
    2015/11/16 13:16:06 [INFO] consul: New leader elected: consul-3
    2015/11/16 13:16:31 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/16 13:16:31 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:31 [ERR] raft-net: Failed to flush response: write tcp 10.0.1.5:54881: connection reset by peer
    2015/11/16 13:16:33 [INFO] raft: Duplicate RequestVote for same term: 224
    2015/11/16 13:16:33 [INFO] raft: Duplicate RequestVote for same term: 224
    2015/11/16 13:16:33 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:16:33 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:33 [INFO] raft: Election won. Tally: 2
    2015/11/16 13:16:33 [INFO] raft: Node at 10.0.0.5:8300 [Leader] entering Leader state
    2015/11/16 13:16:33 [INFO] consul: cluster leadership acquired
    2015/11/16 13:16:33 [INFO] consul: New leader elected: consul-1
    2015/11/16 13:16:33 [INFO] raft: pipelining replication to peer 10.0.1.5:8300
    2015/11/16 13:16:33 [INFO] raft: pipelining replication to peer 10.0.0.6:8300
    2015/11/16 13:16:34 [WARN] raft: Failed to contact 10.0.1.5:8300 in 500.185904ms
    2015/11/16 13:16:35 [WARN] raft: Failed to contact 10.0.1.5:8300 in 933.234141ms
    2015/11/16 13:16:35 [WARN] raft: Failed to contact 10.0.1.5:8300 in 1.3765368s
    2015/11/16 13:16:40 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:16:40 [ERR] raft: peer 10.0.1.5:8300 has newer term, stopping replication
    2015/11/16 13:16:40 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:16:40 [INFO] raft: aborting pipeline replication to peer 10.0.1.5:8300
    2015/11/16 13:16:40 [INFO] consul: cluster leadership lost
    2015/11/16 13:16:40 [INFO] raft: aborting pipeline replication to peer 10.0.0.6:8300
    2015/11/16 13:16:40 [ERR] http: Request /v1/catalog/nodes, error: No cluster leader
    2015/11/16 13:16:40 [ERR] http: Request /v1/catalog/services, error: No cluster leader
    2015/11/16 13:16:41 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.1.5:8300
    2015/11/16 13:16:41 [INFO] consul: New leader elected: consul-3
    2015/11/16 13:16:43 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.1.5:8300
    2015/11/16 13:16:50 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/16 13:16:50 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:51 [ERR] http: Request /v1/catalog/service/monitoring?dc=ec2&index=367725&wait=30000ms, error: No cluster leader
    2015/11/16 13:16:52 [ERR] raft-net: Failed to flush response: write tcp 10.0.1.5:54885: connection reset by peer
    2015/11/16 13:16:52 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:16:52 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:16:52 [INFO] raft: Election won. Tally: 2
    2015/11/16 13:16:52 [INFO] raft: Node at 10.0.0.5:8300 [Leader] entering Leader state
    2015/11/16 13:16:52 [INFO] consul: cluster leadership acquired
    2015/11/16 13:16:52 [INFO] consul: New leader elected: consul-1
    2015/11/16 13:16:52 [INFO] raft: pipelining replication to peer 10.0.1.5:8300
    2015/11/16 13:16:52 [INFO] raft: pipelining replication to peer 10.0.0.6:8300
    2015/11/16 13:17:17 [WARN] raft: Failed to contact 10.0.1.5:8300 in 500.156606ms
    2015/11/16 13:17:17 [WARN] raft: Failed to contact 10.0.1.5:8300 in 984.637251ms
    2015/11/16 13:17:18 [WARN] raft: Failed to contact 10.0.1.5:8300 in 1.443053167s
    2015/11/16 13:17:23 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:17:23 [ERR] raft: peer 10.0.1.5:8300 has newer term, stopping replication
    2015/11/16 13:17:23 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:17:23 [INFO] raft: aborting pipeline replication to peer 10.0.1.5:8300
    2015/11/16 13:17:23 [INFO] consul: cluster leadership lost
    2015/11/16 13:17:23 [INFO] raft: aborting pipeline replication to peer 10.0.0.6:8300
    2015/11/16 13:17:25 [INFO] consul: New leader elected: consul-3
    2015/11/16 13:17:25 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.1.5:8300
    2015/11/16 13:17:41 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/16 13:17:41 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:17:43 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:17:43 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:17:45 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:17:45 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:17:46 [ERR] raft-net: Failed to flush response: write tcp 10.0.1.5:54889: broken pipe
    2015/11/16 13:17:47 [ERR] http: Request /v1/catalog/nodes, error: rpc error: No cluster leader
    2015/11/16 13:17:47 [ERR] http: Request /v1/catalog/services, error: No cluster leader
    2015/11/16 13:17:47 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:17:47 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:17:47 [INFO] raft: Election won. Tally: 2
    2015/11/16 13:17:47 [INFO] raft: Node at 10.0.0.5:8300 [Leader] entering Leader state
    2015/11/16 13:17:47 [INFO] consul: cluster leadership acquired
    2015/11/16 13:17:47 [INFO] consul: New leader elected: consul-1
    2015/11/16 13:17:47 [INFO] raft: pipelining replication to peer 10.0.1.5:8300
    2015/11/16 13:17:47 [INFO] raft: pipelining replication to peer 10.0.0.6:8300
    2015/11/16 13:18:12 [WARN] raft: Failed to contact 10.0.1.5:8300 in 500.20398ms
    2015/11/16 13:18:13 [WARN] raft: Failed to contact 10.0.1.5:8300 in 967.058815ms
    2015/11/16 13:18:13 [WARN] raft: Failed to contact 10.0.1.5:8300 in 1.443708733s
    2015/11/16 13:18:18 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:18:18 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:18:18 [ERR] raft: peer 10.0.1.5:8300 has newer term, stopping replication
    2015/11/16 13:18:18 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:18:18 [INFO] raft: aborting pipeline replication to peer 10.0.1.5:8300
    2015/11/16 13:18:18 [INFO] consul: cluster leadership lost
    2015/11/16 13:18:18 [INFO] raft: aborting pipeline replication to peer 10.0.0.6:8300
    2015/11/16 13:18:19 [INFO] consul: New leader elected: consul-3
    2015/11/16 13:18:20 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.1.5:8300
    2015/11/16 13:18:33 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/16 13:18:33 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:35 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:35 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:37 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:37 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:38 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:38 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:40 [ERR] http: Request /v1/catalog/service/monitoring?dc=ec2&index=367725&wait=30000ms, error: No cluster leader
    2015/11/16 13:18:40 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:40 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:40 [ERR] http: Request /v1/catalog/nodes, error: No cluster leader
    2015/11/16 13:18:40 [ERR] http: Request /v1/catalog/services, error: No cluster leader
    2015/11/16 13:18:42 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:42 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:44 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:44 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:45 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:47 [ERR] raft: Failed to make RequestVote RPC to 10.0.1.5:8300: read tcp 10.0.1.5:8300: i/o timeout
    2015/11/16 13:18:47 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:47 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:48 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:48 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:50 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:50 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:51 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:18:51 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:18:52 [ERR] raft: Failed to make RequestVote RPC to 10.0.1.5:8300: read tcp 10.0.1.5:8300: i/o timeout
    2015/11/16 13:18:52 [INFO] raft: Election won. Tally: 2
    2015/11/16 13:18:52 [INFO] raft: Node at 10.0.0.5:8300 [Leader] entering Leader state
    2015/11/16 13:18:52 [INFO] consul: cluster leadership acquired
    2015/11/16 13:18:52 [INFO] consul: New leader elected: consul-1
    2015/11/16 13:18:52 [INFO] raft: pipelining replication to peer 10.0.1.5:8300
    2015/11/16 13:18:52 [INFO] raft: pipelining replication to peer 10.0.0.6:8300
    2015/11/16 13:18:54 [ERR] raft: Failed to make RequestVote RPC to 10.0.1.5:8300: read tcp 10.0.1.5:8300: i/o timeout
    2015/11/16 13:18:57 [WARN] raft: Failed to contact 10.0.1.5:8300 in 500.182041ms
    2015/11/16 13:18:58 [WARN] raft: Failed to contact 10.0.1.5:8300 in 952.073264ms
    015/11/16 13:18:58 [WARN] raft: Failed to contact 10.0.1.5:8300 in 1.442323586s
    2015/11/16 13:19:00 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:19:01 [WARN] raft: Rejecting vote from 10.0.1.5:8300 since we have a leader: 10.0.0.5:8300
    2015/11/16 13:19:01 [ERR] raft: peer 10.0.1.5:8300 has newer term, stopping replication
    2015/11/16 13:19:01 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:19:01 [INFO] raft: aborting pipeline replication to peer 10.0.1.5:8300
    2015/11/16 13:19:01 [INFO] consul: cluster leadership lost
    2015/11/16 13:19:01 [INFO] raft: aborting pipeline replication to peer 10.0.0.6:8300
    2015/11/16 13:19:02 [INFO] consul: New leader elected: consul-3
    2015/11/16 13:19:16 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/16 13:19:16 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:19:16 [ERR] agent: failed to sync remote state: No cluster leader
    2015/11/16 13:19:17 [ERR] agent: failed to sync remote state: No cluster leader
    2015/11/16 13:19:18 [INFO] raft: Duplicate RequestVote for same term: 261
    2015/11/16 13:19:18 [WARN] raft: Election timeout reached, restarting election
    2015/11/16 13:19:18 [INFO] raft: Node at 10.0.0.5:8300 [Candidate] entering Candidate state
    2015/11/16 13:19:18 [INFO] consul: New leader elected: consul-2
    2015/11/16 13:19:19 [INFO] raft: Node at 10.0.0.5:8300 [Follower] entering Follower state
    2015/11/16 13:19:19 [WARN] raft: Failed to get previous log: 367742 log not found (last: 367741)
    2015/11/16 13:19:19 [INFO] consul: New leader elected: consul-2

I am going to try setting the MTU to 1500, GOMAXPROCS=2 and use caching for DNS as a safeguard. If none of those work, I will try upping from t2.small

MrMMorris on 14 Dec 2015

Looks like you're doing some work to closing out some issues @slackpad. I've tried to replicate this flapping in 0.6.0 and have fortunately, been unsuccessful. Looks like all the network optimizations in 0.6.0 did the trick.

Closing, and hoping this stays closed.

djenriquez on 8 Jan 2016

@djenriquez thanks much for the follow up report!

slackpad on 8 Jan 2016

Not all clients in my cluster have been moved to 0.6.0, but my masters have and they still lose each other.

sstarcher on 8 Jan 2016

@sstarcher we can re-open this issue if you'd like, or try to work the details of your setup in another one - please let me know.

slackpad on 8 Jan 2016

@slackpad I'll have more time next week to dig into the consul consensus. After I dig into it I'll get back in touch.

sstarcher on 8 Jan 2016

Things seem to be pretty stable for the last few days since making the changes I mentioned. Will keep updated if something changes.

MrMMorris on 8 Jan 2016

so just woke up to over an hour of downtime due to Consul not being able to keep it together. This is getting very frustrating.

Logs from consul-2: https://gist.github.com/MrMMorris/8eefe65fb6b55dc9cb20
Logs from consul-1: https://gist.github.com/MrMMorris/9d9206f546852516022a

I'm am starting to think about dynamically creating host files using consul-template on each node so they aren't effected by this. If anyone has some other ideas, I would love to hear them...

MrMMorris on 11 Jan 2016

@MrMMorris, have you tried upgrading to 0.6? There were enormous changes to the internals that, for me, solved the flapping issue.

I believe this specific github issue is tied to 0.5.2, if people are still having flapping issues with 0.6, I recommend opening a new issue specifying that. This will help the dev team isolate the problem.

djenriquez on 11 Jan 2016

yea this is with 0.6.0. I was initially on 0.5.2 when I first commented. I will open a new issue, thanks.

MrMMorris on 11 Jan 2016

@MrMMorris sorry about the down time. Looking through these logs it seems like there was a pretty bad communication failure between the Consul servers that disrupted the TCP traffic between them, not the usual node health checking. Do you have any network metrics you can look at during this time for packet loss or other signs of connectivity trouble?

slackpad on 11 Jan 2016

@slackpad thanks for the quick response! I did notice this morning when one of the servers went down that I couldn't ssh in so that makes me think this may be an EC2 problem as well. I am getting some network and consul metrics set up now. Will report back.

MrMMorris on 11 Jan 2016

@slackpad I have determined that it is a cron.daily task that is causing high disk io. Might be logrotate, still looking into it.

screen shot 2016-01-12 at 12 15 57 pm

However, I am experiencing a different issue now. On the same server, I have been seeing these logs since 9:30am UTC (not when cron.daily is run). Any idea why this is happening and how to prevent or have it recover automatically? or does it require manual intervention with a peers.json edit?

    2016/01/12 17:16:53 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:16:53 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:16:54 [ERR] agent: coordinate update error: No cluster leader
    2016/01/12 17:16:55 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:16:55 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:16:56 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:16:56 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:16:58 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:16:58 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:00 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:00 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:01 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:01 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:03 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:03 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:04 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:04 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:05 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:17:05 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:17:05 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:05 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:06 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:06 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:07 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:07 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:08 [ERR] agent: failed to sync remote state: No cluster leader
    2016/01/12 17:17:09 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:09 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:11 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:11 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:12 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:12 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:14 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:14 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:16 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:16 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:17 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:17 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:18 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:18 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:19 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:19 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:20 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:17:20 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:17:21 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:21 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:22 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:22 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:23 [ERR] agent: coordinate update error: No cluster leader
    2016/01/12 17:17:24 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:24 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:24 [ERR] agent: failed to sync remote state: No cluster leader
    2016/01/12 17:17:26 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:26 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:28 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:28 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:29 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:29 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:31 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:31 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:33 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:33 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:34 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:34 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:35 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:17:35 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:17:35 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:35 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:37 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:37 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:39 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:39 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:40 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:40 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:42 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:42 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:44 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:44 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:45 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:45 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:46 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:46 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:46 [ERR] agent: coordinate update error: No cluster leader
    2016/01/12 17:17:48 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:48 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:49 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:49 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:50 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:17:50 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:17:51 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:51 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:51 [ERR] agent: failed to sync remote state: No cluster leader
    2016/01/12 17:17:53 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:53 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:54 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:54 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:55 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:55 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:56 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:56 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:17:58 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:17:58 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:00 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:00 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:02 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:02 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:03 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:03 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:04 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:04 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:05 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:18:05 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:18:06 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:06 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:07 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:07 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:09 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:09 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:10 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:10 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:12 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:12 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:14 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:14 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:15 [ERR] agent: coordinate update error: No cluster leader
    2016/01/12 17:18:16 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:16 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:17 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:17 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:18 [ERR] agent: failed to sync remote state: No cluster leader
    2016/01/12 17:18:18 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:18 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:20 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:18:20 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:18:20 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:20 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:21 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:21 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:23 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:23 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:24 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:24 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:26 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:26 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:28 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:28 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:29 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:29 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:31 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:31 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:32 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:32 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:34 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:34 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:35 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:35 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:35 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:18:35 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:18:36 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:36 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:37 [ERR] agent: failed to sync remote state: No cluster leader
    2016/01/12 17:18:37 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:37 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:39 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:39 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:39 [ERR] agent: coordinate update error: No cluster leader
    2016/01/12 17:18:40 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:40 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:42 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:42 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:44 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:44 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:46 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:46 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:47 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:47 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:48 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:48 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:49 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:49 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:50 [ERR] http: Request GET /v1/catalog/nodes, error: No cluster leader
    2016/01/12 17:18:50 [ERR] http: Request GET /v1/catalog/services, error: No cluster leader
    2016/01/12 17:18:51 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:51 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:53 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:53 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:55 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:55 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:56 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:56 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:58 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:58 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:18:59 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:18:59 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:19:01 [ERR] agent: failed to sync remote state: No cluster leader
    2016/01/12 17:19:01 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:19:01 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:19:02 [WARN] raft: Election timeout reached, restarting election
    2016/01/12 17:19:02 [INFO] raft: Node at 10.0.0.9:8300 [Candidate] entering Candidate state
    2016/01/12 17:19:02 [ERR] agent: coordinate update error: No cluster leader

MrMMorris on 12 Jan 2016

@slackpad scratch that last issue. It was due to some port not being open, although I'm not sure which.

This is my SG for Consul Servers. What am I missing that would cause the previous election timeouts?

screen shot 2016-01-13 at 12 49 28 am

MrMMorris on 13 Jan 2016

@MrMMorris the server-to-server Raft traffic should be going over 8300/tcp in with the default configuration.

slackpad on 14 Jan 2016

@slackpad After relaunching all of our consul servers on larger nodes we are still seeing significant problems. Would you like me to create a new ticket or document it here. An overview is as follows

AWS EC2 - m3.medium
5 Servers
100 Clients
Consul running in a docker container with --net=host - Docker 1.9.1
Ubuntu 14.04
DNS forwarding is setup on our DNS servers to forward requests for *.service.consul to our 5 servers
Vault is in use, but running on separate servers.

Config

{
  "data_dir": "/data",
  "client_addr": "0.0.0.0",
  "advertise_addr": "10.0.10.8",
  "leave_on_terminate": true,
  "encrypt": "XXXXXXXXXXXXXXXXXXXXX",
  "atlas_infrastructure":"XXXXXXXXXXXX",
  "atlas_token":"XXXXXXXX",
  "atlas_join": true,
  "ui_dir":"/ui",
  "server": true,
  "dns_config": {
    "allow_stale": true,
    "node_ttl": "5s",
    "service_ttl": {
      "*": "5s"
    }
  }
}

Our logs show the following over the past 12 hours. Certainly below over the past 12 hours only one server has had failed to reconnect problems and it's possible that, that node is having issues, but over the next 12 hours I suspect I'll see other servers have issues as has consistently been the case. Over the past week I have moved the servers from t2.micros to t2.mediums and now to m3.mediums to see if it was any performance problem. The CPU utilization is 15% on the servers.

    Time    message     instance-id     host  
January 20th 2016, 06:55:02.198     2016/01/20 11:55:02 [WARN] raft: Failed to contact 10.0.10.8:8300 in 522.178375ms   i-5dd966d4  ip-10-0-10-68
January 20th 2016, 06:55:02.198     2016/01/20 11:55:02 [WARN] raft: Failed to contact 10.0.20.199:8300 in 733.919544ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 06:44:52.657     2016/01/20 11:44:52 [WARN] raft: Failed to contact 10.0.20.199:8300 in 931.336012ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 06:44:52.225     2016/01/20 11:44:52 [WARN] raft: Failed to contact 10.0.20.199:8300 in 500.159745ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 05:43:38.524     2016/01/20 10:43:38 [WARN] raft: Failed to contact 10.0.20.199:8300 in 975.418272ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 05:43:38.049     2016/01/20 10:43:38 [WARN] raft: Failed to contact 10.0.20.199:8300 in 500.155204ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 02:42:36.630     2016/01/20 07:42:36 [WARN] raft: Failed to contact 10.0.20.156:8300 in 500.207691ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:55:52.514     2016/01/20 06:55:52 [WARN] raft: Failed to contact 10.0.20.156:8300 in 972.641195ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:55:52.042     2016/01/20 06:55:52 [WARN] raft: Failed to contact 10.0.20.156:8300 in 500.208781ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:49:57.234     2016/01/20 06:49:57 [WARN] raft: Failed to contact quorum of nodes, stepping down   i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:49:57.234     2016/01/20 06:49:57 [WARN] raft: Failed to contact 10.0.10.238:8300 in 542.080863ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:49:57.233     2016/01/20 06:49:57 [WARN] raft: Failed to contact 10.0.20.156:8300 in 542.209503ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:49:57.233     2016/01/20 06:49:57 [WARN] raft: Failed to contact 10.0.10.8:8300 in 542.034192ms   i-5dd966d4  ip-10-0-10-68
January 20th 2016, 00:56:53.560     2016/01/20 05:56:53 [WARN] raft: Failed to contact 10.0.20.199:8300 in 934.142244ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 00:56:53.153     2016/01/20 05:56:53 [WARN] raft: Failed to contact 10.0.20.199:8300 in 512.482722ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 00:34:26.854     2016/01/20 05:34:26 [WARN] raft: Failed to contact 10.0.20.199:8300 in 945.522732ms i-5dd966d4  ip-10-0-10-68
January 20th 2016, 00:34:26.409     2016/01/20 05:34:26 [WARN] raft: Failed to contact 10.0.20.199:8300 in 500.161658ms i-5dd966d4  ip-10-0-10-68

    Time    message     instance-id     host  
January 20th 2016, 06:45:37.449     2016/01/20 11:45:37 [INFO] consul: New leader elected: ip-10-0-10-68    i-cc9fee35  ip-10-0-30-16
January 20th 2016, 06:45:14.831     2016/01/20 11:45:14 [INFO] consul: New leader elected: ip-10-0-10-68    i-4cd182fd  ip-10-0-40-99
January 20th 2016, 06:44:56.261     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-145a08a2  ip-10-0-10-224
January 20th 2016, 06:44:56.260     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-1d625aab  ip-10-0-10-79
January 20th 2016, 06:44:56.139     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-2c579d9f  ip-10-0-20-155
January 20th 2016, 06:44:56.095     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-26b4cb95  ip-10-0-20-156
January 20th 2016, 06:44:56.088     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-0e2a12b8  ip-10-0-10-44
January 20th 2016, 06:44:56.086     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-3429e787  ip-10-0-20-194
January 20th 2016, 06:44:56.085     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-08ac58bb  ip-10-0-20-236
January 20th 2016, 06:44:56.075     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-3c350f8b  ip-10-0-10-240
January 20th 2016, 06:44:56.072     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-d342f760  ip-10-0-20-39
January 20th 2016, 06:44:56.063     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-c4eb2177  ip-10-0-20-250
January 20th 2016, 06:44:56.028     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-627d5bd2  ip-10-0-20-179
January 20th 2016, 06:44:56.024     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-30412d83  ip-10-0-20-114
January 20th 2016, 06:44:56.010     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-cee05a47  ip-10-0-10-193
January 20th 2016, 06:44:56.004     2016/01/20 11:44:56 [INFO] consul: New leader elected: ip-10-0-10-68    i-4718a2f1  ip-10-0-10-158
January 20th 2016, 06:44:55.995     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-35e024e2  ip-10-0-10-27
January 20th 2016, 06:44:55.990     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-34752782  ip-10-0-10-115
January 20th 2016, 06:44:55.990     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-5650cbaf  ip-10-0-30-143
January 20th 2016, 06:44:55.989     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-a0b70829  ip-10-0-10-8
January 20th 2016, 06:44:55.975     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-e686586f  ip-10-0-10-92
January 20th 2016, 06:44:55.972     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-a2f4172b  ip-10-0-10-131
January 20th 2016, 06:44:55.971     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-ac61bd25  ip-10-0-10-225
January 20th 2016, 06:44:55.965     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-18b844a8  ip-10-0-20-17
January 20th 2016, 06:44:55.964     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-3d34a38e  ip-10-0-20-160
January 20th 2016, 06:44:55.959     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-6e67aee7  ip-10-0-10-202
January 20th 2016, 06:44:55.956     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-bb443b08  ip-10-0-20-199
January 20th 2016, 06:44:55.955     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-0d0b21ba  ip-10-0-10-252
January 20th 2016, 06:44:55.944     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-ff532c4c  ip-10-0-20-47
January 20th 2016, 06:44:55.928     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-234c3390  ip-10-0-20-62
January 20th 2016, 06:44:55.927     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-a7e7582e  ip-10-0-10-238
January 20th 2016, 06:44:55.925     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-98c8432e  ip-10-0-10-89
January 20th 2016, 06:44:55.898     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-d793e92e  ip-10-0-90-123
January 20th 2016, 06:44:55.898     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-615e82e8  ip-10-0-10-145
January 20th 2016, 06:44:55.877     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-b673c605  ip-10-0-20-152
January 20th 2016, 06:44:55.874     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-2706fb97  ip-10-0-20-81
January 20th 2016, 06:44:55.874     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-ebf77858  ip-10-0-20-169
January 20th 2016, 06:44:55.859     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-bd364644  ip-10-0-90-31
January 20th 2016, 06:44:55.842     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-765990ff  ip-10-0-10-20
January 20th 2016, 06:44:55.742     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-e0347019  ip-10-0-90-232
January 20th 2016, 06:44:55.740     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-120cb6a4  ip-10-0-10-21
January 20th 2016, 06:44:55.722     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-fef7df4e  ip-10-0-20-133
January 20th 2016, 06:44:55.718     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-5be15ced  source-tunnel
January 20th 2016, 06:44:55.697     2016/01/20 11:44:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-5dd966d4  ip-10-0-10-68
January 20th 2016, 06:44:47.003     2016/01/20 11:44:47 [INFO] consul: New leader elected: ip-10-0-10-68    i-5f48a8d6  ip-10-0-10-86
January 20th 2016, 06:44:42.724     2016/01/20 11:44:42 [INFO] consul: New leader elected: ip-10-0-10-68    i-5c48a8d5  ip-10-0-10-85
January 20th 2016, 01:56:35.735     2016/01/20 06:56:35 [INFO] consul: New leader elected: ip-10-0-10-68    i-cc9fee35  ip-10-0-30-16
January 20th 2016, 01:56:13.519     2016/01/20 06:56:13 [INFO] consul: New leader elected: ip-10-0-10-68    i-4cd182fd  ip-10-0-40-99
January 20th 2016, 01:55:55.284     2016/01/20 06:55:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-08ac58bb  ip-10-0-20-236
January 20th 2016, 01:55:55.093     2016/01/20 06:55:55 [INFO] consul: New leader elected: ip-10-0-10-68    i-18b844a8  ip-10-0-20-17
January 20th 2016, 01:55:54.979     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-145a08a2  ip-10-0-10-224
January 20th 2016, 01:55:54.976     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-34752782  ip-10-0-10-115
January 20th 2016, 01:55:54.945     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-3c350f8b  ip-10-0-10-240
January 20th 2016, 01:55:54.940     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-fef7df4e  ip-10-0-20-133
January 20th 2016, 01:55:54.924     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-c4eb2177  ip-10-0-20-250
January 20th 2016, 01:55:54.921     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-bb443b08  ip-10-0-20-199
January 20th 2016, 01:55:54.913     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-4718a2f1  ip-10-0-10-158
January 20th 2016, 01:55:54.891     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-d793e92e  ip-10-0-90-123
January 20th 2016, 01:55:54.885     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-120cb6a4  ip-10-0-10-21
January 20th 2016, 01:55:54.883     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-627d5bd2  ip-10-0-20-179
January 20th 2016, 01:55:54.880     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-2c579d9f  ip-10-0-20-155
January 20th 2016, 01:55:54.863     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-b673c605  ip-10-0-20-152
January 20th 2016, 01:55:54.851     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-5be15ced  source-tunnel
January 20th 2016, 01:55:54.850     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-a0b70829  ip-10-0-10-8
January 20th 2016, 01:55:54.842     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-bd364644  ip-10-0-90-31
January 20th 2016, 01:55:54.817     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-1d625aab  ip-10-0-10-79
January 20th 2016, 01:55:54.812     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-26b4cb95  ip-10-0-20-156
January 20th 2016, 01:55:54.812     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-2706fb97  ip-10-0-20-81
January 20th 2016, 01:55:54.810     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-d342f760  ip-10-0-20-39
January 20th 2016, 01:55:54.810     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-3429e787  ip-10-0-20-194
January 20th 2016, 01:55:54.809     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-30412d83  ip-10-0-20-114
January 20th 2016, 01:55:54.800     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-a7e7582e  ip-10-0-10-238
January 20th 2016, 01:55:54.795     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-6e67aee7  ip-10-0-10-202
January 20th 2016, 01:55:54.795     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-0d0b21ba  ip-10-0-10-252
January 20th 2016, 01:55:54.791     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-ac61bd25  ip-10-0-10-225
January 20th 2016, 01:55:54.788     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-765990ff  ip-10-0-10-20
January 20th 2016, 01:55:54.780     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-98c8432e  ip-10-0-10-89
January 20th 2016, 01:55:54.779     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-ebf77858  ip-10-0-20-169
January 20th 2016, 01:55:54.763     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-e0347019  ip-10-0-90-232
January 20th 2016, 01:55:54.762     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-234c3390  ip-10-0-20-62
January 20th 2016, 01:55:54.761     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-ff532c4c  ip-10-0-20-47
January 20th 2016, 01:55:54.761     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-35e024e2  ip-10-0-10-27
January 20th 2016, 01:55:54.757     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-a2f4172b  ip-10-0-10-131
January 20th 2016, 01:55:54.756     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-e686586f  ip-10-0-10-92
January 20th 2016, 01:55:54.749     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-5650cbaf  ip-10-0-30-143
January 20th 2016, 01:55:54.747     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-3d34a38e  ip-10-0-20-160
January 20th 2016, 01:55:54.735     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-0e2a12b8  ip-10-0-10-44
January 20th 2016, 01:55:54.604     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-cee05a47  ip-10-0-10-193
January 20th 2016, 01:55:54.601     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-615e82e8  ip-10-0-10-145
January 20th 2016, 01:55:54.389     2016/01/20 06:55:54 [INFO] consul: New leader elected: ip-10-0-10-68    i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:55:45.964     2016/01/20 06:55:45 [INFO] consul: New leader elected: ip-10-0-10-68    i-5f48a8d6  ip-10-0-10-86
January 20th 2016, 01:55:41.768     2016/01/20 06:55:41 [INFO] consul: New leader elected: ip-10-0-10-68    i-5c48a8d5  ip-10-0-10-85
January 20th 2016, 01:50:39.901     2016/01/20 06:50:39 [INFO] consul: New leader elected: ip-10-0-10-68    i-cc9fee35  ip-10-0-30-16
January 20th 2016, 01:50:18.022     2016/01/20 06:50:18 [INFO] consul: New leader elected: ip-10-0-10-68    i-4cd182fd  ip-10-0-40-99
January 20th 2016, 01:49:59.259     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-6e67aee7  ip-10-0-10-202
January 20th 2016, 01:49:59.253     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-3d34a38e  ip-10-0-20-160
January 20th 2016, 01:49:59.247     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-765990ff  ip-10-0-10-20
January 20th 2016, 01:49:59.219     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-145a08a2  ip-10-0-10-224
January 20th 2016, 01:49:59.211     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-120cb6a4  ip-10-0-10-21
January 20th 2016, 01:49:59.209     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-a7e7582e  ip-10-0-10-238
January 20th 2016, 01:49:59.208     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-cee05a47  ip-10-0-10-193
January 20th 2016, 01:49:59.193     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-c4eb2177  ip-10-0-20-250
January 20th 2016, 01:49:59.186     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-ebf77858  ip-10-0-20-169
January 20th 2016, 01:49:59.186     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-0d0b21ba  ip-10-0-10-252
January 20th 2016, 01:49:59.173     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-b673c605  ip-10-0-20-152
January 20th 2016, 01:49:59.170     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-bb443b08  ip-10-0-20-199
January 20th 2016, 01:49:59.160     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-1d625aab  ip-10-0-10-79
January 20th 2016, 01:49:59.159     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-627d5bd2  ip-10-0-20-179
January 20th 2016, 01:49:59.150     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-234c3390  ip-10-0-20-62
January 20th 2016, 01:49:59.149     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-e0347019  ip-10-0-90-232
January 20th 2016, 01:49:59.146     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-5650cbaf  ip-10-0-30-143
January 20th 2016, 01:49:59.136     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-26b4cb95  ip-10-0-20-156
January 20th 2016, 01:49:59.130     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-bd364644  ip-10-0-90-31
January 20th 2016, 01:49:59.129     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-615e82e8  ip-10-0-10-145
January 20th 2016, 01:49:59.119     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-a0b70829  ip-10-0-10-8
January 20th 2016, 01:49:59.111     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-08ac58bb  ip-10-0-20-236
January 20th 2016, 01:49:59.109     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-0e2a12b8  ip-10-0-10-44
January 20th 2016, 01:49:59.096     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-3429e787  ip-10-0-20-194
January 20th 2016, 01:49:59.095     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-35e024e2  ip-10-0-10-27
January 20th 2016, 01:49:59.092     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-3c350f8b  ip-10-0-10-240
January 20th 2016, 01:49:59.088     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-4718a2f1  ip-10-0-10-158
January 20th 2016, 01:49:59.083     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-a2f4172b  ip-10-0-10-131
January 20th 2016, 01:49:59.083     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-ac61bd25  ip-10-0-10-225
January 20th 2016, 01:49:59.082     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-98c8432e  ip-10-0-10-89
January 20th 2016, 01:49:59.080     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-d342f760  ip-10-0-20-39
January 20th 2016, 01:49:59.078     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-34752782  ip-10-0-10-115
January 20th 2016, 01:49:59.068     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-fef7df4e  ip-10-0-20-133
January 20th 2016, 01:49:59.050     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-ff532c4c  ip-10-0-20-47
January 20th 2016, 01:49:59.035     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-18b844a8  ip-10-0-20-17
January 20th 2016, 01:49:59.011     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-d793e92e  ip-10-0-90-123
January 20th 2016, 01:49:59.008     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-2c579d9f  ip-10-0-20-155
January 20th 2016, 01:49:59.007     2016/01/20 06:49:59 [INFO] consul: New leader elected: ip-10-0-10-68    i-e686586f  ip-10-0-10-92
January 20th 2016, 01:49:58.985     2016/01/20 06:49:58 [INFO] consul: New leader elected: ip-10-0-10-68    i-2706fb97  ip-10-0-20-81
January 20th 2016, 01:49:58.919     2016/01/20 06:49:58 [INFO] consul: New leader elected: ip-10-0-10-68    i-5be15ced  source-tunnel
January 20th 2016, 01:49:58.918     2016/01/20 06:49:58 [INFO] consul: New leader elected: ip-10-0-10-68    i-30412d83  ip-10-0-20-114
January 20th 2016, 01:49:58.895     2016/01/20 06:49:58 [INFO] consul: New leader elected: ip-10-0-10-68    i-5dd966d4  ip-10-0-10-68
January 20th 2016, 01:49:50.106     2016/01/20 06:49:50 [INFO] consul: New leader elected: ip-10-0-10-68    i-5f48a8d6  ip-10-0-10-86
January 20th 2016, 01:49:45.877     2016/01/20 06:49:45 [INFO] consul: New leader elected: ip-10-0-10-68    i-5c48a8d5  ip-10-0-10-85

sstarcher on 20 Jan 2016

I'm going to update to 0.6.3 as that may fix my issue.

sstarcher on 21 Jan 2016

After 3 days I have had 2 separate instances of Failed to contact on 2 separate nodes out of the 5. I'm also seeing random cpu spikes from the normal of 10-15% to 35-40%.

All 5 servers are consul 0.6.3, but the clients are mixed versions from 0.5.2 to 0.6.3

sstarcher on 25 Jan 2016

Hi @sstarcher checking in on this one to see how things are going. If you are still seeing flaps I'd probably start looking at details on what your cluster load is like and/or other things happening on the server boxes. That would probably be best done on another issue, linked to this one, since there are several things mixed on here.

slackpad on 2 Feb 2016

@slackpad I have documented several times what my server load is. The load on the server nodes is pretty much non-existent. The only role of those boxes are to be consul masters. We ship logs off the box and collect metrics, but that's it.

sstarcher on 2 Feb 2016

Sorry I meant load in a Consul sense - are there any heavy patterns of use of Consul like pulling huge subsets of the KV store, lots of blocking queries by running complex consul-template instances on many nodes, etc.

slackpad on 2 Feb 2016

@slackpad Thanks for the clarification. We don't currently use consul template. We do use vault, but only in a very minimal sense and our vault usage is newer, but the flapping issue has been around for a very long time. We use consul for service discovery, the cache TTLs are all around 5 seconds I believe to eliminate excessive load.

I'm currently testing our consul servers on CoreOS to see if they exhibit the same flapping problem.

sstarcher on 2 Feb 2016

Interesting - that does sound pretty light and nothing that should cause Consul to behave badly GC-wise, and 100 clients is a relatively small cluster. I'd probably go down the path of looking for network issues affecting your servers. We've seen some folks run into a Xen bug which may or may not apply to you, see https://github.com/hashicorp/consul/issues/1154#issuecomment-133117548. That one caused heavy packet loss and TCP retransmits. Running Docker with --net=host is the right way to go to Consul, and should eliminate Docker-related network issues, though it would be interesting to see how the cluster performs on the same machines outside of Docker if that's possible.

slackpad on 2 Feb 2016

@slackpad After migrating the nodes to CoreOS if the flapping continues I will run everything outside of docker and report back. I do run with --net=host which is why I have kept running in docker, but it could still be related.

sstarcher on 2 Feb 2016

Sounds good - yeah I noticed above that you were using --net=host so I can't think of other Docker-related causes if you are doing that, but it would be an interesting data point.

slackpad on 2 Feb 2016

@sstarcher - did running outside docker help? We are seeing infrequent 'random' flapping with a very small <25 node AWS EC2 cluster. We see nodes timeout for <10 seconds and then reconnect, as well as occasional "no cluster leader" errors.

Unfortunately we're on AWS ECS which doesn't support docker's --net=host so we almost certainly have to migrate to direct EC2. Just wondering if that will help or whether there is something else at play. Thanks!

mrwilby on 8 Feb 2016

@mrwilby I am running on EC2 outside of docker and still get flapping. I actually had an outage just last night. Still have no idea what's going on.

MrMMorris on 8 Feb 2016

We're seeing with a 40 node cluster tons of flapping in EC2 running on Centos 6 w/Consul 0.6.3, not using any containers. Seems to happen on all of our clients (the consul servers do not appear to be flapping though). It's happening across all of our boxes in Development with basically no traffic. UDP + TCP are open on Consul Ports.

scottefein on 22 Feb 2016

We migrated to consul running on the host (no docker) and still see flapping. However, I didn't grok the GOMAXPROCS comments until now. We will also see if that helps, but at this point I am not optimistic...

mrwilby on 26 Feb 2016

@mrwilby GOMAXPROCS changes haven't done anything for me. I was running docker with prometheus exporters on my consul servers, so I have stopped docker to see if that helps.

MrMMorris on 26 Feb 2016

With the newer versions of Go, GOMAXPROCS should not make a difference since Go will by default use all the cores available.

Node health flapping (as opposed to spurious leader elections) is almost always caused by network issues or not having all of the ports open or reachable between certain pairs of hosts.

We are working on some tools to try to make this easier to debug, but if you can look through your Consul logs for failed, no acks received you can see which node is declaring a node failed. Look for patterns where one node is consistently failing a bunch of other nodes (usually a networking problem on that host), or where pairs of nodes never seem to be able to ping each other (usually a firewall / network ACL kind of issue).

slackpad on 26 Feb 2016

I suspect that for me the issue is more basic, the process seems to be restarting every few minutes, I can tell because consul monitor reports inability to communicate with the consul process.
How does one debug this kind of behavior?

romansky on 26 Feb 2016

@romansky there's a debug log level that might help there -https://www.consul.io/docs/agent/options.html#_log_level.

slackpad on 26 Feb 2016

@slackpad thanks, I already have debug set, does this take effect during monitor? I don't seem to see any messages with [debug] prefix. If the process is indeed crashing will it produce any relevant logs?

romansky on 26 Feb 2016

@romansky yes consul monitor can set the log level to debug, though if it looks like the process is crashing you might want to set that on the Consul agent itself and observe its output directly - the consul monitor command is doing an RPC call to the agent so it might miss any final logs related to a panic, etc.

slackpad on 26 Feb 2016

@slackpad so if the config file is set with the following-

{
...
"log_level": "DEBUG",
"enable_debug": true,
...
}

Is this enough?

romansky on 26 Feb 2016

Just this part should be required:

{
...
"log_level": "debug",
...
}

The other option enables some additional debug endpoints, but those aren't required for logging.

slackpad on 26 Feb 2016

@romansky your issue is probably worth splitting off into a new Github issue since it's not related to the flapping others are seeing here. That'll help keep the noise down since there are quite a few threads going on this one.

slackpad on 26 Feb 2016

@slackpad rgr

romansky on 26 Feb 2016

I am also experiencing this exact problem, and the cluster setup is very similar to what others have been posting here. Our cluster currently has a mix of 0.5.2 and 0.6.x nodes.

We are running the servers on v0.6.3, m3.mediums, same region but different AZs, and as upstart services (not containers). consul rtt between the server nodes look healthy. I've tried to do rtt on nodes that showed up as failed, but they were also good.

I suspect the network internals doesn't play well between the 0.5.2 and 0.6.x variants.

calvn on 9 Mar 2016

We are running all 0.6.3, no docker, same region, multi AZ and still see issues so I don't think 0.5.x + 0.6.x interop is the sole issue.

mrwilby on 9 Mar 2016

Hi @cleung2010 we did work and extensive testing to make sure a mix of 0.5.2 and 0.6.x would work well together, though the newly-added TCP fallback ping for 0.6.x won't apply if it's talking to a 0.5.2 node on the other end (and vice-versa) so if you are experiencing heavy packet loss you won't get the benefit of that new feature.

We are working on some improvements for flapping in the upcoming 0.7 release at the Serf level, so we should have more details to share here soon.

slackpad on 11 Mar 2016

👍1

Hi, just wanted to chip in and mention that I have similar flapping issues related to consul / docker
(not running on Aws however)

My setup:

docker exec -ti f9457379e54d /consul/consul  version
Consul v0.6.3
Consul Protocol: 3 (Understands back to: 1)

Consul cluster of 3 running inside docker containers.
3 Separate VM's (vmware) on same Vlan (no firewalls at all)

First I make sure that I've run conntrack -F or ip -s -s neigh flush all
to flush the cache. I also rm any {raft,serf} files I might have in /var/tmp/consul

Cluster is started fresh using :

node1 : 10.7.7.75
node2 : 10.7.7.79
node3 : 10.7.7.83

(the above ip addresses are changed for this post) (tried to be consistent).

#node 1, node 2 and node 3 are all started with the same command
#the ip addresses are changed for each node.

/usr/bin/docker run -d --privileged --net=host -v /var/tmp/consul:/data \
    -p 10.7.7.75:8300:8300/tcp \
    -p 10.7.7.75:8300:8300/udp \
    -p 10.7.7.75:8301:8301/tcp \
    -p 10.7.7.75:8301:8301/udp \
    -p 10.7.7.75:8302:8302/tcp \
    -p 10.7.7.75:8302:8302/udp \
    -p 10.7.7.75:8400:8400/tcp \
    -p 10.7.7.75:8400:8400/udp \
    -p 10.7.7.75:8500:8500/tcp \
    -p 10.7.7.75:8500:8500/udp \
    -p 10.7.7.75:8600:8600/tcp \
    -p 10.7.7.75:8600:8600/udp \
    gliderlabs/consul-agent -server -bind 10.7.7.75 -bootstrap-expect 3 -ui -dc A -config-file /data/server.json

cat /var/tmp/consul/server.json
{
  "disable_update_check": true,
  "dns_config": {
    "allow_stale": true,
    "node_ttl": "2s",
    "max_stale": "5s",
    "service_ttl": {
      "*": "5s",
      "api": "10s"
    }
  }
}

then I run join from one of the nodes.

docker exec -ti $(docker ps -lq) /bin/consul join 10.7.7.75 10.7.7.79 10.7.7.83

So far so good.

#node1 logs
2016/04/02 22:16:23 [ERR] agent: failed to sync remote state: No cluster leader
2016/04/02 22:16:38 [ERR] agent: coordinate update error: No cluster leader
2016/04/02 22:16:41 [INFO] serf: EventMemberJoin: node3 10.7.7.83
2016/04/02 22:16:41 [INFO] consul: adding LAN server node3 (Addr: 10.7.7.83:8300) (DC: A)
2016/04/02 22:16:41 [INFO] serf: EventMemberJoin: node2 10.7.7.79
2016/04/02 22:16:41 [INFO] consul: adding LAN server node2 (Addr: 10.7.7.79:8300) (DC: A)
2016/04/02 22:16:41 [INFO] consul: New leader elected: node2
2016/04/02 22:16:43 [INFO] agent: Synced service 'consul'
2016/04/02 22:18:07 [INFO] agent.rpc: Accepted client: 127.0.0.1:56493
2016/04/02 22:18:12 [INFO] agent.rpc: Accepted client: 127.0.0.1:56494

#node2 logs (leader)

2016/04/02 22:16:41 [INFO] consul: adding LAN server node3 (Addr: 10.7.7.83:8300) (DC: A)
2016/04/02 22:16:41 [WARN] raft: Heartbeat timeout reached, starting election
2016/04/02 22:16:41 [INFO] raft: Node at 10.7.7.79:8300 [Candidate] entering Candidate state
2016/04/02 22:16:41 [INFO] raft: Election won. Tally: 2
2016/04/02 22:16:41 [INFO] raft: Node at 10.7.7.79:8300 [Leader] entering Leader state
2016/04/02 22:16:41 [INFO] consul: cluster leadership acquired
2016/04/02 22:16:41 [INFO] consul: New leader elected: node2
2016/04/02 22:16:41 [INFO] raft: pipelining replication to peer 10.7.7.83:8300
2016/04/02 22:16:41 [INFO] raft: pipelining replication to peer 10.7.7.75:8300
2016/04/02 22:16:41 [WARN] raft: Remote peer 10.7.7.75:8300 does not have local node 10.7.7.79:8300 as a peer
2016/04/02 22:16:41 [INFO] consul: member 'node2' joined, marking health alive
2016/04/02 22:16:41 [INFO] consul: member 'node1' joined, marking health alive
2016/04/02 22:16:41 [INFO] consul: member 'node3' joined, marking health alive
2016/04/02 22:16:43 [INFO] agent: Synced service 'consul'

#node3 logs
2016/04/02 22:16:41 [INFO] agent: (LAN) joining: [10.7.7.75 10.7.7.79 10.7.7.83]
2016/04/02 22:16:41 [INFO] serf: EventMemberJoin: node1 10.7.7.75
2016/04/02 22:16:41 [INFO] consul: adding LAN server node1 (Addr: 10.7.7.75:8300) (DC: A)
2016/04/02 22:16:41 [INFO] serf: EventMemberJoin: node2 10.7.7.79
2016/04/02 22:16:41 [INFO] consul: adding LAN server node2 (Addr: 10.7.7.79:8300) (DC: A)
2016/04/02 22:16:41 [INFO] consul: Attempting bootstrap with nodes: [10.7.7.83:8300 10.7.7.75:8300 10.7.7.79:8300]
2016/04/02 22:16:41 [INFO] agent: (LAN) joined: 3 Err: <nil>
2016/04/02 22:16:41 [INFO] consul: New leader elected: node2
2016/04/02 22:16:41 [INFO] agent: Synced service 'consul'
2016/04/02 22:16:57 [INFO] agent.rpc: Accepted client: 127.0.0.1:34524

Then I start up another container on for example node1.

[root@node1 ~]# docker run -d -e CONSUL_SERVER=10.7.7.75 -e DATA_CENTER=A myregistry:5000/my-application:latest

f9457379e54d1262b0a4ce8b583b3e029f91c5cd8ac44d6f3468345a52cd5508

This particular container has a consul agent inside of it which starts up using supervisord as pid1 :

# from the supervisord conf (the environment variables are replaced with  the -e variables I provided at the docker run command shown above.

/consul/consul agent -dc=%(ENV_DATA_CENTER)s -retry-join=%(ENV_CONSUL_SERVER)s -config-dir=/consul/config -pid-file=/consul/consul.pid -data-dir=/consul/data

The logs from consul monitor on node1 says:

2016/04/02 22:25:50 [INFO] serf: EventMemberJoin: f9457379e54d 172.17.0.3
2016/04/02 22:27:05 [INFO] serf: EventMemberFailed: f9457379e54d 172.17.0.3
2016/04/02 22:27:19 [INFO] serf: EventMemberJoin: f9457379e54d 172.17.0.3
2016/04/02 22:27:30 [INFO] memberlist: Marking f9457379e54d as failed, suspect timeout reached
2016/04/02 22:27:30 [INFO] serf: EventMemberFailed: f9457379e54d 172.17.0.3
2016/04/02 22:27:49 [INFO] serf: EventMemberJoin: f9457379e54d 172.17.0.3
2016/04/02 22:28:01 [INFO] memberlist: Marking f9457379e54d as failed, suspect timeout reached
2016/04/02 22:28:01 [INFO] serf: EventMemberFailed: f9457379e54d 172.17.0.3
2016/04/02 22:28:04 [INFO] serf: attempting reconnect to f9457379e54d 172.17.0.3:8301
2016/04/02 22:28:04 [INFO] serf: EventMemberJoin: f9457379e54d 172.17.0.3
2016/04/02 22:30:01 [INFO] serf: EventMemberFailed: f9457379e54d 172.17.0.3
2016/04/02 22:30:19 [INFO] serf: EventMemberJoin: f9457379e54d 172.17.0.3

logs from the container f9457379e54d says:

(never mind the time stamp since I've created this ticket after the fact and scrolled through the logs and picked parts that were related)

2016/04/02 23:06:25 [WARN] memberlist: Refuting a suspect message (from: node3)
2016/04/02 23:07:05 [WARN] memberlist: Was able to reach node3 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
2016/04/02 23:07:08 [INFO] serf: EventMemberFailed: ca3877daf6c5 172.17.0.2
2016/04/02 23:07:19 [WARN] memberlist: Refuting a suspect message (from: f9457379e54d)
2016/04/02 23:07:20 [INFO] serf: attempting reconnect to ca3877daf6c5 172.17.0.2:8301
2016/04/02 23:07:20 [INFO] serf: EventMemberJoin: ca3877daf6c5 172.17.0.2
2016/04/02 23:07:47 [INFO] agent.rpc: Accepted client: 127.0.0.1:37013

brought up another container with the same run commands as the previous container.

(ca3877daf6c5) on node1

Same flapping issues - serf checks going green and orange.

logs from node 3

2016/04/02 23:11:11 [ERR] memberlist: Push/Pull with f9457379e54d failed: dial tcp 172.7.0.3:8301: getsockopt: no route to host
2016/04/02 23:11:15 [INFO] serf: EventMemberFailed: ca3877daf6c5 172.17.0.2
2016/04/02 23:11:21 [INFO] serf: EventMemberJoin: ca3877daf6c5 172.17.0.2
2016/04/02 23:11:41 [INFO] memberlist: Suspect ca3877daf6c5 has failed, no acks received

2016/04/02 23:11:42 [ERR] memberlist: Push/Pull with f9457379e54d failed: dial tcp 172.17.0.3:8301: getsockopt: no route to host  ### <-- tried --expose 8301 when starting the application containers up however no luck - same issue and same messages. - clearly a routing issue however I can't understand why since there are no firewalls or similar in place.

2016/04/02 23:11:46 [INFO] memberlist: Marking ca3877daf6c5 as failed, suspect timeout reached

details of the members in the cluster.

[root@node1 ~]# docker exec -ti ca3877daf6c5 /consul/consul members -detailed
Node                                          Address            Status  Tags
ca3877daf6c5                                  172.17.0.2:8301    alive   build=0.6.3:c933efde,dc=sta,role=node,vsn=2,vsn_max=3,vsn_min=1
f9457379e54d                                  172.17.0.3:8301    alive   build=0.6.3:c933efde,dc=sta,role=node,vsn=2,vsn_max=3,vsn_min=1
node1  10.7.7.75:8301  alive   build=0.6.3:c933efde,dc=sta,expect=3,port=8300,role=consul,vsn=2,vsn_max=3,vsn_min=1
node2  10.7.7.79:8301  alive   build=0.6.3:c933efde,dc=sta,expect=3,port=8300,role=consul,vsn=2,vsn_max=3,vsn_min=1
node3  10.7.7.83:8301  alive   build=0.6.3:c933efde,dc=sta,expect=3,port=8300,role=consul,vsn=2,vsn_max=3,vsn_min=1

I've tried all possible combinations - exposing port 8301/udp and tcp on the application containers I start up .. running them with --net=host, --expose 8301/udp --expose 8301/tcp etc ..

always having the flapping issues.

kamigerami on 3 Apr 2016

I think I read through the whole thread. Did I miss the solution? It says it's closed. What's the solution? I am having the exact same problem.

I have two VPCs connected via VPC Peering. In one VPC, there are four consul servers and all agents in both VPCs register to them. When I look at the Consul UI, I can see that random nodes are going orange and then back to green. There are ~70 servers in my infrastructure and they all have at least the consul agent running.

I have everything going to Logstash so I can mine logs quickly for patterns. I am running consul 0.6.4. I am using consul-template to update HAProxy.

The infrastructure is built by Cloudformation and all security groups are open to each other with respect to Consul and Consul-template.

The flapping is not bad most of the time, but there have been several instances where the flapping was such that there were no servers in rotation in HAProxy but only for less than 1 minute. I don't know how to solve this issue.

ghost on 8 Jul 2016

Hi @pong-takepart sorry the paper trail isn't very clear on this one. We've got some changes teed up to go out in the next release of Consul to make this much better - here's the PR that pulled them in - https://github.com/hashicorp/consul/pull/2101.

slackpad on 8 Jul 2016

@slackpad OSSUM!!! Thanks!!

ghost on 9 Jul 2016

❤1

@pong-takepart My issue turned out to be a low memory issue caused by running monitoring containers. When something happened in the Consul cluster which resulted in an increase in logging, it meant the monitoring container used more memory which lead to more issues with Consul due to low memory and more logging. I ended up removing the monitoring containers and the Consul cluster has been rock solid ever since.

MrMMorris on 9 Jul 2016

@pong-takepart what size are your servers? And if you are not collecting the consul metrics I would recommend it.

sstarcher on 9 Jul 2016

@slackpad Hello. Is there any tentative guesstimate about when you will cut a release that contains these changes? I am guessing a lot of folks will be interested to pick up these improvements and see if they address the flapping issues. Thanks for spending cycles investigating this!

mrwilby on 14 Jul 2016

Hi @mrwilby I don't have a firm date on the final release but hopefully a release candidate in the next several weeks. We are burning down a few more things before we cut a release, though this feature is fully integrated in master now.

slackpad on 14 Jul 2016

Thanks @slackpad - We're trying out master now in one of our test environments. Fingers folded for finally fixing flapping faults!

mrwilby on 14 Jul 2016

@mrwilby excellent - would appreciate any feedback. The way these fixes should manifest is that the node experiencing non real-time behavior (CPU exhaustion, network performance issues, dropped packets, etc.) should still get marked failed, but it shouldn't be able to falsely accuse other, healthy nodes of being failed and cause them to flap.

slackpad on 14 Jul 2016

@slackpad - ok. Is there (or will there be) tuning parameters we can use to adjust the "non real-time behavior" tolerances ? We, and I am sure others, would prefer to avoid having to provision costly/large cloud instance types just for a very small cluster (a few handfuls of nodes) simply because consul is over-aggressive in its determinations of whether a node has failed or otherwise...

mrwilby on 14 Jul 2016

👍1

We are definitely planning to do that for Raft and the servers as part of this release. Didn't plan manual tunes at the Serf level, but this new algorithm should be much more forgiving in that regard since it requires independent confirmations in order to quickly declare a failure so depending on what's causing the NRT behavior and how bad it is, you may find that the degraded node itself isn't getting marked failed at all. Especially if your load spikes are short lived and erratic then this should perform much better. As we continue testing and get feedback we may consider some Serf tunes as well, but hopefully we won't need to.

slackpad on 14 Jul 2016

we may consider some Serf tunes as well

Would be great to have access to Serf tunables at least to some extent.

teluka on 18 Jul 2016

@slackpad I am running a build from master (0.7.0dev @ 6af6baf) and unfortunately I am still experiencing leader elections several times a day. This is a 5-node cluster that is _completely_ idle except for Consul (which is just sitting there, not yet in use). The servers are relatively small (t2.small) spread across 3 AWS availability zones. I have tried using m3.medium and had the same result.

What data can I provide that will help?

kingpong on 1 Aug 2016

@kingpong Coming from an AWS setup I can tell you I always experience leader election running t2 instance types. It was not until I moved to m3 and larger that the election issue went away.

sstarcher on 1 Aug 2016

I am running my 3 node EC2 cluster on t2.small's and do not have the election issues.

MrMMorris on 1 Aug 2016

@kingpong check out this thread: https://github.com/hashicorp/vault/issues/1585

it is related to vault going down because of consul flapping. TL/DR: like noted in this thread and elsewhere, you can't run consul on super super small instances,

In summary, i found two things:

consul is sensitive to network and disk IO; so if the underlying OS does anything to impact that, you'll see Consul flap. For example, if you are running docker and/or CoreOS you _will_ have issues with smaller instances (see https://github.com/coreos/bugs/issues/1424, https://github.com/coreos/bugs/issues/1081, and https://bugzilla.kernel.org/show_bug.cgi?id=65201); or docker logging buffer overflows (https://github.com/docker/docker/issues/22502), journalD IO pegging, etc.... I've found so many little issues that have nothing to do with consul but they impact consul's internal health checks. I just run on larger instances as those instances have enough spare capacity that weird stuff can happen and consul most likely won't be impacted.
consul had an issue where a single flapping node, in the right circumstances, could propagate errors to the rest of the consul cluster, causing otherwise healthy instances to register as unhealthy. I found this occurs (for me anyway) when an unhealthy node flaps quickly and continuously, such as triggered by any of the issues listed above, and it doesn't have time to stay marked as unhealthy or healthy. This is supposedly fixed on master.

skippy on 1 Aug 2016

Thanks for the additional info, @skippy.

tl;dr: I think GOMAXPROCS=2 was the culprit. Changing to 10 seems to have solved it.

These servers are completely unused outside of Consul, and Consul itself is literally doing nothing except gossiping with itself. Even a t2.small is ridiculously overpowered for that task. They are 97% idle all of the time, including during the times when these leader elections have occurred.

This smells like an application bug because I have been monitoring with vmstat, ping and tcpdump for the last 24 hours, and none of those tools indicate system load or link-layer network instability during the elections. I do see some TCP retransmits and connection resets, but other traffic (e.g. pings) between the hosts at literally the very same second is unaffected.

I have been using the Chef consul cookbook to deploy Consul. The cookbook automatically sets the consul process's GOMAXPROCS to the number of cores or 2, whichever is greater. So on my single core machine, that means GOMAXPROCS=2. On a whim, I set it to GOMAXPROCS=10 (picked out of thin air) across the cluster. It's been six hours without an election so far (which is a record by a margin of about 5 hours).

Tomorrow I will try removing GOMAXPROCS altogether.

kingpong on 2 Aug 2016

Just to provide my perspective again: I use an Ansible playbook to install and run consul, and the default GOMAXPROCS is also 2. Haven't had issues with it. ¯_(ツ)_/¯

MrMMorris on 2 Aug 2016

@MrMMorris are you running inside of docker? If not that's likely why you can run on t2's by default. I noticed similar results to @kingpong with t2 instance types not being utilized by default.

sstarcher on 2 Aug 2016

yep, no docker for consul here

MrMMorris on 2 Aug 2016

@kingpong Did your gomaxprocs=10 change have any measurable effect over time?

We are running 0.7.x (with the latest flapping-related improvements) but still seeing a few flaps now and then (on c4.large in AWS). This is with a small cluster of ~30 nodes (incl. 3 consul servers).

@slackpad It would be very helpful if there was some way to help diagnose the root cause of flaps - i.e. is this due to networking issues, CPU starvation or something else... Right now we have consul debug logging turned on, but the logs are not particularly enlightening.

We also have the consul metrics piped into our stats backend, but again, don't see any real correlations with the flap times. We probably are not looking in the correct place inside the consul logs and/or stats.

IMO the collection of flapping-related issues are definitely something that Hashicorp should prioritize - so far we've yet to find any recipe to solve these issues. The only recommended solution is massively overprovision your instance types, but when the agents themselves seem to contribute (not just the consul servers) then this effectively means every instance type has to be over-provisioned which is simply not viable in order to use consul itself.

mrwilby on 23 Aug 2016

@mrwilby is consul eating up your cpu on the m4.large or is the cpu not doing a lot of work? When we were on m3.mediums we had 0 flapping until we pegged out CPUs at 80% + and moving to c4.larges resolved that. We run a large cluster and had a lot going on, but we could likely go down to m3.mediums again.

With docker+non-t2 instance types CPU should only be an issue if the entire server's CPU is pegged. Setting gomaxprocs should not be necessary.

sstarcher on 23 Aug 2016

@sstarcher Sorry, my mistake - we're actually running consul servers on c4.large (I edited my post to correct). The consul servers are listed as 100% idle from our metrics (of course, I am sure there are small bursts of activity which our metric granularity doesn't reflect). But anyway, massively over-provisioned for our very light workloads.

The 0.7x changes appear to be a lot more stable for us than 0.6.x, but we do still see occasional flaps with no substantiating metrics to enable us to pin down why. Again, we are most likely not looking in the correct place which is why I asked if there was any info about how to diagnose the root cause of flaps.

We have a small kafka cluster of 3 brokers. The last flap involved 2 of the brokers deciding that the 3rd had failed.

2016/08/19 22:38:44 [INFO] serf: EventMemberFailed: 10.0.3.94 10.0.3.94
2016/08/19 22:38:44 [INFO] consul: member '10.0.3.94' failed, marking health critical
2016/08/19 22:38:44 [INFO] serf: EventMemberJoin: 10.0.3.94 10.0.3.94
2016/08/19 22:38:44 [INFO] consul: member '10.0.3.94' joined, marking health alive

From looking at the stats & logs of the 3rd broker, it was >90% idle during this time. Our test environment was running kafka on m4.large instance type, so it wasn't an under-provisioned t2 series instance.

We moved consul from docker to native EC2 a long time ago due to all the docker UDP issues and in the hope that it would resolve the flapping issues, but to no avail.

mrwilby on 23 Aug 2016

@mrwilby your case certainly sounds like a networking issue. I would recommend looking at the logs of the node that is the leader.

sstarcher on 23 Aug 2016

@mrwilby thanks for the feedback - fixing these flaps was a big area of focus for the 0.7 release and hopefully we can tune things before the release to get performance even better. It's not in master yet but we are also going to expose Raft timing controls that should let you run your servers on less powerful instances as well.

The 0.7x changes appear to be a lot more stable for us than 0.6.x, but we do still see occasional flaps with no substantiating metrics to enable us to pin down why. Again, we are most likely not looking in the correct place which is why I asked if there was any info about how to diagnose the root cause of flaps.

Are you running 0.7.0-rc1, and is it on all nodes in this test cluster or just some of them? To get the full flap reduction benefits you'd want most or ideally all nodes to have the new logic.

You can see who actually detected the node as failed by looking for Suspect <node name> has failed, no acks received in Consul logs across the cluster. That's the node that performed a probe via UDP direct, UDP via a few peers, and TCP direct and never heard back, so started the suspicion of the node. Looking on the flapping node is usually not helpful to determine who accused it because it hears about it via gossip so it doesn't report where the suspicion originated. I agree this is hard to diagnose, we've got plans to make it better but that will come after 0.7.

slackpad on 23 Aug 2016

when Adding new node and distribute data w/o downtime and will the performance will degrade

banupriya20 on 5 Oct 2016

We're currently running Consul 0.6.4 on a cluster of ~100 nodes with both Windows and Linux boxes. We're seeing a high rate of failure and subsequent join events (almost immediately) when there's a max CPU usage situation on a few of the nodes (4 or 5). The number of failure events (EventMemberFailure) seen at a peer is around 200-250 per hour. The susprising thing is that failures reported are not just for the nodes with max CPU, but is across entire cluster. Though, the majority (a little over 50% ) are for the nodes with max CPU.

Force stopping the Consul agent on the slammed nodes makes all the failures go away. Therefore, we suspect false alarms are being triggered by the slammed nodes, due to them not being able to send/receive/process UDP pkts in time, causing the entire cluster to experience a churn. We're still surprised that false alarms can progress to the point that many healthy nodes are getting reported as failed.

We see that 0.7.0 adds guards for similar/related scenarios. Before we put in a effort to upgrade and re-validate our deployments, would be great if we could confirm that this is 'as expected' behavior with 0.6.4 when a few nodes are at max CPU, and that we haven't missed some tricks in the book to address this with 0.6.4. Thanks!

er-kiran on 13 Dec 2016

Hi @er-kiran you are correct - Consul 0.7's Lifeguard changes were specifically targeted at limiting the damage that a degraded node could do to the rest of the cluster. Previously, one degraded node could start making false accusations that could lead to churn around the cluster.

slackpad on 13 Dec 2016

🎉1

Unfortunately, 0.7 does not seem to fix this. We had a few incidents and we were surprised to find out that our staging instances were affecting production (even though staging were set up to only be able to read consul information, not write to it, via auth tokens).

We upgraded everything to 0.7 but are still seeing flapping.

It would be nice to be able to tune timeouts or at least disallow certain instances from having a vote (i.e. via access tokens for example).