I am trying to start three Consul servers in an ECS cluster that consist of three ec2 instances, each in a different availabity zone within one aws region.
the command I am using in my task definition:
agent,-server,-client=0.0.0.0,-bootstrap-expect=3,-datacenter=eu-west-1,-retry-join="provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster",-log-level=TRACE
and I am getting the following log in all three instances (it worth mentioning that the task definitioin has an IAM role with the abiltiy to DescribeInstances:
==> Found address 'x.x.x.x' for interface 'eth0', setting bind option...
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
Version: 'v1.0.2'
Node ID: 'b5062427-37cd-ffcd-8019-803ba5454c83'
Node name: 'ip-x-x-x-x'
Datacenter: 'eu-west-1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: x.x.x.xLAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2018/01/12 09:44:59 [DEBUG] Using random ID "b5062427-37cd-ffcd-8019-803ba5454c83" as node ID
2018/01/12 09:44:59 [INFO] raft: Initial configuration (index=0): []
2018/01/12 09:44:59 [INFO] raft: Node at x.x.x.x:8300 [Follower] entering Follower state (Leader: "")
2018/01/12 09:44:59 [INFO] serf: EventMemberJoin: ip-x-x-x-x.eu-west-1 x.x.x.x
2018/01/12 09:44:59 [INFO] serf: EventMemberJoin: ip-x-x-x-x x.x.x.x
2018/01/12 09:44:59 [INFO] consul: Adding LAN server ip-x-x-x-x (Addr: tcp/x.x.x.x:8300) (DC: eu-west-1)
2018/01/12 09:44:59 [INFO] consul: Handled member-join event for server "ip-x-x-x-x.eu-west-1" in area "wan"
2018/01/12 09:44:59 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2018/01/12 09:44:59 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2018/01/12 09:44:59 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
2018/01/12 09:44:59 [INFO] agent: started state syncer
2018/01/12 09:44:59 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer
2018/01/12 09:44:59 [INFO] agent: Joining LAN cluster...
2018/01/12 09:44:59 [ERR] agent: Join LAN: discover: provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster: missing '='
2018/01/12 09:44:59 [WARN] agent: Join LAN failed: No servers to join, retrying in 30s
2018/01/12 09:45:06 [WARN] raft: no known peers, aborting election
2018/01/12 09:45:07 [ERR] agent: failed to sync remote state: No cluster leader
2018/01/12 09:45:29 [ERR] agent: Join LAN: discover: provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster: missing '='
I have tried the following to make sure everything is fine:
/ # ps -ef
PID USER TIME COMMAND
1 root 0:00 {docker-entrypoi} /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server -client=0.0.0.0 -bootstrap-expect=3 -datacenter=eu-west-1 -retry-join="provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster"
6 consul 0:00 consul agent -data-dir=/consul/data -config-dir=/consul/config -bind=x.x.x.x -server -client=0.0.0.0 -bootstrap-expect=3 -datacenter=eu-west-1 -retry-join="provider=aws tag_name=aws:autoscaling:groupName tag_value=ecs_cluster"
25 root 0:00 sh
29 root 0:00 ps -ef
I have tried to ssh directly to the container instances and run the following command (with aws key):
docker run --net=host -e 'CONSUL_BIND_INTERFACE=eth0' consul agent -server -client=0.0.0.0 -bootstrap-expect=3 -datacenter=eu-west-1 -retry-join="provider=aws tag_name=Name tag_value=ecs_cluster access_key_id=x secret_access_key=x" -log-level=TRACE
and I got slightly different output:
==> Found address 'x.x.x.x' for interface 'eth0', setting bind option...
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
Version: 'v1.0.2'
Node ID: '86d75175-f8ac-badc-19d3-30e9942f09d0'
Node name: 'ip-x-x-x-x'
Datacenter: 'eu-west-1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: x.x.x.x (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2018/01/12 09:43:27 [DEBUG] Using random ID "86d75175-f8ac-badc-19d3-30e9942f09d0" as node ID
2018/01/12 09:43:27 [INFO] raft: Initial configuration (index=0): []
2018/01/12 09:43:27 [INFO] raft: Node at x.x.x.x:8300 [Follower] entering Follower state (Leader: "")
2018/01/12 09:43:27 [INFO] serf: EventMemberJoin: ip-x-x-x-x.eu-west-1 x.x.x.x
2018/01/12 09:43:27 [INFO] serf: EventMemberJoin: ip-x-x-x-x x.x.x.x
2018/01/12 09:43:27 [INFO] consul: Adding LAN server ip-x-x-x-x (Addr: tcp/x.x.x.x:8300) (DC: eu-west-1)
2018/01/12 09:43:27 [INFO] consul: Handled member-join event for server "ip-x-x-x-x.eu-west-1" in area "wan"
2018/01/12 09:43:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
2018/01/12 09:43:27 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
2018/01/12 09:43:27 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
2018/01/12 09:43:27 [INFO] agent: started state syncer
2018/01/12 09:43:27 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer
2018/01/12 09:43:27 [INFO] agent: Joining LAN cluster...
2018/01/12 09:43:27 [DEBUG] discover: Using provider "aws"
2018/01/12 09:43:27 [INFO] discover-aws: Address type is not supported. Valid values are {private_v4,public_v4,public_v6}. Falling back to 'private_v4'
2018/01/12 09:43:27 [INFO] discover-aws: Region not provided. Looking up region in metadata...
2018/01/12 09:43:27 [INFO] discover-aws: Region is eu-west-1
2018/01/12 09:43:27 [DEBUG] discover-aws: Creating session...
2018/01/12 09:43:27 [INFO] discover-aws: Filter instances with =ecs_cluster
2018/01/12 09:43:28 [DEBUG] discover-aws: Found 0 reservations
2018/01/12 09:43:28 [DEBUG] discover-aws: Found ip addresses: []
2018/01/12 09:43:28 [INFO] agent: Discovered LAN servers:
2018/01/12 09:43:28 [WARN] agent: Join LAN failed: No servers to join, retrying in 30s
2018/01/12 09:43:34 [ERR] agent: failed to sync remote state: No cluster leader
2018/01/12 09:43:37 [WARN] raft: no known peers, aborting election
It worth mentioning that the aws key used in the command hasn't been used according to IAM.
I was able to get around this by running:
consul join x.x.x.x inside the containers.
Documenting some of the conversation from gitter for others that may see this issue...
In the first case it appears to be an options passing/parsing problem with AWS and consul/go-discover is seeing the entire retry-join as one string and attempting to parse it all as the provider.
Additionally, it should be tag_key and tag_value instead of tag_name and tag_value in the discovery string.
closing as per above.
for the AWS issue, no need to use double quotes.
awesome, simple enough :)
for posterity
When used in a command block in an AWS ECS task-definition, it should be:
(although it looks weird)
# task-definition.json
{
"containerDefinitions": [
{
"command": [
"agent",
"-server",
"-retry-join=provider=aws tag_key=foo tag_value=bar"
],
...
Thanks @ebarault . Your comment saved me hours of debugging.
Most helpful comment
for posterity
When used in a command block in an AWS ECS task-definition, it should be:
(although it looks weird)