After more testing PR #770 I found this:
I detect another changed IP problem, after I restart my swarm ec2 cluster today.
The master use the old ip's from the swarm machines
time="2015-03-18T18:23:54Z" level=error msg="Get https://54.69.29.90:2376/v1.15/info: dial tcp 54.69.29.90:2376: i/o timeout"
time="2015-03-18T18:23:54Z" level=error msg="Get https://54.69.230.35:2376/v1.15/info: dial tcp 54.69.230.35:2376: i/o timeout"
time="2015-03-18T18:23:54Z" level=error msg="Get https://54.69.255.39:2376/v1.15/info: dial tcp 54.69.255.39:2376: i/o timeout"
time="2015-03-18T18:23:54Z" level=error msg="Get https://52.10.167.59:2376/v1.15/info: dial tcp 52.10.167.59:2376: i/o timeout"
I analyze the problem:
The swarm agent are join with the old ip 52.10.167.59
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM
amazonec2-03 amazonec2 Stopped
dev virtualbox Stopped
ec2-swarm-01 amazonec2 Running tcp://54.149.27.239:2376 ec2-swarm-master
ec2-swarm-02 amazonec2 Running tcp://52.10.108.31:2376 ec2-swarm-master
ec2-swarm-03 * amazonec2 Running tcp://54.148.5.178:2376 ec2-swarm-master
ec2-swarm-master amazonec2 Running tcp://52.11.98.189:2376 ec2-swarm-master (master)
$ $(docker-machine env ec2-swarm-master)
$ docker ps --no-trunc
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
13d27667155b3b1962b99b8d817c7a9865b47fe5b0d5d9c0af08735b26163efa swarm:latest "/swarm join --addr 52.10.167.59:2376 token://5a57a53a13470b1e680c6904ce5b34d1" 35 hours ago Up 11 minutes 2375/tcp swarm-agent
810f7ce04b6439c191470a2116197088ee2a3d2e5ed1cc7f4742aacef46317f9 swarm:latest "/swarm manage --tlsverify --tlscacert=/etc/docker/ca.pem --tlscert=/etc/docker/server.pem --tlskey=/etc/docker/server-key.pem -H tcp://0.0.0.0:3376 token://5a57a53a13470b1e680c6904ce5b34d1" 35 hours ago Up 11 minutes 2375/tcp, 0.0.0.0:3376->3376/tcp swarm-agent-master
$ docker-machine ip ec2-swarm-master
52.11.98.189
After the IP from swarm machine changed, the implementation must reconfigure the swarm agent, remove the old container and start a new one.
The only quick fix is currently recreate the agent with this tiny script:
create-swam-agent.sh
#!/bin/bash
TOKEN=$(docker inspect -f "{{ index .Config.Cmd 3 }}" swarm-agent)
IP=$(curl http://169.254.169.254/latest/meta-data/public-ipv4)
docker stop swarm-agent
docker rm swarm-agent
docker run -d --name swarm-agent --restart=always swarm \
join --addr ${IP}:2376 \
${TOKEN}
I think longer-termish we will have to support some kind of "sync" to the config store, I don't know if the Docker Hub token discovery service would support modifying the cluster IPs, but I'm sure the KV backends would.
cc @aluzzardi @vieux @abronan How would you envision workflow for this case (changing IPs in the swarm)?
@nathanleclaire Entries in the K/V are deleted after TTL expiration (nodes are removed from the discovery). So if the IPs are changing, the store will reflect the state of the cluster correctly after a stop/restart (on EC2 for example). Still you might expect old entries to be listed for a bit of time until their TTL expires (If you have 3 machines, expect to have 6 of those listed even though old entries will be marked as unhealthy and couldn't be used in the Swarm)
As a workaround, if Machine is aware that an instance is restarting, it could directly delete the entry in the K/V to not list machines with wrong IPs after a restart.
Here is my workaround after changing IP address of docker swarm node:
% docker-machine env docker-node
% docker-machine regenerate-certs docker-node
(I sometimes need to run multiple times when error occurs.)
% eval $(docker-machine env docker-node)
% export TOKEN=$(docker inspect -f "{{ index .Config.Cmd 3}}" swarm-agent)
% docker rm -f swarm-agent
% docker run -d --name=swarm-agent --restart=always swarm:latest join --advertise "${DOCKER_HOST##tcp://}" "${TOKEN}"
Most helpful comment
Here is my workaround after changing IP address of docker swarm node: