Machine: Machine state not preserved across reboots

Created on 13 Jul 2015  Â·  26Comments  Â·  Source: docker/machine

Created a couple of machines as:

NAME            ACTIVE   DRIVER       STATE     URL                         SWARM
lab                      virtualbox   Running   tcp://192.168.99.101:2376   
summit2015               virtualbox   Running   tcp://192.168.99.100:2376   

Restarted the host machine, and now the status is shown as:

NAME            ACTIVE   DRIVER       STATE     URL   SWARM
lab                      virtualbox   Stopped         
summit2015               virtualbox   Stopped         

State of the machine is not preserved across host reboot.

Also related to https://github.com/docker/swarm/issues/668.

kinenhancement

Most helpful comment

I guess the host in the cloud will "never" go through power recycle?

All 26 comments

+1, same issue. Is there a workaround?

Would you expect them to be running @arun-gupta ? Generally I'd think it's pretty implicit in power-cycling your host machine that all of your VMs will be turned off as well.

If a host power cycles, then it cannot be relied upon to be part of the cluster in that case. That seems like a broken model. Or does that need to be configured somewhere?

If a host power cycles, then it cannot be relied upon to be part of the cluster in that case. That seems like a broken model. Or does that need to be configured somewhere?

I'm confused. If you power-cycle a laptop which has VMs, there is no way for them to continue running, so this behavior seems normal to me. If you power-cycle a laptop with references to cloud VMs, they will keep running just fine. I would not really expect users to have their laptop as a node in a production cluster.

I guess the host in the cloud will "never" go through power recycle?

I guess the host in the cloud will "never" go through power recycle?

It is likely that the host computer will have to be rebooted at some point, but if all that it contains are references to cloud VMs, those will still keep running. Machine does not do any magic related to stopping and starting of the hosts it manages on a host reboot.

I mean the host that really hosts the VM itself, not the host with just reference to those VMs.

Hi,

i am not sure if this is the correct ticket but it is referenced in https://github.com/docker/swarm/issues/668 .

As from the original ticket:

  • I have docker-machine with virtualbox: 1 swarm master and 2 agents
  • when i run docker-machine stop and start afterwards, i can see them running in "docker-machine ls" but they appear to not be part of the cluster anymore

It was mentioned that this is some TLS issue, but i don't know how to work around it.
Is there a fix for this?

Same problem here. I'm trying to preserve the swarm across machine restarts, I regenerate all TLS certs but no luck. I have to destroy and create new nodes every time i shut down the swarm.

I think a docker-machine provision command (https://github.com/docker/machine/pull/2121) might help to re-bootstrap swarms after restarting the hosts they are located on.

I think this feature: https://github.com/docker/docker/pull/17364 propagated through swarm may be a better idea to handle machine re-IPing than making someone manually invoking docker-machine provision to re-bootstrap after a machine restart.

Hm, I see. So the issue is around the TLS certificates becoming invalid because the IP addresses change?

@nathanleclaire Another issue is that docker-machine (because there was no better way until recently) hard-codes the dotted IP address of the host in the swarm container setups. I can't speak for all setups but using the boot2docker.iso (1.7.1 - 1.9 and likely before) with --swarm upon vsphere with DHCP IP assignment yields a broken swarm after a power cycle. I have found that one does not need to destroy and create new nodes as @zuBux reports here; however I found that one must manually remove the swarm containers (since their registered command lines have the hard-coded IPs within) and then manually run new swarm containers with the new IP assignment stated within. For large clusters, it is a real PITA.

+1

i know there is a lot of talk about what the impact of this would be on production servers, but this is a huge pain. I am trying to maintain the same docker-compose for production and local osx dev, and everything works, but if you laptop hits low power, or you reboot for any reason, having to destroy and rebuild the swarm and swarm nodes then re-deploying makes the solution almost unusable.

+1

+1

I nicer way to re-attach swarm nodes would be useful as well. For example, because of #2267 someone on a Windows host never gets the out-of-the-box experience and has to run everything by hand.

Just to confirm with everyone running into this issue, you are all using the Hub discovery service (--swarm-discovery token://...)? I haven't seen such issues with Consul but I have with the Hub discovery service. I don't know if --cluster-advertise engine option works with Hub.

I had my issue with both the hub discovery service and consul.

Thomas Cooper
Senior Director of Product Research | DialogTech
216.920.6407 | tom.[email protected]

www.dialogtech.com | Twitter http://twitter.com/dialogtech | Facebook
http://facebook.com/dialogtech

 Ifbyphone is now DialogTech. Click here

http://www.dialogtech.com/ifbyphone-is-now-dialogtech to learn more.

_Need immediate answers? Check out our new _Support Community!
https://support.dialogtech.com/

_https://support.ifbyphone.com/ifbyphone_ https://support.dialogtech.com/

_Connect with the Customer Success Team on _Twitter
https://twitter.com/DialogTechCST_:_

_https://twitter.com/DialogTechCST_ https://twitter.com/DialogTechCST

On Mon, Nov 30, 2015 at 5:01 PM, Nathan LeClaire [email protected]
wrote:

Just to confirm with everyone running into this issue, you are all using
the Hub discovery service (--swarm-discovery token://...)? I haven't seen
such issues with Consul but I have with the Hub discovery service. I don't
know if --cluster-advertise engine option works with Hub.

—
Reply to this email directly or view it on GitHub
https://github.com/docker/machine/issues/1518#issuecomment-160775123.

The same here, using a consul cluster as discovery service
On Dec 2, 2015 2:14 PM, "Thomas Cooper" [email protected] wrote:

I had my issue with both the hub discovery service and consul.

Thomas Cooper
Senior Director of Product Research | DialogTech
216.920.6407 | tom.[email protected]

www.dialogtech.com | Twitter http://twitter.com/dialogtech | Facebook
http://facebook.com/dialogtech

Ifbyphone is now DialogTech. Click here
http://www.dialogtech.com/ifbyphone-is-now-dialogtech to learn more.

_Need immediate answers? Check out our new _Support Community!
https://support.dialogtech.com/

_https://support.ifbyphone.com/ifbyphone_ <https://support.dialogtech.com/

_Connect with the Customer Success Team on _Twitter
https://twitter.com/DialogTechCST_:_

_https://twitter.com/DialogTechCST_ https://twitter.com/DialogTechCST

On Mon, Nov 30, 2015 at 5:01 PM, Nathan LeClaire <[email protected]

wrote:

Just to confirm with everyone running into this issue, you are all using
the Hub discovery service (--swarm-discovery token://...)? I haven't seen
such issues with Consul but I have with the Hub discovery service. I
don't
know if --cluster-advertise engine option works with Hub.

—
Reply to this email directly or view it on GitHub
https://github.com/docker/machine/issues/1518#issuecomment-160775123.

—
Reply to this email directly or view it on GitHub
https://github.com/docker/machine/issues/1518#issuecomment-161349417.

I am using consul too and I didn't have to power cycle to get into a bad swarm state. Now I have to recreate the swarm cluster like the other comments. Unlike the other comments, my docker-machine ls does still show the swarm cluster (is this then a different issue?). In my case, I created my swarm cluster following the instructions at https://htmlpreview.github.io/?https://github.com/javaee-samples/docker-java/blob/e07f668fc3dcd4cd62180eb8bb9b764c73b6abd1/readme.html while away from the office, slept my MacBookPro, and the next day I tried to use my swarm cluster in the office. Now my swarm cluster is unusable and I can't restart the swarm-master and I'll have to recreate it.

$ docker-machine ls
NAME             ACTIVE   DRIVER       STATE     URL                         SWARM
consul-machine   -        virtualbox   Running   tcp://192.168.99.102:2376   
default          -        virtualbox   Running   tcp://192.168.99.100:2376   
swarm-master     *        virtualbox   Running   tcp://192.168.99.103:2376   swarm-master (master)
swarm-node-01    -        virtualbox   Running   tcp://192.168.99.104:2376   swarm-master
swarm-node-02    -        virtualbox   Running   tcp://192.168.99.105:2376   swarm-master
$ docker-machine restart swarm-master
Too many retries waiting for SSH to be available.  Last error: Maximum number of retries (60) exceeded

I had to kill all my docker-machines and all except the master restarted successfully and even though it is running, I am unable to get the master env.

$ docker-machine kill swarm-master
$ docker-machine start swarm-master
Too many retries waiting for SSH to be available.  Last error: Maximum number of retries (60) exceeded
$ docker-machine env swarm-master
Error running connection boilerplate: Error getting driver URL: Something went wrong running an SSH command!
command : ip addr show dev eth1
err     : exit status 255
output  : 

Same behavior here. I'm forced to recreate the whole cluster each time I restart the master node. docker-machine --version 0.5.1

Same problem here. I'm using Consul and Digital Ocean on Debian 8.

I thought it was maybe because of doing a manual upgrade of the machines to 1.10RC2, see my method and symptoms here:
http://serverfault.com/questions/753041/how-to-manually-upgrade-docker-version-on-a-docker-machine/753161#753161

...but it sounds like it's maybe this general issue with restarting machines.

docker-machine ls shows the nodes as running and members of the swarm, docker info shows only the master node in the swarm though.

I note also that docker-machine inspect swarm-1 shows it has swarm properties:

        "SwarmOptions": {
            "IsSwarm": true,
            "Address": "",
            "Discovery": "consul://<consul box>:8500",
            "Master": false,
            "Host": "tcp://0.0.0.0:3376",
            "Image": "swarm:latest",
            "Strategy": "spread",
            "Heartbeat": 0,
            "Overcommit": 0,
            "ArbitraryFlags": [],
            "Env": null
        },

Does anyone have a workaround for how to reattach nodes to the swarm?

To rejoin the swarm:

$ docker-machine ssh swarm-1
swarm-1# docker run -d swarm join --addr=<node ip>:2376 consul://<consul ip>:8500

docker info now shows both nodes are in the swarm

is it maybe just a case of adding a restart policy to this command?

Referring to the official tutorial here: https://docs.docker.com/engine/swarm/swarm-tutorial/ I'm not sure how this setup would react to restarting individual hosts being part of the cluster.

I'm not really interested in clustering/elasticity options, rather a simple-to maintain and robust solution for static multi-host system.

The individual containers can be automatically started using docker's restart policy. However, if the orchestration and potentially networking is bound to the swarm, even when restarted there still is a problem.

Can anybody point me to the right direction? Thanks.

+1

Was this page helpful?
0 / 5 - 0 ratings