Flynn: Proper way to perform maintenance

Created on 13 Jul 2016 · 15Comments · Source: flynn/flynn

I am testing using Flynn to host a decent sized collection of Rails and PHP applications. I have been performing disaster recovery procedures with a 3 host cluster. So far it has not been encouraging.

I am starting my testing with what would happen if you needed to install updates on the hosts and perform rolling restarts on each of them. I have attempted to stop the flynn-host service and then restart the host, it mostly seems to rejoin properly. However, I don't see the scheduler moving services that are doubled up on the other hosts back to the recovered host, for example 2 postgres servers running on 1 host. After an appreciable amount of time I went to the host that had the 2 postgres running on it and stopped the flynn-host service. This failed to stop all the jobs that were running on the host. I then rebooted this host and it failed to join the cluster properly and my cluster was inoperable. I ran the flynn-host fix --min-hosts 3 and recieved an unknown error about the controller instance. Here is the flynn-host collect-debug-info from that state: https://gist.github.com/anonymous/ad98e43f4a4a2ce57590926a28b65834

At this point I became frustrated and shut all 3 hosts down, thinking I would restore the pre-init snapshot and start fresh. I of course couldn't leave well enough alone so I restarted all 3 hosts at almost the same time and thought I would take one more look. Lo and behold they converged and the services were available eventually. Here is the post cluster reboot flynn-host collect-debug-info: https://gist.github.com/anonymous/8b550248b675b87f5039095953e9d18a

What improvements or documentation should be made to not make a cluster break when performing simple maintenance tasks on the supposedly redundant hosts?

kinquestion

Source

bbaptist

👍1

All 15 comments

I think your points go in the same direction as my comment here #3104.

While Flynn is not officially labeled as production ready, the overall stability and recoverability are very important things for taking it into a production environment. If things go down hill, its great to have your platform (Flynn) supporting you and trying its best to recover automatically.

IsNull on 13 Jul 2016

Flynn does include some ability to repair itself if things go pear-shaped or a full power-off happens as you discovered after rebooting the cluster. Unfortunately this work is not yet complete and isn't aggressive as it could be, most importantly it doesn't take action when less than the initial set of bootstrap peers isn't available. This isn't ideal but it was merged with this restriction to prevent it making a cluster that is having problems even worse. That said the code has proven pretty reliable and we may look at lifting that restriction soon when we can come up with some better heuristics for when the cluster recovery/resurrection code should take action.

As for tasks not being rebalanced after a node has been replaced this is a known issue and something we intend to address.

I'm not sure but I don't think flynn-host actually terminates tasks when it's killed/stopped. I am not sure what would be the desired behavior here as we do use this capability to support updates and the like. @lmars could probably elaborate on that front.

Thanks for testing this stuff, we are serious about making cluster recovery/resurrection robust.

josephglanville on 13 Jul 2016

👍1

@bbaptist

However, I don't see the scheduler moving services that are doubled up on the other hosts back to the recovered host, for example 2 postgres servers running on 1 host

Currently the scheduler will not kill a job unless it is explicitly told to do so. This is purely for stability, if something goes wrong when moving jobs around then there is an operator around to actual debug the issue (i.e. someone there explicitly trying to move jobs around), rather than the scheduler trying to move things around at say 3am and borking a cluster.

We do plan to change this in the future to rebalance (#2101), but we need better testing and understanding of the side effects, so as much manual tinkering and reports like this are really helpful :smile:.

After an appreciable amount of time I went to the host that had the 2 postgres running on it and stopped the flynn-host service. This failed to stop all the jobs that were running on the host.

flynn-host _should_ stop all jobs on shutdown (see here), what jobs remained after stopping flynn-host? And how did you stop it (SIGTERM / SIGKILL)?

lmars on 13 Jul 2016

👍1

@josephglanville

most importantly it doesn't take action when less than the initial set of bootstrap peers isn't available. This isn't ideal but it was merged with this restriction to prevent it making a cluster that is having problems even worse.

So if I was testing a 6 node cluster what actions would be taken when doing maintenance on each of those nodes? Meaning is there a minimum number of hosts to be able to perform maintenance?

As for tasks not being rebalanced after a node has been replaced this is a known issue and something we intend to address

This seems like an idea situation for flynn-host fix to address, or I perhaps a flynn-host rebalance.

bbaptist on 13 Jul 2016

@lmars

We do plan to change this in the future to rebalance (#2101), but we need better testing and understanding of the side effects, so as much manual tinkering and reports like this are really helpful

Flynn currently handles a migration to unbalanced states, what testing do you need for the migration back to a balanced state?

what jobs remained after stopping flynn-host?

I am pretty sure it was controller and discoverd but I am not positive. I can test again today and see if I can reproduce that state.

And how did you stop it (SIGTERM / SIGKILL)?

I did a service flynn-host stop. My thinking being that the Upstart job would have a graceful way of shutting down tasks. This would also be the method that a graceful shutdown or reboot would use.

bbaptist on 13 Jul 2016

@bbaptist

Flynn currently handles a migration to unbalanced states, what testing do you need for the migration back to a balanced state?

We will want some rigorous integration tests (e.g. drop a node, wait for rebalance, add a node, wait for rebalance, check jobs), but we currently don't have a flexible enough CI environment to do this easily (we have improvement plans #2970).

The scheduler also doesn't currently check for job health, it just restarts any jobs it knows should be running. With a rebalance, it would need to ensure the new job is functioning correctly on the new node before killing the existing job so it doesn't make things worse. This will need some restructuring (that logic currently lives in the deploy worker).

what jobs remained after stopping flynn-host?

I am pretty sure it was controller and discoverd but I am not positive. I can test again today and see if I can reproduce that state.

OK what version of Flynn was this on? We made some fixes recently to stop the controller blocking on shutdown (#3014) so it is likely you did not have that fix and the controller lasted longer than flynn-host was given to wait for it to stop (and discoverd is killed last, so that would explain why that didn't die either).

lmars on 13 Jul 2016

@lmars

The scheduler also doesn't currently check for job health, it just restarts any jobs it knows should be running. With a rebalance, it would need to ensure the new job is functioning correctly on the new node before killing the existing job so it doesn't make things worse.

This area was my next pain point with Flynn. Currently we see a cluster wide red/green for status. Ideally we would be able to drill into what the hosts see for each Flynn job.

OK what version of Flynn was this on?

The latest released version: v20160624.1. How would I go about testing the newest version?

bbaptist on 13 Jul 2016

How would I go about testing the newest version?

You can install and subscribe to the nightly version of Flynn by passing --channel nightly to the install script:

sudo bash /path/to/install-flynn --channel nightly

lmars on 13 Jul 2016

Testing v20160712.0

I did a service flynn-host stop on one node in the 3 node cluster. All the jobs were successfully stopped. The jobs recovered on the remaining hosts in the cluster. I then did a service flynn-host start, simulating that I restart the host. It rejoined the cluster and the jobs started on it successfully.

I then went to the host that was not the postgresql leader. I did a service flynn-host stop on that host. All jobs except discoverd stopped successfully. Here is the collect-debug-info gist in that state: https://gist.github.com/6490b2fcd2330b831eec4755c9e19ffa. The jobs all came up successfully on the 2 other hosts. Cluster is working properly. I then rebooted the host with a stuck job. Immediately after that host went down a flynn-host ps on the host that had been up the whole time showed no jobs running on that host. After a while the jobs all started showing up for that host again. Here is the collect-debug-info from that host during this time: https://gist.github.com/7ba79f4140b60e32073c4294b26d450f

The rebooted machine would not come back up. I reinstalled Flynn and joined it to the cluster. It joined and was working. I then moved onto shutting down the remaining Flynn server. I did a service flynn-host stop on that host. There are now two apps still running discoverd and flynn-controller. I had to kill the two processes before I could get the collect-debug-info from this state: https://gist.github.com/0ed75c23d6d4327650e3a0a03d309049. I notice also that when flannel stops it doesn't remove the flannel.1 network, or shutdown the flynnbr0 interface. Since I did a kill of the Flynn processes instead of a reboot I just did a service flynn-host start. From here it rejoined.

Ran sudo flynn-host fix --min-hosts 3
INFO[07-14|17:28:38] found expected hosts n=3
INFO[07-14|17:28:38] ensuring discoverd is running on all hosts
INFO[07-14|17:28:38] checking flannel
INFO[07-14|17:28:38] flannel looks good
INFO[07-14|17:28:38] waiting for discoverd to be available
INFO[07-14|17:28:38] checking for running controller API
INFO[07-14|17:28:38] found running controller API instances n=2
INFO[07-14|17:28:38] found controller instance, checking critical formations
INFO[07-14|17:28:38] checking status of sirenia databases
INFO[07-14|17:28:38] checking for database state db=postgres
INFO[07-14|17:28:38] checking sirenia cluster status fn=FixSirenia service=postgres
INFO[07-14|17:28:38] found running leader fn=FixSirenia service=postgres
INFO[07-14|17:28:38] found running instances fn=FixSirenia service=postgres count=3
INFO[07-14|17:28:38] getting sirenia status fn=FixSirenia service=postgres
INFO[07-14|17:28:38] cluster claims to be read-write fn=FixSirenia service=postgres
INFO[07-14|17:28:38] checking for database state db=mariadb
INFO[07-14|17:28:38] checking sirenia cluster status fn=FixSirenia service=mariadb
INFO[07-14|17:28:38] found running leader fn=FixSirenia service=mariadb
INFO[07-14|17:28:38] found running instances fn=FixSirenia service=mariadb count=3
INFO[07-14|17:28:38] getting sirenia status fn=FixSirenia service=mariadb
INFO[07-14|17:28:38] cluster claims to be read-write fn=FixSirenia service=mariadb
INFO[07-14|17:28:38] checking for database state db=mongodb
INFO[07-14|17:28:38] checking sirenia cluster status fn=FixSirenia service=mongodb
INFO[07-14|17:28:38] no running leader fn=FixSirenia service=mongodb
INFO[07-14|17:28:38] found running instances fn=FixSirenia service=mongodb count=3
INFO[07-14|17:28:38] getting sirenia status fn=FixSirenia service=mongodb
INFO[07-14|17:28:38] getting service metadata fn=FixSirenia service=mongodb
INFO[07-14|17:28:38] getting primary job info fn=FixSirenia service=mongodb job.id=ieflynn16071302.iexposure.com-2554aa98-0afa-4c37-8ead-5eddc8b50bd4
EROR[07-14|17:28:38] unable to get primary job info fn=FixSirenia service=mongodb
INFO[07-14|17:28:38] getting sync job info fn=FixSirenia service=mongodb job.id=ieflynn16071301.iexposure.com-f4ebf9e9-5f5d-41ef-ab82-c2ce47bdfd3b
INFO[07-14|17:28:38] terminating unassigned sirenia instances fn=FixSirenia service=mongodb
INFO[07-14|17:28:39] starting primary job fn=FixSirenia service=mongodb job.id=ieflynn16071301.iexposure.com-33feef81-212a-4db3-941f-cdd84b814cb4
INFO[07-14|17:28:41] starting sync job fn=FixSirenia service=mongodb job.id=ieflynn16071300.iexposure.com-00aef22e-cef9-468f-99ea-a28385cb996f
INFO[07-14|17:28:41] waiting for instance to start fn=FixSirenia service=mongodb job.id=ieflynn16071301.iexposure.com-33feef81-212a-4db3-941f-cdd84b814cb4
INFO[07-14|17:28:41] waiting for cluster to come up read-write fn=FixSirenia service=mongodb`
17:33:43.220895 host.go:153: timeout waiting for expected status

Same result with subsequent flynn-host fix --min-hosts 3 runs. I was not using mongodb.

I attempted to reboot the last host that I killed the jobs on. It too locked up on boot and I am unable to get it to complete a boot. It halts during init with no errors. I am using a ZFS pool for flynn-default as recommended. Perhaps something to do with Flynn not unmounting the volumes?

I am not seeing a way to have a consistent cluster after host system maintenance with Flynn in it's current state.

bbaptist on 15 Jul 2016

Ok, looks like there is several things I may have to look at here.

The first is why didn't discoverd stop. Unfortunately collect-debug-info didn't pick up the log for that job so I am going to have to try replicate that manually unless you still have that host arround and can get the log from /var/log/flynn/0453517f-f905-43b6-921e-2a29856a596c.log.

After you re-installed the host you didn't promote the replacement node to a member of the consensus cluster so the fault tolerance of the cluster was then impaired. Unfortunately the reason why you probably didn't do this is on us, we have only just merged that functionality and haven't shipped docs for it. I am doing that right now however.
I am curious what you mean by wouldn't come back up, was the host itself unable to be revived or were you unable to get flynn-host to start?

The reason for the small outage of flynn-host ps while you removed that node is that 10.10.142.92 was the Raft leader. Dropping it causes a Raft election to take place with 10.10.142.90 becoming the new leader. This blip is expected but shouldn't last very long.

I'm confused about the last set of logs though. Did you run collect-debug-info on a different host to the one you rebooted/re-installed/rebooted? The IP reported from ifconfig implies it was actually run on the last stable member of the cluster 10.10.142.91.

Did flynn-host fix error out on not being able to repair MongoDB? It should have continued even if it wasn't able to make progress with MongoDB as it's a non-critical component for restoration.
The fixer actually reported the cluster was fine (which seems to be what the logs indicate to, as

As for the reboot/hang issues.. we have not observed this before. Flynn doesn't do much/if-anything fancy at that layer other than utilising ZFS.

So we need to:

Investigate issues with discoverd/controller not exiting on stopping flynn-host
Document promote/demote and general consensus cluster concerns
Investigate ZFS pool import at boot time, check if we use ZFS automount attributes

Thanks for your detailed report, I will start looking at this stuff shortly.

josephglanville on 15 Jul 2016

I am curious what you mean by wouldn't come back up, was the host itself unable to be revived or were you unable to get flynn-host to start?

All three of the hosts were stuck at this point after I tried to reboot them all. I even left them overnight in the hope they would error out on something.

flynn-boot-fail

bbaptist on 15 Jul 2016

Unfortunately collect-debug-info didn't pick up the log for that job so I am going to have to try replicate that manually unless you still have that host arround and can get the log from /var/log/flynn/0453517f-f905-43b6-921e-2a29856a596c.log.

Here is that log file.
0453517f-f905-43b6-921e-2a29856a596c.log.txt

bbaptist on 15 Jul 2016

I'm confused about the last set of logs though. Did you run collect-debug-info on a different host to the one you rebooted/re-installed/rebooted? The IP reported from ifconfig implies it was actually run on the last stable member of the cluster 10.10.142.91.

I realized after I posted this I should have been calling the hosts by name. That collect-debug-info was run after shutting down the flynn-host on the last remaining server to be worked on, ie-flynn-16071301. When I did a service flynn-host stop on that host the discoverd and flynn-controller jobs had to be killed, then I ran a collect-debug-info on that host.

bbaptist on 15 Jul 2016

The discoverd issue has been found and fixed in #3467. Are there any issues remaining here?

titanous on 4 Oct 2016

I think the other issues I identified have been fixed.
Documentation on promote/demote can be found here: https://flynn.io/docs/production#replacing-hosts

josephglanville on 4 Oct 2016

Was this page helpful?

0 / 5 - 0 ratings