Origin: Rebuilding a Master

Created on 29 Jan 2016 · 16Comments · Source: openshift/origin

What's the best way to set OpenShift up to easily rebuild the master while keeping data and settings intact? We had a few masters blow up on us for reasons we're not sure of, and we're running into lots of problems getting our cluster back online.

Our current strategy was going to be run a separate etcd instance and image that, but when we destroyed the master and attempted to rebuild it using the ansible-install script, everything seems to be frozen and not doing anything. The web console lists the services and deployments but they're not building or deploying, just hanging, and there's nothing in the logs for any of them either.

The only suspicious thing in our journalctl is:

`Jan 29 18:00:23 ip-172-31-41-10.ec2.internal origin-master[858]: E0129 18:00:23.961762 858 horizontal.go:69] Couldn't reconcile horizontal pod autoscalers: error listing nodes: the server has asked for the client to provide credentials (get horizontalPodAutoscalers)``

We're running on AWS for now but will eventually have our own metal, using this AMI: https://aws.amazon.com/marketplace/pp/B00O7WM7QW

(I'm also on IRC as <hbk> <bneese> - we connected the IRC channel to our slack. :))

kinquestion lifecyclrotten prioritP2

Source

brettneese

All 16 comments

Are you making configuration changes to the master config files (or certificates) after you ran install the first time? openshift-ansible is supposed to be reentrant, but if you change the config it will roll that back.

I _think_ the error you're describing sounds like the nodes are no longer available. Can you confirm after doing this that the nodes (oc get nodes) are still listed as schedulable and have a recent heartbeat

smarterclayton on 30 Jan 2016

I accidentally'd the entire cluster late last week and am working on rebuilding, but I'm pretty sure it has something to do with kubernetes using old internal IPs.

I'm using EC2, but I have the same elastic external IPs but different internal ones. For what it's worth, however, I'm using the BYO configuration and pointing it to the AWS instances (we're acquiring our own metal and I want to be in a place where we can deploy on it without much hassle).

brettneese on 1 Feb 2016

OK, I rebuilt. Was banging on my head about this all last night.

Even if I abstract etcd onto a separate volume and mount it directly to the etcd instance (as /var/lib/etcd), after 'oc delete node' the original master, I still get:

Feb 02 15:38:36 ip-172-31-58-77.ec2.internal origin-master[846]: E0202 15:38:36.328385 846 horizontal.go:69] Couldn't reconcile horizontal pod autoscalers: error listing nodes: the server has asked for the client to provide credentials (get horizontalPodAutoscalers)

in journalctl.

This is with new master, new etcd - the only thing I carried over was the data volume, so there shouldn't be any issues with certificates?

According to oc get node my master/node is online and Ready. ip-172-31-58-77.ec2.internal is definitely the master.

brettneese on 2 Feb 2016

wondering if maybe this has something to do with builder and deployer service accounts.

on a new master there'd be different certs, and the old DeploymentConfig's will keep the old ones around?

anyway to regenerate those?

the old deploymentconfigs also still reference the IP of the old OPENSHIFT_MASTER.

interestingly oc deleteing the original DC's just hangs.

brettneese on 2 Feb 2016

I thought this was a new etcd... there shouldn't be any old dc's, right?

liggitt on 2 Feb 2016

It's a new instance, but we're pulling in the old /var/lib/etcd by mounting a volume created from a snapshot. Unless there's a better way to backup the etcd data store?

Basically we don't want to have to reconfigure everything if we need to launch a new cluster for whatever reason.

brettneese on 2 Feb 2016

https://docs.openshift.org/latest/install_config/upgrades.html#preparing-for-a-manual-upgrade talks about backing up etcd, but it doesn't say anything about how to actually _restore_ that backup. So admittedly we're just kind of shooting in the dark here.

brettneese on 2 Feb 2016

I think your IP has changed - which means the certificates the master uses
are wrong. If you're using a written out config, you'll need to generate
new certs for the new IP.

On Tue, Feb 2, 2016 at 11:32 AM, Brett Neese [email protected]
wrote:

https://docs.openshift.org/latest/install_config/upgrades.html#preparing-for-a-manual-upgrade
talks about backing up etcd, but it doesn't say anything about how to
actually _restore_ that backup. So admittedly we're just kind of shooting
in the dark here.

—
Reply to this email directly or view it on GitHub
https://github.com/openshift/origin/issues/6911#issuecomment-178675058.

smarterclayton on 4 Feb 2016

How does one do that?

brettneese on 12 Feb 2016

I have the same issue after cloning a master and running the openshift-ansible installer to re-create all of the etcd, master and node configuration files. It has been a lot of work to get here (It's very important for me to restore without losing the contents of my cluster).

Etcd is happy, and master-api seems to starts up ok, as does the node but none of the pods will load and everything is reported as 'scaling' and I see this error in the master-controllers log: -

May 9 13:57:26 vrdevosmaster001 atomic-openshift-master-controllers: E0509 13:57:26.276630 68738 horizontal.go:69] Couldn't reconcile horizontal pod autoscalers: error listing nodes: the server has asked for the client to provide credentials (get horizontalPodAutoscalers)

Can anyone assist as to where the authentication details are loaded from for the master-controllers? I don't have issues when using the system:admin credentials: -

oc login -u system:admin -n default --config=/etc/origin/master/admin.kubeconfig

oc get sa

NAME SECRETS AGE
builder 3 149d
default 8 149d
deployer 2 149d
registry 2 118d
router 3 149d
svc-bamboo 2 115d

regards

Dave

ghost on 9 May 2016

Hi
I found the issue causing the horizontalPodAutoscalers error and others in the controller. As part of my re-install I removed ALL of the configuration files from /etc/orgin/master and /etc/origin/node and let the ansible installer re-create them.

The existing serviceaccounts in etcd were signed with the previous keys in these files: -
/etc/origin/master/serviceaccounts.pubic.key
/etc/origin/master/serviceaccounts.private.key

You need to restore these files from your previous installation otherwise all of the serviceaccounts you have defined can not automatically authenticate (hence the request to provide credentials).

regards

Dave

ghost on 10 May 2016

👍1

I wonder if anyone who's commented on this issue knows about the need to run etcdctl backup as well as oc export all to create a full backup of origin? I created https://github.com/openshift/openshift-docs/issues/3186 to see if anyone on the docs project knows more.

alikhajeh1 on 14 Nov 2016

You should not need to export anything if you perform an etcd backup.

smarterclayton on 15 Nov 2016

👍1

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 8 Feb 2018

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 12 Mar 2018

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot on 11 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings