Openshift-ansible: 3.10 upgrade - verify cluster if healthy

Created on 12 Mar 2019 · 7Comments · Source: openshift/openshift-ansible

Description

On a single master, multiple nodes install, upgrade from 3.9 to 3.10 of an origin deployment is stuck at TASK [etcd : Verify cluster is healthy].
I've been making upgrades to my OKD test cluster, which was on version 3.7. Upgrade to 3.8 and 3.9 have been OK so far, as I can run pods on it.
But upgrading to 3.10 is failing, no matter what I try to do...
I followed the official OKD how-to to do this upgrade.
Btw this how-to seems to contain errors (it asks to Ensure the openshift_deployment_type parameter in your inventory file is set to openshift-enterprise when I think it should be origin... please correct me if I'm wrong...
etcd seems to be dockerized now when it used to be a separate install in a previous version. What did I miss?

Version

Place the output between the code block below:

* Your ansible version per `ansible --version`

ansible 2.4.3.0

* The output of `git describe`

openshift-ansible-3.10.125-1-2-g7de8c9892

Steps To Reproduce

Upgrade to OKD 3.9
Follow the official how-to to upgrade to 3.10
Run the openshift_node_group.yml playbook
Run the ansible-playbook -i </path/to/inventory/file> /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml playbook

Expected Results

OKD 3.10 should be up and running

$ oc get nodes
ose-master.infra.lesitedemoi.com   Ready     compute,master   1y        v1.10
ose-node1.infra.lesitedemoi.com    Ready     compute,infra    1y        v1.10
ose-node3.infra.lesitedemoi.com    Ready     compute          5h        v1.10

Observed Results

Ansible fails to complete

fatal: [ose-master.infra.lesitedemoi.com]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://ose-master.infra.lesitedemoi.com:2379", "cluster-health"], "delta": "0:00:00.050175", "end": "2019-03-12 22:39:51.503322", "msg": "non-zero return code", "rc": 1, "start": "2019-03-12 22:39:51.453147", "stderr": "Error response from daemon: Container 98e4dd754ef1055370a9d64bdf42960d83313fb52d45841b1108455fdb202b1e is not running", "stderr_lines": ["Error response from daemon: Container 98e4dd754ef1055370a9d64bdf42960d83313fb52d45841b1108455fdb202b1e is not running"], "stdout": "", "stdout_lines": []}

Additional Information

*  OS: `CentOS Linux release 7.6.1810 (Core)`
* My inventory file : 

# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
etcd

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=admin

#NTP
openshift_clock_enabled=true

# Versions
openshift_release=3.10
openshift_image_tag=v3.10.0

# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true

openshift_deployment_type=origin

# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# Disable memory checks
openshift_disable_check=memory_availability

# Disable automation of htpasswd file
openshift_master_manage_htpasswd=false

# host group for masters
[masters]
ose-master.infra.lesitedemoi.com

# host group for etcd
[etcd]
ose-master.infra.lesitedemoi.com openshift_public_hostname=ose3-master.cloudapps.lesitedemoi.com

# host group for nodes, includes region info
[nodes]
ose-master.infra.lesitedemoi.com openshift_node_group_name='node-config-master'
ose-node1.infra.lesitedemoi.com openshift_schedulable=True openshift_node_group_name='node-config-infra'
ose-node3.infra.lesitedemoi.com openshift_schedulable=True openshift_node_group_name='node-config-compute'

Source

mrik974

Most helpful comment

You're right. But we're talking about upgrading from 3.9 (which uses RPMs) to 3.10.
The 3.10 installer uses a 3.2.22 etcd image.
But the 3.9 installs had a bug in their repos and RPMs which resulted in installing etcd 3.3 instead of 3.2.22.
I must downgrade etcd from 3.3 to 3.2.22 using RPMs and then retry the 3.10 upgrade which will replace RPM based etcd with image based etcd.

mrik974 on 18 Mar 2019

👍2

All 7 comments

OK I found the error. But I don't know if I fixed it properly...
I manually checked the etcd pod health using /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://localhost:2379 cluster-health
It answered Error response from daemon: Container 24f72c82ebaa6a77c1b127b79f38296a96a84126d73e4d05414ed08b0f737853 is not running
Checking the logs of that container showed
etcdserver/membership: cluster cannot be downgraded (current version: 3.2.22 is lower than determined cluster version: 3.3)
So it looks like the container is running etcd version 3.2.22 instead of 3.3.
That's weird because the static pod configuration I found in the ansible files asks for a 3.3 image...

I modified the file at /etc/origin/node/pods/etcd.yaml to put the correct version and etcd started correctly right away.

Does anyone have an idea why this particular version was set up in the static pod configuration ?

mrik974 on 13 Mar 2019

Starting the ansible script after modifying the etcd.yaml file manually results in an issue... It looks like the script replaces this file every time by a wrong version.
I don't understand what's going on here...

mrik974 on 13 Mar 2019

Etcd containers are meant to use version 3.2.22

Considering your error message, I assume that you did suffer from some redhat mishap', as they did push etcd 3.3.11, both in their docker registries and RPMs. I've seen that issue a couple times already: first you'll have to downgrade from etcd 3.3 to 3.2, for that upgrade to work (openshift 3.9 would work just fine using etcd 3.3.11, you'ld only notice that issue upgrading openshift to 3.10)

FYI, the downgrading process is documented over there:

https://access.redhat.com/solutions/3885101

faust64 on 18 Mar 2019

I don't have a subscription. I'll consider having one. Modifying the pod configuration to pull and use etcd 3.3 is working, and has no impact so far on Openshift's stability.
Thank you for your help.
I thought Openshift 1.10 was using etcd 3.3, i have been mistaken by this yaml file :
https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/etcd/files/etcd.yaml

mrik974 on 18 Mar 2019

You're right, that file is pointing to some etcd:v3.3 on RH packages as well.
Nevertheless, I can confirm that even in 3.11, OKD still uses v3.2.22.

Now if you're using Centos-Extras repository, note that they still have that 3.3 etcd listed:

etcd-3.2.22-1.el7.x86_64 : A highly-available key value store for shared configuration
Repo        : extras
etcd-3.3.11-2.el7.centos.x86_64 : A highly-available key value store for shared configuration
Repo        : extras

Not sure why they did this. AFAIU, any OKD up to 3.9 would pull that 3.3.11 package, leading Ansible to fail upgrading.

In /etc/yum.conf, try adding: