On a single master, multiple nodes install, upgrade from 3.9 to 3.10 of an origin deployment is stuck at TASK [etcd : Verify cluster is healthy].
I've been making upgrades to my OKD test cluster, which was on version 3.7. Upgrade to 3.8 and 3.9 have been OK so far, as I can run pods on it.
But upgrading to 3.10 is failing, no matter what I try to do...
I followed the official OKD how-to to do this upgrade.
Btw this how-to seems to contain errors (it asks to Ensure the openshift_deployment_type parameter in your inventory file is set to openshift-enterprise when I think it should be origin... please correct me if I'm wrong...
etcd seems to be dockerized now when it used to be a separate install in a previous version. What did I miss?
Place the output between the code block below:
* Your ansible version per `ansible --version`
ansible 2.4.3.0
* The output of `git describe`
openshift-ansible-3.10.125-1-2-g7de8c9892
openshift_node_group.yml playbookansible-playbook -i </path/to/inventory/file>
/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml playbookOKD 3.10 should be up and running
$ oc get nodes
ose-master.infra.lesitedemoi.com Ready compute,master 1y v1.10
ose-node1.infra.lesitedemoi.com Ready compute,infra 1y v1.10
ose-node3.infra.lesitedemoi.com Ready compute 5h v1.10
Ansible fails to complete
fatal: [ose-master.infra.lesitedemoi.com]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["/usr/local/bin/master-exec", "etcd", "etcd", "etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "--endpoints", "https://ose-master.infra.lesitedemoi.com:2379", "cluster-health"], "delta": "0:00:00.050175", "end": "2019-03-12 22:39:51.503322", "msg": "non-zero return code", "rc": 1, "start": "2019-03-12 22:39:51.453147", "stderr": "Error response from daemon: Container 98e4dd754ef1055370a9d64bdf42960d83313fb52d45841b1108455fdb202b1e is not running", "stderr_lines": ["Error response from daemon: Container 98e4dd754ef1055370a9d64bdf42960d83313fb52d45841b1108455fdb202b1e is not running"], "stdout": "", "stdout_lines": []}
* OS: `CentOS Linux release 7.6.1810 (Core)`
* My inventory file :
# Create an OSEv3 group that contains the masters and nodes groups
[OSEv3:children]
masters
nodes
etcd
# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=admin
#NTP
openshift_clock_enabled=true
# Versions
openshift_release=3.10
openshift_image_tag=v3.10.0
# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true
openshift_deployment_type=origin
# uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
# Disable memory checks
openshift_disable_check=memory_availability
# Disable automation of htpasswd file
openshift_master_manage_htpasswd=false
# host group for masters
[masters]
ose-master.infra.lesitedemoi.com
# host group for etcd
[etcd]
ose-master.infra.lesitedemoi.com openshift_public_hostname=ose3-master.cloudapps.lesitedemoi.com
# host group for nodes, includes region info
[nodes]
ose-master.infra.lesitedemoi.com openshift_node_group_name='node-config-master'
ose-node1.infra.lesitedemoi.com openshift_schedulable=True openshift_node_group_name='node-config-infra'
ose-node3.infra.lesitedemoi.com openshift_schedulable=True openshift_node_group_name='node-config-compute'
OK I found the error. But I don't know if I fixed it properly...
I manually checked the etcd pod health using /usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://localhost:2379 cluster-health
It answered Error response from daemon: Container 24f72c82ebaa6a77c1b127b79f38296a96a84126d73e4d05414ed08b0f737853 is not running
Checking the logs of that container showed
etcdserver/membership: cluster cannot be downgraded (current version: 3.2.22 is lower than determined cluster version: 3.3)
So it looks like the container is running etcd version 3.2.22 instead of 3.3.
That's weird because the static pod configuration I found in the ansible files asks for a 3.3 image...
I modified the file at /etc/origin/node/pods/etcd.yaml to put the correct version and etcd started correctly right away.
Does anyone have an idea why this particular version was set up in the static pod configuration ?
Starting the ansible script after modifying the etcd.yaml file manually results in an issue... It looks like the script replaces this file every time by a wrong version.
I don't understand what's going on here...
Etcd containers are meant to use version 3.2.22
Considering your error message, I assume that you did suffer from some redhat mishap', as they did push etcd 3.3.11, both in their docker registries and RPMs. I've seen that issue a couple times already: first you'll have to downgrade from etcd 3.3 to 3.2, for that upgrade to work (openshift 3.9 would work just fine using etcd 3.3.11, you'ld only notice that issue upgrading openshift to 3.10)
FYI, the downgrading process is documented over there:
I don't have a subscription. I'll consider having one. Modifying the pod configuration to pull and use etcd 3.3 is working, and has no impact so far on Openshift's stability.
Thank you for your help.
I thought Openshift 1.10 was using etcd 3.3, i have been mistaken by this yaml file :
https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/etcd/files/etcd.yaml
You're right, that file is pointing to some etcd:v3.3 on RH packages as well.
Nevertheless, I can confirm that even in 3.11, OKD still uses v3.2.22.
Now if you're using Centos-Extras repository, note that they still have that 3.3 etcd listed:
etcd-3.2.22-1.el7.x86_64 : A highly-available key value store for shared configuration
Repo : extras
etcd-3.3.11-2.el7.centos.x86_64 : A highly-available key value store for shared configuration
Repo : extras
Not sure why they did this. AFAIU, any OKD up to 3.9 would pull that 3.3.11 package, leading Ansible to fail upgrading.
In /etc/yum.conf, try adding:
exclude=etcd-3.3*
Then re-deploy your cluster. (Or downgrade etcd, then try upgrading again)
Openshift doesn't use etcd RPMs on masters in 3.10+, it uses etcd images
You're right. But we're talking about upgrading from 3.9 (which uses RPMs) to 3.10.
The 3.10 installer uses a 3.2.22 etcd image.
But the 3.9 installs had a bug in their repos and RPMs which resulted in installing etcd 3.3 instead of 3.2.22.
I must downgrade etcd from 3.3 to 3.2.22 using RPMs and then retry the 3.10 upgrade which will replace RPM based etcd with image based etcd.
Most helpful comment
You're right. But we're talking about upgrading from 3.9 (which uses RPMs) to 3.10.
The 3.10 installer uses a 3.2.22 etcd image.
But the 3.9 installs had a bug in their repos and RPMs which resulted in installing etcd 3.3 instead of 3.2.22.
I must downgrade etcd from 3.3 to 3.2.22 using RPMs and then retry the 3.10 upgrade which will replace RPM based etcd with image based etcd.