芦 Drain Node for Kubelet upgrade 禄 action execute: oadm drain ... but this command don't exists:
# oadm drain
Error: unknown command "drain" for "oc"
Run 'oc --help' for usage.
with this version:
# oc version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO
To fix it, I need to add adm like this oadm adm drain ...
diff --git a/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml b/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
index c93a5d8..a21fb7f 100644
--- a/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
+++ b/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
@@ -26,7 +26,7 @@
- name: Drain Node for Kubelet upgrade
command: >
- {{ hostvars[groups.oo_first_master.0].openshift.common.admin_binary }} drain {{ openshift.node.nodename | lower }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets
+ {{ hostvars[groups.oo_first_master.0].openshift.common.client_binary }} adm drain {{ openshift.node.nodename | lower }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets
delegate_to: "{{ groups.oo_first_master.0 }}"
register: l_upgrade_nodes_drain_result
until: not l_upgrade_nodes_drain_result | failed
Is your oadm binary in a weird state? compare oc version with oadm version ?
-bash-4.2# oc version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
-bash-4.2# oadm version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
-bash-4.2# ls -lha /var/usrlocal/bin/oadm
lrwxrwxrwx. 1 root root 24 May 30 2016 /var/usrlocal/bin/oadm -> /usr/local/bin/openshift
-bash-4.2# ls -lha /usr/local/bin/openshift
-rwxr-xr-x. 1 root root 180M Jul 31 21:59 /usr/local/bin/openshift
-bash-4.2# file /usr/local/bin/openshift
/usr/local/bin/openshift: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=8710a81ed939b75ccede41dff21a2c4b18521898, not stripped
I got this too, it causes the origin v1.5.1 -> v3.6.0 upgrade to fail.
After I got the failure (which I think occurs after the oc client is updated), I noticed that oadm and oc link to the same /usr/local/openshift binary - hence the issue. So it's not clear if this is an issue with the 1.5 install, or a change in feature that's not made it through to openshift installer. (I think a better approach would be an alias for oadm, perhaps?)
(this was on Fedora Atomic Host 26)
I noticed that oadm and oc link to the same /usr/local/openshift binary - hence the issue.
That shouldn't be a problem. openshift inspects $0 and acts accordingly.
@fabianofranz Think this may be a CLI bug where it's not properly picking up $0?
If you get stuck in the upgrade where it seems to retry 60 times, you should have enough time to overwrite the adm alias with a little bash script so you don't have to restart the upgrade script:
#!/bin/bash
oc adm $@
Seems like no one tried an upgrade path before the release :) Otherwise the upgrade would have failed
BTW mine is on Atomic
Dug into this a bit, the problem is that we're copying the oc binary out of the container and placing it in /usr/local/bin/openshift
I was only able to reproduce this on atomic host, it doesn't seem to happen on a rhel host using a containerized install.
Reproduced on F26 atomic. @andrewklau's suggestion got us past the step.
@ashcrow Long term we need to fix why /usr/local/bin/openshift is actually the oc binary, but short term we can simply update all of the drain commands to be oc adm drain instead. I had started to look at https://github.com/sdodson/openshift-ansible/commit/2ea7c0d02d7bc10b3bb6313b13c3bbf37ca4a67c to understand where we ended up copying the wrong binary but I didn't get to the bottom of it.
If it's only happening on Atomic Host then I have a feeling it is likely in _sync_atomic. I'll take a look.
@sdodson I did a quick look and @giuseppe's commit you were looking at isn't part of 3.6 (looking at released-3.6 tag). It looks like the last time the sync binaries code was updated was on 2016-11-28. That being said, there is no _sync_atomic there. Will update as I continue looking.
For the heck of it I double checked the containers to rule out funny business there and they look fine:
lrwxrwxrwx. 1 root root 9 Aug 3 18:47 /usr/bin/oadm -> openshift
-rwxr-xr-x. 1 root root 180M Aug 3 14:25 /usr/bin/oc
-rwxr-xr-x. 1 root root 295M Aug 3 14:25 /usr/bin/openshift
More info ....
Installed 1.5/3.5 on Fedora AH.
# oc version
oc v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://192.168.124.11:8443
openshift v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4
[root@localhost ~]# ls -lah /usr/local/bin/
total 243M
drwxr-xr-x. 2 root root 4.0K Aug 15 12:06 .
drwxr-xr-x. 11 root root 4.0K Aug 14 09:50 ..
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 kubectl -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 oadm -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 oc -> openshift
-rwxr-xr-x. 1 root root 243M Apr 21 14:34 openshift
ANSIBLE_PYTHON_INTERPRETER="/usr/bin/python3" ansible-playbook -vvv playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml | tee log
"output": [
"a678a02aa788b60f4e5126292152367008f8362e7453156916867d3e03d1af47\n",
"Using temp dir: /tmp/tmp8aurs07y",
"Moved /tmp/tmp8aurs07y/openshift to /usr/local/bin/openshift.",
"Moved /tmp/tmp8aurs07y/oc to /usr/local/bin/oc."
drwxr-xr-x. 11 root root 4.0K Aug 14 09:50 ..
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 kubectl -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 oadm -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 oc -> openshift
-rwxr-xr-x. 1 root root 295M Aug 1 02:29 openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 kubectl -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 oadm -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 12:06 oc -> openshift
-rwxr-xr-x. 1 root root 180M Aug 1 02:29 openshift
/cc @sdodson
Just running binary sync provides positive results. oc and openshift are both what they should be:
lrwxrwxrwx. 1 root root 9 Aug 15 16:28 kubectl -> openshift
lrwxrwxrwx. 1 root root 9 Aug 15 16:28 oadm -> openshift
-rwxr-xr-x. 1 root root 180M Aug 1 02:29 oc
-rwxr-xr-x. 1 root root 295M Aug 1 02:29 openshift
I believe we have been over thinking this and the symlink checking isn't work as expected. After doing a number of tests (both in isolation and 1.5->3.6 upgrades) the actual python code is copying properly. The problem seems to be the shell equivalent of this:
mkdir test # location for files to test with
echo "one" > test/one # Make a file in the test location
echo "two" > two # Make a second file which we will use for copying
ln -s `pwd`/test/one test/two # Symlink test/one -> test/two
cp two test/two # Replace test/two with our second file
cat test/one # Oops, test/one was replaced!
With the following patch I didn't encounter the error:
diff --git a/roles/openshift_cli/library/openshift_container_binary_sync.py b/roles/openshift_cli/library/openshift_container_binary_sync.py
index 57ac16602..8f83ef520 100644
--- a/roles/openshift_cli/library/openshift_container_binary_sync.py
+++ b/roles/openshift_cli/library/openshift_container_binary_sync.py
@@ -102,6 +102,11 @@ class BinarySyncer(object):
dest_path = os.path.join(self.bin_dir, binary_name)
incoming_checksum = self.module.run_command(['sha256sum', src_path])[1]
if not os.path.exists(dest_path) or self.module.run_command(['sha256sum', dest_path])[1] != incoming_checksum:
+
+ # See: https://github.com/openshift/openshift-ansible/issues/4965
+ if os.path.islink(dest_path):
+ os.unlink(dest_path)
+ self.output.append('Removed old symlink {} before copying binary.'.format(dest_path))
shutil.move(src_path, dest_path)
self.output.append("Moved %s to %s." % (src_path, dest_path))
self.changed = True
"output": [
"f59cad2092cedd7b2394aee56a5dc85ca364c5ccbc4bbd656aa5a92951f9747e\n",
"Using temp dir: /tmp/tmp6if9o7__",
"Moved /tmp/tmp6if9o7__/openshift to /usr/local/bin/openshift.",
"Removed old symlink /usr/local/bin/oc before copying binary.",
"Moved /tmp/tmp6if9o7__/oc to /usr/local/bin/oc."
]
Before I PR this I want to do a little more testing with a fresh 1.5->3.6 upgrade.
I did a fresh 1.5 install, upgraded it to 3.6 with my patch and the issue no longer popped up.
Most helpful comment
If you get stuck in the upgrade where it seems to retry 60 times, you should have enough time to overwrite the adm alias with a little bash script so you don't have to restart the upgrade script: