Openshift-ansible: Error: unknown command "drain" for "oc" in « Drain Node for Kubelet upgrade » action

Created on 1 Aug 2017 · 17Comments · Source: openshift/openshift-ansible

« Drain Node for Kubelet upgrade » action execute: oadm drain ... but this command don't exists:

# oadm  drain
Error: unknown command "drain" for "oc"
Run 'oc --help' for usage.

with this version:

# oc version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

To fix it, I need to add adm like this oadm adm drain ...

diff --git a/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml b/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
index c93a5d8..a21fb7f 100644
--- a/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
+++ b/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml
@@ -26,7 +26,7 @@

   - name: Drain Node for Kubelet upgrade
     command: >
-      {{ hostvars[groups.oo_first_master.0].openshift.common.admin_binary }} drain {{ openshift.node.nodename | lower }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets
+      {{ hostvars[groups.oo_first_master.0].openshift.common.client_binary }} adm drain {{ openshift.node.nodename | lower }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets
     delegate_to: "{{ groups.oo_first_master.0 }}"
     register: l_upgrade_nodes_drain_result
     until: not l_upgrade_nodes_drain_result | failed

affects_3.6 kinbug prioritP2

Source

harobed

👍1

Most helpful comment

If you get stuck in the upgrade where it seems to retry 60 times, you should have enough time to overwrite the adm alias with a little bash script so you don't have to restart the upgrade script:

#!/bin/bash
oc adm $@

andrewklau on 11 Aug 2017

👍2

All 17 comments

Is your oadm binary in a weird state? compare oc version with oadm version ?

sdodson on 1 Aug 2017

-bash-4.2# oc version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

-bash-4.2# oadm version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

harobed on 2 Aug 2017

-bash-4.2# ls -lha /var/usrlocal/bin/oadm
lrwxrwxrwx. 1 root root 24 May 30  2016 /var/usrlocal/bin/oadm -> /usr/local/bin/openshift
-bash-4.2# ls -lha /usr/local/bin/openshift
-rwxr-xr-x. 1 root root 180M Jul 31 21:59 /usr/local/bin/openshift
-bash-4.2# file /usr/local/bin/openshift
/usr/local/bin/openshift: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=8710a81ed939b75ccede41dff21a2c4b18521898, not stripped

harobed on 2 Aug 2017

I got this too, it causes the origin v1.5.1 -> v3.6.0 upgrade to fail.

After I got the failure (which I think occurs after the oc client is updated), I noticed that oadm and oc link to the same /usr/local/openshift binary - hence the issue. So it's not clear if this is an issue with the 1.5 install, or a change in feature that's not made it through to openshift installer. (I think a better approach would be an alias for oadm, perhaps?)

(this was on Fedora Atomic Host 26)

edseymour on 4 Aug 2017

I noticed that oadm and oc link to the same /usr/local/openshift binary - hence the issue.

That shouldn't be a problem. openshift inspects $0 and acts accordingly.

@fabianofranz Think this may be a CLI bug where it's not properly picking up $0?

sdodson on 4 Aug 2017

If you get stuck in the upgrade where it seems to retry 60 times, you should have enough time to overwrite the adm alias with a little bash script so you don't have to restart the upgrade script:

#!/bin/bash
oc adm $@

andrewklau on 11 Aug 2017

👍2

Seems like no one tried an upgrade path before the release :) Otherwise the upgrade would have failed
BTW mine is on Atomic

dmytroleonenko on 11 Aug 2017

Dug into this a bit, the problem is that we're copying the oc binary out of the container and placing it in /usr/local/bin/openshift

I was only able to reproduce this on atomic host, it doesn't seem to happen on a rhel host using a containerized install.

sdodson on 11 Aug 2017

Reproduced on F26 atomic. @andrewklau's suggestion got us past the step.

djdevin on 14 Aug 2017

@ashcrow Long term we need to fix why /usr/local/bin/openshift is actually the oc binary, but short term we can simply update all of the drain commands to be oc adm drain instead. I had started to look at https://github.com/sdodson/openshift-ansible/commit/2ea7c0d02d7bc10b3bb6313b13c3bbf37ca4a67c to understand where we ended up copying the wrong binary but I didn't get to the bottom of it.

sdodson on 14 Aug 2017

If it's only happening on Atomic Host then I have a feeling it is likely in _sync_atomic. I'll take a look.

ashcrow on 15 Aug 2017

@sdodson I did a quick look and @giuseppe's commit you were looking at isn't part of 3.6 (looking at released-3.6 tag). It looks like the last time the sync binaries code was updated was on 2016-11-28. That being said, there is no _sync_atomic there. Will update as I continue looking.

ashcrow on 15 Aug 2017

For the heck of it I double checked the containers to rule out funny business there and they look fine:

lrwxrwxrwx. 1 root root    9 Aug  3 18:47 /usr/bin/oadm -> openshift
-rwxr-xr-x. 1 root root 180M Aug  3 14:25 /usr/bin/oc
-rwxr-xr-x. 1 root root 295M Aug  3 14:25 /usr/bin/openshift

ashcrow on 15 Aug 2017

More info ....

Base

Installed 1.5/3.5 on Fedora AH.

# oc version
oc v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://192.168.124.11:8443
openshift v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4
[root@localhost ~]# ls -lah /usr/local/bin/
total 243M
drwxr-xr-x.  2 root root 4.0K Aug 15 12:06 .
drwxr-xr-x. 11 root root 4.0K Aug 14 09:50 ..
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 kubectl -> openshift
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 oadm -> openshift
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 oc -> openshift
-rwxr-xr-x.  1 root root 243M Apr 21 14:34 openshift

Upgrade

Command

ANSIBLE_PYTHON_INTERPRETER="/usr/bin/python3" ansible-playbook -vvv playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml | tee log

Sync Output

    "output": [
        "a678a02aa788b60f4e5126292152367008f8362e7453156916867d3e03d1af47\n",
        "Using temp dir: /tmp/tmp8aurs07y",
        "Moved /tmp/tmp8aurs07y/openshift to /usr/local/bin/openshift.",
        "Moved /tmp/tmp8aurs07y/oc to /usr/local/bin/oc."

After the sync

drwxr-xr-x. 11 root root 4.0K Aug 14 09:50 ..        
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 kubectl -> openshift                                           
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 oadm -> openshift                                              
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 oc -> openshift                                                
-rwxr-xr-x.  1 root root 295M Aug  1 02:29 openshift

Before drain attempt (notice openshift size)

lrwxrwxrwx.  1 root root    9 Aug 15 12:06 kubectl -> openshift
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 oadm -> openshift
lrwxrwxrwx.  1 root root    9 Aug 15 12:06 oc -> openshift
-rwxr-xr-x.  1 root root 180M Aug  1 02:29 openshift

ashcrow on 15 Aug 2017

/cc @sdodson

Just running binary sync provides positive results. oc and openshift are both what they should be:

lrwxrwxrwx.  1 root root    9 Aug 15 16:28 kubectl -> openshift                                           
lrwxrwxrwx.  1 root root    9 Aug 15 16:28 oadm -> openshift                                              
-rwxr-xr-x.  1 root root 180M Aug  1 02:29 oc        
-rwxr-xr-x.  1 root root 295M Aug  1 02:29 openshift

I believe we have been over thinking this and the symlink checking isn't work as expected. After doing a number of tests (both in isolation and 1.5->3.6 upgrades) the actual python code is copying properly. The problem seems to be the shell equivalent of this:

mkdir test                                # location for files to test with 
echo "one" > test/one                     # Make a file in the test location
echo "two" > two                          # Make a second file which we will use for copying
ln -s `pwd`/test/one test/two             # Symlink test/one -> test/two
cp two test/two                           # Replace test/two with our second file
cat test/one                              # Oops, test/one was replaced!

ashcrow on 15 Aug 2017

With the following patch I didn't encounter the error:

diff --git a/roles/openshift_cli/library/openshift_container_binary_sync.py b/roles/openshift_cli/library/openshift_container_binary_sync.py
index 57ac16602..8f83ef520 100644
--- a/roles/openshift_cli/library/openshift_container_binary_sync.py
+++ b/roles/openshift_cli/library/openshift_container_binary_sync.py
@@ -102,6 +102,11 @@ class BinarySyncer(object):
         dest_path = os.path.join(self.bin_dir, binary_name)
         incoming_checksum = self.module.run_command(['sha256sum', src_path])[1]
         if not os.path.exists(dest_path) or self.module.run_command(['sha256sum', dest_path])[1] != incoming_checksum:
+
+            # See: https://github.com/openshift/openshift-ansible/issues/4965
+            if os.path.islink(dest_path):
+                os.unlink(dest_path)
+                self.output.append('Removed old symlink {} before copying binary.'.format(dest_path))
             shutil.move(src_path, dest_path)
             self.output.append("Moved %s to %s." % (src_path, dest_path))
             self.changed = True

    "output": [                                                                                                                                                                                                      
        "f59cad2092cedd7b2394aee56a5dc85ca364c5ccbc4bbd656aa5a92951f9747e\n",                                                                                                                                        
        "Using temp dir: /tmp/tmp6if9o7__",                                                                                                                                                                          
        "Moved /tmp/tmp6if9o7__/openshift to /usr/local/bin/openshift.",                                                                                                                                             
        "Removed old symlink /usr/local/bin/oc before copying binary.",                                                                                                                                              
        "Moved /tmp/tmp6if9o7__/oc to /usr/local/bin/oc."                                                                                                                                                            
    ]

Before I PR this I want to do a little more testing with a fresh 1.5->3.6 upgrade.

ashcrow on 15 Aug 2017

I did a fresh 1.5 install, upgraded it to 3.6 with my patch and the issue no longer popped up.

ashcrow on 16 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

sanity-check removed vars problem

thebithead · 5Comments

openshift-ansible-service-broker

DavidTinoco · 6Comments

OKD 3.11 - deploy_cluster.yml fails ("Unable to connect to the server: unexpected EOF")

adamulacha · 6Comments

How to redeploy only named certificates?

leoluk · 4Comments

Detected OpenShift version 1.3.0 does not match requested openshift_release 1.5.0-alpha.2

cgutshal · 4Comments