Origin: oc cluster up with metrics fails with "No API token found for service account metrics-deployer"

Created on 17 Nov 2016 · 12Comments · Source: openshift/origin

Cannot run "oc cluster up --metrics" successfully. It always fails.

Version

$ ./oc version
oc v1.4.0-alpha.1+f189ede
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Steps To Reproduce

Simply run the command "sudo ./oc cluster up --metrics"

Current Result

$ sudo ./oc cluster up --metrics
-- Checking OpenShift client ... OK
-- Checking Docker client ... OK
-- Checking Docker version ... OK
-- Checking for existing OpenShift container ... OK
-- Checking for openshift/origin:v1.4.0-alpha.1 image ... OK
-- Checking Docker daemon configuration ... OK
-- Checking for available ports ... 
   WARNING: Binding DNS on port 8053 instead of 53, which may not be resolvable from all clients.
-- Checking type of volume mount ... 
   Using nsenter mounter for OpenShift volumes
-- Creating host directories ... OK
-- Finding server IP ... 
   Using 192.168.1.2 as the server IP
-- Starting OpenShift container ... 
   Creating initial OpenShift configuration
   Starting OpenShift using container 'origin'
   Waiting for API server to start listening
   OpenShift server started
-- Adding default OAuthClient redirect URIs ... OK
-- Installing registry ... OK
-- Installing router ... OK
-- Installing metrics ... FAIL
   Error: cannot create metrics deployer pod
   Details:
     Last 10 lines of "origin" container log:
     I1117 00:38:44.353250   12675 trace.go:61] Trace "Update
/api/v1/namespaces/openshift-infra/serviceaccounts/deployment-controller" (started 2016-11-17
00:38:43.74571921 +0000 UTC):
     [22.617µs] [22.617µs] About to convert to expected version
     [93.692µs] [71.075µs] Conversion done
     [99.156µs] [5.464µs] About to store object in database
     [607.415218ms] [607.316062ms] Object stored in database
     [607.425586ms] [10.368µs] Self-link added
     [607.484338ms] [58.752µs] END
     I1117 00:38:44.353911   12675 trace.go:61] Trace "Delete
/api/v1/namespaces/openshift-infra/secrets/namespace-controller-token-fwytu" (started 2016-11-17
00:38:42.796986729 +0000 UTC):
     [30.765µs] [30.765µs] About do delete object from database
     [1.556883322s] [1.556852557s] END

   Caused By:
     Error: No API token found for service account "metrics-deployer", retry after the token is
automatically created and added to the service account

Expected Result

Successful install of OpenShift with metrics.

Additional Information

I originally did not have my user in the docker group, which is why I prefix my command with "sudo". However, I did try this by putting my user in the docker group, and it didn't help. Same problem occurs. So I do not think it is related, but here's what I did:
$ sudo groupadd docker && sudo gpasswd -a ${USER} docker && sudo systemctl restart docker && newgrp docker

Also, I build "oc" from current master branch, and I get the same problem.

componencomposition componenmetrics kinbug prioritP2

Source

jmazzitelli

Most helpful comment

@jmazzitelli thanks for confirming that's the problem. I'd rather simply instantiate a job so that the job controller can do the retry for us.

csrwng on 17 Nov 2016

👍3

All 12 comments

@pweil- I don't believe this has anything to do with Origin Metrics directly. Setting up the service account is done as part of the cluster up command.

If @jmazzitelli installs origin metrics directly then it works for him.

And if I follow the steps outlined in what he is doing, then it works properly for me.

mwringe on 17 Nov 2016

I just want to be clear about the replication procedures, I get this failure by doing nothing special other than download the oc binary, untar it, and run it via sudo:

$ wget https://github.com/openshift/origin/releases/download/v1.4.0-alpha.1/openshift-origin-client-tools-v1.4.0-alpha.1.f189ede-linux-64bit.tar.gz
$ tar xvfz openshift-origin-client-tools-v1.4.0-alpha.1.f189ede-linux-64bit.tar.gz
$ cd openshift-origin-client-tools-v1.4.0-alpha.1+f189ede-linux-64bit/
$ sudo ./oc cluster up --metrics

For the record, I am on Fedora 23, with "uname -a" as follows:

$ uname -a
Linux mazztower 4.6.7-200.fc23.x86_64 #1 SMP Wed Aug 17 14:24:53 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

jmazzitelli on 17 Nov 2016

if we are synchronously creating a service account then immediately creating a pod that uses it, that code needs to be able to retry creating the pod if it is forbidden because the service account's token hasn't been auto-generated yet.

liggitt on 17 Nov 2016

https://github.com/openshift/origin/blob/master/pkg/bootstrap/docker/openshift/metrics.go#L68-L72

liggitt on 17 Nov 2016

@liggitt thanks, would changing it to a Job do the trick?

csrwng on 17 Nov 2016

@soltysh could confirm, but I would expect so

liggitt on 17 Nov 2016

I'm quickly going to see if this works. Not sure if you'd want a PR with this (this changes the code so it just keeps retrying if the error it gets is this "retry after the token is ready" error message. I suspect this is going to fix it (if this truly is a case where retrying will help - I will see in a minute).

-       deployerPod := metricsDeployerPod(hostName, imagePrefix, imageVersion)
-       if _, err = kubeClient.Pods(infraNamespace).Create(deployerPod); err != nil {
-               return errors.NewError("cannot create metrics deployer pod").WithCause(err).WithDetails(h.OriginLog())
+       for keepTrying := true; keepTrying == true; {
+               deployerPod := metricsDeployerPod(hostName, imagePrefix, imageVersion)
+               if _, err = kubeClient.Pods(infraNamespace).Create(deployerPod); err != nil {
+                       if !strings.Contains(err.Error(), "retry after the token") {
+                               return errors.NewError("cannot create metrics deployer pod").WithCause(err).WithDetails(h.OriginLog())
+                       }
+               } else {
+                       keepTrying = false
+               }

jmazzitelli on 17 Nov 2016

That fix works. My "cluster up" command finished successfully and I do see this in the output: