Operator-sdk: Testing operator sdk locally isn't reading from another namespace

Created on 11 Mar 2020  路  22Comments  路  Source: operator-framework/operator-sdk

Type of question

I am having an issue running the operator-sdk local.

Question

I have an operator called cloud-ingress-operator. This is for Openshift Dedicated (OSD). It installs addtional ELB's and such.

When doing so, it reads (gets) 'machines' from the openshift-machine-api namespace as well as other clusterscoped CR's, like the 'infrastructures.config.openshift.io'.

When I run this locally with the command, with a KUBECONFIG pointed to a admin kubeconfig file:

operator-sdk run --local --namespace=openshift-cloud-ingress-operator

It can read the CR 'infrastructures.config.openshift.io', but it can't read the 'machines' from the openshift-machine-api. My list comes back blank.

However, when compiled, and run in a pod in the cluster, it does return the machines.

Environment

  • operator-sdk version:
$ operator-sdk version                                                                
operator-sdk version: "v0.15.1", commit: "e35ec7b722ba095e6438f63fafb9e7326870b486", go version: "go1.13.8 linux/amd64"

  • Kubernetes version information:
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"$Format:%H$", GitTreeState:"", BuildDate:"1970-01-01T00:00:00Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"b3bfb5a", GitTreeState:"clean", BuildDate:"2020-03-02T08:50:52Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes cluster kind:

OCP 4.3

Additional context

The project in question is here:

https://github.com/openshift/cloud-ingress-operator

The reconciler loop in question is here (this is where the function get's called in the loop). Ones previous to this succeed:

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/apischeme/apischeme_controller.go#L171

That calls this function where it fails:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/pkg/controller/utils/clusterinfo.go#L107

To test, this cr can be created:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/deploy/crds/cloudingress.managed.openshift.io_v1alpha1_apischeme_cr.yaml

One thing to note. This WILL create aws infrastructure. One way to prevent that is when loading the credentials reuqest, delete all the perms out of this file:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/deploy/cloud-ingress-operator-credentials.yml

that way the error can be seen, but not worry about creating everything

This is a lot. Happy to talk through slack!

kinbug

All 22 comments

interesting, wanted to make sure you can run 'oc get machines' from the command line and it lists something? from all the target namespaces as well?

HI @mwoodson,

See here that your project needs to be executed with WATCH_NAMESPACE as "". (cluster-scoped)

And then, note that you are applying the roles permssions in a different namespace that you are using to run the operator. See your role_binding it will create the resources for cloud-ingress-operator and not in --namespace=openshift-cloud-ingress-operator as used by you to run it locally.

So, could you try the following steps:

1) Create and install all resources in the namespace cloud-ingress-operator less the operator.yaml file.
2) Before execute the command operator-sdk run --local --namespace=cloud-ingress-operator run in the terminal:
export WATCH_NAMESPACE=""
export OPERATOR_NAME="cloud-ingress-operator"
3) then, in the same terminal ( if you close it or run in another one then, the env vars will be lost ) run the command to test it locally operator-sdk run --local --namespace=cloud-ingress-operator

Note that the deploy/role.yaml, deploy/role_binding.yaml, deploy/service_account.yaml need to be in the same namespace used to run the operator as the WATCH_NAMESPACE needs to be an empty string.

Please, let us know if it solves your scenario.

@camilamacedo86 Thanks so much for taking time to help me with this issue! It's greatly appreciated!

I have a quick question.

And then, note that you are applying the roles permssions in a different namespace that you are using to run the operator. See your role_binding it will create the resources for cloud-ingress-operator and not in --namespace=openshift-cloud-ingress-operator as used by you to run it locally.

You said that the rolebinding "name: cloud-ingress-operator" is not in the openshift-cloud-ingress-operator namespace. But on line 9, there is a namespace defined.

Why would this not be sufficient? Is there something in the operator-sdk that assumes the name of the rolebinding is associated with the namespace?

Hi @mwoodson,

You said that the rolebinding "name: cloud-ingress-operator" is not in the openshift-cloud-ingress-operator namespace. But on line 9, there is a namespace defined.

It is right. All are with the namespaced used openshift-cloud-ingress-operator. It was my mistake.

Why would this not be sufficient? Is there something in the operator-sdk that assumes the name of the rolebinding is associated with the namespace?

If the RBCA be not applied in the namespace where the operator will be "deployed" than, it will not have the permissions required.

Let's troubleshooting it?

1) Log the value that will be set in the namespace for we check that it is passing " " for the manager?

See: before here

2) By looking at the implementation of the method that is returning an empty list shows using the AWS client and not the client provided. See:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/pkg/awsclient/ec2_helper.go#L240-L253

So, I understand that your problem is that the following line is returning empty. Am I right?

filter := []*ec2.Filter{{Name: aws.String("tag:Name"), Values: aws.StringSlice([]string{name})}}

Could you please confirm it? Also, log the value of "filter"?

3. Checking further the above line:

See that it is ec2 and the import is"github.com/aws/aws-sdk-go/service/ec2". So, what is not working in your project when runs locally are not the client provided by SDK but this AWS client.

I think may it requires some specific configuration that is available only in the cluster. Locally the operator-sdk run --local will use the configuration from the Kubeconfig, however, when it is running inside of the cluster the pod will get the config from it (cfg, err := config.GetConfig()).

In this way, may the problem is that AWS client implementation requires something that is not provided in the Kubeconfig.

1) Log the value that will be set in the namespace for we check that it is passing " " for the manager?

See: before here

I ran this command:

operator-sdk run --local --namespace=openshift-cloud-ingress-operator

Here is the output from a log message. The Request.Namespace is "openshift-cloud-ingress-operator"

{"level":"info","ts":1584024077.6486912,"logger":"controller_apischeme","msg":"Reconciling APIScheme","Request.Namespace":"openshift-cloud-ingress-operator","Request.Name":"example-apischeme"}

2) By looking at the implementation of the method that is returning an empty list shows using the AWS client and not the client provided. See:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/pkg/awsclient/ec2_helper.go#L240-L253

So, I understand that your problem is that the following line is returning empty. Am I right?

filter := []*ec2.Filter{{Name: aws.String("tag:Name"), Values: aws.StringSlice([]string{name})}}
Could you please confirm it? Also, log the value of "filter"?

This is function in the reconciler that fails:

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/apischeme/apischeme_controller.go#L163

Ones above this line are working.

This is the section that is failing:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/pkg/controller/utils/clusterinfo.go#L58-L64

This is the call that is returning an empty list:

err := kclient.List(context.TODO(), machineList, client.InNamespace("openshift-machine-api"), client.MatchingLabels{masterMachineLabel: "master"})

Its doing a basic "oc get machines -n openshift-machine-api". It comes back blank when run locally from the operator sdk. But when I run it in cluster, it comes back with a valid list.

Again, the kubeconfig works when I run that command locally.

HI @mwoodson,

I ran this command:

operator-sdk run --local --namespace=openshift-cloud-ingress-operator

Here is the output from a log message. The Request.Namespace is "openshift-cloud-ingress-operator"

{"level":"info","ts":1584024077.6486912,"logger":"controller_apischeme","msg":"Reconciling APIScheme","Request.Namespace":"openshift-cloud-ingress-operator","Request.Name":"example-apischeme"}

It is the Log from the controller (the namespace where the requests of this controller has been executed). I suggested you print the namespace var to know what value is passing to the manager.

See:

        // Create a new Cmd to provide shared dependencies and start components
    mgr, err := manager.New(cfg, manager.Options{
        Namespace:          namespace,
        MetricsBindAddress: fmt.Sprintf("%s:%d", metricsHost, metricsPort),
    })
    if err != nil {
        log.Error(err, "")
        os.Exit(1)
    }

In your case, the value in the namespace needs to be " ".

This is the call that is returning an empty list:

err := kclient.List(context.TODO(), machineList, client.InNamespace("openshift-machine-api"), client.MatchingLabels{masterMachineLabel: "master"})

Its doing a basic "oc get machines -n openshift-machine-api". It comes back blank when run locally from the operator sdk. But when I run it in cluster, it comes back with a valid list.

Note that it is checking for the label="master".

err := kclient.List(context.TODO(), machineList, client.InNamespace("openshift-machine-api"), client.MatchingLabels{masterMachineLabel: "master"})

So, just to confirm.

When you run manually and filter by label == "master" also is it returned?

I was trying to test this out on my ocp 4.3 instance...I'm seeing this when I run it locally with WATCH_NAMESPACE=""

{"level":"info","ts":1584025111.5984113,"logger":"controller_apischeme","msg":"Reconciling APIScheme","Request.Namespace":"openshift-cloud-ingress-operator","Request.Name":"example-apischeme"}
E0312 09:58:31.699161 19273 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 366 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1934b80, 0x2cf7cc0)
/home/jeffmc/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/home/jeffmc/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1934b80, 0x2cf7cc0)
/usr/lib/golang/src/runtime/panic.go:679 +0x1b2
github.com/openshift/cloud-ingress-operator/pkg/controller/utils.GetClusterRegion(0x1f0ff20, 0xc0007293e0, 0xc00084a300, 0x0, 0x0, 0xc0006d1020)/home/jeffmc/cloud-ingress-operator/pkg/controller/utils/clusterinfo.go:100 +0x61

@jmccormick2001 That's the error i'm seeing.

The error you are seeing:

E0312 09:58:31.699161 19273 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 366 [running]:

is happening at this line in the code:

https://github.com/openshift/cloud-ingress-operator/blob/7183c10ec66eba79aa8e72ad0f0281e5ffa8325c/pkg/controller/utils/clusterinfo.go#L75

The &machineList.Items[0] isn't there, because it can't get the machines. So it's throwing that error

HI @mwoodson,

The track is showing that the first issue ( at least according to what was added here ) is in:

github.com/openshift/cloud-ingress-operator/pkg/controller/utils.GetClusterRegion(0x1f0ff20, 0xc0007293e0, 0xc00084a300, 0x0, 0x0, 0xc0006d1020)/home/jeffmc/cloud-ingress-operator/pkg/controller/utils/clusterinfo.go:100 +0x61

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/utils/clusterinfo.go#L100

PS.: We always read the stack trace from the bottom to know where the first error occurs.

Then, see:

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/utils/clusterinfo.go#L100

return infra.Status.PlatformStatus.AWS.Region, nil

Which means that something above is nil.

Then, by looking it we have:

func getInfrastructureObject(kclient client.Client) (*configv1.Infrastructure, error) {
    infra := &configv1.Infrastructure{}
    ns := types.NamespacedName{
        Namespace: "",
        Name:      "cluster",
    }
    err := kclient.Get(context.TODO(), ns, infra)
    if err != nil {
        return nil, err
    }
    return infra, nil
}

Could you check the trace and log the values of infra? Could you put a few logs in the above func?

IHMO, it is required to check what is nil in the infra (infra.Status.PlatformStatus.AWS.Region) and then, try to understand how the value is set in this resource to know what is not working locally. IHMO: It still shows something related to the AWS.

@camilamacedo86

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/utils/clusterinfo.go#L100

I don't think this call is the issue, but I'm happy to be wrong. That code is called from the reconciler here:

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/apischeme/apischeme_controller.go#L138-L144

And we log that info here in the reconciler:

https://github.com/openshift/cloud-ingress-operator/blob/master/pkg/controller/apischeme/apischeme_controller.go#L146

Here is the output of that log message:
{"level":"info","ts":1584026576.2520607,"logger":"controller_apischeme","msg":"Region: us-east-1, Owner tags: +map[kubernetes.io/cluster/mwoodson-mar11-8zcmj:owned]","Request.Namespace":"openshift-cloud-ingress-operator","Request.Name":"example-apischeme"}

As I have stated, we seem to be able to get cluster scoped resources (like the infrastructure), but when i try to get the openshift-machine-api, i can't access them.

jeff: infra.Status.PlatformStatus.AWS is nil

thats the nil value.

@jmccormick2001 curious, are you running this on an AWS cluster?

nope, local ocp cluster. I'm missing the point that this only will run on an AWS cluster I'll bet.

Yeah. This is what we see on the aws cluster

$ oc get infrastructure cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2020-03-11T13:22:14Z"
  generation: 1
  name: cluster
  resourceVersion: "427"
  selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
  uid: 3bbd4276-d798-4616-920f-991559e69f5b
spec:
  cloudConfig:
    name: ""
status:
  apiServerInternalURI: https://api-int.mwoodson-mar11.u5d7.s1.devshift.org:6443
  apiServerURL: https://api.mwoodson-mar11.u5d7.s1.devshift.org:6443
  etcdDiscoveryDomain: mwoodson-mar11.u5d7.s1.devshift.org
  infrastructureName: mwoodson-mar11-8zcmj
  platform: AWS
  platformStatus:
    aws:
      region: us-east-1
    type: AWS

HI @mwoodson,

However, it is when you are running the operator in the cluster. So, I do not know how works your project, however, you are using this AWS client which shows very related to the scenario.

Note that your project may be updating the platformStatus.aws.region when it is running in the cluster and not locally. And then, my guess would be that it is using some specific configuration that is obtained when the project is deployed in the cluster.

When we run the project locally with operator-sdk run --local it will be executed outside of the cluster and the configuration used will be what is available in the Kubeconfig. However, when the operator is deployed in the cluster the config is obtained from it (from the pod in real-time).

So, could add a log in the following code:

func getInfrastructureObject(kclient client.Client) (*configv1.Infrastructure, error) {
    infra := &configv1.Infrastructure{}
    ns := types.NamespacedName{
        Namespace: "",
        Name:      "cluster",
    }
    err := kclient.Get(context.TODO(), ns, infra)
    if err != nil {
        return nil, err
    }
         // TODO: add here the log to print the value of infra
    return infra, nil
}

Also, here could you add more one if as follows?

// GetClusterRegion returns the installed cluster's AWS region
func GetClusterRegion(kclient client.Client) (string, error) {
    infra, err := getInfrastructureObject(kclient)
    if err != nil {
        return "", err
    } else if infra.Status.PlatformStatus == nil {
        return "", fmt.Errorf("Expected to have a PlatformStatus for Infrastructure/cluster, but it was nil")
    } else if infra.Status.PlatformStatus.AWS == nil {
                 return "", fmt.Errorf("Expected to have AWS for Infrastructure/cluster, but it was nil")
        }
    return infra.Status.PlatformStatus.AWS.Region, nil
}

And then, build the project and running it locally as described before to check its result.

I think your panic will no longer be faced, at least not in this point. :-) However, could you please add here the full stack trace if faced as the above logs print from the execution made locally after that?

after playing with this today, I am seeing an error come back from even a simple client-go test program, "Unauthorized", when making a call to List the Machines in this namespace. More debugging to go.

Hi @jmccormick2001,

If you are adding third-party schemas (OCP API), then you need to customize the metrics default implementation. See the doc. (I assume that you are checking the similar issues of #2577, #1858). To do the POC to check it you can just comment the addMetrics then, this issue can be solved.

Hi @mwoodson and @jmccormick2001,

I made a POC in order to check it, and I could identify what is happening. See: https://github.com/camilamacedo86/app-operator

Besides we are exporting the environment variable WATCH_NAMESPACE by running export WATCH_NAMESPACE="" the value passed in the flag --namespace has been set in the WATCH_NAMESPACE. So, because of this, the manager has been receiving this value, and then, it will not be able to manage other namespaces.

See:

Screenshot 2020-03-13 at 00 17 49

The workaround is to pass the fixed value "" in the code.

@mwoodson replace the code in the main.go file as follows.

         emptyNS := "" 
        // Create a new Cmd to provide shared dependencies and start components
    mgr, err := manager.New(cfg, manager.Options{
        Namespace:          emptyNS,
        MetricsBindAddress: fmt.Sprintf("%s:%d", metricsHost, metricsPort),
    })
    if err != nil {
        log.Error(err, "")
        os.Exit(1)
    }

As I did in the POC: https://github.com/camilamacedo86/app-operator/blob/master/cmd/manager/main.go#L99

Then, you will see that it will work:

Screenshot 2020-03-13 at 00 21 26

Following the steps to do this check with the POC

It should be fixed with the PR https://github.com/operator-framework/operator-sdk/pull/2617

thanks @camilamacedo86 , #2617

HI @mwoodson,

As you can check in the POC https://github.com/camilamacedo86/app-operator if you pass the namespace as empty for the manager it will allow the operator to watch all namespaces. See: https://github.com/camilamacedo86/app-operator/blob/master/cmd/manager/main.go#L99

Note that the above POC proves that the operator-sdk run --local will able to work with multinamespaces and cluster-scoped operators. The issue here is over how to pass the correct value of WATCH_NAMESPACE that in this case needs to be an empty string.

When we merge the #2617 then, a new flag --watch-namespace will be added to allow you pass this value of WATCH_NAMESPACE as the --operator-namespace OPERATOR_NAMESPACE as well.

However, all the suggestions made in the comment https://github.com/operator-framework/operator-sdk/issues/2644#issuecomment-598381272 still in place. Also, see that the error added for you here shows that infra.Status.PlatformStatus.AWS == nil in your logic implementation which is causing a panic. As the fact that your implementation with the AWS client may work differently locally and is not creating the infra.Status.PlatformStatus.AWS as described above. It is very important to highlight that when the operator runs from the cluster it will get the configuration from inside of the pod and not from the kubeconfig as it is done locally which can also be the root cause of your issue specifically regards the aws client

So, I'd like to recommend you perform all the changes suggested so far in your code implementation, build a new tag and version and then test it. By doing that you will be able to get further pieces of information in the logs to try to figure out what is wrong specifically with your project. If after that you still requiring help, then the logs after the changes suggested are required for we try fo check it.

In this way, please, let us know if you could perform the checks and sort out it or please provide the full logs when you run the operator locally after all the suggested changes in https://github.com/operator-framework/operator-sdk/issues/2644#issuecomment-598381272 as well.

@camilamacedo86 I just deleted my last comment. That was on me, I hadn't properly saved. @jmccormick2001 synced and he showed me errors.

I have restarted it and it appears to be working as expected! Thanks so much, thi sis huge!

I am so happy that it is working for you now and we could be sorted out. It is terrific. 馃憤

This issue will be closed when we merged the PR #2617. I hope that it makes easier to solve it in the future. Please, feel free to open new issues with questions, and problems or feature requests as attending the community meeting. Your collab is very important to us.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nrvnrvn picture nrvnrvn  路  3Comments

camilamacedo86 picture camilamacedo86  路  4Comments

bobdonat picture bobdonat  路  3Comments

kristiandrucker picture kristiandrucker  路  5Comments

camilamacedo86 picture camilamacedo86  路  5Comments