Kubeflow: All pods fail when kf-serving controller manager is shutdown

Created on 20 Nov 2019  Â·  4Comments  Â·  Source: kubeflow/kubeflow

/kind bug

What steps did you take and what happened:
I installed kubeflow on google kubernetes engine following the documentation using the deploy using CLI steps. This succeeded and I was able to use kubeflow. Afterwards I scaled down the cluster to 0 nodes for the weekend in order to reduce costs. After resizing the cluster to 1 again all pods could not be created due to the inferenceservice.kfserving-webhook-server.pod-mutator webhook.

This webhook fires for all pod creation where the namespace does not have a label with key component-plane and invokes the kfserving-webhook-server-service. But the pod which serves as an endpoint can itself not be created because creation is intercepted by the webhook.

After digging into the webhook i found the selector to be:

namespaceSelector:
    matchExpressions:
    - key: control-plane
      operator: DoesNotExist

And the pods which become an endpoint have the following labels:

 template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: kfserving-install
        app.kubernetes.io/instance: kfserving-install-v0.7.0
        app.kubernetes.io/managed-by: kfctl
        app.kubernetes.io/name: kfserving-install
        app.kubernetes.io/part-of: kubeflow
        app.kubernetes.io/version: v0.7.0
        control-plane: kfserving-controller-manager
        controller-tools.k8s.io: "1.0"
        kustomize.component: kfserving

But the pods still match the namespace selector because the namespace does not have these labels. Therefore the creation of these pods is intercepted by the mutatingwebhook and an endpoint for kfserving-webhook-server-service never is created. This causes the mutatingwebhook to fail for all pod creations, causing the complete cluster to not function.

After removing the mutatingwebhook the cluster is up and running again. My question is, what would be the right approach to solve this issue?

What did you expect to happen:
Resizing the cluster from 0 to 1 starts the cluster without having to remove the webhook.

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): 0.7.0
  • kfctl version: 0.7.0
  • Kubernetes platform: GKE
  • Kubernetes version: Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-gke.12", GitCommit:"188432a69210ca32cafded81b4dd1c063720cac0", GitTreeState:"clean", BuildDate:"2019-11-07T19:27:01Z", GoVersion:"go1.12.11b4", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g. from /etc/os-release): MacOS Mojava 10.14.5
kinbug

Most helpful comment

Duplicate of kubeflow/kfserving#568

You can delete the webhook like so

kubectl -n kubeflow delete MutatingWebHookConfiguration inferenceservice.serving.kubeflow.org

All 4 comments

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

How do you remove the mutatingwebhook

Duplicate of kubeflow/kfserving#568

You can delete the webhook like so

kubectl -n kubeflow delete MutatingWebHookConfiguration inferenceservice.serving.kubeflow.org

Duplicate of kubeflow/kfserving#568

Was this page helpful?
0 / 5 - 0 ratings