Kops: kube-scheduler fails to start when `usePolicyConfigMap` is enabled

Created on 3 Oct 2019 · 8Comments · Source: kubernetes/kops

1. What kops version are you running?
1.14.0

2. What Kubernetes version are you running?
1.15.4

3. What cloud provider are you using?
AWS

4. What commands did you run?

kops create -v 10 \
    --name cluster1.somedomain.io \
    --state "s3://cluster1.somedomain.io" \
    -f cluster_config.yaml

kops create -v 10 secret sshpublickey admin \
    --name cluster1.somedomain.io \
    --state "s3://cluster1.somedomain.io" \
    --config cluster_config.yaml \
    -i cluster1.somedomain.io.pub

kops -v 10 update cluster cluster1.somedomain.io \
    --state "s3://cluster1.somedomain.io" \
    --yes

cluster_config.yaml

5. What is the simplest way to reproduce this issue?

Create a cluster and set spec.kubeScheduler.usePolicyConfigMap: true

6. What happened after the commands executed?

Master nodes come up but are in a NotReady state
kube-scheduler pods continually crash

7. What did you expect to happen?

All nodes would come online and enter a Ready state without manual intervention.

8. Please provide your cluster manifest.

cluster manifest

9. Please run the commands with most verbose logging by adding the -v 10 flag.

kops log

10. Anything else do we need to know?

This is not a new issue. I've been working around it since K8s 1.13, maybe earlier.

There are two different issues causing the kube-scheduler pods to crash.

The system:kube-scheduler clusterrole doesn't grant access to configmaps.

Related log messages:

I1003 05:31:39.339583       1 server.go:161] Starting Kubernetes Scheduler version v1.15.4
couldn't get policy config map kube-system/scheduler-policy: configmaps "scheduler-policy" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-system"

Manual fix: edit the system:kube-scheduler clusterrole and append the following:

- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch

The scheduler-policy configmap contains invalid predicates.

Related log messages:

F1003 06:02:19.748546       1 plugins.go:240] Invalid configuration: Predicate type not found for CheckNodeMemoryPressure
...
F1003 06:17:53.461767       1 plugins.go:240] Invalid configuration: Predicate type not found for CheckNodeDiskPressure
...
F1003 06:22:57.394597       1 plugins.go:240] Invalid configuration: Predicate type not found for CheckNodeCondition
...
F1003 06:43:25.075534       1 plugins.go:240] Invalid configuration: Predicate type not found for NoVolumeNodeConflict

scheduler-policy configmap[original]
scheduler-policy configmap[working]
scheduler-policy configmap[tuned]

Manual fix: edit the scheduler-policy configmap and remove the troublesome predicates (CheckNodeMemoryPressure, CheckNodeDiskPressure, CheckNodeCondition, NoVolumeNodeConflict)

I suspect whats going on is that since the TaintNodesByCondition featuregate has been enabled by default since K8s 1.12.0 the troublesome predicates are removed (src) and the configmap resource (src) doesn't reflect this change won't work when used here

It looks like issue 2 is a pretty straightforward fix, but I'm not sure how to handle patching the clusterrole in issue 1.

lifecyclrotten

Source

rtluckie

👍2

Most helpful comment

@zetaab
The command which starts the scheduler on kops-managed cluster

Command:
      /bin/sh
      -c
      mkfifo /tmp/pipe; (tee -a /var/log/kube-scheduler.log < /tmp/pipe & ) ; exec /usr/local/bin/kube-scheduler --kubeconfig=/var/lib/kube-scheduler/kubeconfig --leader-elect=true --policy-configmap=scheduler-policy --policy-configmap-namespace=kube-system --v=2 > /tmp/pipe 2>&1

It uses --kubeconfig flag which is depricated as well. Looks like instead of the flags config file should be used now.

# kube-scheduler --help
...
Misc flags:

      --config string
                The path to the configuration file. Flags override values in this file.

--write-config-to flag can be used to produce sample configuration file

# kube-scheduler --kubeconfig=/var/lib/kube-scheduler/kubeconfig --leader-elect=true --policy-configmap=scheduler-policy --policy-configmap-namespace=kube-system --v=2 --write-config-to /scheduler_config.yaml

Which produces

algorithmSource:
  provider: DefaultProvider
apiVersion: kubescheduler.config.k8s.io/v1alpha1
bindTimeoutSeconds: 600
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: /var/lib/kube-scheduler/kubeconfig
  qps: 50
disablePreemption: false
enableContentionProfiling: false
enableProfiling: false
failureDomains: kubernetes.io/hostname,failure-domain.beta.kubernetes.io/zone,failure-domain.beta.kubernetes.io/region
hardPodAffinitySymmetricWeight: 1
healthzBindAddress: 0.0.0.0:10251
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  lockObjectName: kube-scheduler
  lockObjectNamespace: kube-system
  renewDeadline: 10s
  resourceLock: endpoints
  retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
percentageOfNodesToScore: 0
schedulerName: default-scheduler

schema
The KubeSchedulerConfiguration structure has Plugins field which has Score list. The plugins can be configured in PluginConfig field of KubeSchedulerConfiguration structure
Kops should use the configuration file instead of flags and allow its configuration.

vvbogdanov87 on 26 Nov 2019

👍4

All 8 comments

Related to #6579

rtluckie on 3 Oct 2019

this feature is DEPRECATED in latest kubernetes. So I do not see why it should be fixed / used anymore.

--policy-configmap string

DEPRECATED: name of the ConfigMap object that contains scheduler's policy configuration. It must exist in the system namespace before scheduler initialization if --use-legacy-policy-config=false. The config must be provided as the value of an element in 'Data' map with the key='policy.cfg'

--policy-configmap-namespace string     Default: "kube-system"
DEPRECATED: the namespace where policy ConfigMap is located. The kube-system namespace will be used if this is not provided or is empty.

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/

zetaab on 20 Nov 2019

@zetaab
The command which starts the scheduler on kops-managed cluster

Command:
      /bin/sh
      -c
      mkfifo /tmp/pipe; (tee -a /var/log/kube-scheduler.log < /tmp/pipe & ) ; exec /usr/local/bin/kube-scheduler --kubeconfig=/var/lib/kube-scheduler/kubeconfig --leader-elect=true --policy-configmap=scheduler-policy --policy-configmap-namespace=kube-system --v=2 > /tmp/pipe 2>&1

It uses --kubeconfig flag which is depricated as well. Looks like instead of the flags config file should be used now.

# kube-scheduler --help
...
Misc flags:

      --config string
                The path to the configuration file. Flags override values in this file.

--write-config-to flag can be used to produce sample configuration file

# kube-scheduler --kubeconfig=/var/lib/kube-scheduler/kubeconfig --leader-elect=true --policy-configmap=scheduler-policy --policy-configmap-namespace=kube-system --v=2 --write-config-to /scheduler_config.yaml

Which produces

algorithmSource:
  provider: DefaultProvider
apiVersion: kubescheduler.config.k8s.io/v1alpha1
bindTimeoutSeconds: 600
clientConnection:
  acceptContentTypes: ""
  burst: 100
  contentType: application/vnd.kubernetes.protobuf
  kubeconfig: /var/lib/kube-scheduler/kubeconfig
  qps: 50
disablePreemption: false
enableContentionProfiling: false
enableProfiling: false
failureDomains: kubernetes.io/hostname,failure-domain.beta.kubernetes.io/zone,failure-domain.beta.kubernetes.io/region
hardPodAffinitySymmetricWeight: 1
healthzBindAddress: 0.0.0.0:10251
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: true
  leaseDuration: 15s
  lockObjectName: kube-scheduler
  lockObjectNamespace: kube-system
  renewDeadline: 10s
  resourceLock: endpoints
  retryPeriod: 2s
metricsBindAddress: 0.0.0.0:10251
percentageOfNodesToScore: 0
schedulerName: default-scheduler

vvbogdanov87 on 26 Nov 2019

👍4

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Feb 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 25 Mar 2020

@rtluckie shall this really be closed? issues didn't go away, I just had to apply the same manual workaround on our clusters

mlushpenko on 7 Apr 2020

👍2

We should not close this issue. Exiting functionality is broken :(

avdhoot on 23 Apr 2020

👍1

While this indeed deprecated, if you need to use it to access scheduler settings that aren't yet exposed in Kops you can use this RBAC policy:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kube-scheduler-configmap
  namespace: kube-system
rules:
- apiGroups: [""]
  resources: [configmaps]
  resourceNames: [scheduler-policy]
  verbs: [get]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  name: kube-scheduler-configmap
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kube-scheduler-configmap
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: system:kube-scheduler