Kops: Migrating to 1.8 with RBAC is incompatiable

Created on 28 Dec 2017 · 31Comments · Source: kubernetes/kops

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

What kops version are you running? The command kops version, will display
this information.
Version 1.8.0 (git-4876009bd)
What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
v1.7.7
What cloud provider are you using?
aws
What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster
What happened after the commands executed?
What did you expect to happen?
Upgrade the cluster to v1.8.6
Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Anything else do we need to know?

We are trying to upgrade the cluster from v1.7.7 to v1.8.6 with RBAC turned on.
We used the kops master to upgrade kops version Version 1.8.0 (git-4876009bd)

I1227 16:17:34.682684       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "pods" in namespace "kube-system"
I1227 16:17:34.682827       7 wrap.go:42] POST /api/v1/namespaces/kube-system/pods: (352.225µs) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.683112       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.683175       7 wrap.go:42] POST /api/v1/namespaces/default/events: (204.479µs) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.684278       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.684381       7 wrap.go:42] POST /api/v1/namespaces/default/events: (272.221µs) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-12-26T20:42:03Z
  name: k8s.playground.REDACTED.io
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://k8s.playground.REDACTED.io/k8s.playground.REDACTED.io
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    authorizationRbacSuperUser: admin
    storageBackend: etcd3
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.6
  masterInternalName: api.internal.k8s.playground.REDACTED.io
  masterPublicName: api.k8s.playground.REDACTED.io
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-east-1a
    type: Public
    zone: us-east-1a
  - cidr: 172.20.64.0/19
    name: us-east-1b
    type: Public
    zone: us-east-1b
  - cidr: 172.20.96.0/19
    name: us-east-1c
    type: Public
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

We did run this yaml before migrating and it still didn't help.

kubectl get  clusterrolebinding system:node -o yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: 2017-12-26T20:53:27Z
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:node
  resourceVersion: "850"
  selfLink: /apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings/system%3Anode
  uid: d24fbe68-ea7e-11e7-a9e1-0201c744720e
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes

------------- FEATURE REQUEST TEMPLATE --------------------

Describe IN DETAIL the feature/behavior/change you would like to see.
Feel free to provide a design supporting your feature request.

Source

naveensrinivasan

All 31 comments

Do the authorization errors persist in the log after the api server has completed startup and /healthz returns a 200? Some denials during server startup are normal as the authorization cache fills

liggitt on 28 Dec 2017

It continues and the cluster is inoperable.

naveensrinivasan on 28 Dec 2017

After upgrading, what does this show?

kubectl get clusterrolebinding system:node -o yaml
kubectl get clusterrole system:node -o yaml

liggitt on 28 Dec 2017

I also see this: https://github.com/kubernetes/kops/blob/1ff42edfac77df99ffa617113e51dad209ae0ce8/upup/models/cloudup/resources/addons/rbac.addons.k8s.io/k8s-1.8.yaml

I'm not familiar with what kops does on upgrade with the add on bindings

liggitt on 28 Dec 2017

@liggitt I manually ran the above yaml and it didn't help.

The api-server is unavailable after the upgrade so any of the kubectl commands fail.

naveensrinivasan on 28 Dec 2017

kubelet permissions should not affect api server availability. I'm not sure how to debug further if the api server is unreachable. Do you have more apiserver logs that might be illuminating? @chrislovecnm any ideas of what else might be at play here?

liggitt on 28 Dec 2017

👍1

@naveensrinivasan was RBAC already configured and working when the Cluster was on v1.7.7, or did you change it in the spec as part of the upgrade?

@liggitt not sure about the addons behaviour, but if performing an upgrade from v1.7 then the necessary RoleBinding will already exist, so that shouldn't be the issue I suspect.

KashifSaadat on 29 Dec 2017

@KashifSaadat RBAC was already configured and working when the cluster was v1.7.7

naveensrinivasan on 29 Dec 2017

Here are the log files. https://gist.github.com/naveensrinivasan/80eb10aa3bd2259139b48a6a78100357

I don't know exactly when I grabbed them. This from the master and I grabbed all the logs

api
controller
proxy
scheduler

naveensrinivasan on 29 Dec 2017

I am hitting the same issue after the upgrade following different installation method and I am sure system:nodes group have system:node role. Interestingly it isn't just system:nodes I see other groups e.g system:authenticated effected as well.

RBAC DENY: user "system:kube-proxy" groups ["system:authenticated"] cannot "list" resource "services" cluster-wide

Following this API server never comes up and kube control plane is down.

MQasimSarfraz on 2 Jan 2018

👍1

@naveensrinivasan what does apiserver /healthz show while the API server is crashlooping in that state? do you have the full apiserver manifest used, including all flags?

seeing this, which makes me suspect issues writing to etcd:

I1226 17:20:58.368013       8 trace.go:76] Trace[2144299595]: "Create /api/v1/namespaces" (started: 2017-12-26 17:20:53.848730671 +0000 UTC) (total time: 4.5192501s):
Trace[2144299595]: [4.284563321s] [4.284499666s] About to store object in database
Trace[2144299595]: [4.5192501s] [234.686779ms] END
I1226 17:20:58.368361       8 wrap.go:42] POST /api/v1/namespaces: (4.519639312s) 500

liggitt on 2 Jan 2018

@MQasimSarfraz what is the output of a superuser in the system:masters group calling /healthz on the apiserver? RBAC denials could prevent other components from talking to the API server, but would not keep the API server from coming up. I suspect issues reading from and/or writing to etcd

liggitt on 2 Jan 2018

@liggitt Where can I find that output? also following is what I can find related to /healthz in API server logs:

 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:108 +0x1ca
logging error output: "[+]ping ok\n[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[-]poststarthook/bootstrap-controller failed: reason withheld\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[-]poststarthook/ca-registration failed: reason withheld\n[+]poststarthook/start-kube-apiserver-informers ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[-]autoregister-completion failed: reason withheld\nhealthz check failed\n"
 [[kube-probe/1.8] 127.0.0.1:55014]

MQasimSarfraz on 2 Jan 2018

formatted better, that shows:

[+]ping ok
[+]etcd ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[-]poststarthook/bootstrap-controller failed: reason withheld
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[-]poststarthook/ca-registration failed: reason withheld
[+]poststarthook/start-kube-apiserver-informers ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[-]autoregister-completion failed: reason withheld
healthz check failed

the details for the failed hooks are available at these URLs:

/healthz/poststarthook/bootstrap-controller
/healthz/poststarthook/rbac/bootstrap-roles
/healthz/poststarthook/ca-registration
/healthz/autoregister-completion

liggitt on 2 Jan 2018

Can't find anything useful from URLs:

[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/bootstrap-controller
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:29 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/rbac/bootstrap-roles
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:42 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/ca-registration
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:51 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/autoregister-completion
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:09:02 GMT
Content-Length: 495

internal server error: missing APIService: [v1. v1.authentication.k8s.io v1.authorization.k8s.io v1.autoscaling v1.batch v1.networking.k8s.io v1.rbac.authorization.k8s.io v1.storage.k8s.io v1alpha1.admissionregistration.k8s.io v1beta1.apiextensions.k8s.io v1beta1.apps v1beta1.authentication.k8s.io v1beta1.authorization.k8s.io v1beta1.batch v1beta1.certificates.k8s.io v1beta1.extensions v1beta1.policy v1beta1.rbac.authorization.k8s.io v1beta1.storage.k8s.io v1beta2.apps v2beta1.autoscaling]

MQasimSarfraz on 2 Jan 2018

All of those point to etcd write errors/hangs. Did etcd setup change during the upgrade? What are the flags passed to the apiserver?

liggitt on 2 Jan 2018

👍1

Ahan interesting, No I haven't changed it but let me try to check etcd dumps. Also following are flags to apiserver:

    - --advertise-address=10.1.165.137
    - --etcd-servers=https://10.1.165.214:2379,https://10.1.165.66:2379,https://10.1.165.240:2379
    - --etcd-quorum-read=true
    - --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com.pem
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com-key.pem
    - --insecure-bind-address=0.0.0.0
    - --apiserver-count=3
    - --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,GenericAdmissionWebhook,ResourceQuota
    - --service-cluster-ip-range=10.234.0.0/18
    - --service-node-port-range=30000-32767
    - --client-ca-file=/etc/kubernetes/ssl/ca.pem
    - --profiling=false
    - --repair-malformed-updates=false
    - --kubelet-client-certificate=/etc/kubernetes/ssl/node-kube-master-03.example.com.pem
    - --kubelet-client-key=/etc/kubernetes/ssl/node-kube-master-03.example.com-key.pem
    - --service-account-lookup=true
    - --tls-cert-file=/etc/kubernetes/ssl/apiserver.pem
    - --tls-private-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --proxy-client-cert-file=/etc/kubernetes/ssl/apiserver.pem
    - --proxy-client-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --service-account-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --secure-port=6443
    - --insecure-port=8080
    - --storage-backend=etcd3
    - --runtime-config=admissionregistration.k8s.io/v1alpha1
    - --v=2
    - --allow-privileged=true
    - --anonymous-auth=False
    - --authorization-mode=RBAC
    - --feature-gates=Initializers=true

MQasimSarfraz on 2 Jan 2018

I noticed that etcd is not setup for etcd 3 btw. Check but I think you are still running efcd2

chrislovecnm on 2 Jan 2018

You have

storageBackend: etcd3

But you are not setting the etcd version in the manifest as required

chrislovecnm on 2 Jan 2018

@liggitt thanks for the pointer for me it was etcd. The etcd cluster was misbehaving for some reason and everything is back to normal once I fixed it. I wonder why ectd was marked ok in the health check or there wasn't any logging for etcd failure.

[+]etcd ok

Thanks again!

MQasimSarfraz on 2 Jan 2018

I have it running as etcd3

kubeAPIServer:
    authorizationRbacSuperUser: admin
    storageBackend: etcd3

naveensrinivasan on 2 Jan 2018

@naveensrinivasan and is your etcd cluster an etcd3 cluster? What version is it running?

liggitt on 2 Jan 2018

@liggitt It was running etcd and part of the upgrade I had to change it to etcd3.

naveensrinivasan on 2 Jan 2018

Did you migrate the etcd data from the etcd2 to etcd3 stores? You cannot simply upgrade the etcd binary and switch to etcd3 mode. If you didn't do a migration, you should continue to run kubernetes in etcd2 mode as long as you have v2 data (even against an etcd3 server)

liggitt on 2 Jan 2018

Nope, I didn't migrate. I was trying to use etcd2 in kops for 1.8 and I was running into issues which made me change to etcd3.

@chrislovecnm Would kops upgrade to v1.8 without moving to etcd3.

naveensrinivasan on 2 Jan 2018

You can continue to use etcd2 (or etcd3 in etcd2 mode) against 1.8 and 1.9

liggitt on 2 Jan 2018

how do you use etcd2 in etcd3?

naveensrinivasan on 2 Jan 2018

Run etcd3 binaries and start the kube apiserver with --storage-backend=etcd2

Kubernetes will continue to use the v2 API (which etcd3 still supports) and will have access to your old c2 data via it

liggitt on 3 Jan 2018

Thanks, I don't know if kops is doing this or is it possible to do this in kops?

naveensrinivasan on 3 Jan 2018

Yes, remove the etcd3 line in you manifest. Or edit your cluster.

chrislovecnm on 3 Jan 2018

I think the issue was I was using the kops from the master branch or another version which was causing the whole migration messed up. I pulled the release version of kops 1.8 and it is working. Thanks

naveensrinivasan on 3 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

CoreDNS externalCoreFile Parsing Invalid - Indentation

joshbranham · 3Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments

kops drain node

chrislovecnm · 3Comments

"config" file does not exist when creating cluster

rot26 · 5Comments

Missing documentation for 'upgrade' command

argusua · 5Comments