Kops: Migrating to 1.8 with RBAC is incompatiable

Created on 28 Dec 2017  路  31Comments  路  Source: kubernetes/kops

Thanks for submitting an issue! Please fill in as much of the template below as
you can.

------------- BUG REPORT TEMPLATE --------------------

  1. What kops version are you running? The command kops version, will display
    this information.
    Version 1.8.0 (git-4876009bd)
  2. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
    v1.7.7
  3. What cloud provider are you using?
    aws
  4. What commands did you run? What is the simplest way to reproduce this issue?
    kops update cluster
  5. What happened after the commands executed?

  6. What did you expect to happen?
    Upgrade the cluster to v1.8.6

  7. Please provide your cluster manifest. Execute
    kops get --name my.example.com -oyaml to display your cluster manifest.
    You may want to remove your cluster name and other sensitive information.

  8. Please run the commands with most verbose logging by adding the -v 10 flag.
    Paste the logs into this report, or in a gist and provide the gist link here.

  9. Anything else do we need to know?

  • We are trying to upgrade the cluster from v1.7.7 to v1.8.6 with RBAC turned on.
  • We used the kops master to upgrade kops version Version 1.8.0 (git-4876009bd)
I1227 16:17:34.682684       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "pods" in namespace "kube-system"
I1227 16:17:34.682827       7 wrap.go:42] POST /api/v1/namespaces/kube-system/pods: (352.225碌s) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.683112       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.683175       7 wrap.go:42] POST /api/v1/namespaces/default/events: (204.479碌s) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.684278       7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.684381       7 wrap.go:42] POST /api/v1/namespaces/default/events: (272.221碌s) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-12-26T20:42:03Z
  name: k8s.playground.REDACTED.io
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://k8s.playground.REDACTED.io/k8s.playground.REDACTED.io
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1b
      name: b
    - instanceGroup: master-us-east-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    authorizationRbacSuperUser: admin
    storageBackend: etcd3
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.8.6
  masterInternalName: api.internal.k8s.playground.REDACTED.io
  masterPublicName: api.k8s.playground.REDACTED.io
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-east-1a
    type: Public
    zone: us-east-1a
  - cidr: 172.20.64.0/19
    name: us-east-1b
    type: Public
    zone: us-east-1b
  - cidr: 172.20.96.0/19
    name: us-east-1c
    type: Public
    zone: us-east-1c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

We did run this yaml before migrating and it still didn't help.

kubectl get  clusterrolebinding system:node -o yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: 2017-12-26T20:53:27Z
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:node
  resourceVersion: "850"
  selfLink: /apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings/system%3Anode
  uid: d24fbe68-ea7e-11e7-a9e1-0201c744720e
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes

------------- FEATURE REQUEST TEMPLATE --------------------

  1. Describe IN DETAIL the feature/behavior/change you would like to see.

  2. Feel free to provide a design supporting your feature request.

All 31 comments

Do the authorization errors persist in the log after the api server has completed startup and /healthz returns a 200? Some denials during server startup are normal as the authorization cache fills

It continues and the cluster is inoperable.

After upgrading, what does this show?

kubectl get clusterrolebinding system:node -o yaml
kubectl get clusterrole system:node -o yaml

@liggitt I manually ran the above yaml and it didn't help.

The api-server is unavailable after the upgrade so any of the kubectl commands fail.

kubelet permissions should not affect api server availability. I'm not sure how to debug further if the api server is unreachable. Do you have more apiserver logs that might be illuminating? @chrislovecnm any ideas of what else might be at play here?

@naveensrinivasan was RBAC already configured and working when the Cluster was on v1.7.7, or did you change it in the spec as part of the upgrade?

@liggitt not sure about the addons behaviour, but if performing an upgrade from v1.7 then the necessary RoleBinding will already exist, so that shouldn't be the issue I suspect.

@KashifSaadat RBAC was already configured and working when the cluster was v1.7.7

Here are the log files. https://gist.github.com/naveensrinivasan/80eb10aa3bd2259139b48a6a78100357

I don't know exactly when I grabbed them. This from the master and I grabbed all the logs

  • api
  • controller
  • proxy
  • scheduler

I am hitting the same issue after the upgrade following different installation method and I am sure system:nodes group have system:node role. Interestingly it isn't just system:nodes I see other groups e.g system:authenticated effected as well.

RBAC DENY: user "system:kube-proxy" groups ["system:authenticated"] cannot "list" resource "services" cluster-wide

Following this API server never comes up and kube control plane is down.

@naveensrinivasan what does apiserver /healthz show while the API server is crashlooping in that state? do you have the full apiserver manifest used, including all flags?

seeing this, which makes me suspect issues writing to etcd:

I1226 17:20:58.368013       8 trace.go:76] Trace[2144299595]: "Create /api/v1/namespaces" (started: 2017-12-26 17:20:53.848730671 +0000 UTC) (total time: 4.5192501s):
Trace[2144299595]: [4.284563321s] [4.284499666s] About to store object in database
Trace[2144299595]: [4.5192501s] [234.686779ms] END
I1226 17:20:58.368361       8 wrap.go:42] POST /api/v1/namespaces: (4.519639312s) 500

@MQasimSarfraz what is the output of a superuser in the system:masters group calling /healthz on the apiserver? RBAC denials could prevent other components from talking to the API server, but would not keep the API server from coming up. I suspect issues reading from and/or writing to etcd

@liggitt Where can I find that output? also following is what I can find related to /healthz in API server logs:

 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:108 +0x1ca
logging error output: "[+]ping ok\n[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[-]poststarthook/bootstrap-controller failed: reason withheld\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[-]poststarthook/ca-registration failed: reason withheld\n[+]poststarthook/start-kube-apiserver-informers ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[-]autoregister-completion failed: reason withheld\nhealthz check failed\n"
 [[kube-probe/1.8] 127.0.0.1:55014]

formatted better, that shows:

[+]ping ok
[+]etcd ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[-]poststarthook/bootstrap-controller failed: reason withheld
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[-]poststarthook/ca-registration failed: reason withheld
[+]poststarthook/start-kube-apiserver-informers ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[-]autoregister-completion failed: reason withheld
healthz check failed

the details for the failed hooks are available at these URLs:

/healthz/poststarthook/bootstrap-controller
/healthz/poststarthook/rbac/bootstrap-roles
/healthz/poststarthook/ca-registration
/healthz/autoregister-completion

Can't find anything useful from URLs:

[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/bootstrap-controller
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:29 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/rbac/bootstrap-roles
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:42 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/ca-registration
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:51 GMT
Content-Length: 36

internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/autoregister-completion
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:09:02 GMT
Content-Length: 495

internal server error: missing APIService: [v1. v1.authentication.k8s.io v1.authorization.k8s.io v1.autoscaling v1.batch v1.networking.k8s.io v1.rbac.authorization.k8s.io v1.storage.k8s.io v1alpha1.admissionregistration.k8s.io v1beta1.apiextensions.k8s.io v1beta1.apps v1beta1.authentication.k8s.io v1beta1.authorization.k8s.io v1beta1.batch v1beta1.certificates.k8s.io v1beta1.extensions v1beta1.policy v1beta1.rbac.authorization.k8s.io v1beta1.storage.k8s.io v1beta2.apps v2beta1.autoscaling]

All of those point to etcd write errors/hangs. Did etcd setup change during the upgrade? What are the flags passed to the apiserver?

Ahan interesting, No I haven't changed it but let me try to check etcd dumps. Also following are flags to apiserver:

    - --advertise-address=10.1.165.137
    - --etcd-servers=https://10.1.165.214:2379,https://10.1.165.66:2379,https://10.1.165.240:2379
    - --etcd-quorum-read=true
    - --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem
    - --etcd-certfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com.pem
    - --etcd-keyfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com-key.pem
    - --insecure-bind-address=0.0.0.0
    - --apiserver-count=3
    - --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,GenericAdmissionWebhook,ResourceQuota
    - --service-cluster-ip-range=10.234.0.0/18
    - --service-node-port-range=30000-32767
    - --client-ca-file=/etc/kubernetes/ssl/ca.pem
    - --profiling=false
    - --repair-malformed-updates=false
    - --kubelet-client-certificate=/etc/kubernetes/ssl/node-kube-master-03.example.com.pem
    - --kubelet-client-key=/etc/kubernetes/ssl/node-kube-master-03.example.com-key.pem
    - --service-account-lookup=true
    - --tls-cert-file=/etc/kubernetes/ssl/apiserver.pem
    - --tls-private-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --proxy-client-cert-file=/etc/kubernetes/ssl/apiserver.pem
    - --proxy-client-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --service-account-key-file=/etc/kubernetes/ssl/apiserver-key.pem
    - --secure-port=6443
    - --insecure-port=8080
    - --storage-backend=etcd3
    - --runtime-config=admissionregistration.k8s.io/v1alpha1
    - --v=2
    - --allow-privileged=true
    - --anonymous-auth=False
    - --authorization-mode=RBAC
    - --feature-gates=Initializers=true

I noticed that etcd is not setup for etcd 3 btw. Check but I think you are still running efcd2

You have

storageBackend: etcd3

But you are not setting the etcd version in the manifest as required

@liggitt thanks for the pointer for me it was etcd. The etcd cluster was misbehaving for some reason and everything is back to normal once I fixed it. I wonder why ectd was marked ok in the health check or there wasn't any logging for etcd failure.

[+]etcd ok

Thanks again!

I have it running as etcd3

kubeAPIServer:
    authorizationRbacSuperUser: admin
    storageBackend: etcd3

@naveensrinivasan and is your etcd cluster an etcd3 cluster? What version is it running?

@liggitt It was running etcd and part of the upgrade I had to change it to etcd3.

Did you migrate the etcd data from the etcd2 to etcd3 stores? You cannot simply upgrade the etcd binary and switch to etcd3 mode. If you didn't do a migration, you should continue to run kubernetes in etcd2 mode as long as you have v2 data (even against an etcd3 server)

Nope, I didn't migrate. I was trying to use etcd2 in kops for 1.8 and I was running into issues which made me change to etcd3.

@chrislovecnm Would kops upgrade to v1.8 without moving to etcd3.

You can continue to use etcd2 (or etcd3 in etcd2 mode) against 1.8 and 1.9

how do you use etcd2 in etcd3?

Run etcd3 binaries and start the kube apiserver with --storage-backend=etcd2

Kubernetes will continue to use the v2 API (which etcd3 still supports) and will have access to your old c2 data via it

Thanks, I don't know if kops is doing this or is it possible to do this in kops?

Yes, remove the etcd3 line in you manifest. Or edit your cluster.

I think the issue was I was using the kops from the master branch or another version which was causing the whole migration messed up. I pulled the release version of kops 1.8 and it is working. Thanks

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DocValerian picture DocValerian  路  4Comments

RXminuS picture RXminuS  路  5Comments

Caskia picture Caskia  路  3Comments

pluttrell picture pluttrell  路  4Comments

chrislovecnm picture chrislovecnm  路  3Comments