Thanks for submitting an issue! Please fill in as much of the template below as
you can.
------------- BUG REPORT TEMPLATE --------------------
kops version are you running? The command kops version, will displaykubectl version will print thekops flag.What happened after the commands executed?
What did you expect to happen?
Upgrade the cluster to v1.8.6
Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Anything else do we need to know?
v1.7.7 to v1.8.6 with RBAC turned on.kops master to upgrade kops version
Version 1.8.0 (git-4876009bd)I1227 16:17:34.682684 7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "pods" in namespace "kube-system"
I1227 16:17:34.682827 7 wrap.go:42] POST /api/v1/namespaces/kube-system/pods: (352.225碌s) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.683112 7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.683175 7 wrap.go:42] POST /api/v1/namespaces/default/events: (204.479碌s) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
I1227 16:17:34.684278 7 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "events" in namespace "default"
I1227 16:17:34.684381 7 wrap.go:42] POST /api/v1/namespaces/default/events: (272.221碌s) 403 [[kubelet/v1.8.6 (linux/amd64) kubernetes/6260bb0] 127.0.0.1:32806]
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-12-26T20:42:03Z
name: k8s.playground.REDACTED.io
spec:
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://k8s.playground.REDACTED.io/k8s.playground.REDACTED.io
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1b
name: b
- instanceGroup: master-us-east-1c
name: c
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
authorizationRbacSuperUser: admin
storageBackend: etcd3
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.8.6
masterInternalName: api.internal.k8s.playground.REDACTED.io
masterPublicName: api.k8s.playground.REDACTED.io
networkCIDR: 172.20.0.0/16
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: us-east-1a
type: Public
zone: us-east-1a
- cidr: 172.20.64.0/19
name: us-east-1b
type: Public
zone: us-east-1b
- cidr: 172.20.96.0/19
name: us-east-1c
type: Public
zone: us-east-1c
topology:
dns:
type: Public
masters: public
nodes: public
We did run this yaml before migrating and it still didn't help.
kubectl get clusterrolebinding system:node -o yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
creationTimestamp: 2017-12-26T20:53:27Z
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:node
resourceVersion: "850"
selfLink: /apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings/system%3Anode
uid: d24fbe68-ea7e-11e7-a9e1-0201c744720e
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:nodes
------------- FEATURE REQUEST TEMPLATE --------------------
Describe IN DETAIL the feature/behavior/change you would like to see.
Feel free to provide a design supporting your feature request.
Do the authorization errors persist in the log after the api server has completed startup and /healthz returns a 200? Some denials during server startup are normal as the authorization cache fills
It continues and the cluster is inoperable.
After upgrading, what does this show?
kubectl get clusterrolebinding system:node -o yaml
kubectl get clusterrole system:node -o yaml
I also see this: https://github.com/kubernetes/kops/blob/1ff42edfac77df99ffa617113e51dad209ae0ce8/upup/models/cloudup/resources/addons/rbac.addons.k8s.io/k8s-1.8.yaml
I'm not familiar with what kops does on upgrade with the add on bindings
@liggitt I manually ran the above yaml and it didn't help.
The api-server is unavailable after the upgrade so any of the kubectl commands fail.
kubelet permissions should not affect api server availability. I'm not sure how to debug further if the api server is unreachable. Do you have more apiserver logs that might be illuminating? @chrislovecnm any ideas of what else might be at play here?
@naveensrinivasan was RBAC already configured and working when the Cluster was on v1.7.7, or did you change it in the spec as part of the upgrade?
@liggitt not sure about the addons behaviour, but if performing an upgrade from v1.7 then the necessary RoleBinding will already exist, so that shouldn't be the issue I suspect.
@KashifSaadat RBAC was already configured and working when the cluster was v1.7.7
Here are the log files. https://gist.github.com/naveensrinivasan/80eb10aa3bd2259139b48a6a78100357
I don't know exactly when I grabbed them. This from the master and I grabbed all the logs
I am hitting the same issue after the upgrade following different installation method and I am sure system:nodes group have system:node role. Interestingly it isn't just system:nodes I see other groups e.g system:authenticated effected as well.
RBAC DENY: user "system:kube-proxy" groups ["system:authenticated"] cannot "list" resource "services" cluster-wide
Following this API server never comes up and kube control plane is down.
@naveensrinivasan what does apiserver /healthz show while the API server is crashlooping in that state? do you have the full apiserver manifest used, including all flags?
seeing this, which makes me suspect issues writing to etcd:
I1226 17:20:58.368013 8 trace.go:76] Trace[2144299595]: "Create /api/v1/namespaces" (started: 2017-12-26 17:20:53.848730671 +0000 UTC) (total time: 4.5192501s):
Trace[2144299595]: [4.284563321s] [4.284499666s] About to store object in database
Trace[2144299595]: [4.5192501s] [234.686779ms] END
I1226 17:20:58.368361 8 wrap.go:42] POST /api/v1/namespaces: (4.519639312s) 500
@MQasimSarfraz what is the output of a superuser in the system:masters group calling /healthz on the apiserver? RBAC denials could prevent other components from talking to the API server, but would not keep the API server from coming up. I suspect issues reading from and/or writing to etcd
@liggitt Where can I find that output? also following is what I can find related to /healthz in API server logs:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:108 +0x1ca
logging error output: "[+]ping ok\n[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[-]poststarthook/bootstrap-controller failed: reason withheld\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[-]poststarthook/ca-registration failed: reason withheld\n[+]poststarthook/start-kube-apiserver-informers ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[-]autoregister-completion failed: reason withheld\nhealthz check failed\n"
[[kube-probe/1.8] 127.0.0.1:55014]
formatted better, that shows:
[+]ping ok
[+]etcd ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[-]poststarthook/bootstrap-controller failed: reason withheld
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[-]poststarthook/ca-registration failed: reason withheld
[+]poststarthook/start-kube-apiserver-informers ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[-]autoregister-completion failed: reason withheld
healthz check failed
the details for the failed hooks are available at these URLs:
/healthz/poststarthook/bootstrap-controller
/healthz/poststarthook/rbac/bootstrap-roles
/healthz/poststarthook/ca-registration
/healthz/autoregister-completion
Can't find anything useful from URLs:
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/bootstrap-controller
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:29 GMT
Content-Length: 36
internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/rbac/bootstrap-roles
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:42 GMT
Content-Length: 36
internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/poststarthook/ca-registration
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:08:51 GMT
Content-Length: 36
internal server error: not finished
[qasim.sarfraz@kube-master-03 ~]$ curl -i 127.0.0.1:8080/healthz/autoregister-completion
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 02 Jan 2018 20:09:02 GMT
Content-Length: 495
internal server error: missing APIService: [v1. v1.authentication.k8s.io v1.authorization.k8s.io v1.autoscaling v1.batch v1.networking.k8s.io v1.rbac.authorization.k8s.io v1.storage.k8s.io v1alpha1.admissionregistration.k8s.io v1beta1.apiextensions.k8s.io v1beta1.apps v1beta1.authentication.k8s.io v1beta1.authorization.k8s.io v1beta1.batch v1beta1.certificates.k8s.io v1beta1.extensions v1beta1.policy v1beta1.rbac.authorization.k8s.io v1beta1.storage.k8s.io v1beta2.apps v2beta1.autoscaling]
All of those point to etcd write errors/hangs. Did etcd setup change during the upgrade? What are the flags passed to the apiserver?
Ahan interesting, No I haven't changed it but let me try to check etcd dumps. Also following are flags to apiserver:
- --advertise-address=10.1.165.137
- --etcd-servers=https://10.1.165.214:2379,https://10.1.165.66:2379,https://10.1.165.240:2379
- --etcd-quorum-read=true
- --etcd-cafile=/etc/ssl/etcd/ssl/ca.pem
- --etcd-certfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com.pem
- --etcd-keyfile=/etc/ssl/etcd/ssl/node-kube-master-03.example.com-key.pem
- --insecure-bind-address=0.0.0.0
- --apiserver-count=3
- --admission-control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,GenericAdmissionWebhook,ResourceQuota
- --service-cluster-ip-range=10.234.0.0/18
- --service-node-port-range=30000-32767
- --client-ca-file=/etc/kubernetes/ssl/ca.pem
- --profiling=false
- --repair-malformed-updates=false
- --kubelet-client-certificate=/etc/kubernetes/ssl/node-kube-master-03.example.com.pem
- --kubelet-client-key=/etc/kubernetes/ssl/node-kube-master-03.example.com-key.pem
- --service-account-lookup=true
- --tls-cert-file=/etc/kubernetes/ssl/apiserver.pem
- --tls-private-key-file=/etc/kubernetes/ssl/apiserver-key.pem
- --proxy-client-cert-file=/etc/kubernetes/ssl/apiserver.pem
- --proxy-client-key-file=/etc/kubernetes/ssl/apiserver-key.pem
- --service-account-key-file=/etc/kubernetes/ssl/apiserver-key.pem
- --secure-port=6443
- --insecure-port=8080
- --storage-backend=etcd3
- --runtime-config=admissionregistration.k8s.io/v1alpha1
- --v=2
- --allow-privileged=true
- --anonymous-auth=False
- --authorization-mode=RBAC
- --feature-gates=Initializers=true
I noticed that etcd is not setup for etcd 3 btw. Check but I think you are still running efcd2
You have
storageBackend: etcd3
But you are not setting the etcd version in the manifest as required
@liggitt thanks for the pointer for me it was etcd. The etcd cluster was misbehaving for some reason and everything is back to normal once I fixed it. I wonder why ectd was marked ok in the health check or there wasn't any logging for etcd failure.
[+]etcd ok
Thanks again!
I have it running as etcd3
kubeAPIServer:
authorizationRbacSuperUser: admin
storageBackend: etcd3
@naveensrinivasan and is your etcd cluster an etcd3 cluster? What version is it running?
@liggitt It was running etcd and part of the upgrade I had to change it to etcd3.
Did you migrate the etcd data from the etcd2 to etcd3 stores? You cannot simply upgrade the etcd binary and switch to etcd3 mode. If you didn't do a migration, you should continue to run kubernetes in etcd2 mode as long as you have v2 data (even against an etcd3 server)
Nope, I didn't migrate. I was trying to use etcd2 in kops for 1.8 and I was running into issues which made me change to etcd3.
@chrislovecnm Would kops upgrade to v1.8 without moving to etcd3.
You can continue to use etcd2 (or etcd3 in etcd2 mode) against 1.8 and 1.9
how do you use etcd2 in etcd3?
Run etcd3 binaries and start the kube apiserver with --storage-backend=etcd2
Kubernetes will continue to use the v2 API (which etcd3 still supports) and will have access to your old c2 data via it
Thanks, I don't know if kops is doing this or is it possible to do this in kops?
Yes, remove the etcd3 line in you manifest. Or edit your cluster.
I think the issue was I was using the kops from the master branch or another version which was causing the whole migration messed up. I pulled the release version of kops 1.8 and it is working. Thanks