BUG REPORT
After upgrading a GKE cluster from 1.5.6 to 1.6.0 Prometheus stopped to scrape the node /metrics endpoint due to a 401 unauthorized error.
This is likely due to RBAC being enabled. In order to give Prometheus access to the node metrics I added the following ClusterRole and ClusterRoleBinding and created a dedicated service account that is used by the pod.
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""] # "" indicates the core API group
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Although the mounted token is now the one for the _prometheus_ service account - verified at https://jwt.io/ - it can't get access to the node metrics (they're served by the kubelet, right?).
If I execute the following command it returns the 401 Unauthorized
KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://<node ip>:10250/metrics
Any tips how to get to the bottom and figure out what's needed to get this to work? I already looked at the issue with Prometheus contributors via ticket https://github.com/prometheus/prometheus/issues/2606 but as the curl doesn't work either it's probably not a Prometheus issue.
Kubernetes version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.0", GitCommit:"fff5156092b56e6bd60fff75aad4dc9de6b6ef37", GitTreeState:"clean", BuildDate:"2017-03-28T16:36:33Z", GoVersion:"go1.7.5
", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.0", GitCommit:"fff5156092b56e6bd60fff75aad4dc9de6b6ef37", GitTreeState:"clean", BuildDate:"2017-03-28T16:24:30Z", GoVersion:"go1.7.5
", Compiler:"gc", Platform:"linux/amd64"}
Environment:
clusterIpv4Cidr: 10.248.0.0/14
createTime: '2016-11-14T19:26:49+00:00'
currentMasterVersion: 1.6.0
currentNodeCount: 14
currentNodeVersion: 1.6.0
endpoint: **REDACTED**
initialClusterVersion: 1.4.5
instanceGroupUrls:
- **REDACTED**
locations:
- europe-west1-c
loggingService: logging.googleapis.com
masterAuth:
clientCertificate: **REDACTED**
clientKey: **REDACTED**
clusterCaCertificate: **REDACTED**
password: **REDACTED**
username: **REDACTED**
monitoringService: monitoring.googleapis.com
name: development-europe-west1-c
network: development
nodeConfig:
diskSizeGb: 250
imageType: COS
machineType: n1-highmem-8
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/service.management
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
serviceAccount: default
nodeIpv4CidrSize: 24
What happened:
With a ClusterRole configured I would expect to be able to scrape the /metrics endpoint on each node, but this fails with 401 Unauthorized.
What you expected to happen:
The service account token with appropriate ClusterRole to be able to give access to the /metrics endpoint.
How to reproduce it (as minimally and precisely as possible):
KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) and curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://<node ip>:10250/metrics from the container in your deploymentAnything else we need to know:
This failed with the default service account as well. Whereas I thought initially GKE would still be very liberal with it's access control settings.
Querying the same endpoint over _http_ to port _10255_ actually works. Any idea why there's a difference?
Could the cause be similar to https://github.com/coreos/coreos-kubernetes/issues/714 ?
ref: #11816
GKE doesn't enable service account token authentication to the kubelet
cc @mikedanese @cjcullen
the resources and subresources used to authorize access to the kubelet API are documented at https://kubernetes.io/docs/admin/kubelet-authentication-authorization/#kubelet-authorization
to allow all kubelet API requests, you'd need a role like the one kube-up uses:
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/rbac/kubelet-api-admin-role.yaml
GKE doesn't enable service account token auth to the kubelet
I'm fairly certain we do...
In your ClusterRole I think
- nodes
should be
- nodes
- nodes/metrics
Your nonResourceURLs doesn't make sense.
Thanks, but even when creating the most permissive binding for my prometheus service account I get a 401 unauthorized when querying the kubelet /metrics endpoint with the service account token set as Bearer Token.
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: permissive-binding-prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Running
curl -k --tlsv1.2 -H "Authorization: Bearer <service account token>" -v https://<node ip>:10250/metrics
returns
* About to connect() to **REDACTED** port 10250 (#0)
* Trying <node ip>...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to **REDACTED** (**REDACTED**) port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
* subject: CN=**REDACTED**
* start date: Apr 12 17:23:15 2017 GMT
* expire date: Apr 12 17:23:15 2018 GMT
* common name: **REDACTED**
* issuer: CN=**REDACTED**
> GET /metrics HTTP/1.1
> User-Agent: curl/7.29.0
> Host: **REDACTED**:10250
> Accept: */*
> Authorization: Bearer **REDACTED**
>
< HTTP/1.1 401 Unauthorized
< Date: Thu, 13 Apr 2017 03:05:30 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
{ [data not shown]
100 12 100 12 0 0 59 0 --:--:-- --:--:-- --:--:-- 59
* Connection #0 to host **REDACTED** left intact
If GKE is using the GCE cluster up scripts, it isn't enabling service account token authentication:
https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh#L699
to authenticate to the kubelet with API tokens, these steps would be needed (from https://kubernetes.io/docs/admin/kubelet-authentication-authorization/#kubelet-authentication):
a 401 indicates you are not authenticated. you are not even reaching the authorization stage.
By the way, when querying https://nodes, nodes/proxy and nodes/status.
No nodes/metrics, nodes/log, nodes/stats nor nodes/spec.
I'll inspect the kubelet startup params and escalate to Google.
Those virtual subresources are used by the kubelet to perform authorization checks when speaking directly to the kubelet API in order to allow granting access to part of the kubelet's API
Ahhhh, ya that's not going to work. We don't plan on enabling token review API in GKE. You can either configure prometheus to pull metrics by hitting the apiserver proxy directly or you can create a client certificate using the certificates API for prometheus to use when contacting kubelets.
@mikedanese is there a particular reason to deviate from the more default Kubernetes setup where kubelet uses RBAC? Does it provide it more security? Is it because Google takes care of the master?
@JorritSalverda the problem is the integration with google oauth. Access tokens need to have UserInfo and GroupInfo scopes in order for us to fill out the full kubernetes UserInfo object. These scopes say that google is allowed to give your email and group info out to people with this token. Generally the tokens that we see in GCP do not have this scope. It's possible that we could enable the token API for some but not all tokens.
I've describe how I got this to work for Prometheus by proxy'ing through the API in GKE at https://github.com/prometheus/prometheus/issues/2606#issuecomment-294869099.
I'll close this ticket, because this provides a nice and future-proof way to get to those metrics. The only drawback is that it will put slightly more load on the API server.
Take a look to the following parameter in kubelet exporter:
https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kubelets/values.yaml#L2
Hope it helps
Most helpful comment
Querying the same endpoint over _http_ to port _10255_ actually works. Any idea why there's a difference?
Could the cause be similar to https://github.com/coreos/coreos-kubernetes/issues/714 ?