Telegraf: Kubernetes input plugin not working (deprecated /stats/summary endpoint?)

Created on 31 Jan 2020 · 15Comments · Source: influxdata/telegraf

Relevant telegraf.conf:

[[inputs.kubernetes]]
      url = "https://kubernetes.default.svc"
      bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      insecure_skip_verify = true

System info:

Ubuntu 18.04
k3s v1.17.2+k3s1
Telegraf image: telegraf:1.12.2

Steps to reproduce:

Configure the Kubernetes input plugin in a Telegraf container.

Expected behavior:

The plugin should colect the Kubernetes metrics.

Actual behavior:

The Telegraf plugin log shows that Kubernetes API server returned a 403 Forbiden error code. After adding to the RBAC Service Account of the pod the following rules:

rules:
  - nonResourceURLs: ["/stats", "/stats/*"]
      verbs: ["get", "list"]

the error is 404. No metrics are being collected.

Additional info:

The input plugin kube_intentory seems to be working just fine but the plugin kubernetes is not capable of obtaining any metric, as described. Looking at the code, the kubernetes input pluging calls the /stats/summary Kubernetes API server endpoint.

/stats/summary endpoint was planned to be depracated (https://github.com/kubernetes/kubernetes/issues/68522) but it seems that it is already removed.

arek8s documentation

Source

ednon-labs

👍8

Most helpful comment

Thanks at all, I found the problem, I was using a Deployment defintion, instead of Daemonset. Related problem when you change to daemonset is like commented @alanjcastonguay or @rawkode , you have to use NODEIP:10250, like this:

[[inputs.kubernetes]]
url = "https://$HOSTIP:10250"
bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
insecure_skip_verify = true

So I have changed my yaml for the official helm chart like recommended @nsteinmetz because I had to change/add too params in my yaml. The official chart is OK, deploy in the namespace that you need and collect all metrics ok.

Conclusion:
IF you need to monitor a kubernetes cluster the better option is deploy offical helm chart telegraf-ds. This monitorize by Node inside the cluster (deploy a telegraf agent in each one via daemonset) with only one deploy definition.

https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

jmorcar on 6 Apr 2020

👍4

All 15 comments

We should put together some documentation about what needs done to switch to the replacement and anyway we can smooth the transition. I could definitely use some help from the community on this.

I am assuming similar metrics can be captured with the prometheus input plugin. It would be good to gather a listing of the new metrics because switching over will likely change all metrics and break dashboards/alerts.

It also looks like it should also be possible to use the --enable-cadvisor-endpoints flag to reenable the endpoint, it would be good to describe how this can be set as well.

danielnelson on 31 Jan 2020

👍2

Hello @danielnelson , thank you for your reply. The cadvisor endpoint support will be removed in Kubernetes 1.19 (https://github.com/kubernetes/kubernetes/issues/76660) so I would recommend using the--enable-cadvisor-endpoints flag only as a temporary fix. I think the way to go is to query the metrics server API (https://github.com/kubernetes-sigs/metrics-server) throught the standard Kubernetes API to obtain pod metrics.

masual on 3 Feb 2020

@danielnelson for managed kubernetes, not sure you can ask to add this flag so even as a temporary fix, it won't work for many (most ?) people

@masual : so it would mean we need to deploy the metrics server first to use then this plugin ? or should we use only kube_inventory plugin ?

nsteinmetz on 22 Feb 2020

I could make it work with the help of @rawkode:

As endpoint, you need:

[[inputs.kubernetes]]
    url = "https://kubernetes.default.svc.cluster.local/api/v1/nodes/$NODE_NAME/proxy/"
    bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
    insecure_skip_verify = true

be sure to have:

env:
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

and as ClusterRole (I use ClusterRoleAggregations):

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: influx:stats:viewer
  labels:
    rbac.authorization.k8s.io/aggregate-view-telegraf-stats: "true"
rules:
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get", "watch", "list"]

Tested on k8s 1.17.0 on OVH K8S Managed Service

nsteinmetz on 26 Feb 2020

... and available soon as an helm chart for the deployment of telegraf as a daemonset => https://github.com/influxdata/helm-charts/pull/16

nsteinmetz on 26 Feb 2020

I have the same problem, I follow these recommentations, but same error:
Error:
2020-04-03T08:38:00Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found

Is there any solution or another documentation to fix the problem?

I Checked I have configured rbac permissions, this output:

Name:         telegraf-cluster-reader
Labels:       rbac.authorization.k8s.io/aggregate-view-telegraf=true
rbac.authorization.k8s.io/aggregate-view-telegraf-stats=true
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"rbac.authorization.k8s.io/aggreg...
PolicyRule:
Resources          Non-Resource URLs  Resource Names  Verbs

deployments        []                 []              [get watch list]
nodes/proxy        []                 []              [get watch list]
nodes              []                 []              [get watch list]
persistentvolumes  []                 []              [get watch list]
pods               []                 []              [get watch list]
statefulsets       []                 []              [get watch list]
[/stats/*]         []              [get]
[/stats]           []              [get]
[/stats/*]         []              [list]
[/stats]           []              [list]
[/stats/*]         []              [watch]
[/stats]           []              [watch]`

I have this config applied in yamls:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-reader
  namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: telegraf-cluster-reader
  labels:
    rbac.authorization.k8s.io/aggregate-view-telegraf: "true"
    rbac.authorization.k8s.io/aggregate-view-telegraf-stats: "true"
rules:
  - nonResourceURLs: ["/stats", "/stats/*"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["persistentvolumes", "nodes", "pods", "deployments", "statefulsets", "nodes/proxy"]
    verbs: ["get", "watch", "list"]
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: telegraf-reader-role
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.authorization.k8s.io/aggregate-view-telegraf-stats: "true"
    - matchLabels:
        rbac.authorization.k8s.io/aggregate-view-telegraf: "true"
    - matchLabels:
        rbac.authorization.k8s.io/aggregate-to-view: "true"
rules: [] # Rules are automatically filled in by the controller manager.
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: telegraf-reader-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: telegraf-reader-role
subjects:
  - kind: ServiceAccount
    name: telegraf-reader
    namespace: default

Mi Pod use this, + use token via secrets applied in configMap, other plugins like kube_inventory works fine with this:

    spec:
      serviceAccountName: telegraf-reader
      containers:
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName

jmorcar on 3 Apr 2020

@jmorcar have a look at what we did for telegraf-ds chart as we get it working => https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

nsteinmetz on 3 Apr 2020

[[inputs.kubernetes]]
      url = "https://kubernetes.default.svc"

I think the plugin is expecting a URL to the Node's API, not the API-server's API. So the telegraph container runs on every node, in a daemonset, configured with something like url = "https://$NODEIP:10250", with the environment variable coming from the downward API.

alanjcastonguay on 3 Apr 2020

I have checked right now with NODE IP variable, here HOSTIP, captured via fieldPath: status.hostIP, but is answer is Forbidden:

# curl https://$HOSTIP:10250/stats/summary --header "Authorization: Bearer $TOKEN" --insecureForbidden (user=system:serviceaccount:default:telegraf-reader, verb=get, resource=nodes, subresource=stats)

While if I use the previous command I posted, the query is permitted with data:

# curl https://kubernetes/stats/summary --header "Authorization: Bearer $TOKEN" --insecure
{
  "paths": [
    "/apis",
    "/apis/",
    "/apis/apiextensions.k8s.io",
    "/apis/apiextensions.k8s.io/v1",
    "/apis/apiextensions.k8s.io/v1beta1",
    "/healthz",
    "/healthz/etcd",
    "/healthz/log",
    "/healthz/ping",
    "/healthz/poststarthook/crd-informer-synced",
    "/healthz/poststarthook/generic-apiserver-start-informers",
    "/healthz/poststarthook/start-apiextensions-controllers",
    "/healthz/poststarthook/start-apiextensions-informers",
    "/livez",
    "/livez/etcd",
    "/livez/log",
    "/livez/ping",
    "/livez/poststarthook/crd-informer-synced",
    "/livez/poststarthook/generic-apiserver-start-informers",
    "/livez/poststarthook/start-apiextensions-controllers",
    "/livez/poststarthook/start-apiextensions-informers",
    "/metrics",
    "/openapi/v2",
    "/readyz",
    "/readyz/etcd",
    "/readyz/log",
    "/readyz/ping",
    "/readyz/poststarthook/crd-informer-synced",
    "/readyz/poststarthook/generic-apiserver-start-informers",
    "/readyz/poststarthook/start-apiextensions-controllers",
    "/readyz/poststarthook/start-apiextensions-informers",
    "/readyz/shutdown",
    "/version"
  ]

(Both queries are exec inside the container Telegraf and use the service account created in yaml definition)

For the creation the serviceaccount , telegraf-reader , I followed the guide posted by kube_inventory plugin in GitHub. I checked telegraf-reader has privilegies to query resources like /api/v1/namespaces/default/pods...for that I created ClusterRole and rolebindings.

Before that, it was when all answers of any resource query was Forbidden, but not right now, so URL should be the problem.

I checked "Kubernetes.default.svc" is same "kubernetes" short name, both are the ClusterIP for default to the Kubernetes cluster.

I will have to check source code to kubernetes input plugin for telegraf to find the exac query return a "404 not found"

jmorcar on 3 Apr 2020

@jmorcar have a look at what we did for telegraf-ds chart as we get it working => https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

I don't found the ClusterRole or role bindings definitions on template charts, so I think the deploy will have the Forbidden error. I posted a suggest to include this documentation in charts because, yaml definition calling the service account is not sufficient, if you don't created RBAC permissions before.

jmorcar on 3 Apr 2020

@jmorcar,

here is the role and rolebinding

The telegraf-ds chart works fine for me - did you try it on your cluster ?

nsteinmetz on 3 Apr 2020

Thanks! I have applied now... and same problem:

2020-04-03T17:21:20Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:21:30Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:21:40Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:21:50Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:22:00Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found

jmorcar on 3 Apr 2020

@jmorcar if you are going through the Kubernetes API, you need the proxy endpoint.

It's usually best to go through the NODEIP from the downwardAPI.

I see mentions of that above, but I couldn't work out what problem you had with that approach.

By any chance are you on GKE? They do block access to the Kubelet this way (last time I checked)

rawkode on 3 Apr 2020

[[inputs.kubernetes]]
url = "https://$HOSTIP:10250"
bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
insecure_skip_verify = true

https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

jmorcar on 6 Apr 2020

👍4

Try creating a Service Account and ClusterRoleBinding for telegraf using the yaml configuration below. Mind the namespace.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: metric-scanner-kubelet-api-admin
subjects:
- kind: ServiceAccount
  name: telegraf
  namespace: influxdb
roleRef:
  kind: ClusterRole
  name: system:kubelet-api-admin
  apiGroup: rbac.authorization.k8s.io

Faced similar issue, after applying the yaml telegraf was able to authenticate in the cluster to scrape the metrics.