Prometheus-operator: Etcd monitoring using prometheus operator

Created on 16 Jan 2018 · 47Comments · Source: prometheus-operator/prometheus-operator

What did you do?
Using prometheus operator I created a Service Monitor for etcd monitoring, together with a Service and an Endpoint to list my etcd nodes (3 nodes). We use Tectonic (1.7.9 version) and etcd runs together with the master nodes. Prometheus is on version 1.8.1.

What did you expect to see?
I expect to see etcd_server metrics in Prometheus

What did you see instead? Under which circumstances?
I cannot see those metrics in Prometheus

Environment

Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.9+coreos.0", GitCommit:"2ded8a1912d014561208d882cfcc12dfa5374f22", GitTreeState:"clean", BuildDate:"2017-10-24T13:07:42Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster kind:
Tectonic installer

Manifests:

apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
    namespace: tectonic-system
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
    namespace: tectonic-system
subsets:
- addresses:
  - ip: 10.223.10.221
    nodeName: 10.223.10.221
  - ip: 10.223.10.212
    nodeName: 10.223.10.212
  - ip: 10.223.10.225
    nodeName: 10.223.10.225
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
spec:
  selector:
    matchLabels:
     k8s-app: etcd
  namespaceSelector:
    matchNames:
    - tectonic-system
  endpoints:
  - interval: 10s
    port: api
    scheme: https
    tlsConfig:
      caFile: /etc/ssl/etcd/ca.crt
      certFile: /etc/ssl/etcd/server.crt
      keyFile: /etc/ssl/etcd/server.key
      insecure_skip_verify: true

Prometheus Operator Logs:

insert Prometheus Operator logs relevant to the issue here

Source

simox-83

Most helpful comment

@brancz I got it working.. somehow :) One of the things I missed was the namespaceselector in the service monitor definition. I will continue tomorrow and polish what I have done so I have a clear idea of what was broken.
Thanks for your help!

simox-83 on 27 Feb 2018

🎉1 👍1

All 47 comments

Note that customizing the Prometheus server shipped with Tectonic Monitoring is limited to the explicit configuration provided: https://coreos.com/tectonic/docs/latest/tectonic-prometheus-operator/user-guides/configuring-tectonic-monitoring.html

Note the compatibility guarantees: https://coreos.com/tectonic/docs/latest/tectonic-prometheus-operator/user-guides/update-and-compatibility-guarantees.html

As a side note, fully automated etcd monitoring with alerts and dashboarding is coming with the very next release :slightly_smiling_face: .

brancz on 17 Jan 2018

Hey @brancz,
thanks for your answer. And yeah that could be definitely the culrprit - though I don't know if my SSL is properly configured as the SANs include the hostnames and not the IPs..
To circumvent the Tectonic limitations, can I create a new configmap so that Prometheus reads from it and loads my etcd rules?
Ah thanks for letting me know about the next release - when is it going to happen? Unfortunately I can't wait for too long as etcd is very critical and I need to expose those metrics and generate alerts out of them.
Cheers
Simone

simox-83 on 17 Jan 2018

What we are going to do, and you can do the same is create a completely separate Prometheus server, whose sole purpose it is to monitor etcd. This Prometheus server runs on the master nodes, so it has access to the etcd nodes network wise, the worker nodes cannot reach the etcd due to the firewall setup, but when you run Prometheus on master nodes this works. In the first iteration, we're doing the same as you're attempting, we're using the same certificates as the apiserver, but in the next iteration we'll likely use the authZ available in etcd to allow Prometheus to only access the /metrics endpoint.

brancz on 17 Jan 2018

Thanks @brancz , but I don't see why I need to run another Prometheus server inside the k8s cluster when I have already the operator running over there - could you please clarify that?
Meanwhile I went ahead and modify a bit my approach - I created a new Prometheus object where I loaded my secrets and added a serviceMonitorSelector. I see a new secret with the job scraping rules in the Tectonic UI, but I still can't see that config loaded into prometheus. I tried to add loglevel: Debug to my new Prometheus object to get more info, but I seem to get only the info log...
Thanks,
Simone

simox-83 on 17 Jan 2018

A second Prometheus object means a second Prometheus server. It's necessary as we don't want worker nodes to be able to have network access to etcd nodes, and the normal prometheus instance should run on workers as they can get large scale and potentially be the only thing running on a single node. Basically we did what you said you're now attempting. The secret(s) you want to mount into the Prometheus container have nothing to do with Prometheus itself, so it's normal, that you're not seeing anything in the logs. The logLevel field has only landed in the latest minor release, that's why it's not having any effect - the feature doesn't exist in the version you're using.

Can you share the Pod yaml of that you are seeing running your Prometheus server?

brancz on 17 Jan 2018

Yep, I got what you mean about prometheus server right after I sent out the message :)
Ah I forgot to mention that the cluster I am testing this on, is not on AWS but on our private cloud.

This is the yaml file of Kind Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: etcd-prometheus
  namespace: containerhosting
  labels:
    k8s-app: etcd
spec:
  logLevel: Debug
  secrets:
  - etcd-certs
  resources:
    requests:
      memory: 400Mi
  serviceMonitorSelector:
    matchLabels:
      k8s-app: etcd

and this is the yaml which defines the Service, Service Monitor and the Endpoints for etcd:

apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
    namespace: containerhosting
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
    namespace: containerhosting
subsets:
- addresses:
  - ip: 10.223.10.221
    nodeName: 10.223.10.221
  - ip: 10.223.10.212
    nodeName: 10.223.10.212
  - ip: 10.223.10.225
    nodeName: 10.223.10.225
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  namespace: containerhosting
  labels:
    k8s-app: etcd
    prometheus: k8s
spec:
  selector:
    matchLabels:
     k8s-app: etcd
  endpoints:
  - interval: 10s
    port: api
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
      certFile: /etc/prometheus/secrets/etcd-certs/server.crt
      keyFile: /etc/prometheus/secrets/etcd-certs/server.key
      insecure_skip_verify: true

And this is the yaml of the prometheus pod that gets generated:

kind: Pod
apiVersion: v1
metadata:
  generateName: prometheus-etcd-prometheus-
  annotations:
    kubernetes.io/created-by: >
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"containerhosting","name":"prometheus-etcd-prometheus","uid":"20084a4e-fb9e-11e7-ba11-005056a22d34","apiVersion":"apps/v1beta1","resourceVersion":"14050451"}}
  selfLink: /api/v1/namespaces/containerhosting/pods/prometheus-etcd-prometheus-0
  resourceVersion: '14050507'
  name: prometheus-etcd-prometheus-0
  uid: fac21c61-fba2-11e7-a453-005056a2ad2d
  creationTimestamp: '2018-01-17T16:25:06Z'
  namespace: containerhosting
  ownerReferences:
    - apiVersion: apps/v1beta1
      kind: StatefulSet
      name: prometheus-etcd-prometheus
      uid: 20084a4e-fb9e-11e7-ba11-005056a22d34
      controller: true
      blockOwnerDeletion: true
  labels:
    app: prometheus
    controller-revision-hash: prometheus-etcd-prometheus-2723354034
    prometheus: etcd-prometheus
spec:
  restartPolicy: Always
  serviceAccountName: default
  subdomain: prometheus-operated
  schedulerName: default-scheduler
  terminationGracePeriodSeconds: 600
  nodeName: z9jo-37ko.concurasp.com
  securityContext:
    runAsUser: 1000
    runAsNonRoot: true
    fsGroup: 2000
  containers:
    - resources:
        requests:
          memory: 400Mi
      readinessProbe:
        httpGet:
          path: /-/ready
          port: web
          scheme: HTTP
        timeoutSeconds: 3
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 6
      terminationMessagePath: /dev/termination-log
      name: prometheus
      livenessProbe:
        httpGet:
          path: /-/healthy
          port: web
          scheme: HTTP
        initialDelaySeconds: 30
        timeoutSeconds: 3
        periodSeconds: 5
        successThreshold: 1
        failureThreshold: 10
      ports:
        - name: web
          containerPort: 9090
          protocol: TCP
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - name: config
          readOnly: true
          mountPath: /etc/prometheus/config
        - name: rules
          readOnly: true
          mountPath: /etc/prometheus/rules
        - name: prometheus-etcd-prometheus-db
          mountPath: /var/prometheus/data
        - name: secret-etcd-certs
          readOnly: true
          mountPath: /etc/prometheus/secrets/etcd-certs
        - name: default-token-4msmp
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePolicy: File
      image: 'quay.io/prometheus/prometheus:v2.0.0-rc.1'
      args:
        - '--config.file=/etc/prometheus/config/prometheus.yaml'
        - '--storage.tsdb.path=/var/prometheus/data'
        - '--storage.tsdb.retention=24h'
        - '--web.enable-lifecycle'
        - '--web.route-prefix=/'
    - name: prometheus-config-reloader
      image: 'quay.io/coreos/prometheus-config-reloader:v0.0.2'
      args:
        - '-reload-url=http://localhost:9090/-/reload'
        - '-config-volume-dir=/etc/prometheus/config'
        - '-rule-volume-dir=/etc/prometheus/rules'
      resources:
        limits:
          cpu: 10m
          memory: 50Mi
        requests:
          cpu: 10m
          memory: 50Mi
      volumeMounts:
        - name: config
          readOnly: true
          mountPath: /etc/prometheus/config
        - name: rules
          mountPath: /etc/prometheus/rules
        - name: default-token-4msmp
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
  hostname: prometheus-etcd-prometheus-0
  serviceAccount: default
  volumes:
    - name: config
      secret:
        secretName: prometheus-etcd-prometheus
        defaultMode: 420
    - name: rules
      emptyDir:
        sizeLimit: '0'
    - name: secret-etcd-certs
      secret:
        secretName: etcd-certs
        defaultMode: 420
    - name: prometheus-etcd-prometheus-db
      emptyDir:
        sizeLimit: '0'
    - name: default-token-4msmp
      secret:
        secretName: default-token-4msmp
        defaultMode: 420
  dnsPolicy: ClusterFirst
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2018-01-17T16:25:06Z'
    - type: Ready
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2018-01-17T16:25:11Z'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2018-01-17T16:25:06Z'
  hostIP: 10.223.10.223
  podIP: 172.16.9.152
  startTime: '2018-01-17T16:25:06Z'
  containerStatuses:
    - name: prometheus
      state:
        running:
          startedAt: '2018-01-17T16:25:07Z'
      lastState: {}
      ready: true
      restartCount: 0
      image: 'quay.io/prometheus/prometheus:v2.0.0-rc.1'
      imageID: >-
        docker-pullable://quay.io/prometheus/prometheus@sha256:4728f61dd339fb3f1685feeed5b4a67a7cdb865894f64d8d72b9cb1b37e1ddce
      containerID: >-
        docker://0b343247cdc3657c420fac2ace9006665361a1172bc257aad0f92ab2d86d12cf
    - name: prometheus-config-reloader
      state:
        running:
          startedAt: '2018-01-17T16:25:08Z'
      lastState: {}
      ready: true
      restartCount: 0
      image: 'quay.io/coreos/prometheus-config-reloader:v0.0.2'
      imageID: >-
        docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:9ea705b2234f7deea22058a00ea46a303087828bde579b37255a590e8c0f4eca
      containerID: >-
        docker://4af87a16fc5fde96bd91a40fb9620be9e9ef8f70524843dc0ea76c10bd18807d
  qosClass: Burstable

And those are the logs of the prometheus pod above:

level=info ts=2018-01-17T16:25:07.640188634Z caller=main.go:216 msg="Starting prometheus" version="(version=2.0.0-rc.1, branch=HEAD, revision=5ab8834befbd92241a88976c790ace7543edcd59)"
level=info ts=2018-01-17T16:25:07.6402426Z caller=main.go:217 build_context="(go=go1.9.1, user=root@1f56dd8b6f7b, date=20171017-12:34:15)"
level=info ts=2018-01-17T16:25:07.640260065Z caller=main.go:218 host_details="(Linux 4.14.11-coreos #1 SMP Fri Jan 5 11:00:14 UTC 2018 x86_64 prometheus-etcd-prometheus-0 (none))"
level=info ts=2018-01-17T16:25:07.641303658Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-17T16:25:07.641339572Z caller=main.go:315 msg="Starting TSDB"
level=info ts=2018-01-17T16:25:07.641328581Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
level=info ts=2018-01-17T16:25:07.646666802Z caller=main.go:327 msg="TSDB started"
level=info ts=2018-01-17T16:25:07.64669899Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-01-17T16:25:07.647848468Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-17T16:25:07.648406607Z caller=main.go:371 msg="Server is ready to receive requests."
level=info ts=2018-01-17T16:25:08.77870819Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-01-17T16:25:08.780003161Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
E0117 17:10:12.328726 7 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.9.152:42884->192.168.128.1:443: read: connection timed out
E0117 17:10:12.328742 7 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.9.152:42884->192.168.128.1:443: read: connection timed out
E0117 17:10:12.328751 7 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.9.152:42884->192.168.128.1:443: read: connection timed out

Those connections timeout looks very weird to me and I see them for the first time...
Thanks for your help!

simox-83 on 18 Jan 2018

Ah I forgot to mention that the cluster I am testing this on, is not on AWS but on our private cloud.

That doesn't make a difference for us.

Could you try port-forwarding the prometheus server to look at the UI and see if you can see etcd targets at all?

kubectl -n <namespace> port-forward prometheus-etcd-prometheus-0 9090

There are a couple things I can think of that might be going wrong right now.

1) The pod is not on a master, so you need to specify the correct tolerations to do so.
2) The ServiceAccount doesn't have the correct RBAC roles to access the requires resources from the Kubernetes API.

brancz on 18 Jan 2018

The pod is not running on a master for sure.. what is the toleration you talking about?
I did the port forwarding and I see a surprising thing:

etcd-k8s (0/3 up)
--
EndpointStateLabelsLast ScrapeErrorhttps://10.223.10.212:2379/metricsDOWNendpoint="api" instance="10.223.10.212:2379" namespace="containerhosting" service="etcd-k8s"6.652s agoGet https://10.223.10.212:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.212https://10.223.10.221:2379/metricsDOWNendpoint="api" instance="10.223.10.221:2379" namespace="containerhosting" service="etcd-k8s"1.165s agoGet https://10.223.10.221:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.221https://10.223.10.225:2379/metricsDOWNendpoint="api" instance="10.223.10.225:2379" namespace="containerhosting" service="etcd-k8s"1.531s agoGet https://10.223.10.225:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.225 | https://10.223.10.212:2379/metrics | DOWN | endpoint="api" instance="10.223.10.212:2379" namespace="containerhosting" service="etcd-k8s" | 6.652s ago | Get https://10.223.10.212:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.212 | https://10.223.10.221:2379/metrics | DOWN | endpoint="api" instance="10.223.10.221:2379" namespace="containerhosting" service="etcd-k8s" | 1.165s ago | Get https://10.223.10.221:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.221 | https://10.223.10.225:2379/metrics | DOWN | endpoint="api" instance="10.223.10.225:2379" namespace="containerhosting" service="etcd-k8s" | 1.531s ago | Get https://10.223.10.225:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.225

https://10.223.10.212:2379/metrics | DOWN | endpoint="api" instance="10.223.10.212:2379" namespace="containerhosting" service="etcd-k8s" | 6.652s ago | Get https://10.223.10.212:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.212
https://10.223.10.221:2379/metrics | DOWN | endpoint="api" instance="10.223.10.221:2379" namespace="containerhosting" service="etcd-k8s" | 1.165s ago | Get https://10.223.10.221:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.221
https://10.223.10.225:2379/metrics | DOWN | endpoint="api" instance="10.223.10.225:2379" namespace="containerhosting" service="etcd-k8s" | 1.531s ago | Get https://10.223.10.225:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.225

If I look at the certificate the SANs are:
X509v3 Subject Alternative Name:
DNS:yr25-49k4.concurasp.com, DNS:x3ow-djm8.concurasp.com, DNS:yr25-e974.concurasp.com, DNS:localhost, DNS:*.kube-etcd.kube-system.svc.cluster.local, DNS:kube-etcd-client.kube-system.svc.cluster.local, IP Address:127.0.0.1, IP Address:192.168.128.15, IP Address:192.168.128.20
Can I use those IPs instead?
Cheers,
Simone

simox-83 on 18 Jan 2018

Ok I tried to put the 192.168.x.y addresses and now I get the following:

simox-83 on 18 Jan 2018

I got it working!! I had to add the following in the ServiceMonitor definition:

tlsConfig:
   serverName: kube-etcd-client.kube-system.svc.cluster.local

and I used the node IPs (10.223.x.y) - all the targets are up and running.
One question: where does Prometheus store its metrics? I restarted the pods and the old metrics are gone so I think they are on a not-persistent volume on the pod itself.
Thanks a lot for your help,
Simone

simox-83 on 18 Jan 2018

Glad you got it working, what you did is exactly what I would have suggested, and is what we're doing as well. The tolerations we use are:

      tolerations:                    
      - effect: NoSchedule         
        key: node-role.kubernetes.io/master                           
        operator: Exists

By default the tsdb data uses an emptyDir volume, but you can configure any storage mean you want. See the docs on storage for that.

brancz on 19 Jan 2018

Hey @brancz I have a question about etcd metrics exposed. We had an issue (in our non production environment luckily) where the database compaction didn't work; therefore the etcd quota capacity was reached and the cluster became non operational. I was wondering whether there is a metric which exposes the available quota and anything that tell us about last database compaction.
Thanks a lot,
Simone

simox-83 on 22 Jan 2018

@simox-83 There are, although currently part of the etcd_debugging namespace of metrics exposed. The debugging namespace of metrics is meant for currently unstable metrics, meaning these metrics can break in any way in any upcoming release. Nevertheless they still may be useful. Some you could have a look at:

etcd_debugging_mvcc_db_total_size_in_bytes
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds

If none of those seem useful it's probably a good idea to take this upstream and discuss what kind of metrics we could introduce to monitor this better.

brancz on 22 Jan 2018

Any idea when that next release with fullly automated etcd monitoring and alerting will be ready? I'm thinking about setting up an nginx reverse proxy daemonset scheduled on the masters to get access to the metrics endpoint from the worker nodes. If this is coming very soon, it would save me some work :-)

erkolson on 31 Jan 2018

@erkolson due to the security sensitive nature, there won't be any way to unify this. You need to make sure your Prometheus server has access to scrape etcd. The discrepancy across setups is too big to unify.

brancz on 1 Feb 2018

@simox-83 @brancz

I followed the above steps to setup etcd monitoring on my baremetal cluster I created using kubespray. I had issues with service account permissions for the second prometheus setup, so I granted them. But, I still dont see any targets listed when I port forward and visit the ui.

This is my prometheus yaml

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: etcd-prometheus
  namespace: monitoring
  labels:
    k8s-app: etcd
spec:
  logLevel: debug
  secrets:
  - etcd-certs
  resources:
    requests:
      memory: 400Mi
  serviceMonitorSelector:
    matchLabels:
      k8s-app: etcd

This is the yaml for servicemonitor and endpoints

apiVersion: v1
kind: Service
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
    namespace: monitoring
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
  name: etcd-k8s
  labels:
    k8s-app: etcd
    namespace: monitoring
subsets:
- addresses:
  - ip: 10.140.81.54
    nodeName: 10.140.81.54
  - ip: 10.140.81.39
    nodeName: 10.140.81.39
  - ip: 10.140.81.62
    nodeName: 10.140.81.62
  ports:
  - name: api
    port: 2379
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd-k8s
  namespace: monitoring
  labels:
    k8s-app: etcd
    prometheus: k8s
spec:
  selector:
    matchLabels:
     k8s-app: etcd
  endpoints:
  - interval: 10s
    port: api
    scheme: https
    tlsConfig:
      caFile: /etc/ssl/etcd/ssl/ca.pem
      certFile: /etc/ssl/etcd/ssl/admin-k8s-master.pem
      keyFile: /etc/ssl/etcd/ssl/admin-k8s-master-key.pem
      #use insecureSkipVerify only if you cannot use a Subject Alternative Name
      #insecure_skip_verify: true
      serverName: etcd.kube-system.svc.cluster.local

ibalajiarun on 6 Feb 2018

This looks like wrong indentation:

  selector:
    matchLabels:
     k8s-app: etcd

We are in the process of finishing everything for the next release which will include CRD validations, then these kinds of problems won't be able to occur anymore! :slightly_smiling_face:

brancz on 7 Feb 2018

@brancz I will try and fix that. When is the next release anticipated?

Thanks a lot!.

ibalajiarun on 7 Feb 2018

I can't say an exact date, but I'm expecting within the next two weeks.

brancz on 8 Feb 2018

So, after successfully setup the etcd monitoring, I wanted to use my custom Prometheus to scrape the node exporter metrics. After hours of troubleshooting (the nodes are not listed in the targets as they should be), I found out that if I run wget http://10.223.10.212:9100/metrics from my custom Prometheus pod, I get a nice permission denied. If I do the same from the official tectonic Prometheus pod, I get the metrics with no problem (indeed the nodes are in the targets over there). Any thoughts on this? Please note that I created a Service Monitor which points directly to the node-exporter built-in tectonic service.. here the yaml:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ch-node-exporter     #node exporter for custom Prometheus ch-prom
  namespace: {{ .Values.namespace}}
labels:
  k8s-app: {{ .Values.k8s_app}}
  prometheus: ch-prometheus
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
  endpoints: 
    - interval: 30s
      port: http-metrics

and here my custom Prometheus object

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: ch-prometheus
  namespace: {{ .Values.namespace}}
spec:
  version: {{ .Values.version}}
  replicas: 2
  secrets:
  - etcd-certs
  serviceAccountName: ch-monitoring
  serviceMonitorSelector:
    matchLabels:
      k8s-app: {{ .Values.k8s_app}}
  alerting:
    alertmanagers:
      - name: alertmanager-main
        namespace: tectonic-system
        pathPrefix: ''
        port: web
        scheme: ''
  ruleSelector:
    matchLabels:
      k8s-app: {{ .Values.k8s_app}}
      prometheus: ch-prometheus
  resources:
    requests:
      memory: {{ .Values.resources.requests.memory}}
      cpu: {{ .Values.resources.requests.cpu}}
    limits:
      memory: {{ .Values.resources.limits.memory}}
      cpu: {{ .Values.resources.limits.cpu}}

Thanks a lot

simox-83 on 22 Feb 2018

@brancz any news on this? I get still this from a prometheus pod:

/prometheus $ wget http://10.223.10.223:9100/metrics
Connecting to 10.223.10.223:9100 (10.223.10.223:9100)
wget: can't open 'metrics': Permission denied

I saw the RBAC policies and I gave the user account full permission - I setup a clusterrole for this. I suspect it might be the native node exporter which is from Tectonic.. I hope I don't have to deploy a separate one.
Thanks,
Simone

simox-83 on 26 Feb 2018

What you are looking for is most likely this permission: https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus/prometheus-k8s-roles.yaml#L54

That's what the node-exporter checks for.

brancz on 26 Feb 2018

@brancz thanks for the answer, I found out that nonresourceUrl last Friday and included in my clusterroles, but it didn't help. Here my roles config:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: ch-monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: ch-monitoring
subjects:
- kind: ServiceAccount
  name: ch-monitoring
  namespace: containerhosting
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: ch-monitoring
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- apiGroups: [""]
  resources:
  - nodes/metrics
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

I still can't see the nodes as targets.. is there any way I can see more logs from the prometheus pods? That wget command just says permission denied, I actually wanted to check what happens when the pod tries to scrape the node-exporter targets.
Cheers,
Simone

simox-83 on 26 Feb 2018

I'm not exactly sure if the version you have supports this, but there is the logLevel field.

brancz on 26 Feb 2018

I am on 2.1.0 and it doesn't seem to support that field. Is there anything I could look into to troubleshoot this more?

simox-83 on 26 Feb 2018

What you can do is set pause: true in the Prometheus object, then you can modify the underlying StatefulSet to set the --log-level flag yourself.

brancz on 26 Feb 2018

It doesn't seem to work, as soon as I add the loglevel in the statefulset, it will immediately revert it back.

simox-83 on 26 Feb 2018

sorry, that's paused: true in the prometheus object (https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec)

brancz on 26 Feb 2018

It doesnt seem to like the logLevel flag. I tried --log-level=debug or --loglevel=debug or --logLevel=debug but the container crashes.

simox-83 on 26 Feb 2018

Excerpt from prometheus -h:

--log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn, error]

Sorry once again, it was --log.level.

brancz on 26 Feb 2018

Thanks with the log.level flag works :) I am looking at the logs, it seems I can see it tries to scrape the node-exporters endpoints and I don't see any error. Here an extract of the logs:

model.LabelSet{\"__meta_kubernetes_pod_host_ip\":\"10.223.11.17\", \"__meta_kubernetes_pod_node_name\":\"y58b-oggj.concurasp.com\", \"__meta_kubernetes_pod_ip\":\"10.223.11.17\", \"__meta_kubernetes_pod_container_port_number\":\"9100\", \"__meta_kubernetes_pod_container_port_protocol\":\"TCP\", \"__address__\":\"10.223.11.17:9100\", \"__meta_kubernetes_pod_name\":\"node-exporter-352rt\", \"__meta_kubernetes_pod_uid\":\"76102591-dc15-11e7-9fea-005056a23d76\", \"__meta_kubernetes_pod_annotation_kubernetes_io_created_by\":\"{\\\"kind\\\":\\\"SerializedReference\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"reference\\\":{\\\"kind\\\":\\\"DaemonSet\\\",\\\"namespace\\\":\\\"tectonic-system\\\",\\\"name\\\":\\\"node-exporter\\\",\\\"uid\\\":\\\"9b43a817-da8a-11e7-af0d-005056a2ad2d\\\",\\\"apiVersion\\\":\\\"extensions\\\",\\\"resourceVersion\\\":\\\"658933\\\"}}\\n\", \"__meta_kubernetes_pod_container_name\":\"node-exporter\", \"__meta_kubernetes_pod_container_port_name\":\"scrape\", \"__meta_kubernetes_pod_label_pod_template_generation\":\"2\", \"__meta_kubernetes_endpoint_port_name\":\"http-metrics\", \"__meta_kubernetes_endpoint_port_protocol\":\"TCP\", \"__meta_kubernetes_endpoint_ready\":\"true\", \"__meta_kubernetes_pod_ready\":\"true\", \"__meta_kubernetes_pod_label_controller_revision_hash\":\"2231131536\", \"__meta_kubernetes_pod_label_k8s_app\":\"node-exporter\"}}, Labels:model.LabelSet{\"__meta_kubernetes_service_name\":\"node-exporter\", \"__meta_kubernetes_service_label_k8s_app\":\"node-exporter\", \"__meta_kubernetes_namespace\":\"tectonic-system\", \"__meta_kubernetes_endpoints_name\":\"node-exporter\"}, Source:\"endpoints/tectonic-system/node-exporter\"}"

I am not sure if I need to deploy my own endpoints in the containerhosting namespace, but I assume not, because the service account used can do get endpoints so it can use the tectonic ones. I still see the wget call returning permission denied though.

simox-83 on 26 Feb 2018

@brancz I did another test. I used the token used by the ch-monitoring account and then I expose one of the node exporter pod using kubectl proxy node-exporter-352rt. I then ran curl 127.0.0.1:8001/metrics and got the metrics printed at screen. So this should exclude, as root cause of the issue, the roles associated with my account. So now I really don't know what else could be, I still getting the permission denied when running this from the prometheus pod. How exactly the prometheus pod gets the /metrics? Is there any other service account or anything else I should look into?
Thanks a lot,
Simone

simox-83 on 26 Feb 2018

Can you share the Pod manifest of one of the node-exporter pods? Something in the line of the request is denying it, and I'm not sure what it could be here (maybe some firewall issues?). But with a look at the node-exporter Pod I just want to make sure there is nothing from node-exporter side that might be doing that.

brancz on 27 Feb 2018

@brancz here you are:

kind: Pod
apiVersion: v1
metadata:
  generateName: node-exporter-
  annotations:
    kubernetes.io/created-by: >
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"DaemonSet","namespace":"tectonic-system","name":"node-exporter","uid":"9b43a817-da8a-11e7-af0d-005056a2ad2d","apiVersion":"extensions","resourceVersion":"658933"}}
  selfLink: /api/v1/namespaces/tectonic-system/pods/node-exporter-352rt
  resourceVersion: '27431649'
  name: node-exporter-352rt
  uid: 76102591-dc15-11e7-9fea-005056a23d76
  creationTimestamp: '2017-12-08T12:43:59Z'
  namespace: tectonic-system
  ownerReferences:
    - apiVersion: extensions/v1beta1
      kind: DaemonSet
      name: node-exporter
      uid: 9b43a817-da8a-11e7-af0d-005056a2ad2d
      controller: true
      blockOwnerDeletion: true
  labels:
    controller-revision-hash: '2231131536'
    k8s-app: node-exporter
    pod-template-generation: '2'
spec:
  restartPolicy: Always
  serviceAccountName: default
  hostPID: true
  schedulerName: default-scheduler
  hostNetwork: true
  terminationGracePeriodSeconds: 30
  nodeName: y58b-oggj.concurasp.com
  securityContext: {}
  containers:
    - resources:
        limits:
          cpu: 200m
          memory: 50Mi
        requests:
          cpu: 100m
          memory: 30Mi
      terminationMessagePath: /dev/termination-log
      name: node-exporter
      ports:
        - name: scrape
          hostPort: 9100
          containerPort: 9100
          protocol: TCP
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - name: proc
          readOnly: true
          mountPath: /host/proc
        - name: sys
          readOnly: true
          mountPath: /host/sys
        - name: default-token-v5x6r
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePolicy: File
      image: 'quay.io/prometheus/node-exporter:v0.15.0'
      args:
        - '--path.procfs=/host/proc'
        - '--path.sysfs=/host/sys'
  serviceAccount: default
  volumes:
    - name: proc
      hostPath:
        path: /proc
    - name: sys
      hostPath:
        path: /sys
    - name: default-token-v5x6r
      secret:
        secretName: default-token-v5x6r
        defaultMode: 420
  dnsPolicy: ClusterFirst
  tolerations:
    - key: node-role.kubernetes.io/master
      operator: Exists
      effect: NoSchedule
    - key: node.alpha.kubernetes.io/notReady
      operator: Exists
      effect: NoExecute
    - key: node.alpha.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2017-12-08T12:43:59Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2018-01-17T17:28:02Z'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2017-12-08T12:44:04Z'
  hostIP: 10.223.11.17
  podIP: 10.223.11.17
  startTime: '2017-12-08T12:43:59Z'
  containerStatuses:
    - name: node-exporter
      state:
        running:
          startedAt: '2018-01-17T17:28:00Z'
      lastState:
        terminated:
          exitCode: 255
          reason: Error
          startedAt: '2017-12-15T14:46:31Z'
          finishedAt: '2018-01-17T17:25:29Z'
          containerID: >-
            docker://b3a2e82039d1430a7b59b4d4ae020f0f233f1155751180040a91fb426b582868
      ready: true
      restartCount: 4
      image: 'quay.io/prometheus/node-exporter:v0.15.0'
      imageID: >-
        docker-pullable://quay.io/prometheus/node-exporter@sha256:0b1053a5416a3346aff000c86b268353728804bac7ffa1071ea3da9dde02af1d
      containerID: >-
        docker://6573d505b1b42b620268a40232a8776a4a9561c41ad50adf57c3817ebb85eac2
  qosClass: Burstable

I honestly don't think it's a firewall issue, the tectonic prometheus can scrape the metrics properly and they are all in the same network.. hope this can help.
Cheers,
Simone

simox-83 on 27 Feb 2018

@brancz I am not sure actually this is an issue. I have the etcd metrics properly scraped and visible as targets in the Prometheus UI; however if I run wget https://10.223.10.212:2379/metrics I get the exact message wget: can't open 'metrics': Permission denied. As etcd is properly working I believe this is a misleading error. But at this point I am hopeless, don't really know what else to check.
Thanks,
Simone

simox-83 on 27 Feb 2018

🎉1 👍1

At the beginning， I'm puzzled with the tlsConfig field of etcd-k8s servicemonitor, after several attempts I find the way.

It works:

 tlsConfig:
      caFile: /etc/prometheus/secrets/etcd-certs/etcd-ca.crt
      certFile: /etc/prometheus/secrets/etcd-certs/etcd.crt
      insecureSkipVerify: true
      keyFile: /etc/prometheus/secrets/etcd-certs/etcd.key

According to the document of Monitoring external etcd, I create a etcd-secret in monitoring namespace, then use kubectl replace .... command to update the CRD of prometheus. If you can't make sure where your etcd's cafile/cert/key stored, you can use the k8s's secret mechanism. You can use kubectl exec -ti -n monitoring prometheus-k8s-0 /bin/sh command and use ls /etc/prometheus/secrets/etcd-certs/ command to show all of etcd cert files:

etcd-ca.crt etcd.crt etcd.key

you can replace these filename to suitable field of tlsConfig in etcd-k8s servicemonitor.

Good luck!

lth2015 on 2 Apr 2018

Thanks man,
I actually figured this out at some point, but forgot to give an update.
Cheers,
Simone

2018-04-02 9:20 GMT+02:00 李探花 notifications@github.com:

I'm puzzled with the tlsConfig field of etcd-k8s servicemonitor, after
several attempts I find the way.

It works:
tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/etcd-ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/etcd.crt insecureSkipVerify:
true keyFile: /etc/prometheus/secrets/etcd-certs/etcd.key
According to the documents of
https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/docs/Monitoring%20external%20etcd.md,
I create a etcd-secret in monitoring namespace, then use kubectl replace
.... command to update the CRD of prometheus. If you can't make sure
where your etcd's cafile/cert/key stored, you can use the k8s's secret to
use it. You can use kubectl exec -ti -n monitoring prometheus-k8s-0
/bin/sh command and use ls /etc/prometheus/secrets/etcd-certs/ command to
show all of etcd cert files:

etcd-ca.crt etcd.crt etcd.key

you can replace these filename to suitable field of tlsConfig in etcd-k8s
servicemonitor.

Good luck!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/coreos/prometheus-operator/issues/898#issuecomment-377873219,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AZnNa7hKibTUgW5qBZXtGrqA_Vc4ChPcks5tkdE9gaJpZM4RgAln
.

simox-83 on 2 Apr 2018

👍1

Would you like to add this to the docs and create a PR? :)

brancz on 3 Apr 2018

@simox-83 thank you, that a trouble for me and you give the solution and direction.

lth2015 on 3 Apr 2018

@brancz I commit a pr. ...

lth2015 on 3 Apr 2018

What we are going to do, and you can do the same is create a completely separate Prometheus server, whose sole purpose it is to monitor etcd. This Prometheus server runs on the master nodes, so it has access to the etcd nodes network wise, the worker nodes cannot reach the etcd due to the firewall setup, but when you run Prometheus on master nodes this works. In the first iteration, we're doing the same as you're attempting, we're using the same certificates as the apiserver, but in the next iteration we'll likely use the authZ available in etcd to allow Prometheus to only access the /metrics endpoint.

@brancz should the operator support to launch a prometheus instance in the master node just to scrap etcd metrics?

gianrubio on 13 Apr 2018

What about to run a http proxy in the master node that forward traffic to etcd metrics endpoint?

gianrubio on 13 Apr 2018

The Prometheus Operator already gives you all the tools you need to perform any of those configurations mentioned above, it's up to you and the configuration of your cluster to decide what's the best way to do this. Ideally I recommend to have no additional indirections such as a proxy, as this often leads to isssues.

Tectonic shipped etcd monitoring in 1.8.7, so I'm closing this.

brancz on 13 Apr 2018

I've setup the ServiceMonitor in the following way:

  serviceMonitor:
    scheme: https
    insecureSkipVerify: true
    caFile: /etc/prometheus/secrets/etcd-client-cert/ca.crt
    certFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client.pem
    keyFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client-key.pem

however in the targets page I get error messages like this:

Get https://172.20.124.83:4001/metrics: x509: certificate is valid for 127.0.0.1, not 172.20.124.83

I've created the cluster with kops and I'm running k8s 1.11.5. I took the tls files from the master under /srv/kubernetes folder

mazzy89 on 18 Dec 2018

😕1

@mazzy89 i have same issue using kops, pls paste your config, that help me a lot!

unclebob2013 on 20 Dec 2018

@unclebob2013 my config is the one that you see above. for me it worked once I copied correctly the etcd client keys and created the etcd-client-cert under prometheus config

mazzy89 on 20 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings