What did you do?
Using prometheus operator I created a Service Monitor for etcd monitoring, together with a Service and an Endpoint to list my etcd nodes (3 nodes). We use Tectonic (1.7.9 version) and etcd runs together with the master nodes. Prometheus is on version 1.8.1.
What did you expect to see?
I expect to see etcd_server metrics in Prometheus
What did you see instead? Under which circumstances?
I cannot see those metrics in Prometheus
Environment
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.0", GitCommit:"6e937839ac04a38cac63e6a7a306c5d035fe7b0a", GitTreeState:"clean", BuildDate:"2017-09-28T22:57:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.9+coreos.0", GitCommit:"2ded8a1912d014561208d882cfcc12dfa5374f22", GitTreeState:"clean", BuildDate:"2017-10-24T13:07:42Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
apiVersion: v1
kind: Service
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
namespace: tectonic-system
spec:
type: ClusterIP
clusterIP: None
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
namespace: tectonic-system
subsets:
- addresses:
- ip: 10.223.10.221
nodeName: 10.223.10.221
- ip: 10.223.10.212
nodeName: 10.223.10.212
- ip: 10.223.10.225
nodeName: 10.223.10.225
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
spec:
selector:
matchLabels:
k8s-app: etcd
namespaceSelector:
matchNames:
- tectonic-system
endpoints:
- interval: 10s
port: api
scheme: https
tlsConfig:
caFile: /etc/ssl/etcd/ca.crt
certFile: /etc/ssl/etcd/server.crt
keyFile: /etc/ssl/etcd/server.key
insecure_skip_verify: true
insert Prometheus Operator logs relevant to the issue here
Note that customizing the Prometheus server shipped with Tectonic Monitoring is limited to the explicit configuration provided: https://coreos.com/tectonic/docs/latest/tectonic-prometheus-operator/user-guides/configuring-tectonic-monitoring.html
Note the compatibility guarantees: https://coreos.com/tectonic/docs/latest/tectonic-prometheus-operator/user-guides/update-and-compatibility-guarantees.html
As a side note, fully automated etcd monitoring with alerts and dashboarding is coming with the very next release :slightly_smiling_face: .
Hey @brancz,
thanks for your answer. And yeah that could be definitely the culrprit - though I don't know if my SSL is properly configured as the SANs include the hostnames and not the IPs..
To circumvent the Tectonic limitations, can I create a new configmap so that Prometheus reads from it and loads my etcd rules?
Ah thanks for letting me know about the next release - when is it going to happen? Unfortunately I can't wait for too long as etcd is very critical and I need to expose those metrics and generate alerts out of them.
Cheers
Simone
What we are going to do, and you can do the same is create a completely separate Prometheus server, whose sole purpose it is to monitor etcd. This Prometheus server runs on the master nodes, so it has access to the etcd nodes network wise, the worker nodes cannot reach the etcd due to the firewall setup, but when you run Prometheus on master nodes this works. In the first iteration, we're doing the same as you're attempting, we're using the same certificates as the apiserver, but in the next iteration we'll likely use the authZ available in etcd to allow Prometheus to only access the /metrics endpoint.
Thanks @brancz , but I don't see why I need to run another Prometheus server inside the k8s cluster when I have already the operator running over there - could you please clarify that?
Meanwhile I went ahead and modify a bit my approach - I created a new Prometheus object where I loaded my secrets and added a serviceMonitorSelector. I see a new secret with the job scraping rules in the Tectonic UI, but I still can't see that config loaded into prometheus. I tried to add loglevel: Debug to my new Prometheus object to get more info, but I seem to get only the info log...
Thanks,
Simone
A second Prometheus object means a second Prometheus server. It's necessary as we don't want worker nodes to be able to have network access to etcd nodes, and the normal prometheus instance should run on workers as they can get large scale and potentially be the only thing running on a single node. Basically we did what you said you're now attempting. The secret(s) you want to mount into the Prometheus container have nothing to do with Prometheus itself, so it's normal, that you're not seeing anything in the logs. The logLevel field has only landed in the latest minor release, that's why it's not having any effect - the feature doesn't exist in the version you're using.
Can you share the Pod yaml of that you are seeing running your Prometheus server?
Yep, I got what you mean about prometheus server right after I sent out the message :)
Ah I forgot to mention that the cluster I am testing this on, is not on AWS but on our private cloud.
This is the yaml file of Kind Prometheus:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: etcd-prometheus
namespace: containerhosting
labels:
k8s-app: etcd
spec:
logLevel: Debug
secrets:
- etcd-certs
resources:
requests:
memory: 400Mi
serviceMonitorSelector:
matchLabels:
k8s-app: etcd
and this is the yaml which defines the Service, Service Monitor and the Endpoints for etcd:
apiVersion: v1
kind: Service
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
namespace: containerhosting
spec:
type: ClusterIP
clusterIP: None
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
namespace: containerhosting
subsets:
- addresses:
- ip: 10.223.10.221
nodeName: 10.223.10.221
- ip: 10.223.10.212
nodeName: 10.223.10.212
- ip: 10.223.10.225
nodeName: 10.223.10.225
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-k8s
namespace: containerhosting
labels:
k8s-app: etcd
prometheus: k8s
spec:
selector:
matchLabels:
k8s-app: etcd
endpoints:
- interval: 10s
port: api
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/server.crt
keyFile: /etc/prometheus/secrets/etcd-certs/server.key
insecure_skip_verify: true
And this is the yaml of the prometheus pod that gets generated:
kind: Pod
apiVersion: v1
metadata:
generateName: prometheus-etcd-prometheus-
annotations:
kubernetes.io/created-by: >
{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"StatefulSet","namespace":"containerhosting","name":"prometheus-etcd-prometheus","uid":"20084a4e-fb9e-11e7-ba11-005056a22d34","apiVersion":"apps/v1beta1","resourceVersion":"14050451"}}
selfLink: /api/v1/namespaces/containerhosting/pods/prometheus-etcd-prometheus-0
resourceVersion: '14050507'
name: prometheus-etcd-prometheus-0
uid: fac21c61-fba2-11e7-a453-005056a2ad2d
creationTimestamp: '2018-01-17T16:25:06Z'
namespace: containerhosting
ownerReferences:
- apiVersion: apps/v1beta1
kind: StatefulSet
name: prometheus-etcd-prometheus
uid: 20084a4e-fb9e-11e7-ba11-005056a22d34
controller: true
blockOwnerDeletion: true
labels:
app: prometheus
controller-revision-hash: prometheus-etcd-prometheus-2723354034
prometheus: etcd-prometheus
spec:
restartPolicy: Always
serviceAccountName: default
subdomain: prometheus-operated
schedulerName: default-scheduler
terminationGracePeriodSeconds: 600
nodeName: z9jo-37ko.concurasp.com
securityContext:
runAsUser: 1000
runAsNonRoot: true
fsGroup: 2000
containers:
- resources:
requests:
memory: 400Mi
readinessProbe:
httpGet:
path: /-/ready
port: web
scheme: HTTP
timeoutSeconds: 3
periodSeconds: 5
successThreshold: 1
failureThreshold: 6
terminationMessagePath: /dev/termination-log
name: prometheus
livenessProbe:
httpGet:
path: /-/healthy
port: web
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 3
periodSeconds: 5
successThreshold: 1
failureThreshold: 10
ports:
- name: web
containerPort: 9090
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: config
readOnly: true
mountPath: /etc/prometheus/config
- name: rules
readOnly: true
mountPath: /etc/prometheus/rules
- name: prometheus-etcd-prometheus-db
mountPath: /var/prometheus/data
- name: secret-etcd-certs
readOnly: true
mountPath: /etc/prometheus/secrets/etcd-certs
- name: default-token-4msmp
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePolicy: File
image: 'quay.io/prometheus/prometheus:v2.0.0-rc.1'
args:
- '--config.file=/etc/prometheus/config/prometheus.yaml'
- '--storage.tsdb.path=/var/prometheus/data'
- '--storage.tsdb.retention=24h'
- '--web.enable-lifecycle'
- '--web.route-prefix=/'
- name: prometheus-config-reloader
image: 'quay.io/coreos/prometheus-config-reloader:v0.0.2'
args:
- '-reload-url=http://localhost:9090/-/reload'
- '-config-volume-dir=/etc/prometheus/config'
- '-rule-volume-dir=/etc/prometheus/rules'
resources:
limits:
cpu: 10m
memory: 50Mi
requests:
cpu: 10m
memory: 50Mi
volumeMounts:
- name: config
readOnly: true
mountPath: /etc/prometheus/config
- name: rules
mountPath: /etc/prometheus/rules
- name: default-token-4msmp
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
hostname: prometheus-etcd-prometheus-0
serviceAccount: default
volumes:
- name: config
secret:
secretName: prometheus-etcd-prometheus
defaultMode: 420
- name: rules
emptyDir:
sizeLimit: '0'
- name: secret-etcd-certs
secret:
secretName: etcd-certs
defaultMode: 420
- name: prometheus-etcd-prometheus-db
emptyDir:
sizeLimit: '0'
- name: default-token-4msmp
secret:
secretName: default-token-4msmp
defaultMode: 420
dnsPolicy: ClusterFirst
status:
phase: Running
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2018-01-17T16:25:06Z'
- type: Ready
status: 'True'
lastProbeTime: null
lastTransitionTime: '2018-01-17T16:25:11Z'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2018-01-17T16:25:06Z'
hostIP: 10.223.10.223
podIP: 172.16.9.152
startTime: '2018-01-17T16:25:06Z'
containerStatuses:
- name: prometheus
state:
running:
startedAt: '2018-01-17T16:25:07Z'
lastState: {}
ready: true
restartCount: 0
image: 'quay.io/prometheus/prometheus:v2.0.0-rc.1'
imageID: >-
docker-pullable://quay.io/prometheus/prometheus@sha256:4728f61dd339fb3f1685feeed5b4a67a7cdb865894f64d8d72b9cb1b37e1ddce
containerID: >-
docker://0b343247cdc3657c420fac2ace9006665361a1172bc257aad0f92ab2d86d12cf
- name: prometheus-config-reloader
state:
running:
startedAt: '2018-01-17T16:25:08Z'
lastState: {}
ready: true
restartCount: 0
image: 'quay.io/coreos/prometheus-config-reloader:v0.0.2'
imageID: >-
docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:9ea705b2234f7deea22058a00ea46a303087828bde579b37255a590e8c0f4eca
containerID: >-
docker://4af87a16fc5fde96bd91a40fb9620be9e9ef8f70524843dc0ea76c10bd18807d
qosClass: Burstable
And those are the logs of the prometheus pod above:
level=info ts=2018-01-17T16:25:07.640188634Z caller=main.go:216 msg="Starting prometheus" version="(version=2.0.0-rc.1, branch=HEAD, revision=5ab8834befbd92241a88976c790ace7543edcd59)"
level=info ts=2018-01-17T16:25:07.6402426Z caller=main.go:217 build_context="(go=go1.9.1, user=root@1f56dd8b6f7b, date=20171017-12:34:15)"
level=info ts=2018-01-17T16:25:07.640260065Z caller=main.go:218 host_details="(Linux 4.14.11-coreos #1 SMP Fri Jan 5 11:00:14 UTC 2018 x86_64 prometheus-etcd-prometheus-0 (none))"
level=info ts=2018-01-17T16:25:07.641303658Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-01-17T16:25:07.641339572Z caller=main.go:315 msg="Starting TSDB"
level=info ts=2018-01-17T16:25:07.641328581Z caller=targetmanager.go:68 component="target manager" msg="Starting target manager..."
level=info ts=2018-01-17T16:25:07.646666802Z caller=main.go:327 msg="TSDB started"
level=info ts=2018-01-17T16:25:07.64669899Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-01-17T16:25:07.647848468Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-01-17T16:25:07.648406607Z caller=main.go:371 msg="Server is ready to receive requests."
level=info ts=2018-01-17T16:25:08.77870819Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/config/prometheus.yaml
level=info ts=2018-01-17T16:25:08.780003161Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
E0117 17:10:12.328726 7 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.9.152:42884->192.168.128.1:443: read: connection timed out
E0117 17:10:12.328742 7 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.9.152:42884->192.168.128.1:443: read: connection timed out
E0117 17:10:12.328751 7 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.9.152:42884->192.168.128.1:443: read: connection timed out
Those connections timeout looks very weird to me and I see them for the first time...
Thanks for your help!
Ah I forgot to mention that the cluster I am testing this on, is not on AWS but on our private cloud.
That doesn't make a difference for us.
Could you try port-forwarding the prometheus server to look at the UI and see if you can see etcd targets at all?
kubectl -n <namespace> port-forward prometheus-etcd-prometheus-0 9090
There are a couple things I can think of that might be going wrong right now.
1) The pod is not on a master, so you need to specify the correct tolerations to do so.
2) The ServiceAccount doesn't have the correct RBAC roles to access the requires resources from the Kubernetes API.
The pod is not running on a master for sure.. what is the toleration you talking about?
I did the port forwarding and I see a surprising thing:
etcd-k8s (0/3 up)
--
EndpointStateLabelsLast ScrapeErrorhttps://10.223.10.212:2379/metricsDOWNendpoint="api" instance="10.223.10.212:2379" namespace="containerhosting" service="etcd-k8s"6.652s agoGet https://10.223.10.212:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.212https://10.223.10.221:2379/metricsDOWNendpoint="api" instance="10.223.10.221:2379" namespace="containerhosting" service="etcd-k8s"1.165s agoGet https://10.223.10.221:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.221https://10.223.10.225:2379/metricsDOWNendpoint="api" instance="10.223.10.225:2379" namespace="containerhosting" service="etcd-k8s"1.531s agoGet https://10.223.10.225:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.225 | https://10.223.10.212:2379/metrics | DOWN | endpoint="api" instance="10.223.10.212:2379" namespace="containerhosting" service="etcd-k8s" | 6.652s ago | Get https://10.223.10.212:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.212 | https://10.223.10.221:2379/metrics | DOWN | endpoint="api" instance="10.223.10.221:2379" namespace="containerhosting" service="etcd-k8s" | 1.165s ago | Get https://10.223.10.221:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.221 | https://10.223.10.225:2379/metrics | DOWN | endpoint="api" instance="10.223.10.225:2379" namespace="containerhosting" service="etcd-k8s" | 1.531s ago | Get https://10.223.10.225:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.225
https://10.223.10.212:2379/metrics | DOWN | endpoint="api" instance="10.223.10.212:2379" namespace="containerhosting" service="etcd-k8s" | 6.652s ago | Get https://10.223.10.212:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.212
https://10.223.10.221:2379/metrics | DOWN | endpoint="api" instance="10.223.10.221:2379" namespace="containerhosting" service="etcd-k8s" | 1.165s ago | Get https://10.223.10.221:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.221
https://10.223.10.225:2379/metrics | DOWN | endpoint="api" instance="10.223.10.225:2379" namespace="containerhosting" service="etcd-k8s" | 1.531s ago | Get https://10.223.10.225:2379/metrics: x509: certificate is valid for 127.0.0.1, 192.168.128.15, 192.168.128.20, not 10.223.10.225
If I look at the certificate the SANs are:
X509v3 Subject Alternative Name:
DNS:yr25-49k4.concurasp.com, DNS:x3ow-djm8.concurasp.com, DNS:yr25-e974.concurasp.com, DNS:localhost, DNS:*.kube-etcd.kube-system.svc.cluster.local, DNS:kube-etcd-client.kube-system.svc.cluster.local, IP Address:127.0.0.1, IP Address:192.168.128.15, IP Address:192.168.128.20
Can I use those IPs instead?
Cheers,
Simone
Ok I tried to put the 192.168.x.y addresses and now I get the following:
https://192.168.128.15:2379/metrics | DOWN | endpoint="api" instance="192.168.128.15:2379" namespace="containerhosting" service="etcd-k8s" | 12.205s ago | context deadline exceeded
-- | -- | -- | -- | --
https://192.168.128.20:2379/metrics | DOWN | endpoint="api" instance="192.168.128.20:2379" namespace="containerhosting" service="etcd-k8s" | 17.299s ago | context deadline exceeded
I got it working!! I had to add the following in the ServiceMonitor definition:
tlsConfig:
serverName: kube-etcd-client.kube-system.svc.cluster.local
and I used the node IPs (10.223.x.y) - all the targets are up and running.
One question: where does Prometheus store its metrics? I restarted the pods and the old metrics are gone so I think they are on a not-persistent volume on the pod itself.
Thanks a lot for your help,
Simone
Glad you got it working, what you did is exactly what I would have suggested, and is what we're doing as well. The tolerations we use are:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
By default the tsdb data uses an emptyDir volume, but you can configure any storage mean you want. See the docs on storage for that.
Hey @brancz I have a question about etcd metrics exposed. We had an issue (in our non production environment luckily) where the database compaction didn't work; therefore the etcd quota capacity was reached and the cluster became non operational. I was wondering whether there is a metric which exposes the available quota and anything that tell us about last database compaction.
Thanks a lot,
Simone
@simox-83 There are, although currently part of the etcd_debugging namespace of metrics exposed. The debugging namespace of metrics is meant for currently unstable metrics, meaning these metrics can break in any way in any upcoming release. Nevertheless they still may be useful. Some you could have a look at:
etcd_debugging_mvcc_db_total_size_in_bytesetcd_debugging_mvcc_db_compaction_pause_duration_millisecondsetcd_debugging_mvcc_db_compaction_total_duration_millisecondsetcd_debugging_mvcc_index_compaction_pause_duration_millisecondsIf none of those seem useful it's probably a good idea to take this upstream and discuss what kind of metrics we could introduce to monitor this better.
Any idea when that next release with fullly automated etcd monitoring and alerting will be ready? I'm thinking about setting up an nginx reverse proxy daemonset scheduled on the masters to get access to the metrics endpoint from the worker nodes. If this is coming very soon, it would save me some work :-)
@erkolson due to the security sensitive nature, there won't be any way to unify this. You need to make sure your Prometheus server has access to scrape etcd. The discrepancy across setups is too big to unify.
@simox-83 @brancz
I followed the above steps to setup etcd monitoring on my baremetal cluster I created using kubespray. I had issues with service account permissions for the second prometheus setup, so I granted them. But, I still dont see any targets listed when I port forward and visit the ui.
This is my prometheus yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: etcd-prometheus
namespace: monitoring
labels:
k8s-app: etcd
spec:
logLevel: debug
secrets:
- etcd-certs
resources:
requests:
memory: 400Mi
serviceMonitorSelector:
matchLabels:
k8s-app: etcd
This is the yaml for servicemonitor and endpoints
apiVersion: v1
kind: Service
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
namespace: monitoring
spec:
type: ClusterIP
clusterIP: None
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-k8s
labels:
k8s-app: etcd
namespace: monitoring
subsets:
- addresses:
- ip: 10.140.81.54
nodeName: 10.140.81.54
- ip: 10.140.81.39
nodeName: 10.140.81.39
- ip: 10.140.81.62
nodeName: 10.140.81.62
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-k8s
namespace: monitoring
labels:
k8s-app: etcd
prometheus: k8s
spec:
selector:
matchLabels:
k8s-app: etcd
endpoints:
- interval: 10s
port: api
scheme: https
tlsConfig:
caFile: /etc/ssl/etcd/ssl/ca.pem
certFile: /etc/ssl/etcd/ssl/admin-k8s-master.pem
keyFile: /etc/ssl/etcd/ssl/admin-k8s-master-key.pem
#use insecureSkipVerify only if you cannot use a Subject Alternative Name
#insecure_skip_verify: true
serverName: etcd.kube-system.svc.cluster.local
This looks like wrong indentation:
selector:
matchLabels:
k8s-app: etcd
We are in the process of finishing everything for the next release which will include CRD validations, then these kinds of problems won't be able to occur anymore! :slightly_smiling_face:
@brancz I will try and fix that. When is the next release anticipated?
Thanks a lot!.
I can't say an exact date, but I'm expecting within the next two weeks.
So, after successfully setup the etcd monitoring, I wanted to use my custom Prometheus to scrape the node exporter metrics. After hours of troubleshooting (the nodes are not listed in the targets as they should be), I found out that if I run wget http://10.223.10.212:9100/metrics from my custom Prometheus pod, I get a nice permission denied. If I do the same from the official tectonic Prometheus pod, I get the metrics with no problem (indeed the nodes are in the targets over there). Any thoughts on this? Please note that I created a Service Monitor which points directly to the node-exporter built-in tectonic service.. here the yaml:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ch-node-exporter #node exporter for custom Prometheus ch-prom
namespace: {{ .Values.namespace}}
labels:
k8s-app: {{ .Values.k8s_app}}
prometheus: ch-prometheus
spec:
selector:
matchLabels:
k8s-app: node-exporter
endpoints:
- interval: 30s
port: http-metrics
and here my custom Prometheus object
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: ch-prometheus
namespace: {{ .Values.namespace}}
spec:
version: {{ .Values.version}}
replicas: 2
secrets:
- etcd-certs
serviceAccountName: ch-monitoring
serviceMonitorSelector:
matchLabels:
k8s-app: {{ .Values.k8s_app}}
alerting:
alertmanagers:
- name: alertmanager-main
namespace: tectonic-system
pathPrefix: ''
port: web
scheme: ''
ruleSelector:
matchLabels:
k8s-app: {{ .Values.k8s_app}}
prometheus: ch-prometheus
resources:
requests:
memory: {{ .Values.resources.requests.memory}}
cpu: {{ .Values.resources.requests.cpu}}
limits:
memory: {{ .Values.resources.limits.memory}}
cpu: {{ .Values.resources.limits.cpu}}
Thanks a lot
@brancz any news on this? I get still this from a prometheus pod:
/prometheus $ wget http://10.223.10.223:9100/metrics
Connecting to 10.223.10.223:9100 (10.223.10.223:9100)
wget: can't open 'metrics': Permission denied
I saw the RBAC policies and I gave the user account full permission - I setup a clusterrole for this. I suspect it might be the native node exporter which is from Tectonic.. I hope I don't have to deploy a separate one.
Thanks,
Simone
What you are looking for is most likely this permission: https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus/prometheus-k8s-roles.yaml#L54
That's what the node-exporter checks for.
@brancz thanks for the answer, I found out that nonresourceUrl last Friday and included in my clusterroles, but it didn't help. Here my roles config:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: ch-monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ch-monitoring
subjects:
- kind: ServiceAccount
name: ch-monitoring
namespace: containerhosting
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: ch-monitoring
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- apiGroups: [""]
resources:
- nodes/metrics
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
I still can't see the nodes as targets.. is there any way I can see more logs from the prometheus pods? That wget command just says permission denied, I actually wanted to check what happens when the pod tries to scrape the node-exporter targets.
Cheers,
Simone
I'm not exactly sure if the version you have supports this, but there is the logLevel field.
I am on 2.1.0 and it doesn't seem to support that field. Is there anything I could look into to troubleshoot this more?
What you can do is set pause: true in the Prometheus object, then you can modify the underlying StatefulSet to set the --log-level flag yourself.
It doesn't seem to work, as soon as I add the loglevel in the statefulset, it will immediately revert it back.
sorry, that's paused: true in the prometheus object (https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec)
It doesnt seem to like the logLevel flag. I tried --log-level=debug or --loglevel=debug or --logLevel=debug but the container crashes.
Excerpt from prometheus -h:
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
Sorry once again, it was --log.level.
Thanks with the log.level flag works :) I am looking at the logs, it seems I can see it tries to scrape the node-exporters endpoints and I don't see any error. Here an extract of the logs:
model.LabelSet{\"__meta_kubernetes_pod_host_ip\":\"10.223.11.17\", \"__meta_kubernetes_pod_node_name\":\"y58b-oggj.concurasp.com\", \"__meta_kubernetes_pod_ip\":\"10.223.11.17\", \"__meta_kubernetes_pod_container_port_number\":\"9100\", \"__meta_kubernetes_pod_container_port_protocol\":\"TCP\", \"__address__\":\"10.223.11.17:9100\", \"__meta_kubernetes_pod_name\":\"node-exporter-352rt\", \"__meta_kubernetes_pod_uid\":\"76102591-dc15-11e7-9fea-005056a23d76\", \"__meta_kubernetes_pod_annotation_kubernetes_io_created_by\":\"{\\\"kind\\\":\\\"SerializedReference\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"reference\\\":{\\\"kind\\\":\\\"DaemonSet\\\",\\\"namespace\\\":\\\"tectonic-system\\\",\\\"name\\\":\\\"node-exporter\\\",\\\"uid\\\":\\\"9b43a817-da8a-11e7-af0d-005056a2ad2d\\\",\\\"apiVersion\\\":\\\"extensions\\\",\\\"resourceVersion\\\":\\\"658933\\\"}}\\n\", \"__meta_kubernetes_pod_container_name\":\"node-exporter\", \"__meta_kubernetes_pod_container_port_name\":\"scrape\", \"__meta_kubernetes_pod_label_pod_template_generation\":\"2\", \"__meta_kubernetes_endpoint_port_name\":\"http-metrics\", \"__meta_kubernetes_endpoint_port_protocol\":\"TCP\", \"__meta_kubernetes_endpoint_ready\":\"true\", \"__meta_kubernetes_pod_ready\":\"true\", \"__meta_kubernetes_pod_label_controller_revision_hash\":\"2231131536\", \"__meta_kubernetes_pod_label_k8s_app\":\"node-exporter\"}}, Labels:model.LabelSet{\"__meta_kubernetes_service_name\":\"node-exporter\", \"__meta_kubernetes_service_label_k8s_app\":\"node-exporter\", \"__meta_kubernetes_namespace\":\"tectonic-system\", \"__meta_kubernetes_endpoints_name\":\"node-exporter\"}, Source:\"endpoints/tectonic-system/node-exporter\"}"
I am not sure if I need to deploy my own endpoints in the containerhosting namespace, but I assume not, because the service account used can do get endpoints so it can use the tectonic ones. I still see the wget call returning permission denied though.
@brancz I did another test. I used the token used by the ch-monitoring account and then I expose one of the node exporter pod using kubectl proxy node-exporter-352rt. I then ran curl 127.0.0.1:8001/metrics and got the metrics printed at screen. So this should exclude, as root cause of the issue, the roles associated with my account. So now I really don't know what else could be, I still getting the permission denied when running this from the prometheus pod. How exactly the prometheus pod gets the /metrics? Is there any other service account or anything else I should look into?
Thanks a lot,
Simone
Can you share the Pod manifest of one of the node-exporter pods? Something in the line of the request is denying it, and I'm not sure what it could be here (maybe some firewall issues?). But with a look at the node-exporter Pod I just want to make sure there is nothing from node-exporter side that might be doing that.
@brancz here you are:
kind: Pod
apiVersion: v1
metadata:
generateName: node-exporter-
annotations:
kubernetes.io/created-by: >
{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"DaemonSet","namespace":"tectonic-system","name":"node-exporter","uid":"9b43a817-da8a-11e7-af0d-005056a2ad2d","apiVersion":"extensions","resourceVersion":"658933"}}
selfLink: /api/v1/namespaces/tectonic-system/pods/node-exporter-352rt
resourceVersion: '27431649'
name: node-exporter-352rt
uid: 76102591-dc15-11e7-9fea-005056a23d76
creationTimestamp: '2017-12-08T12:43:59Z'
namespace: tectonic-system
ownerReferences:
- apiVersion: extensions/v1beta1
kind: DaemonSet
name: node-exporter
uid: 9b43a817-da8a-11e7-af0d-005056a2ad2d
controller: true
blockOwnerDeletion: true
labels:
controller-revision-hash: '2231131536'
k8s-app: node-exporter
pod-template-generation: '2'
spec:
restartPolicy: Always
serviceAccountName: default
hostPID: true
schedulerName: default-scheduler
hostNetwork: true
terminationGracePeriodSeconds: 30
nodeName: y58b-oggj.concurasp.com
securityContext: {}
containers:
- resources:
limits:
cpu: 200m
memory: 50Mi
requests:
cpu: 100m
memory: 30Mi
terminationMessagePath: /dev/termination-log
name: node-exporter
ports:
- name: scrape
hostPort: 9100
containerPort: 9100
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: proc
readOnly: true
mountPath: /host/proc
- name: sys
readOnly: true
mountPath: /host/sys
- name: default-token-v5x6r
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePolicy: File
image: 'quay.io/prometheus/node-exporter:v0.15.0'
args:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
serviceAccount: default
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: default-token-v5x6r
secret:
secretName: default-token-v5x6r
defaultMode: 420
dnsPolicy: ClusterFirst
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node.alpha.kubernetes.io/notReady
operator: Exists
effect: NoExecute
- key: node.alpha.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
status:
phase: Running
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2017-12-08T12:43:59Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2018-01-17T17:28:02Z'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2017-12-08T12:44:04Z'
hostIP: 10.223.11.17
podIP: 10.223.11.17
startTime: '2017-12-08T12:43:59Z'
containerStatuses:
- name: node-exporter
state:
running:
startedAt: '2018-01-17T17:28:00Z'
lastState:
terminated:
exitCode: 255
reason: Error
startedAt: '2017-12-15T14:46:31Z'
finishedAt: '2018-01-17T17:25:29Z'
containerID: >-
docker://b3a2e82039d1430a7b59b4d4ae020f0f233f1155751180040a91fb426b582868
ready: true
restartCount: 4
image: 'quay.io/prometheus/node-exporter:v0.15.0'
imageID: >-
docker-pullable://quay.io/prometheus/node-exporter@sha256:0b1053a5416a3346aff000c86b268353728804bac7ffa1071ea3da9dde02af1d
containerID: >-
docker://6573d505b1b42b620268a40232a8776a4a9561c41ad50adf57c3817ebb85eac2
qosClass: Burstable
I honestly don't think it's a firewall issue, the tectonic prometheus can scrape the metrics properly and they are all in the same network.. hope this can help.
Cheers,
Simone
@brancz I am not sure actually this is an issue. I have the etcd metrics properly scraped and visible as targets in the Prometheus UI; however if I run wget https://10.223.10.212:2379/metrics I get the exact message wget: can't open 'metrics': Permission denied. As etcd is properly working I believe this is a misleading error. But at this point I am hopeless, don't really know what else to check.
Thanks,
Simone
@brancz I got it working.. somehow :) One of the things I missed was the namespaceselector in the service monitor definition. I will continue tomorrow and polish what I have done so I have a clear idea of what was broken.
Thanks for your help!
At the beginning, I'm puzzled with the tlsConfig field of etcd-k8s servicemonitor, after several attempts I find the way.
It works:
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/etcd-ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/etcd.crt
insecureSkipVerify: true
keyFile: /etc/prometheus/secrets/etcd-certs/etcd.key
According to the document of Monitoring external etcd, I create a etcd-secret in monitoring namespace, then use kubectl replace .... command to update the CRD of prometheus. If you can't make sure where your etcd's cafile/cert/key stored, you can use the k8s's secret mechanism. You can use kubectl exec -ti -n monitoring prometheus-k8s-0 /bin/sh command and use ls /etc/prometheus/secrets/etcd-certs/ command to show all of etcd cert files:
etcd-ca.crt etcd.crt etcd.key
you can replace these filename to suitable field of tlsConfig in etcd-k8s servicemonitor.
Good luck!
Thanks man,
I actually figured this out at some point, but forgot to give an update.
Cheers,
Simone
2018-04-02 9:20 GMT+02:00 李探花 notifications@github.com:
I'm puzzled with the tlsConfig field of etcd-k8s servicemonitor, after
several attempts I find the way.It works:
tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/etcd-ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/etcd.crt insecureSkipVerify:
true keyFile: /etc/prometheus/secrets/etcd-certs/etcd.key
According to the documents of
https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/docs/Monitoring%20external%20etcd.md,
I create a etcd-secret in monitoring namespace, then use kubectl replace
.... command to update the CRD of prometheus. If you can't make sure
where your etcd's cafile/cert/key stored, you can use the k8s's secret to
use it. You can use kubectl exec -ti -n monitoring prometheus-k8s-0
/bin/sh command and use ls /etc/prometheus/secrets/etcd-certs/ command to
show all of etcd cert files:etcd-ca.crt etcd.crt etcd.key
you can replace these filename to suitable field of tlsConfig in etcd-k8s
servicemonitor.Good luck!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/coreos/prometheus-operator/issues/898#issuecomment-377873219,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AZnNa7hKibTUgW5qBZXtGrqA_Vc4ChPcks5tkdE9gaJpZM4RgAln
.
Would you like to add this to the docs and create a PR? :)
@simox-83 thank you, that a trouble for me and you give the solution and direction.
@brancz I commit a pr. ...
What we are going to do, and you can do the same is create a completely separate Prometheus server, whose sole purpose it is to monitor etcd. This Prometheus server runs on the master nodes, so it has access to the etcd nodes network wise, the worker nodes cannot reach the etcd due to the firewall setup, but when you run Prometheus on master nodes this works. In the first iteration, we're doing the same as you're attempting, we're using the same certificates as the apiserver, but in the next iteration we'll likely use the authZ available in etcd to allow Prometheus to only access the /metrics endpoint.
@brancz should the operator support to launch a prometheus instance in the master node just to scrap etcd metrics?
What about to run a http proxy in the master node that forward traffic to etcd metrics endpoint?
The Prometheus Operator already gives you all the tools you need to perform any of those configurations mentioned above, it's up to you and the configuration of your cluster to decide what's the best way to do this. Ideally I recommend to have no additional indirections such as a proxy, as this often leads to isssues.
Tectonic shipped etcd monitoring in 1.8.7, so I'm closing this.
I've setup the ServiceMonitor in the following way:
serviceMonitor:
scheme: https
insecureSkipVerify: true
caFile: /etc/prometheus/secrets/etcd-client-cert/ca.crt
certFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client.pem
keyFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client-key.pem
however in the targets page I get error messages like this:
Get https://172.20.124.83:4001/metrics: x509: certificate is valid for 127.0.0.1, not 172.20.124.83
I've created the cluster with kops and I'm running k8s 1.11.5. I took the tls files from the master under /srv/kubernetes folder
@mazzy89 i have same issue using kops, pls paste your config, that help me a lot!
@unclebob2013 my config is the one that you see above. for me it worked once I copied correctly the etcd client keys and created the etcd-client-cert under prometheus config
Most helpful comment
@brancz I got it working.. somehow :) One of the things I missed was the namespaceselector in the service monitor definition. I will continue tomorrow and polish what I have done so I have a clear idea of what was broken.
Thanks for your help!