Kops: Gathering metrics from etcd-manager

Created on 12 Jun 2019 · 14Comments · Source: kubernetes/kops

1. Describe IN DETAIL the feature/behavior/change you would like to see.

Before updating I used the following setup to get metrics from etcd:

apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: prometheus-service-proxier
rules:
- apiGroups: [""]
  resources: ["services/proxy"]
  resourceNames: ["http:etcd-server-prometheus-discovery:etcd"]
  verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: prometheus-service-proxier
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: prometheus-service-proxier
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
  name: etcd-server-prometheus-discovery
  namespace: kube-system
  labels:
    k8s-app: etcd-server
spec:
  selector:
    k8s-app: etcd-server
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https
    port: 443
    targetPort: 443
    protocol: TCP
  - name: etcd
    port: 4001
    targetPort: 4001
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: etcd-server
  name: etcd-server
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https
    scheme: https
    path: /api/v1/namespaces/kube-system/services/http:etcd-server-prometheus-discovery:etcd/proxy/metrics
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      serverName: kubernetes
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: etcd-server

It's a little hacky, but much more secure than opening firewall ports for etcd and simpler than deploying another prometheus on masters just to monitor etcd (like suggested in https://github.com/kubernetes/kops/issues/4975 and links in that thread).

After upgrading to 1.12 however, I tried to adapt the above solution to work with etcd-manager:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: prometheus-service-proxier
rules:
- apiGroups: [""]
  resources: ["services/proxy"]
  resourceNames:
  - "https:etcd-manager-main-prometheus-discovery:etcd"
  - "https:etcd-manager-events-prometheus-discovery:etcd"
  verbs: ["get"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: prometheus-service-proxier
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: monitoring
roleRef:
  kind: ClusterRole
  name: prometheus-service-proxier
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
  name: etcd-manager-events-prometheus-discovery
  namespace: kube-system
  labels:
    k8s-app: etcd-manager-events
spec:
  selector:
    k8s-app: etcd-manager-events
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https
    port: 443
    targetPort: 443
    protocol: TCP
  - name: etcd
    port: 4001
    targetPort: 4001
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: etcd-manager-events
  name: etcd-manager-events
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https
    scheme: https
    path: /api/v1/namespaces/kube-system/services/https:etcd-manager-events-prometheus-discovery:etcd/proxy/metrics
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      serverName: kubernetes
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: etcd-manager-events
---
apiVersion: v1
kind: Service
metadata:
  name: etcd-manager-main-prometheus-discovery
  namespace: kube-system
  labels:
    k8s-app: etcd-manager-main
spec:
  selector:
    k8s-app: etcd-manager-main
  type: ClusterIP
  clusterIP: None
  ports:
  - name: https
    port: 443
    targetPort: 443
    protocol: TCP
  - name: etcd
    port: 4001
    targetPort: 4001
    protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: etcd-manager-main
  name: etcd-manager-main
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: https
    scheme: https
    path: /api/v1/namespaces/kube-system/services/https:etcd-manager-main-prometheus-discovery:etcd/proxy/metrics
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      serverName: kubernetes
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: etcd-manager-main

Unfortunately this doesn't work:

$ kubectl get --raw /api/v1/namespaces/kube-system/services/https:etcd-manager-main-prometheus-discovery:etcd/proxy/metrics
Error from server (ServiceUnavailable): the server is currently unable to handle the request

But, the cluster has valid endpoints for that service:

$ kubectl describe service -n kube-system etcd-manager-main-prometheus-discovery                                                            
Name:              etcd-manager-main-prometheus-discovery
Namespace:         kube-system
Labels:            k8s-app=etcd-manager-main
Annotations:       kubectl.kubernetes.io/last-applied-configuration:
                     {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"k8s-app":"etcd-manager-main"},"name":"etcd-manager-main-promet...
Selector:          k8s-app=etcd-manager-main
Type:              ClusterIP
IP:                None
Port:              https  443/TCP
TargetPort:        443/TCP
Endpoints:         10.120.128.108:443,10.120.129.224:443,10.120.130.81:443
Port:              etcd  4001/TCP
TargetPort:        4001/TCP
Endpoints:         10.120.128.108:4001,10.120.129.224:4001,10.120.130.81:4001
Session Affinity:  None
Events:            <none>

And ssh to the host and executing curl:

root@ip-10-120-129-224 ~# curl https://localhost:4001/metrics -k --cert /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --key /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key

Provides valid metrics for etcd.

Do you have any suggestions on what I'm doing wrong? Is it about apiserver proxy not presenting client cert to the metrics service?
Do you know a better way to gather etcd-manager metrics?

Best regards

Łukasz Tomaszkiewicz

lifecyclstale

Source

tomaszkiewicz

👍4

Most helpful comment

I have the same problem.
How get etcd metrics in prometheus ?

jidckii on 21 Oct 2019

👍6

All 14 comments

It looks like your ServiceMonitor is configured to use etcd's https port (443) rather than the etcd port (4001) that you used in your curl command. Additionally, I'm not super familiar with ServiceMonitor but your curl command is specifying the client certificate and key and I don't see those defined in the ServiceMonitor, so you may need to specify them somehow.

rifelpet on 12 Jun 2019

It's configured to use https by purpose as I use apiserver proxy feature to bypass firewall. The cert and key for service monitor is specified (in tls section and token file) but I'm afraid apiserver proxy doesn't proxy that information to the target service and uses that only for proxy authentication and authorization.

So probably we need to figure out another way to get to the metrics. I've read some docs and probably exposing metrics on separate port with http only will solve the case however I haven't found how to do that in kops :(

tomaszkiewicz on 14 Jun 2019

Hi,

With the upgrade to etcd-manager etcd is only available from the masters, which means unless prometheus is running on the masters you can not scrape it for metrics. There is a feature in etcd 3.3 that allows metrics to be exposed on a different port, and I have an issue open on etcdmanager, to expose that. https://github.com/kopeio/etcd-manager/issues/139 That will not be avaliable until at least 1.14 though, as that is when kubernetes upgrades the recomended version of etcd.

granular-ryanbonham on 14 Jun 2019

👀1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 12 Sep 2019

/remove-lifecycle stale

tomaszkiewicz on 12 Sep 2019

I have the same problem.
How get etcd metrics in prometheus ?

jidckii on 21 Oct 2019

👍6

bump also looking to get the metrics from kops etcd-manager deployment into prometheus-operator

ivnilv on 12 Nov 2019

It looks like https://github.com/kopeio/etcd-manager/issues/139, but I don't see any way of setting this var in kops

olemarkus on 20 Nov 2019

👍1

Another issue helped me to solve it: https://github.com/coreos/prometheus-operator/issues/2207#issuecomment-505122891

Thanks to @tkozma and @irizzant.

But I couldn't get the certs needed from the mentioned PODs. I fetched them from the kops s3 store:

aws s3 cp s3://${KOPS_STATE_STORE}/${KOPS_CLUSTER_NAME}/pki/issued/etcd-clients-ca/$ca_file_name /tmp/etcd_ca.pem
aws s3 cp ${KOPS_STATE_STORE}/${KOPS_CLUSTER_NAME}/pki/issued/etcd-clients-ca/$client_cert_file_name /tmp/client.crt
aws s3 cp ${KOPS_STATE_STORE}/${KOPS_CLUSTER_NAME}/pki/private/etcd-clients-ca/$client_key_file_name /tmp/client.key

The filenames contain some generated number. You can list the directory to get the names:

aws s3 ls s3://${KOPS_STATE_STORE}/${KOPS_CLUSTER_NAME}/pki/issued/etcd-clients-ca/

But that's just a workaround.

czunker on 13 Dec 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 12 Mar 2020

/remove-lifecycle stale

mariusv on 12 Mar 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 10 Jun 2020

This has been fixed now. See https://kops.sigs.k8s.io/cluster_spec/#etcd-metrics

/close

olemarkus on 10 Jun 2020

👍3

@olemarkus: Closing this issue.

In response to this:

This has been fixed now. See https://kops.sigs.k8s.io/cluster_spec/#etcd-metrics

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.