K3s: Metric-server unable to collect metrics

Created on 19 Dec 2019  路  10Comments  路  Source: k3s-io/k3s

Version:

k3s version v1.0.0 (18bd921c)

Describe the bug

To Reproduce

  • Start 3 VMs running Fedora Server 31, make sure they can reach each other
  • Open firewall ports
    firewall-cmd --add-port=6443/tcp firewall-cmd --add-port=6443/tcp --permanent firewall-cmd --add-port=10250/tcp firewall-cmd --add-port=10250/tcp --permanent firewall-cmd --add-port=8472/udp firewall-cmd --add-port=8472/udp --permanent
  • Install k3s without starting the service
    curl -sfL https://get.k3s.io | INSTALL_K3S_SKIP_START=true sh -
  • Add environment variables for service
    cat >/etc/systemd/system/k3s.service.env <<EOF [Service] Environment="K3S_TOKEN=XXX" # only on node1 Environment="K3S_CLUSTER_INIT=true" # only on node2 & node3 Environment="K3S_URL=https://node1:6443" EOF
  • Start k3s service

Expected behavior

kubectl top node and kubectl top pod should work on all three nodes, and no errors should show in the logs.

Actual behavior

kubectl top node shows only node1 when run on node1, <unknown> for the other two nodes, and errors with "Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)." when run on node2 or node3.

Additional context

The following is spammed in the logs for node2 and node3:

Dec 19 14:57:57 node2.localdomain k3s[2607]: E1219 14:57:57.392837    2607 controller.go:114] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'dial tcp 10.43.109.25:443: connect: no route to host', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
Dec 19 14:57:57 node2.localdomain k3s[2607]: I1219 14:57:57.392915    2607 controller.go:127] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
Dec 19 14:57:57 node2.localdomain k3s[2607]: E1219 14:57:57.889042    2607 available_controller.go:416] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Dec 19 14:57:58 node2.localdomain k3s[2607]: I1219 14:57:58.896597    2607 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Dec 19 14:57:58 node2.localdomain k3s[2607]: E1219 14:57:58.935020    2607 available_controller.go:416] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again

Most helpful comment

I have this problem, and fixed it by editing /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml

   nodeName: vmpkube001.agrotis.local # Restric to run on my master
      containers:
      - name: metrics-server
        image: rancher/metrics-server:v0.3.6
        args:
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname

However, this config will get overwrited when the master restarts.... looking for how to fix it!

All 10 comments

I have this problem, and fixed it by editing /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml

   nodeName: vmpkube001.agrotis.local # Restric to run on my master
      containers:
      - name: metrics-server
        image: rancher/metrics-server:v0.3.6
        args:
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname

However, this config will get overwrited when the master restarts.... looking for how to fix it!

@luisbrandao I think the fix is to deploy k3s master with the --no-deploy metrics-server and then manually deploy metrics-server with helm or kubectl apply -f

Anyways I am getting an issue like this with metrics-server and adding those args doesn't seem to help.

$ kubectl top node

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
$ kubectl get apiservice | grep metrics

v1beta1.metrics.k8s.io                 kube-system/metrics-server   False (FailedDiscoveryCheck)   6m3s
$ kubectl logs pod/metrics-server-85774544d5-hv9sp -n kube-system

I0118 02:00:03.686157       1 serving.go:312] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0118 02:00:04.042009       1 secure_serving.go:116] Serving securely on 0.0.0.0:8443
I0118 02:00:43.525812       1 log.go:172] http: TLS handshake error from 10.42.0.0:59561: EOF
I0118 02:00:43.525840       1 log.go:172] http: TLS handshake error from 10.42.0.0:52305: EOF
I0118 02:00:43.525852       1 log.go:172] http: TLS handshake error from 10.42.0.0:34680: EOF

This doesn't seem to be an issue with k3s, more likely metrics-server or maybe both.

@davidnuzik I have this issue and I am on Ubuntu 18.04.3 with v1.17.0+k3s.1

Edit: When I deploy metrics-server to a node that has a NIC assigned with a MTU of 1500 it works. I have a few nodes in my cluster that have a NIC with MTU 9000 (connected via 10Gb SFP+) wtf? 馃槙

@onedr0p are the nodes with 9k and 1.5k MTU on the same subnet?

@bradtopol yes 馃う鈥嶁檪

I did a little bit of reading on Jumbo frames / MTU sizes and saw there's really only a ~4% increase in efficiency for MTU 9000 over MTU 1500. I have changed all my networking back to use a standard MTU of 1500 and there's no issues and I am still getting great speeds over 10Gb Fiber.

Anyways, sorry to hijack this issue!

justSayNoToJumboFrames

https://netcraftsmen.com/just-say-no-to-jumbo-frames/

Hi Any Update on this. I am facing same issue with centos 7

In _--cluster-init_ version _v1.17.4+k3s1_, I get similar issue with metric-server

Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.090207    1640 iptables.go:155] Adding iptables rule: -s 10.42.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.091926    1640 iptables.go:155] Adding iptables rule: ! -s 10.42.0.0/16 -d 10.42.1.0/24 -j RETURN
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.093615    1640 iptables.go:145] Some iptables rules are missing; deleting and recreating rules
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.093734    1640 iptables.go:167] Deleting iptables rule: -s 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.095800    1640 iptables.go:155] Adding iptables rule: ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --random-fully
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.097867    1640 iptables.go:167] Deleting iptables rule: -d 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.098338    1640 iptables.go:155] Adding iptables rule: -s 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.099295    1640 iptables.go:155] Adding iptables rule: -d 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: E0405 21:05:52.640725    1640 available_controller.go:419] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.16
Apr 05 21:05:53 master-12 k3s[1640]: E0405 21:05:53.684812    1640 available_controller.go:419] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apir

I'm on k3s version v1.18.6+k3s1 (6f56fa1d). This also made kubectl really slow to respond if I pointed my kubeconfig server: IP address to any other controller node than the first one I stood up. If I kubectl say get nodes against the first controller node it was ms fast. If I pointed it to controller node 2 or 3 it would take ~35 seconds to return get nodes.

Once I update the arg values for metrics server to the values @luisbrandao suggested kubectl get nodes returned results fast no matter what controller ip my kubeconfig was pointed at. and my controller logs were also not filled with all kind of metric server errors, retries, timeouts, and warnings.

        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname

I am also seeing this, on 1.19.2-k3s1 on a 3-node RPi4 cluster - all nodes are control nodes.

In the secondary control nodes' logs (but not in the primary's logs), there are lots of these:

Oct 04 21:36:53 pi3.example.com k3s[25106]: E1004 21:36:53.462130   25106 available_controller.go:437] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Oct 04 21:36:58 pi3.example.com k3s[25106]: E1004 21:36:58.508805   25106 available_controller.go:437] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.77.92:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.77.92:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Oct 04 21:36:59 pi3.example.com k3s[25106]: E1004 21:36:59.405102   25106 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: dial tcp 10.43.77.92:443: i/o timeout
Oct 04 21:36:59 pi3.example.com k3s[25106]: I1004 21:36:59.405183   25106 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

The metrics-server pod is apparently healthy and running on pi1. The service is apparently OK:

$ kubectl get services -n kube-system -o wide metrics-server
NAME             TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE   SELECTOR
metrics-server   ClusterIP   10.43.77.92   <none>        443/TCP   21h   k8s-app=metrics-server

Trying to get stats from the primary control node sort of works but it can't see its own stats:

pi@pi1:~ $ sudo kubectl -v5 top nodes
NAME                       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
pi2.example.com        292m         7%     1597Mi          20%
pi3.example.com        253m         6%     1577Mi          20%
pi1.example.com        <unknown>                           <unknown>               <unknown>               <unknown>

For secondary control nodes, it doesn't work:

pi@pi2:~ $ sudo kubectl -v5 top nodes
I1004 19:17:20.880717 2694661 helpers.go:199] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server is currently unable to handle the request (get nodes.metrics.k8s.io)",
  "reason": "ServiceUnavailable",
  "details": {
    "group": "metrics.k8s.io",
    "kind": "nodes",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "error trying to reach service: dial tcp 192.168.22.75:443: i/o timeout"
      }
    ]
  },
  "code": 503
}]
F1004 19:17:20.882321 2694661 helpers.go:114] Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
[there's a big stack dump here - let me know if you need it]

I assume @luisbrandao's workaround will work, but obviously this needs a proper fix.

I tried updating the deployment as per the workaround (edited it with kubectl edit -n kube-system deployment.apps/metrics-server -o yaml rather than amending the file on disk). The change applied OK, pod was replaced and the metrics-server pod is now running on pi3. Alas, now pi1 and p2 have those errors in the logs and pi3 doesn't.

I restarted the metrics-server deployment, and the pod moved again, to pi2. Now pi1 and pi3 have the errors and pi2 doesn't.

So obviously the node can only see the metrics server when it's local. Does this imply a problem with the network setup? Should the metrics-server service have an external IP so the other nodes can see it? I'm new to k8s and pretty stuck here.

There is a solution/workaround posted in https://github.com/rancher/k3s/issues/1968#issuecomment-680219642 - similar to @luisbrandao's above but with a few more steps. It hasn't worked for me (yet!) but has for at least one other person - hopefully other people who find this issue might find it useful.

This doesn't change the fact that it should really work out of the box in k3s.

Was this page helpful?
0 / 5 - 0 ratings