Version:
k3s version v1.0.0 (18bd921c)
Describe the bug
To Reproduce
firewall-cmd --add-port=6443/tcp
firewall-cmd --add-port=6443/tcp --permanent
firewall-cmd --add-port=10250/tcp
firewall-cmd --add-port=10250/tcp --permanent
firewall-cmd --add-port=8472/udp
firewall-cmd --add-port=8472/udp --permanent
curl -sfL https://get.k3s.io | INSTALL_K3S_SKIP_START=true sh -
cat >/etc/systemd/system/k3s.service.env <<EOF
[Service]
Environment="K3S_TOKEN=XXX"
# only on node1
Environment="K3S_CLUSTER_INIT=true"
# only on node2 & node3
Environment="K3S_URL=https://node1:6443"
EOF
Expected behavior
kubectl top node and kubectl top pod should work on all three nodes, and no errors should show in the logs.
Actual behavior
kubectl top node shows only node1 when run on node1, <unknown> for the other two nodes, and errors with "Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)." when run on node2 or node3.
Additional context
The following is spammed in the logs for node2 and node3:
Dec 19 14:57:57 node2.localdomain k3s[2607]: E1219 14:57:57.392837 2607 controller.go:114] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error trying to reach service: 'dial tcp 10.43.109.25:443: connect: no route to host', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
Dec 19 14:57:57 node2.localdomain k3s[2607]: I1219 14:57:57.392915 2607 controller.go:127] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
Dec 19 14:57:57 node2.localdomain k3s[2607]: E1219 14:57:57.889042 2607 available_controller.go:416] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Dec 19 14:57:58 node2.localdomain k3s[2607]: I1219 14:57:58.896597 2607 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Dec 19 14:57:58 node2.localdomain k3s[2607]: E1219 14:57:58.935020 2607 available_controller.go:416] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
I have this problem, and fixed it by editing /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml
nodeName: vmpkube001.agrotis.local # Restric to run on my master
containers:
- name: metrics-server
image: rancher/metrics-server:v0.3.6
args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
However, this config will get overwrited when the master restarts.... looking for how to fix it!
@luisbrandao I think the fix is to deploy k3s master with the --no-deploy metrics-server and then manually deploy metrics-server with helm or kubectl apply -f
Anyways I am getting an issue like this with metrics-server and adding those args doesn't seem to help.
$ kubectl top node
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
$ kubectl get apiservice | grep metrics
v1beta1.metrics.k8s.io kube-system/metrics-server False (FailedDiscoveryCheck) 6m3s
$ kubectl logs pod/metrics-server-85774544d5-hv9sp -n kube-system
I0118 02:00:03.686157 1 serving.go:312] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0118 02:00:04.042009 1 secure_serving.go:116] Serving securely on 0.0.0.0:8443
I0118 02:00:43.525812 1 log.go:172] http: TLS handshake error from 10.42.0.0:59561: EOF
I0118 02:00:43.525840 1 log.go:172] http: TLS handshake error from 10.42.0.0:52305: EOF
I0118 02:00:43.525852 1 log.go:172] http: TLS handshake error from 10.42.0.0:34680: EOF
This doesn't seem to be an issue with k3s, more likely metrics-server or maybe both.
@davidnuzik I have this issue and I am on Ubuntu 18.04.3 with v1.17.0+k3s.1
Edit: When I deploy metrics-server to a node that has a NIC assigned with a MTU of 1500 it works. I have a few nodes in my cluster that have a NIC with MTU 9000 (connected via 10Gb SFP+) wtf? 馃槙
@onedr0p are the nodes with 9k and 1.5k MTU on the same subnet?
@bradtopol yes 馃う鈥嶁檪
I did a little bit of reading on Jumbo frames / MTU sizes and saw there's really only a ~4% increase in efficiency for MTU 9000 over MTU 1500. I have changed all my networking back to use a standard MTU of 1500 and there's no issues and I am still getting great speeds over 10Gb Fiber.
Anyways, sorry to hijack this issue!
Hi Any Update on this. I am facing same issue with centos 7
In _--cluster-init_ version _v1.17.4+k3s1_, I get similar issue with metric-server
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.090207 1640 iptables.go:155] Adding iptables rule: -s 10.42.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.091926 1640 iptables.go:155] Adding iptables rule: ! -s 10.42.0.0/16 -d 10.42.1.0/24 -j RETURN
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.093615 1640 iptables.go:145] Some iptables rules are missing; deleting and recreating rules
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.093734 1640 iptables.go:167] Deleting iptables rule: -s 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.095800 1640 iptables.go:155] Adding iptables rule: ! -s 10.42.0.0/16 -d 10.42.0.0/16 -j MASQUERADE --random-fully
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.097867 1640 iptables.go:167] Deleting iptables rule: -d 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.098338 1640 iptables.go:155] Adding iptables rule: -s 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: I0405 21:05:52.099295 1640 iptables.go:155] Adding iptables rule: -d 10.42.0.0/16 -j ACCEPT
Apr 05 21:05:52 master-12 k3s[1640]: E0405 21:05:52.640725 1640 available_controller.go:419] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.16
Apr 05 21:05:53 master-12 k3s[1640]: E0405 21:05:53.684812 1640 available_controller.go:419] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apir
I'm on k3s version v1.18.6+k3s1 (6f56fa1d). This also made kubectl really slow to respond if I pointed my kubeconfig server: IP address to any other controller node than the first one I stood up. If I kubectl say get nodes against the first controller node it was ms fast. If I pointed it to controller node 2 or 3 it would take ~35 seconds to return get nodes.
Once I update the arg values for metrics server to the values @luisbrandao suggested kubectl get nodes returned results fast no matter what controller ip my kubeconfig was pointed at. and my controller logs were also not filled with all kind of metric server errors, retries, timeouts, and warnings.
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
I am also seeing this, on 1.19.2-k3s1 on a 3-node RPi4 cluster - all nodes are control nodes.
In the secondary control nodes' logs (but not in the primary's logs), there are lots of these:
Oct 04 21:36:53 pi3.example.com k3s[25106]: E1004 21:36:53.462130 25106 available_controller.go:437] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Oct 04 21:36:58 pi3.example.com k3s[25106]: E1004 21:36:58.508805 25106 available_controller.go:437] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.77.92:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.77.92:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Oct 04 21:36:59 pi3.example.com k3s[25106]: E1004 21:36:59.405102 25106 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: dial tcp 10.43.77.92:443: i/o timeout
Oct 04 21:36:59 pi3.example.com k3s[25106]: I1004 21:36:59.405183 25106 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The metrics-server pod is apparently healthy and running on pi1. The service is apparently OK:
$ kubectl get services -n kube-system -o wide metrics-server
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
metrics-server ClusterIP 10.43.77.92 <none> 443/TCP 21h k8s-app=metrics-server
Trying to get stats from the primary control node sort of works but it can't see its own stats:
pi@pi1:~ $ sudo kubectl -v5 top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
pi2.example.com 292m 7% 1597Mi 20%
pi3.example.com 253m 6% 1577Mi 20%
pi1.example.com <unknown> <unknown> <unknown> <unknown>
For secondary control nodes, it doesn't work:
pi@pi2:~ $ sudo kubectl -v5 top nodes
I1004 19:17:20.880717 2694661 helpers.go:199] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server is currently unable to handle the request (get nodes.metrics.k8s.io)",
"reason": "ServiceUnavailable",
"details": {
"group": "metrics.k8s.io",
"kind": "nodes",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "error trying to reach service: dial tcp 192.168.22.75:443: i/o timeout"
}
]
},
"code": 503
}]
F1004 19:17:20.882321 2694661 helpers.go:114] Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
[there's a big stack dump here - let me know if you need it]
I assume @luisbrandao's workaround will work, but obviously this needs a proper fix.
I tried updating the deployment as per the workaround (edited it with kubectl edit -n kube-system deployment.apps/metrics-server -o yaml rather than amending the file on disk). The change applied OK, pod was replaced and the metrics-server pod is now running on pi3. Alas, now pi1 and p2 have those errors in the logs and pi3 doesn't.
I restarted the metrics-server deployment, and the pod moved again, to pi2. Now pi1 and pi3 have the errors and pi2 doesn't.
So obviously the node can only see the metrics server when it's local. Does this imply a problem with the network setup? Should the metrics-server service have an external IP so the other nodes can see it? I'm new to k8s and pretty stuck here.
There is a solution/workaround posted in https://github.com/rancher/k3s/issues/1968#issuecomment-680219642 - similar to @luisbrandao's above but with a few more steps. It hasn't worked for me (yet!) but has for at least one other person - hopefully other people who find this issue might find it useful.
This doesn't change the fact that it should really work out of the box in k3s.
Most helpful comment
I have this problem, and fixed it by editing /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml
However, this config will get overwrited when the master restarts.... looking for how to fix it!