K3s: Help! k3s 070 agent conns error by wan ip, when the server run in docker

Created on 9 Aug 2019 · 17Comments · Source: k3s-io/k3s

Agent connect to k3s server by the server's host wan ip, when the server run in docker

server: in the machine ali-vm1

the server's host wan ipPort: 47.98.xxx.xxx:7441
run with docker-compose, the container's ip:

[root@ali-vm1 v070-t]# dcp exec server bash
[root@k3-server /]# 
[root@k3-server /]# ip a |grep inet
    inet 127.0.0.1/8 scope host lo
    inet 2.3.1.2/24 brd 2.3.1.255 scope global eth0

agent: in the machine hw-vm1

[root@hw-vm1 v070]# dcp -f node.yml up
Recreating v070_node_1 ... done
Attaching to v070_node_1
node_1  | time="2019-08-09T14:00:52.000382372+08:00" level=info msg="Starting k3s agent v0.7.0 (61bdd852)"
node_1  | time="2019-08-09T14:00:54.104057009+08:00" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
node_1  | time="2019-08-09T14:00:54.104348928+08:00" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
node_1  | time="2019-08-09T14:00:54.104841131+08:00" level=info msg="Waiting for containerd startup: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\""
node_1  | time="2019-08-09T14:00:55.107259392+08:00" level=info msg="module br_netfilter was already loaded"
node_1  | time="2019-08-09T14:00:55.107335659+08:00" level=info msg="module overlay was already loaded"
node_1  | time="2019-08-09T14:00:55.107344421+08:00" level=info msg="module nf_conntrack was already loaded"
node_1  | time="2019-08-09T14:00:55.236427480+08:00" level=info msg="Connecting to proxy" url="wss://2.3.1.2:6443/v1-k3s/connect"

got this and stuck: conn server with the containers's ip: 2.3.1.2, not the server's host wan ip.

msg="Connecting to proxy" url="wss://2.3.1.2:6443/v1-k3s/connect"

Source

huapox

Most helpful comment

@huapox hi, i Use aliyun vm and meet this problem too, Did you solve the problem by use kilo at last ?

is there any k3s-agnent cli flags can read node public ip and use it?

thanks

Ehco1996 on 2 Jan 2020

👍2

All 17 comments

server's config:

[root@ali-vm1 v070-t]# cat docker-compose.yml 
version: '2'
services:
  server:
    image: reg.xx.com/k-spe/att-k3s:v070
    command: server --disable-agent --cluster-cidr=7.0.0.0/16 --service-cidr=6.7.8.0/23 --cluster-domain=t2.k3s --tls-san=47.98.xxx.xxx --kube-apiserver-arg log-file=/tmp/kubeapi.log --kube-apiserver-arg bind-address=0.0.0.0 --no-deploy=traefik --no-deploy=servicelb 
    #...
    privileged: true
    ports:
    - "7441:6443"

agent's config:

[root@hw-vm1 v070]# cat node.yml
version: '2'
services:
  node:
    image: reg.xx.com/k-spe/att-k3s:v070
    command: agent --kubelet-arg="address=0.0.0.0" 
    privileged: true
    network_mode: "host"
    environment:
    - K3S_URL=https://47.98.xxx.xxx:7441
    - K3S_CLUSTER_SECRET=somethingtotallyrandom

this works fine with k3s v0.4.0

huapox on 9 Aug 2019

pkg/agent/tunnel/tunnel.go line 74:

    addresses := []string{config.ServerAddress}

    endpoint, _ := client.CoreV1().Endpoints("default").Get("kubernetes", metav1.GetOptions{})
    if endpoint != nil {
        addresses = getAddresses(endpoint)

I've got the code here, Can be a flag not to use the address from k8s that if not in HA mode?
The address from k8s is diff with the real, when k3s-server in docker, or k3s-server in nat. and k3s-agent conns not in the same lan env.

@erikwilson

huapox on 9 Aug 2019

We can probably add a flag to disable the load-balancer, but there are a couple concerns I have.

The load balancer should fail over to the original server url, so it should eventually connect if the endpoints are not routable.

The endpoints should be routable tho, I am guessing you are using the flags you are for a reason, but I suspect there is a larger configuration issue.

erikwilson on 10 Aug 2019

for rappid validation, I've just notes this:
pkg/agent/tunnel/tunnel.go line 74:

    addresses := []string{config.ServerAddress}

    /*endpoint, _ := client.CoreV1().Endpoints("default").Get("kubernetes", metav1.GetOptions{})
    if endpoint != nil {
        addresses = getAddresses(endpoint)
        if onChange != nil {
            onChange(addresses)
        }
    }*/

the nat cluster working now:

[root@(⎈ |default:default) ~]$ kc get node
NAME      STATUS   ROLES    AGE     VERSION
ali-vm1   Ready    worker   7d22h   v1.14.4-k3s.1
hw-vm1    Ready    worker   77s     v1.14.5-k3s.1

[root@(⎈ |default:default) ~]$ kc get pod -A -o wide
NAMESPACE            NAME                                    READY   STATUS             RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
cattle-system        cattle-cluster-agent-679b8c965d-sxl5r   1/1     Running            0          6d13h   7.0.0.17         ali-vm1   <none>           <none>
cattle-system        cattle-node-agent-9t7rn                 1/1     Running            0          87s     192.168.0.105    hw-vm1    <none>           <none>
cattle-system        cattle-node-agent-v46c2                 1/1     Running            0          6d13h   172.16.168.255   ali-vm1   <none>           <none>

huapox on 10 Aug 2019

got these errors when the agent start:

node_1  | I0810 09:17:52.061338       6 iptables.go:155] Adding iptables rule: ! -s 7.0.0.0/16 -d 7.0.1.0/24 -j RETURN
node_1  | I0810 09:17:52.063115       6 iptables.go:155] Adding iptables rule: ! -s 7.0.0.0/16 -d 7.0.0.0/16 -j MASQUERADE --random-fully
node_1  | time="2019-08-10T09:17:53.625946781+08:00" level=info msg="Tunnel endpoint watch event: [2.3.0.2:6443]"
node_1  | time="2019-08-10T09:17:53.625970645+08:00" level=info msg="Updating load balancer server addresses -> [2.3.0.2:6443 47.98.xxx.xxx:7442]"
node_1  | time="2019-08-10T09:17:53.626179200+08:00" level=info msg="Stopped tunnel to 127.0.0.1:22104"
    node_1  | time="2019-08-10T09:17:53.626202055+08:00" level=info msg="Connecting to proxy" url="wss://2.3.0.2:6443/v1-k3s/connect"
node_1  | time="2019-08-10T09:17:53.626327829+08:00" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:22104/v1-k3s/connect"
    node_1  | time="2019-08-10T09:20:00.945914731+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
    node_1  | time="2019-08-10T09:20:00.946800772+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
node_1  | time="2019-08-10T09:20:05.946919710+08:00" level=info msg="Connecting to proxy" url="wss://2.3.0.2:6443/v1-k3s/connect"
node_1  | time="2019-08-10T09:22:13.288332093+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
node_1  | time="2019-08-10T09:22:13.288366468+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
    node_1  | time="2019-08-10T09:22:18.288474629+08:00" level=info msg="Connecting to proxy" url="wss://2.3.0.2:6443/v1-k3s/connect"
node_1  | W0810 09:22:48.816723       6 info.go:52] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
    node_1  | time="2019-08-10T09:24:25.512354666+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
    node_1  | time="2019-08-10T09:24:25.512389515+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
node_1  | time="2019-08-10T09:24:30.512490123+08:00" level=info msg="Connecting to proxy" url="wss://2.3.0.2:6443/v1-k3s/connect"
node_1  | time="2019-08-10T09:26:37.736346722+08:00" level=error msg="Failed to connect to proxy" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
node_1  | time="2019-08-10T09:26:37.736381292+08:00" level=error msg="Remotedialer proxy error" error="dial tcp 2.3.0.2:6443: connect: connection timed out"
    node_1  | time="2019-08-10T09:26:42.736474014+08:00" level=info msg="Connecting to proxy" url="wss://2.3.0.2:6443/v1-k3s/connect"

my rapid dealing :
pkg/agent/tunnel/tunnel.go line 74:

                    /*endpoint, ok := ev.Object.(*v1.Endpoints)
                    if !ok {
                        logrus.Errorf("Tunnel could not case event object to endpoint: %v", ev)
                        continue watching
                    }*/

                    //newAddresses := getAddresses(endpoint)
                    newAddresses := []string{config.ServerAddress}

then got these result, seems works fine now:

node_1  | I0810 10:15:33.218430       6 conntrack.go:52] Setting nf_conntrack_max to 131072
node_1  | I0810 10:15:33.218815       6 config.go:202] Starting service config controller
node_1  | I0810 10:15:33.218834       6 controller_utils.go:1027] Waiting for caches to sync for service config controller
node_1  | I0810 10:15:33.218843       6 config.go:102] Starting endpoints config controller
node_1  | I0810 10:15:33.218850       6 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
node_1  | I0810 10:15:33.295689       6 kuberuntime_manager.go:950] updating runtime config through cri with podcidr 7.0.1.0/24
node_1  | I0810 10:15:33.296182       6 kubelet_network.go:69] Setting Pod CIDR:  -> 7.0.1.0/24
node_1  | I0810 10:15:33.300242       6 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-ssl" (UniqueName: "kubernetes.io/host-path/d68fb52a-bb0c-11e9-9aa5-024202030002-k8s-ssl") pod "cattle-node-agent-9t7rn" (UID: "d68fb52a-bb0c-11e9-9aa5-024202030002") 
node_1  | I0810 10:15:33.300279       6 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "var-run" (UniqueName: "kubernetes.io/host-path/d68fb52a-bb0c-11e9-9aa5-024202030002-var-run") pod "cattle-node-agent-9t7rn" (UID: "d68fb52a-bb0c-11e9-9aa5-024202030002") 
node_1  | I0810 10:15:33.300306       6 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "run" (UniqueName: "kubernetes.io/host-path/d68fb52a-bb0c-11e9-9aa5-024202030002-run") pod "cattle-node-agent-9t7rn" (UID: "d68fb52a-bb0c-11e9-9aa5-024202030002") 
node_1  | I0810 10:15:33.300324       6 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "cattle-credentials" (UniqueName: "kubernetes.io/secret/d68fb52a-bb0c-11e9-9aa5-024202030002-cattle-credentials") pod "cattle-node-agent-9t7rn" (UID: "d68fb52a-bb0c-11e9-9aa5-024202030002") 
node_1  | I0810 10:15:33.300343       6 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "cattle-token-2f5xk" (UniqueName: "kubernetes.io/secret/d68fb52a-bb0c-11e9-9aa5-024202030002-cattle-token-2f5xk") pod "cattle-node-agent-9t7rn" (UID: "d68fb52a-bb0c-11e9-9aa5-024202030002") 
node_1  | I0810 10:15:33.300350       6 reconciler.go:154] Reconciler: start to sync state
node_1  | I0810 10:15:33.320760       6 controller_utils.go:1034] Caches are synced for endpoints config controller
node_1  | I0810 10:15:33.392095       6 kubelet_node_status.go:112] Node hw-vm1 was previously registered
node_1  | I0810 10:15:33.392114       6 kubelet_node_status.go:73] Successfully registered node hw-vm1
node_1  | I0810 10:15:33.418924       6 controller_utils.go:1034] Caches are synced for service config controller
node_1  | I0810 10:15:34.169997       6 kube.go:134] Node controller sync successful
node_1  | I0810 10:15:34.171148       6 vxlan.go:120] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
node_1  | I0810 10:15:34.174177       6 flannel.go:75] Wrote subnet file to /run/flannel/subnet.env
node_1  | I0810 10:15:34.174185       6 flannel.go:79] Running backend.
node_1  | I0810 10:15:34.174192       6 vxlan_network.go:60] watching for new subnet leases

huapox on 10 Aug 2019

Yah, the reverse tunnel uses that also, so not really a load-balancer issue. Sounds like your endpoints should be routable.

erikwilson on 10 Aug 2019

Yah, the reverse tunnel uses that also, so not really a load-balancer issue. Sounds like your endpoints should be routable.

Thx for the working and the reply.

There with my Suggestions of this topic:

with a scenary of this:
for convenience or graceful reason, Just run server in docker (not network_mode: host), and expose the 6443 port to outsite of the host machine, and with HA mode.

Weather we can with a satic conf of the HA master nodes, just from the config file or config params?
this will lose the feature of dynamic monitoring master's node ip, but this ip not offen change.

or any good ideas?

huapox on 10 Aug 2019

I am curious, what is the purpose of setting --kube-apiserver-arg bind-address=0.0.0.0 and --kubelet-arg="address=0.0.0.0"? What network devices are available?

erikwilson on 10 Aug 2019

I am curious, what is the purpose of setting --kube-apiserver-arg bind-address=0.0.0.0 and --kubelet-arg="address=0.0.0.0"? What network devices are available?

current my using arch:

server without agent, with sqlite k/v storage.

running in docker without network_mode: host; just expose the inner 6443 port.

agent
running in docker with network_mode: host;

[root@hw-vm1 ~]# docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS                                         NAMES
8f676fb76ea3        reg.xxx.com/k-xxx/att-k3s-prs   "/entry.sh agent -..."   2 minutes ago       Up 2 minutes                                                      v070_node_1
3326ae640eb6        rancher/rancher:v2.2.6                                "entrypoint.sh"          6 days ago          Up 6 days           0.0.0.0:8880->80/tcp, 0.0.0.0:8443->443/tcp   rancher
[root@hw-vm1 ~]# docker exec -it v070_node_1 bash
[root@hw-vm1 /]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 10:15 ?        00:00:00 bash /entry.sh agent --kubelet-arg=address=0.0.0.0 --pause-image=registry.cn-hangzhou.aliyuncs.
rpc          5     1  0 10:15 ?        00:00:00 rpcbind -f
root         6     1  1 10:15 ?        00:00:02 k3s agent --kubelet-arg=address=0.0.0.0 --pause-image=registry.cn-hangzhou.aliyuncs.com/google_
root        16     6  0 10:15 ?        00:00:00 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/cont
root        77    16  0 10:15 ?        00:00:00 containerd-shim -namespace k8s.io -workdir /var/lib/rancher/k3s/agent/containerd/io.containerd.
root        94    77  0 10:15 ?        00:00:00 /pause
root       125    16  0 10:15 ?        00:00:00 containerd-shim -namespace k8s.io -workdir /var/lib/rancher/k3s/agent/containerd/io.containerd.
root       141   125  0 10:15 ?        00:00:00 agent
root       320     0  0 10:17 ?        00:00:00 bash
root       361   320  0 10:17 ?        00:00:00 ps -ef
[root@hw-vm1 /]# pstree
bash-+-k3s-agent---containerd-+-containerd-shim---pause
     |                        `-containerd-shim---agent
     `-rpcbind
[root@hw-vm1 /]# k3s crictl ps
CONTAINER ID        IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
432a39da4d6e8       ce6bb2c8f5c81       2 minutes ago       Running             agent               1                   d6aa890dc8212

I started useing k3s from version v0.3.x, setting --kube-apiserver-arg bind-address=0.0.0.0 and --kubelet-arg="address=0.0.0.0" just from previous using experience.
And now I've not tested the result of not setting these two flags on my current deploy arch.

huapox on 10 Aug 2019

After the upper change, just few warnings of got the info of maching-id failed:

node_1  | W0810 10:20:33.124208       6 info.go:52] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
node_1  | W0810 10:25:33.110251       6 info.go:52] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
node_1  | W0810 10:30:33.110257       6 info.go:52] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"

huapox on 10 Aug 2019

Awesome, looks like it is working! The kube-apiserver flag may have been needed at one point for metrics server to work, but is probably causing problems with the current configuration.

erikwilson on 10 Aug 2019

Hopefully that helps, might be worth checking out https://github.com/rancher/k3d also. If there is any more info I can give please let me know.

erikwilson on 10 Aug 2019

Hopefully that helps, might be worth checking out https://github.com/rancher/k3d also. If there is any more info I can give please let me know.

I will, thx~

huapox on 10 Aug 2019

Awesome, looks like it is working! The kube-apiserver flag may have been needed at one point for metrics server to work, but is probably causing problems with the current configuration.

yes, truly it is, in this mode can only run some standalone pod in the lan node, or you need to add route to the kubernetes cluster or other node (as my hw-vm1 and ali-vm1 all be vpc mode, the node's ip is the vm's lan ip, not the wan ip):

any pod run in hw-vm1 node, when need to communicate with the kubernetes will fail:

panic: Get https://6.7.8.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 6.7.8.1:443: connect: no route to host

goroutine 1 [running]:
main.main()
    /go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b
[root@(⎈ |default:kube-system) ~]$ kc logs -f metrics-server-f5896c776-xwgsc

I0810 04:01:54.980525       1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
W0810 04:01:56.352480       1 authentication.go:245] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://6.7.8.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 6.7.8.1:443: connect: no route to host

the svc of kubernetes of my current cluster:

[root@(⎈ |default:default) ~]$ kc describe svc kubernetes 
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                6.7.8.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         2.3.0.2:6443
Session Affinity:  None
Events:            <none>

huapox on 10 Aug 2019

my former cluster use v040 of k3s(no HA feature)

[root@(⎈ |default:default) ~]$ kc describe svc kubernetes 
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                6.7.8.1
Port:              https  443/TCP
TargetPort:        6445/TCP
Endpoints:         127.0.0.1:6445
Session Affinity:  None
Events:            <none>

we can see the Endpoints's ip change:
1.the ip changed to the docker's ip from the former's localhost;
2.the port changed to 6443 from the for former's 6445

huapox on 10 Aug 2019

updates: