K3s: cluster fails to start with embedded etcd

Created on 3 Sep 2020 · 7Comments · Source: k3s-io/k3s

Environmental Info:
K3s Version:

k3s version v1.19.0+k3s-9ac113de (9ac113de)

Node(s) CPU architecture, OS, and Version:

Linux ip-172-31-33-134 5.4.0-1021-aws #21-Ubuntu SMP Fri Jul 24 09:42:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

single server node

Describe the bug:

Attempting to build from master commit is not working due to something with embedded etcd and/or snapshot/backup/restore. Before the snapshot/backup/restore code was pushed this was working.

Working commit id tested: 719ffbfb2742eb057fa1f2eefca08d9053bc9a39
Non-working commit id tested: 9ac113de4c79b5b30f90b97363fe608a77d97ac4

Steps To Reproduce:

Both of the following two install methods fail with the same error:

curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=9ac113de4c79b5b30f90b97363fe608a77d97ac4 INSTALL_K3S_EXEC="--cluster-init" sh -
curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=9ac113de4c79b5b30f90b97363fe608a77d97ac4 INSTALL_K3S_EXEC="--datastore-endpoint etcd --etcd-snapshot-retention 7 --etcd-snapshot-schedule-cron '*/5 * * * *'" sh -

Expected behavior:

Cluster comes up with embedded etcd

Actual behavior:

Hangs forever

Additional context / logs:

Repetitive entries in logs. journalctl -eu k3s -f:

Sep 03 17:12:45 ip-172-31-33-134 k3s[2154]: WARNING: 2020/09/03 17:12:45 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 03 17:12:49 ip-172-31-33-134 k3s[2154]: WARNING: 2020/09/03 17:12:49 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 03 17:12:50 ip-172-31-33-134 k3s[2154]: {"level":"warn","ts":"2020-09-03T17:12:50.621Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 03 17:12:50 ip-172-31-33-134 k3s[2154]: time="2020-09-03T17:12:50.621461714Z" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 03 17:12:55 ip-172-31-33-134 k3s[2154]: WARNING: 2020/09/03 17:12:55 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

Trying to update cron values by adding the --etcd-snapshot-schedule-cron value flag does not fix this.
This behavior is NOT seen in rke2, which is odd because the code should be the same.

kinbug statublocker

Source

rancher-max

👍1

Most helpful comment

Was running into a similar issue when starting up the second node "Error while dialing dial tcp 127.0.0.1:2379", turned out that I had to open 2379 and 2380 ports in the firewall.

sbellan on 13 Oct 2020

👍2

All 7 comments

Can confirm that this happens with the current HEAD (ie. f72d39ad9cced43f61506f2a66e63031a0ee2072).

It used to work fine with 30f672b72a4b7e3a51a52aaf724dfc325d82728d

pschmitt on 5 Sep 2020

Nobody have this problem? https://github.com/rancher/k3s/issues/2131

ElisaMeng on 9 Sep 2020

Validated in `v1.19.1-rc1+k3s1`

Can start the cluster using simply the --cluster-init flag: curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.19.1-rc1+k3s1 INSTALL_K3S_EXEC="--cluster-init" sh - and embedded etcd will be used by default
Can also do this through the config.yaml file
Additional options related to etcd also work (etcd-disable-snapshots, etcd-snapshot-dir, etcd-snapshot-schedule-cron and etcd-snapshot-retention have all been validated)
Forcibly resetting an operational single-node cluster with --cluster-reset and --cluster-reset-restore-path also have been validated
A similar issue to the original issue mentioned here is seen when attempting a reset after a multinode cluster with embedded etcd loses quorum: https://github.com/rancher/k3s/issues/2227

rancher-max on 10 Sep 2020

Was running into a similar issue when starting up the second node "Error while dialing dial tcp 127.0.0.1:2379", turned out that I had to open 2379 and 2380 ports in the firewall.

sbellan on 13 Oct 2020

👍2

Was running into a similar issue when starting up the second node "Error while dialing dial tcp 127.0.0.1:2379", turned out that I had to open 2379 and 2380 ports in the firewall.

Opening up 2379 and 2380 for my Security Groups worked for me. Which makes sense because those are the official etcd ports.

The official etcd ports are 2379 for client requests and 2380 for peer communication.

Steps:

Node 1:
```
# Install K3s Server
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.19.3+k3s1" sh -s - --cluster-init

# Grab token
 cat /var/lib/rancher/k3s/server/node-token
```

Node 2:
```
# Set environment variable from previous node
export K3S_TOKEN=""

# Install K3s Server
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.19.3+k3s1" sh -s - --server https://<node_1_ip>:6443
```

Examine
ubuntu@usw1-k3s-control2:~$ sudo kubectl get nodes NAME STATUS ROLES AGE VERSION usw1-k3s-control1 Ready etcd,master 10m v1.19.3+k3s1 usw1-k3s-control2 Ready etcd,master 114s v1.19.3+k3s1

atsai1220 on 28 Oct 2020

Yes, we need to add the etcd ports to the docs @davidnuzik

brandond on 29 Oct 2020

Hi guys,

Do note that I also had to add the IP/hostname to the TLS san, like that:

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.19.3+k3s2" sh -s - --cluster-init --tls-san 10.0.0.2 --node-ip 10.0.0.2 --node-external-ip 12.34.56.78

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.19.3+k3s2" sh -s - --server https://10.0.0.2:6443 --tls-san 10.0.0.3 --node-ip 10.0.0.3 --node-external-ip 12.34.56.79