Job [rke-network-plugin-deploy-job] deletion times out during restoring a snapshot process, however, this only happens to one particular snapshot and when I try to restore some other snapshot it works without any errors.
RKE version:
I'm using RKE v1.1.3
Docker version:
Docker version 19.03.15, build 99e3ed8919
Operating system and kernel:
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
ESXi Virtual Machines
cluster.yml file:
nodes:
addons_include: []
system_images:
etcd: rancher/coreos-etcd:v3.4.3-rancher1
alpine: rancher/rke-tools:v0.1.58
nginx_proxy: rancher/rke-tools:v0.1.58
cert_downloader: rancher/rke-tools:v0.1.58
kubernetes_services_sidecar: rancher/rke-tools:v0.1.58
kubedns: rancher/k8s-dns-kube-dns:1.15.2
dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.2
kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.2
kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
coredns: rancher/coredns-coredns:1.6.9
coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
nodelocal: rancher/k8s-dns-node-cache:1.15.7
kubernetes: rancher/hyperkube:v1.18.3-rancher2
flannel: rancher/coreos-flannel:v0.12.0
flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
calico_node: rancher/calico-node:v3.13.4
calico_cni: rancher/calico-cni:v3.13.4
calico_controllers: rancher/calico-kube-controllers:v3.13.4
calico_ctl: rancher/calico-ctl:v3.13.4
calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
canal_node: rancher/calico-node:v3.13.4
canal_cni: rancher/calico-cni:v3.13.4
canal_flannel: rancher/coreos-flannel:v0.12.0
canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
weave_node: weaveworks/weave-kube:2.6.4
weave_cni: weaveworks/weave-npc:2.6.4
pod_infra_container: rancher/pause:3.1
ingress: rancher/nginx-ingress-controller:nginx-0.32.0-rancher1
ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: rancher/metrics-server:v0.3.6
windows_pod_infra_container: rancher/kubelet-pause:v0.1.4
ssh_key_path: /root/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: null
kubernetes_version: ""
private_registries: []
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
update_strategy: null
cluster_name: ""
cloud_provider:
name: ""
prefix_path: ""
addon_job_timeout: 5000
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
monitoring:
provider: ""
options: {}
node_selector: {}
update_strategy: null
replicas: null
restore:
restore: false
snapshot_name: ""
dns: null
Steps to Reproduce:
the snapshot is locally stored on etcd node in /opt/rke/etcd-snapshots/
after executing the command rke etcd snapshot-restore --name <NAME> it gives a fatal error saying FATA[0346] [k8s] Job [rke-network-plugin-deploy-job] deletion timed out. Consider increasing addon_job_timeout value, however, this only happens with one particular snapshot and using another snapshot works fine.
Results:
INFO[0375] [worker] Successfully started [rke-log-linker] container on host [172.31.21.18]
INFO[0376] [worker] Successfully started [rke-log-linker] container on host [172.31.21.19]
INFO[0377] Removing container [rke-log-linker] on host [172.31.21.18], try #1
INFO[0377] [remove/rke-log-linker] Successfully removed container on host [172.31.21.18]
INFO[0377] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.18]
INFO[0378] Removing container [rke-log-linker] on host [172.31.21.19], try #1
INFO[0378] [remove/rke-log-linker] Successfully removed container on host [172.31.21.19]
INFO[0378] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.19]
INFO[0379] Starting container [kube-proxy] on host [172.31.21.18], try #1
INFO[0380] Starting container [kube-proxy] on host [172.31.21.19], try #1
INFO[0380] [worker] Successfully started [kube-proxy] container on host [172.31.21.18]
INFO[0380] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.18]
INFO[0381] [worker] Successfully started [kube-proxy] container on host [172.31.21.19]
INFO[0381] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.19]
INFO[0383] [healthcheck] service [kubelet] on host [172.31.21.20] is healthy
INFO[0383] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20]
INFO[0386] Starting container [rke-log-linker] on host [172.31.21.20], try #1
INFO[0386] [healthcheck] service [kube-proxy] on host [172.31.21.18] is healthy
INFO[0386] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.18]
INFO[0387] [healthcheck] service [kube-proxy] on host [172.31.21.19] is healthy
INFO[0387] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.19]
INFO[0387] [worker] Successfully started [rke-log-linker] container on host [172.31.21.20]
INFO[0389] Starting container [rke-log-linker] on host [172.31.21.18], try #1
INFO[0390] Removing container [rke-log-linker] on host [172.31.21.20], try #1
INFO[0390] [remove/rke-log-linker] Successfully removed container on host [172.31.21.20]
INFO[0390] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.20]
INFO[0391] Starting container [rke-log-linker] on host [172.31.21.19], try #1
INFO[0391] Starting container [kube-proxy] on host [172.31.21.20], try #1
INFO[0392] [worker] Successfully started [rke-log-linker] container on host [172.31.21.18]
INFO[0393] [worker] Successfully started [kube-proxy] container on host [172.31.21.20]
INFO[0393] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.20]
INFO[0393] [worker] Successfully started [rke-log-linker] container on host [172.31.21.19]
INFO[0393] Removing container [rke-log-linker] on host [172.31.21.18], try #1
INFO[0394] [remove/rke-log-linker] Successfully removed container on host [172.31.21.18]
INFO[0395] Removing container [rke-log-linker] on host [172.31.21.19], try #1
INFO[0395] [remove/rke-log-linker] Successfully removed container on host [172.31.21.19]
INFO[0398] [healthcheck] service [kube-proxy] on host [172.31.21.20] is healthy
INFO[0398] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20]
INFO[0400] Starting container [rke-log-linker] on host [172.31.21.20], try #1
INFO[0402] [worker] Successfully started [rke-log-linker] container on host [172.31.21.20]
INFO[0402] Removing container [rke-log-linker] on host [172.31.21.20], try #1
INFO[0403] [remove/rke-log-linker] Successfully removed container on host [172.31.21.20]
INFO[0403] [worker] Successfully started Worker Plane..
INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20]
INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.18]
INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.19]
INFO[0406] Starting container [rke-log-cleaner] on host [172.31.21.20], try #1
INFO[0407] Starting container [rke-log-cleaner] on host [172.31.21.19], try #1
INFO[0407] Starting container [rke-log-cleaner] on host [172.31.21.18], try #1
INFO[0408] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.20]
INFO[0408] Removing container [rke-log-cleaner] on host [172.31.21.20], try #1
INFO[0409] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.20]
INFO[0409] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.18]
INFO[0409] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.19]
INFO[0410] Removing container [rke-log-cleaner] on host [172.31.21.18], try #1
INFO[0411] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.18]
INFO[0411] Removing container [rke-log-cleaner] on host [172.31.21.19], try #1
INFO[0411] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.19]
INFO[0411] [sync] Syncing nodes Labels and Taints
INFO[0411] [sync] Successfully synced nodes Labels and Taints
INFO[0411] [network] Setting up network plugin: canal
INFO[0411] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0412] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0412] [addons] Executing deploy job rke-network-plugin
FATA[0500] [k8s] Job [rke-network-plugin-deploy-job] deletion timed out. Consider increasing addon_job_timeout value
Apparently that snapshot has a lot of resources that take time or resources to load/process causing the timeout to be hit. In v1.2 line we already raised the timeout in https://github.com/rancher/rke/pull/2289, for now you can manually raise the timeout. If you want to know why this happens with this snapshot, I either need the snapshot or I can give you commands to get an idea why it might happen.
Apparently that snapshot has a lot of resources that take time or resources to load/process causing the timeout to be hit. In v1.2 line we already raised the timeout in #2289, for now you can manually raise the timeout. If you want to know why this happens with this snapshot, I either need the snapshot or I can give you commands to get an idea why it might happen.
Thank you for your guidance that totally makes sense actually. Also so sorry for responding so late.
I'd be delighted if you could give me those commands.
Please share the output of:
ls -la /opt/rke/etcd-snapshots/NAME (for the size of the snapshot)ls -la of the snapshot file itself inside the archive (if possible)docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nPlease share the output of:
* `ls -la /opt/rke/etcd-snapshots/NAME` (for the size of the snapshot) * `ls -la` of the snapshot file itself inside the archive (if possible) * When restore is done using the snapshot, please run the following to get a resource count overview of what is in the snapshot: `docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -n` * Also the output of the commands from that involve endpoint, so endpoint health and endpoint status: https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/
My gratitude for sharing these commands, the outputs are as following:

https://172.31.21.19:2379 is healthy: successfully committed proposal: took = 153.381508ms
https://172.31.21.18:2379 is healthy: successfully committed proposal: took = 242.511842ms
Please include the full log of the snapshot-restore action with --debug included so we get debug logging.
Please include the full log of the snapshot-restore action with
--debugincluded so we get debug logging.
@superseb Thank you so much for your patience and guidance, here is the log file you requested along with my regards.
fail.log
The part where the etcd snapshot gets restored does not show any errors, what is the output when you run the etcdctl commands? And what are the logs from the etcd container? If it can't be restored successfully, RKE should log that and error out but I need to know what the state of etcd is.
The part where the etcd snapshot gets restored does not show any errors, what is the output when you run the etcdctl commands? And what are the logs from the
etcdcontainer? If it can't be restored successfully, RKE should log that and error out but I need to know what the state of etcd is.
Thank you again for your guidance and apologies in advance if I'm doing anything wrong I'm fairly new to this, I ran the command docker logs etcd and made a log file and as for the etcdctl command I ran etcdctl cluster-health, here's the results, again apologies if I didn't do it right.

etcd.log
These stats are for after the snapshot-restore action was timed out.
The screenshot does not show the exact commands that I posted earlier, it will not work if you don't use the exact command. Similar to the other 2 commands regarding endpoint, please use the exact commands from the comment earlier and from the troubleshooting page and share the error if the command does not work (including the command you used).
Based on the logs the storage seems to be not fast enough (or at least the log shows a long execute time for a lot of requests), what kind/type of disk are in the server?
The screenshot does not show the exact commands that I posted earlier, it will not work if you don't use the exact command. Similar to the other 2 commands regarding endpoint, please use the exact commands from the comment earlier and from the troubleshooting page and share the error if the command does not work (including the command you used).
Based on the logs the storage seems to be not fast enough (or at least the log shows a long execute time for a lot of requests), what kind/type of disk are in the server?
The result of executing docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -n command is as follows:
1 acme.cert-manager.io
1 management.cattle.io/listenconfigs
1 mutatingwebhookconfigurations
1 storageclasses
2 management.cattle.io/notifiers
2 management.cattle.io/podsecuritypolicytemplates
2 management.cattle.io/projectcatalogs
2 masterleases
2 networkpolicies
2 podsecuritypolicy
2 priorityclasses
2 ranges
2 validatingwebhookconfigurations
3 crd.projectcalico.org
4 cert-manager.io
4 jobs
4 management.cattle.io/cisbenchmarkversions
4 management.cattle.io/features
5 daemonsets
5 management.cattle.io/cisconfigs
5 statefulsets
6 management.cattle.io/catalogs
6 management.cattle.io/clustercatalogs
6 management.cattle.io/clusters
8 csinodes
8 minions
9 management.cattle.io/kontainerdrivers
10 leases
11 management.cattle.io/clusterregistrationtokens
12 ingress
12 management.cattle.io/authconfigs
12 persistentvolumeclaims
12 persistentvolumes
14 argoproj.io
14 management.cattle.io/nodes
15 management.cattle.io/nodedrivers
16 monitoring.coreos.com
17 management.cattle.io/dynamicschemas
17 management.cattle.io/globalroles
19 deployments
19 management.cattle.io/projectalertgroups
19 management.cattle.io/projects
21 management.cattle.io/userattributes
22 management.cattle.io/clusterroletemplatebindings
28 controllerrevisions
30 management.cattle.io/clusteralertgroups
32 management.cattle.io/etcdbackups
33 management.cattle.io/rkeaddons
34 management.cattle.io/projectalertrules
38 management.cattle.io/preferences
38 project.cattle.io/apps
39 management.cattle.io/roletemplates
40 apiregistration.k8s.io
41 management.cattle.io/rkek8sserviceoptions
42 endpointslices
46 management.cattle.io/users
48 management.cattle.io/globalrolebindings
48 management.cattle.io/tokens
50 pods
51 replicasets
59 configmaps
61 management.cattle.io/settings
66 management.cattle.io/projectroletemplatebindings
76 project.cattle.io/sourcecodeproviderconfigs
85 management.cattle.io/clusteralertrules
89 services
92 namespaces
108 project.cattle.io/apprevisions
111 management.cattle.io/clustermonitorgraphs
119 management.cattle.io/rkek8ssystemimages
133 project.cattle.io/pipelinesettings
137 apiextensions.k8s.io
142 serviceaccounts
170 events
178 roles
191 secrets
209 clusterroles
247 management.cattle.io/projectmonitorgraphs
524 management.cattle.io/catalogtemplates
526 clusterrolebindings
693 management.cattle.io/monitormetrics
4007 rolebindings
22404 management.cattle.io/catalogtemplateversions
As for endpoint status:

https://172.31.21.18:2379 is healthy: successfully committed proposal: took = 29.468433ms
https://172.31.21.19:2379 is healthy: successfully committed proposal: took = 35.606272ms
The type of the disk is VM Disk.
It seems the machines are not fast enough to make sure the pods come up in time before the timeout. Raising the timeout or running rke up again makes sure there is more time to wait for the pods to become ready. What Rancher version have you been running on this cluster?
Also, please post full specifications of the machines (CPU/memory/disk type and IOPS) to determine if the machines are adequate. See https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd for more info.
The Rancher version installed on the cluster is V2.4.4.
Memory info:
MemTotal: 8169968 kB
MemFree: 5375820 kB
MemAvailable: 6970496 kB
Buffers: 115116 kB
Cached: 1380532 kB
SwapCached: 0 kB
Active: 1263600 kB
Inactive: 1284460 kB
Active(anon): 749072 kB
Inactive(anon): 16708 kB
Active(file): 514528 kB
Inactive(file): 1267752 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 184 kB
Writeback: 0 kB
AnonPages: 1052336 kB
Mapped: 539856 kB
Shmem: 17404 kB
Slab: 129160 kB
SReclaimable: 63676 kB
SUnreclaim: 65484 kB
KernelStack: 5376 kB
PageTables: 5084 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 4084984 kB
Committed_AS: 3034756 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
Percpu: 2320 kB
HardwareCorrupted: 0 kB
AnonHugePages: 587776 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 165824 kB
DirectMap2M: 5076992 kB
DirectMap1G: 5242880 k
CPU info:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
stepping : 2
microcode : 0x43
cpu MHz : 2596.992
cache size : 30720 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5193.98
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
stepping : 2
microcode : 0x43
cpu MHz : 2596.992
cache size : 30720 KB
physical id : 2
siblings : 1
core id : 0
cpu cores : 1
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5193.98
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
stepping : 2
microcode : 0x43
cpu MHz : 2596.992
cache size : 30720 KB
physical id : 4
siblings : 1
core id : 0
cpu cores : 1
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5193.98
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
stepping : 2
microcode : 0x43
cpu MHz : 2596.992
cache size : 30720 KB
physical id : 6
siblings : 1
core id : 0
cpu cores : 1
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5193.98
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management:
Disk Type ( SAN ):
*-ide
description: IDE interface
product: 82371AB/EB/MB PIIX4 IDE
vendor: Intel Corporation
physical id: 7.1
bus info: pci@0000:00:07.1
logical name: scsi1
version: 01
width: 32 bits
clock: 33MHz
capabilities: ide isa_compat_mode pci_native_mode bus_master emulated
configuration: driver=ata_piix latency=64
resources: irq:0 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8) ioport:376 ioport:1060(size=16)
*-cdrom
description: DVD-RAM writer
product: VMware IDE CDR00
vendor: NECVMWar
physical id: 0.0.0
bus info: scsi@1:0.0.0
logical name: /dev/cdrom
logical name: /dev/cdrw
logical name: /dev/dvd
logical name: /dev/sr0
version: 1.00
capabilities: removable audio cd-r cd-rw dvd dvd-r dvd-ram
configuration: ansiversion=5 status=ready
*-medium
physical id: 0
logical name: /dev/cdrom
capabilities: partitioned partitioned:dos
configuration: signature=1f4ff68f
*-sas
description: Serial Attached SCSI controller
product: PVSCSI SCSI Controller
vendor: VMware
physical id: 0
bus info: pci@0000:03:00.0
logical name: scsi0
version: 02
width: 64 bits
clock: 33MHz
capabilities: sas pciexpress msi pm msix bus_master cap_list rom
configuration: driver=vmw_pvscsi latency=0
resources: irq:18 ioport:4000(size=8) memory:fd5f8000-fd5fffff memory:fd500000-fd50ffff
*-disk
description: SCSI Disk
product: Virtual disk
vendor: VMware
physical id: 0.0.0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 2.0
size: 100GiB (107GB)
capabilities: 7200rpm partitioned partitioned:dos
configuration: ansiversion=6 logicalsectorsize=512 sectorsize=512 signature=2b50c538
IOPS: 17.92
Using fio (fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
):
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.12
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=89KiB/s][w=40 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=731: Mon Apr 26 14:20:01 2021
write: IOPS=26, BW=59.2KiB/s (60.6kB/s)(21.0MiB/380428msec); 0 zone resets
clat (usec): min=9, max=627, avg=44.74, stdev=17.55
lat (usec): min=9, max=629, avg=45.98, stdev=17.78
clat percentiles (usec):
| 1.00th=[ 23], 5.00th=[ 26], 10.00th=[ 29], 20.00th=[ 33],
| 30.00th=[ 36], 40.00th=[ 39], 50.00th=[ 43], 60.00th=[ 46],
| 70.00th=[ 50], 80.00th=[ 57], 90.00th=[ 64], 95.00th=[ 73],
| 99.00th=[ 91], 99.50th=[ 104], 99.90th=[ 139], 99.95th=[ 210],
| 99.99th=[ 553]
bw ( KiB/s): min= 4, max= 157, per=99.19%, avg=58.52, stdev=24.14, samples=760
iops : min= 2, max= 70, avg=26.27, stdev=10.75, samples=760
lat (usec) : 10=0.01%, 20=0.28%, 50=69.94%, 100=29.16%, 250=0.59%
lat (usec) : 500=0.01%, 750=0.02%
fsync/fdatasync/sync_file_range:
sync (usec): min=335, max=685022, avg=37874.18, stdev=38957.69
sync percentiles (usec):
| 1.00th=[ 469], 5.00th=[ 578], 10.00th=[ 734], 20.00th=[ 8225],
| 30.00th=[ 14091], 40.00th=[ 21890], 50.00th=[ 28967], 60.00th=[ 36963],
| 70.00th=[ 46400], 80.00th=[ 59507], 90.00th=[ 80217], 95.00th=[108528],
| 99.00th=[183501], 99.50th=[210764], 99.90th=[304088], 99.95th=[358613],
| 99.99th=[509608]
cpu : usr=0.12%, sys=0.90%, ctx=15717, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=59.2KiB/s (60.6kB/s), 59.2KiB/s-59.2KiB/s (60.6kB/s-60.6kB/s), io=21.0MiB (23.1MB), run=380428-380428msec
Disk stats (read/write):
sda: ios=0/27697, merge=0/12919, ticks=0/654356, in_queue=598268, util=92.83%
Yeah this doesn't look sufficient, what storage is attached to the VMware cluster which is used for the machine disks?
Thick Provision Lazy Zeroed

It's more about the type of disks, in this case SAS. Based on the fio output, the disks are not sufficient and there is not really a bug here. It seems etcd is just slow causing the cluster not to be ready with the default timeout. Raising this timeout can help or put the machine on a hypervisor with SSD disk which will have better access time/IOPS. Let me know if there is anything else you want to know.
Thank you @superseb for your guidance.