Rke: Hardcoded etcd snapshot image breaks airgapped (or custom system_images) installs

Created on 16 Jul 2019  路  5Comments  路  Source: rancher/rke

RKE version:
rke version v0.2.5

Docker version: (docker version,docker info preferred)
Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-88.git07f3374.el7.centos.x86_64
Go version: go1.9.4
Git commit: 07f3374/1.13.1
Built: Fri Dec 7 16:13:51 2018
OS/Arch: linux/amd64

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Package version: docker-1.13.1-88.git07f3374.el7.centos.x86_64
Go version: go1.9.4
Git commit: 07f3374/1.13.1
Built: Fri Dec 7 16:13:51 2018
OS/Arch: linux/amd64
Experimental: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

3.10.0-327.18.2.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
openstack create instance

cluster.yml file:
nodes:

  • address: 10.57.241.142
    internal_address: 192.168.99.62
    user: centos
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/centos/.ssh/id_rsa
  • address: 10.57.241.143
    internal_address: 192.168.99.64
    user: centos
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/centos/.ssh/id_rsa
  • address: 10.57.241.144
    internal_address: 192.168.99.63
    user: centos
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/centos/.ssh/id_rsa

private_registries:

  • url: 10.57.241.204:5000
    is_default: true

Steps to Reproduce:
rke up --config ./rancher-cluster.yml

Results:
[centos@tpe-liberty-alex-2 HAnode1]$ rke up --config ./rancher-cluster.yml
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [./rancher-cluster.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.57.241.144]
INFO[0000] [dialer] Setup tunnel for host [10.57.241.142]
INFO[0000] [dialer] Setup tunnel for host [10.57.241.143]
INFO[0010] [network] Deploying port listener containers
INFO[0011] [network] Successfully started [rke-etcd-port-listener] container on host [10.57.241.143]
INFO[0016] [network] Successfully started [rke-etcd-port-listener] container on host [10.57.241.144]
INFO[0017] [network] Successfully started [rke-cp-port-listener] container on host [10.57.241.143]
INFO[0018] [network] Successfully started [rke-cp-port-listener] container on host [10.57.241.142]
INFO[0018] [network] Successfully started [rke-cp-port-listener] container on host [10.57.241.144]
INFO[0019] [network] Successfully started [rke-worker-port-listener] container on host [10.57.241.143]
INFO[0021] [network] Successfully started [rke-worker-port-listener] container on host [10.57.241.144]
INFO[0021] [network] Successfully started [rke-worker-port-listener] container on host [10.57.241.142]
INFO[0021] [network] Port listener containers deployed successfully
INFO[0021] [network] Running etcd <-> etcd port checks
INFO[0022] [network] Successfully started [rke-port-checker] container on host [10.57.241.143]
INFO[0023] [network] Successfully started [rke-port-checker] container on host [10.57.241.142]
INFO[0024] [network] Successfully started [rke-port-checker] container on host [10.57.241.144]
INFO[0025] [network] Running control plane -> etcd port checks
INFO[0026] [network] Successfully started [rke-port-checker] container on host [10.57.241.143]
INFO[0029] [network] Successfully started [rke-port-checker] container on host [10.57.241.142]
INFO[0029] [network] Successfully started [rke-port-checker] container on host [10.57.241.144]
INFO[0031] [network] Running control plane -> worker port checks
INFO[0031] [network] Successfully started [rke-port-checker] container on host [10.57.241.143]
INFO[0034] [network] Successfully started [rke-port-checker] container on host [10.57.241.144]
INFO[0034] [network] Successfully started [rke-port-checker] container on host [10.57.241.142]
INFO[0036] [network] Running workers -> control plane port checks
INFO[0038] [network] Successfully started [rke-port-checker] container on host [10.57.241.143]
INFO[0039] [network] Successfully started [rke-port-checker] container on host [10.57.241.144]
INFO[0039] [network] Successfully started [rke-port-checker] container on host [10.57.241.142]
INFO[0059] [network] Checking KubeAPI port Control Plane hosts
INFO[0059] [network] Removing port listener containers
INFO[0059] [remove/rke-etcd-port-listener] Successfully removed container on host [10.57.241.143]
INFO[0060] [remove/rke-etcd-port-listener] Successfully removed container on host [10.57.241.142]
INFO[0061] [remove/rke-etcd-port-listener] Successfully removed container on host [10.57.241.144]
INFO[0062] [remove/rke-cp-port-listener] Successfully removed container on host [10.57.241.143]
INFO[0064] [remove/rke-cp-port-listener] Successfully removed container on host [10.57.241.142]
INFO[0065] [remove/rke-cp-port-listener] Successfully removed container on host [10.57.241.144]
INFO[0065] [remove/rke-worker-port-listener] Successfully removed container on host [10.57.241.143]
INFO[0067] [remove/rke-worker-port-listener] Successfully removed container on host [10.57.241.142]
INFO[0067] [remove/rke-worker-port-listener] Successfully removed container on host [10.57.241.144]
INFO[0067] [network] Port listener containers removed successfully
INFO[0067] [certificates] Deploying kubernetes certificates to Cluster nodes
INFO[0078] [reconcile] Rebuilding and updating local kube config
INFO[0078] Successfully Deployed local admin kubeconfig at [./kube_config_rancher-cluster.yml]
INFO[0078] Successfully Deployed local admin kubeconfig at [./kube_config_rancher-cluster.yml]
INFO[0078] Successfully Deployed local admin kubeconfig at [./kube_config_rancher-cluster.yml]
INFO[0078] [certificates] Successfully deployed kubernetes certificates to Cluster nodes
INFO[0078] [reconcile] Reconciling cluster state
INFO[0078] [reconcile] This is newly generated cluster
INFO[0078] Pre-pulling kubernetes images
INFO[0078] [pre-deploy] Pulling image [10.57.241.204:5000/rancher/hyperkube:v1.14.3-rancher1] on host [10.57.241.143]
INFO[0078] [pre-deploy] Pulling image [10.57.241.204:5000/rancher/hyperkube:v1.14.3-rancher1] on host [10.57.241.144]
INFO[0078] [pre-deploy] Pulling image [10.57.241.204:5000/rancher/hyperkube:v1.14.3-rancher1] on host [10.57.241.142]
INFO[0078] [pre-deploy] Successfully pulled image [10.57.241.204:5000/rancher/hyperkube:v1.14.3-rancher1] on host [10.57.241.144]
INFO[0078] [pre-deploy] Successfully pulled image [10.57.241.204:5000/rancher/hyperkube:v1.14.3-rancher1] on host [10.57.241.143]
INFO[0078] [pre-deploy] Successfully pulled image [10.57.241.204:5000/rancher/hyperkube:v1.14.3-rancher1] on host [10.57.241.142]
INFO[0078] Kubernetes images pulled successfully
INFO[0078] [etcd] Building up etcd plane..
INFO[0078] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [10.57.241.142]
INFO[0078] [etcd] Pulling image [rancher/rke-tools:v0.1.34] on host [10.57.241.142]
INFO[0093] [etcd] Successfully pulled image [rancher/rke-tools:v0.1.34] on host [10.57.241.142]
FATA[0093] [etcd] Failed to bring up Etcd Plane: Failed to create [etcd-rolling-snapshots] container on host [10.57.241.142]: Error: No such image: rancher/rke-tools:v0.1.34

Done kinbug

Most helpful comment

Image used for RKE snapshot was removed from a system_images image and to a static image and tag, making it fail in any situation where that static image and tag is not available.

Making it available on the nodes manually and tagging as the needed image and tag would be workaround before it is fixed, or use v0.2.4.

docker pull your_registry/rancher/rke-tools:v0.1.34
docker tag your_registry/rancher/rke-tools:v0.1.34 rancher/rke-tools:v0.1.34

The log indicating it was successfully pulled while it's not is tracked in https://github.com/rancher/rke/issues/1010

All 5 comments

With version v0.2.5 I have just observed the same issue when running against a v1.13.5-rancher1 cluster (a step I perform before trying the update to v1.14.3-rancher1-1). The difference is that I customize the image list:

system_images:
  etcd: registry.example.com:5000/rancher/coreos-etcd:v3.2.24
  kubernetes: registry.example.com:5000/rancher/hyperkube:v1.13.5-rancher1
  alpine: registry.example.com:5000/rancher/rke-tools:v0.1.28
  nginx_proxy: registry.example.com:5000/rancher/rke-tools:v0.1.28
  cert_downloader: registry.example.com:5000/rancher/rke-tools:v0.1.28
  kubernetes_services_sidecar: registry.example.com:5000/rancher/rke-tools:v0.1.28
  kubedns: registry.example.com:5000/rancher/k8s-dns-kube-dns-amd64:1.15.0
  dnsmasq: registry.example.com:5000/rancher/k8s-dns-dnsmasq-nanny-amd64:1.15.0
  kubedns_sidecar: registry.example.com:5000/rancher/k8s-dns-sidecar-amd64:1.15.0
  kubedns_autoscaler: registry.example.com:5000/rancher/cluster-proportional-autoscaler-amd64:1.0.0
  flannel: registry.example.com:5000/rancher/coreos-flannel:v0.10.0
  flannel_cni: registry.example.com:5000/rancher/coreos-flannel-cni:v0.3.0
  calico_node: registry.example.com:5000/rancher/calico-node:v3.4.0
  calico_cni: registry.example.com:5000/rancher/calico-cni:v3.4.0
  calico_ctl: registry.example.com:5000/rancher/calico-ctl:v2.0.0
  canal_node: registry.example.com:5000/rancher/calico-node:v3.4.0
  canal_cni: registry.example.com:5000/rancher/calico-cni:v3.4.0
  canal_flannel: registry.example.com:5000/rancher/coreos-flannel:v0.10.0
  weave_node: registry.example.com:5000/rancher/weave-kube:2.5.0
  weave_cni: registry.example.com:5000/rancher/weave-npc:2.5.0
  pod_infra_container: registry.example.com:5000/rancher/pause-amd64:3.1
  ingress: registry.example.com:5000/rancher/nginx-ingress-controller:0.21.0-rancher3
  ingress_backend: registry.example.com:5000/rancher/nginx-ingress-controller-defaultbackend:1.4
  metrics_server: registry.example.com:5000/rancher/metrics-server-amd64:v0.3.1
  coredns: registry.example.com:5000/rancher/coredns:1.2.6
  codedns_autoscaler: registry.example.com:5000/rancher/cluster-proportional-autoscaler-amd64:1.0.0

And the result is pretty much the same:

INFO[0008] [etcd] Pulling image [rancher/rke-tools:v0.1.34] on host [cfdd9f3c.example.com] 
INFO[0009] [etcd] Successfully pulled image [rancher/rke-tools:v0.1.34] on host [cfdd9f3c.example.com] 
FATA[0009] [etcd] Failed to bring up Etcd Plane: Failed to create [etcd-rolling-snapshots] container on host [cfdd9f3c.example.com]: Error: No such image: rancher/rke-tools:v0.1.34 

Previous versions ran perfectly.

EDIT: ran again using the correct configuration and images for version v1.14.3-rancher1-1 (an upgrade) and I still see the problem.

looks to be caused by https://github.com/rancher/rke/commit/7531e02563054a72410d878fa6e29a10e957704b#diff-a1c0977dfcb80994021b9d53dfc88892

cc @superseb @kinarashah

Image used for RKE snapshot was removed from a system_images image and to a static image and tag, making it fail in any situation where that static image and tag is not available.

Making it available on the nodes manually and tagging as the needed image and tag would be workaround before it is fixed, or use v0.2.4.

docker pull your_registry/rancher/rke-tools:v0.1.34
docker tag your_registry/rancher/rke-tools:v0.1.34 rancher/rke-tools:v0.1.34

The log indicating it was successfully pulled while it's not is tracked in https://github.com/rancher/rke/issues/1010

Can be tested with rke v0.2.6-rc1

rke v0.2.6-rc1
Airgap
Private registry w/ auth

rke pulls the correct rke-tools image and the cluster is able to provision successfully.

cluster.yml

nodes:
  - address: 172.31.21.72
    user: ubuntu
    role: [controlplane,etcd,worker]

private_registries:
- url: registry:443
  user: username
  password: password
  is_default: true

./rke_linux-amd64 up --config cluster.yml --ssh-agent-auth

ubuntu@ip-172-31-26-130:~$ ./rke_linux-amd64 up --config cluster.yml --ssh-agent-auth
WARN[0000] This is not an officially supported version (v0.2.6-rc1) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases/latest
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [172.31.21.72]
INFO[0000] [state] Pulling image [registry:443/rancher/rke-tools:v0.1.34] on host [172.31.21.72]
INFO[0003] [state] Successfully pulled image [registry:443/rancher/rke-tools:v0.1.34] on host [172.31.21.72]
...
INFO[0038] Kubernetes images pulled successfully
INFO[0038] [etcd] Building up etcd plane..
INFO[0038] [etcd] Pulling image [registry:443/rancher/coreos-etcd:v3.3.10-rancher1] on host [172.31.21.72]
INFO[0040] [etcd] Successfully pulled image [registry:443/rancher/coreos-etcd:v3.3.10-rancher1] on host [172.31.21.72]
INFO[0040] [etcd] Successfully started [etcd] container on host [172.31.21.72]
INFO[0040] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [172.31.21.72]
INFO[0040] [etcd] Successfully started [etcd-rolling-snapshots] container on host [172.31.21.72]
...
INFO[0100] Finished building Kubernetes cluster successfully

Was able to reproduce with rke 0.2.5 - same cluster.yml

INFO[0011] [etcd] Pulling image [rancher/rke-tools:v0.1.34] on host [172.31.21.72]
FATA[0026] [etcd] Failed to bring up Etcd Plane: Can't pull Docker image [rancher/rke-tools:v0.1.34] for host [172.31.21.72]: Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

riaan53 picture riaan53  路  3Comments

de13 picture de13  路  3Comments

kyamazawa picture kyamazawa  路  3Comments

mtchuyen picture mtchuyen  路  4Comments

HighwayofLife picture HighwayofLife  路  4Comments