rke etcd restore fail if restoring from local (onprem) it always looks for S3 bucket

Created on 8 Aug 2019 · 10Comments · Source: rancher/rke

RKE version:
rke version v0.2.2
(same for lastest version v0.2.7)

Docker version: (docker version,docker info preferred)
docker version
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:06 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 18.09.7
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:26:28 2019
OS/Arch: linux/amd64
Experimental: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
cat /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.5 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.5"
PRETTY_NAME="Red Hat"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.5:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.5
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.5"

uname -r
3.10.0-862.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Bare-metal

cluster.yml file:

Steps to Reproduce:

install cluster using rke installer
rke up --config=rke-config.yaml
take backup of etcd
rke etcd snapshot-save --config=rke-config.yml --name=testing_etcd_backup
restore rke without S3 bucket

Results:
rke etcd snapshot-restore --config=rke-config.yml --name=testing_etcd_backup
INFO[0000] Restoring etcd snapshot testing_etcd_backup
INFO[0000] Successfully Deployed state file at [./rke-config.rkestate]
INFO[0000] [dialer] Setup tunnel for host [XXXX.local]
INFO[0000] [dialer] Setup tunnel for host [XXXX.local]
INFO[0000] [dialer] Setup tunnel for host [XXXX.local]
FATA[0006] failed to prepare backup: restoring S3 backups with no cluster level S3 configuration is not supported

it also fail for auto generted name like 2019-08-07T10:26:09Z_etcd.

I think issue is with IsLocalSnapshot function. it always returns false.

https://github.com/rancher/rke/blob/master/cluster/etcd.go

func IsLocalSnapshot(name string) bool {
// name is fmt.Sprintf("%s-%s%s-", cluster.Name, typeFlag, providerFlag)
// typeFlag = "r": recurring
// typeFlag = "m": manaul
//
// providerFlag = "l" local
// providerFlag = "s" s3
re := regexp.MustCompile("^c-[a-z0-9].*?-.l-")
return re.MatchString(name)
}

when i renamed backup file to c-20190706-.l- it worked. so regex is wrong it seems.

internal

Source

letscode-ss

Most helpful comment

I generated a new config and upgraded the cluster with rke 0.2.8 and I got the same error "restoring S3 backups with no cluster level S3 configuration is not supported" after an restore again. With the etcd backup config enabled rke tries to get the backup from S3.

backup_config:
      interval_hours: 6
      retention: 30

Without the backup config the restore works fine:

backup_config: null

Tested commands:

rke etcd snapshot-save --name test-snapshot-040919-4 --config /etc/rke/cluster.yml
rke etcd snapshot-restore --name test-snapshot-040919-4 --config /etc/rke/cluster.yml

mubn on 4 Sep 2019

👍4

All 10 comments

it used to work with v0.13

letscode-ss on 8 Aug 2019

Having the same issue. rke version v0.2.7

sqgzy on 19 Aug 2019

I have the same problem. v0.2.2

mubn on 20 Aug 2019

Works for me with version v0.2.8

mubn on 3 Sep 2019

Can you provide details of rke-config.yaml?

xiaoluhong on 3 Sep 2019

I was wrong. v0.2.8 doesn't fix the problem. The cause for my error was a custom hyperkube image "sdevd/hyperkube:v1.14.1-rancher1-zfs". With the original image "rancher/hyperkube:v1.13.5-rancher1" etcd backups work fine. The same behavior, when setting the image to "rancher/hyperkube:v1.14.5-rancher1". This is probably "work as defined". I should generate a new config with rke.

My cluster.yml

cluster_name: mycluster
nodes:
  - address: rke-node-1
    port: "22"
    internal_address: ""
    role:
      - worker
    hostname_override: ""
    user: rke
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: /home/rke/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels:
      node-role.kubernetes.io/node:
      # change zone label for SSO Storage class:
      failure-domain.beta.kubernetes.io/zone: nova

  - address: rke-node-3
    port: "22"
    internal_address: ""
    role:
      - worker
    hostname_override: ""
    user: rke
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: /home/rke/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels:
      node-role.kubernetes.io/node:
      # change zone label for SSO Storage class:
      failure-domain.beta.kubernetes.io/zone: nova

  - address: rke-node-2
    port: "22"
    internal_address: ""
    role:
      - worker
    hostname_override: ""
    user: rke
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: /home/rke/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels:
      node-role.kubernetes.io/node:
      # change zone label for SSO Storage class:
      failure-domain.beta.kubernetes.io/zone: nova
  - address: rke-master-3
    port: "22"
    internal_address: ""
    role:
      - controlplane
      - etcd
    hostname_override: ""
    user: rke
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: /home/rke/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels:
      node-role.kubernetes.io/master: true
  - address: rke-master-2
    port: "22"
    internal_address: ""
    role:
      - controlplane
      - etcd
    hostname_override: ""
    user: rke
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: /home/rke/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels:
      node-role.kubernetes.io/master: true
  - address: rke-master-1
    port: "22"
    internal_address: ""
    role:
      - controlplane
      - etcd
    hostname_override: ""
    user: rke
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: /home/rke/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels:
      node-role.kubernetes.io/master: true
services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: "/etcdcluster"
    snapshot: null
    retention: ""
    creation: ""
    backup_config:
      enabled: true
      interval_hours: 6
      retention: 30
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
  kubelet:
    image: ""
    extra_args: {}
#      cloud-provider: "external"
    extra_binds: []
    extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
network:
  plugin: canal
  options: {}
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
system_images:
  etcd: rancher/coreos-etcd:v3.2.24-rancher1
  alpine: rancher/rke-tools:v0.1.27
  nginx_proxy: rancher/rke-tools:v0.1.27
  cert_downloader: rancher/rke-tools:v0.1.27
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.27
  kubedns: rancher/k8s-dns-kube-dns:1.15.0
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.0
  kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.0
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.0.0
  coredns: coredns/coredns:1.2.6
  coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.0.0
  kubernetes: sdevd/hyperkube:v1.14.1-rancher1-zfs
  flannel: rancher/coreos-flannel:v0.10.0-rancher1
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher1
  calico_node: rancher/calico-node:v3.4.0
  calico_cni: rancher/calico-cni:v3.4.0
  calico_controllers: ""
  calico_ctl: rancher/calico-ctl:v2.0.0
  canal_node: rancher/calico-node:v3.4.0
  canal_cni: rancher/calico-cni:v3.4.0
  canal_flannel: rancher/coreos-flannel:v0.10.0
  weave_node: weaveworks/weave-kube:2.5.0
  weave_cni: weaveworks/weave-npc:2.5.0
  pod_infra_container: rancher/pause:3.1
  ingress: rancher/nginx-ingress-controller:0.21.0-rancher3
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.4-rancher1
  metrics_server: rancher/metrics-server:v0.3.1
ssh_key_path: /home/rke/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries: []
ingress:
  provider: none
  options: {}
  node_selector: {}
  extra_args: {}
cloud_provider:
  name: openstack
  openstackCloudProvider:
    global:
      username: XXX
      password: XXX
      auth-url: XXX
      tenant-id: XXX
      tenant-name: XXX
      region: XXX
      domain-name: XXX
prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
restore:
  restore: false
  snapshot_name: ""
dns: null

mubn on 3 Sep 2019

backup_config:
      interval_hours: 6
      retention: 30

Without the backup config the restore works fine:

backup_config: null

Tested commands:

rke etcd snapshot-save --name test-snapshot-040919-4 --config /etc/rke/cluster.yml
rke etcd snapshot-restore --name test-snapshot-040919-4 --config /etc/rke/cluster.yml

mubn on 4 Sep 2019

👍4

Available as of RKE v1.1.0-rc5

deniseschannon on 11 Feb 2020

I was able to reproduce this in rke v0.2.7 as originally reported. Sample YAML:

nodes:
  - address: x.x.x.x
    user: root
    role: [controlplane,worker,etcd]

services:
  etcd:
    backup_config:
      interval_hours: 6
      retention: 30
    snapshot: true
    creation: 6h
    retention: 24h

Steps:

$ rke up --config ./cluster.yml --ssh-agent-auth
$ rke etcd snapshot-save --name rke-snapshot-test-1 --config ./cluster.yml --ssh-agent-auth
$ rke etcd snapshot-restore --name rke-snapshot-test-1 --config ./cluster.yml --ssh-agent-auth
...
FATA[0008] failed to prepare backup: restoring S3 backups with no cluster level S3 configuration is not supported

With rke 1.1.0-rc5 I was able to restore the above backup without issues as now the backup source will be determined by the presence of the yaml keys under services.etcd

Scenarios tested:

with minimal backup_config (like the above sample yaml)
without backup_config (defaults to local)
backup_config and s3backupconfig correct settings (save and restore to s3).
with empty/null s3backupconfig (defaults to local)
with s3backupconfig correct but with wrong backup name during restore.
- fails to download from S3 as expected
without s3backupconfig.region still backups to the correct S3 bucket/folder.

izaac on 11 Feb 2020

I generated a new config and upgraded the cluster with rke 0.2.8 and I got the same error "restoring S3 backups with no cluster level S3 configuration is not supported" after an restore again. With the etcd backup config enabled rke tries to get the backup from S3.
backup_config:
      interval_hours: 6
      retention: 30
Without the backup config the restore works fine:
backup_config: null
Tested commands:
rke etcd snapshot-save --name test-snapshot-040919-4 --config /etc/rke/cluster.yml
rke etcd snapshot-restore --name test-snapshot-040919-4 --config /etc/rke/cluster.yml

Restore works with no issues. Steps are different though:
1) you need to up kubernetes with rke up
2) after done - prune cluster, but do not remove it
docker rm -vf $(docker ps -aq)
docker rmi -f $(docker images -aq)
docker volume prune -f
3) You need only one snapshot, but from the master ETDC node put in /opt/rke/etcd-snapshots
4) you need to copy the rancher-cluster.rkestate and kube_config_rancher-cluster.yml files of the cluster that you backed up to the folder from which you run the restore.
5) Run the command rke etcd snapshot-restore and it will re-deploy the cluster.