Rke: RKE restore etcd : Error: snapshot missing hash but --skip-hash-check=false

Created on 25 Jul 2019  Â·  16Comments  Â·  Source: rancher/rke

RKE version:
v0.2.6

Docker version:

Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-60.git9cb56fd.fc28.x86_64

Operating system and kernel
Fedora 28 (Atomic Host) 4.17.11-200.fc28.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
openstack instance

cluster.yml file:

rancher-cluster.yml

nodes:
  - address: 10.57.241.146
    internal_address: 192.168.99.68
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa
  - address: 10.57.241.148
    internal_address: 192.168.99.70
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa
  - address: 10.57.241.149
    internal_address: 192.168.99.69
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa

private_registries:
  - url: 10.57.241.229:5000
    is_default: true

rancher-cluster-restore.yml ( only keep destroyed node3)

nodes:
  - address: 10.57.241.149
    internal_address: 192.168.99.69
    user: fedora
    role: [controlplane,worker,etcd]
    ssh_key_path: /home/fedora/.ssh/id_rsa

private_registries:
  - url: 10.57.241.229:5000
    is_default: true

Steps to Reproduce:

  1. Create 3 nodes and setup rancher HA.
  2. Run below command to save snapshot
    rke etcd snapshot-save --name 20190725-093400 --config rancher-cluster.yml
  3. Destroy node3 (10.57.241.149) and rebuild it.
  4. Run below command to restore snapshot for node3.
    rke etcd snapshot-restore --name 20190725-093400.zip --config rancher-cluster-restore.yml

Results:

[root@tpe-liberty-alex-fedora-1 restote]# rke etcd snapshot-restore --name 20190725-093400.zip --config rancher-cluster-restore.yml
INFO[0000] Restoring etcd snapshot 20190725-093400.zip
INFO[0000] Successfully Deployed state file at [./rancher-cluster-restore.rkestate]
INFO[0000] [dialer] Setup tunnel for host [10.57.241.149]
WARN[0011] failed to stop etcd container on host [10.57.241.149]: Can't stop Docker container [etcd] for host [10.57.241.149]: Error response from daemon: No such container: etcd
INFO[0011] [etcd] starting backup server on host [10.57.241.149]
INFO[0019] [etcd] Successfully started [etcd-Serve-backup] container on host [10.57.241.149]
INFO[0035] [remove/etcd-Serve-backup] Successfully removed container on host [10.57.241.149]
INFO[0035] [etcd] Checking if all snapshots are identical
INFO[0044] [etcd] Successfully started [etcd-checksum-checker] container on host [10.57.241.149]
INFO[0044] Waiting for [etcd-checksum-checker] container to exit on host [10.57.241.149]
INFO[0050] [etcd] Checksum of etcd snapshot on host [10.57.241.149] is [f586f0c56e06b56df9f63a0ff17e54dd]
INFO[0050] Cleaning old kubernetes cluster
INFO[0050] [worker] Tearing down Worker Plane..
INFO[0050] [worker] Successfully tore down Worker Plane..
INFO[0050] [controlplane] Tearing down the Controller Plane..
INFO[0050] [controlplane] Successfully tore down Controller Plane..
INFO[0050] [etcd] Tearing down etcd plane..
INFO[0050] [etcd] Successfully tore down etcd plane..
INFO[0050] [hosts] Cleaning up host [10.57.241.149]
INFO[0050] [hosts] Cleaning up host [10.57.241.149]
INFO[0050] [hosts] Running cleaner container on host [10.57.241.149]
INFO[0072] [kube-cleaner] Successfully started [kube-cleaner] container on host [10.57.241.149]
INFO[0072] Waiting for [kube-cleaner] container to exit on host [10.57.241.149]
INFO[0075] [hosts] Removing cleaner container on host [10.57.241.149]
INFO[0076] [hosts] Removing dead container logs on host [10.57.241.149]
INFO[0089] [cleanup] Successfully started [rke-log-cleaner] container on host [10.57.241.149]
INFO[0091] [remove/rke-log-cleaner] Successfully removed container on host [10.57.241.149]
INFO[0091] [hosts] Successfully cleaned up host [10.57.241.149]
INFO[0091] [etcd] Restoring [20190725-093400.zip] snapshot on etcd host [10.57.241.149]
INFO[0094] [etcd] Successfully started [etcd-restore] container on host [10.57.241.149]
INFO[0094] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0094] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0095] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0095] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0096] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
INFO[0096] Container [etcd-restore] is still running on host [10.57.241.149]
INFO[0097] Waiting for [etcd-restore] container to exit on host [10.57.241.149]
FATA[0100] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 128, container logs: Error: snapshot missing hash but --skip-hash-check=false

Done

Most helpful comment

The name of the snapshot is 20190725-093400, not including the extension.

All 16 comments

Having the same issue.

The name of the snapshot is 20190725-093400, not including the extension.

I think it can be wise to have a clearer message here. Thanks to google indexation of github i ended up straight to this issue but it may not be always the case. Trapping the fact that name ended wit .zip should not be an issue (okay there is a edge case when you really want to name you backup .zip at the end but who is doing that)

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Unstale

This issue/PR has been automatically marked as stale because it has not had activity (commit/comment/label) for 60 days. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

/reopen

@fredleger The thing that I can think of is to log a warning that the given name contains the extension, and that this is not needed. Possibly error out. Is that what you are looking for or do you have other concerns?

Hi, @superseb
I had the same problem with v1.2.3.

I unzip the zip package and restore the data with the following command:
ETCDCTL_API=3 etcdctl snapshot restore xxxxxxxxx --data-dir="/var/lib/etcd"

Etcd data was successfully recovered through etcdctl with no hash check errors.

Yes I absolutely was thinking about a simple warning is enough. But needed
because was not so easy to debug.

Le mar. 22 déc. 2020 à 14:06, Xiaolu Hong notifications@github.com a
écrit :

Hi, @superseb https://github.com/superseb
I had the same problem with v.2.3.

I unzip the zip package and restore the data with the following command:
ETCDCTL_API=3 etcdctl snapshot restore xxxxxxxxx
--data-dir="/var/lib/etcd"

Etcd data was successfully recovered through etcdctl with no hash check
errors.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rke/issues/1501#issuecomment-749530830, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAVGM5IG2LDN52DYYUZIVTDSWCKVPANCNFSM4IGWHZ6Q
.

>

--
[image: webofmars | build-and-run]
Frederic Leger

fondateur | webofmars | build-and-run
M: 06.52.77.53.54
E: [email protected]
webofmars.com
https://twitter.com/webofmars
https://www.linkedin.com/company/webofmars/ https://github.com/webofmars
https://www.youtube.com/channel/UCi77lDIsszaryN0flTS9EOw
https://www.build-and-run.fr/
P.S. During business hours, focus is my work philosophy. So to fully serve
my customers, I do not consult emails during the day. Please send me an SMS
if anything urgent otherwise my reply might take a few workdays

Tested the fix on rke 1.2.6-rc2
Steps

  1. Created a rke cluster

rke1.2.6 % ./rke up
rke build the cluster successfully, below is the logs
--- rke up logs -----
ARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
INFO[0000] Initiating Kubernetes cluster
...
...
NFO[0139] [ingress] Setting up nginx ingress controller
INFO[0139] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0139] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0139] [addons] Executing deploy job rke-ingress-controller
INFO[0150] [ingress] ingress controller nginx deployed successfully
INFO[0150] [addons] Setting up user addons
INFO[0150] [addons] no user addons defined
INFO[0150] Finished building Kubernetes cluster successfully

  1. Took a snapshot name snapshot1
    >./rke etcd snapshot-save --name snapshot1

--- logs for creating snapshot1 ------
WARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
INFO[0000] Starting saving snapshot on etcd hosts
....
....
INFO[0012] Finished saving/uploading snapshot [snapshot1] on all etcd hosts

  1. Restore from snapshot1 but provide the .zip
    >./rke etcd snapshot-restore --name snapshot1.zip

---- log for the restore which failed with the name zip --

rke1.2.6 % ./rke etcd snapshot-restore --name snapshot1.zip
WARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
WARN[0000] The snapshot name [snapshot1.zip] ends with the file extension (.zip) which is not needed, the snapshot name should be provided without the extension
INFO[0000] Checking if state file is included in snapshot file for [snapshot1.zip]
....
....
INFO[0027] Waiting for [etcd-restore] container to exit on host [x.x.x.x]
INFO[0028] Removing container [etcd-restore] on host [x.x.x.x, try #1
FATA[0028] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 1, container logs: {"level":"info","ts":1612205309.9264648,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"/opt/rke/etcd-snapshots/snapshot1.zip","wal-dir":"/opt/rke/etcd-snapshots-restore/member/wal","data-dir":"/opt/rke/etcd-snapshots-restore/","snap-dir":"/opt/rke/etcd-snapshots-restore/member/snap"}
Error: snapshot missing hash but --skip-hash-check=false

  1. Try to restore again not to provide the extension .zip
    > rke1.2.6 % ./rke etcd snapshot-restore --name snapshot1

rke cluster restored successfully when the snapshot name didn't have the extension
------------------ restore logs-------------

WARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
INFO[0000] Checking if state file is included in snapshot file for [snapshot1]
....
....
0106] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0106] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0106] [addons] Executing deploy job rke-ingress-controller
INFO[0106] [ingress] ingress controller nginx deployed successfully
INFO[0106] [addons] Setting up user addons
INFO[0107] [addons] no user addons defined
INFO[0107] Finished building Kubernetes cluster successfully
INFO[0107] Restarting network, ingress, and metrics pods
I0201 10:52:31.675518 9527 request.go:655] Throttling request took 1.014137765s, request: DELETE:https://x.x.x.x:6443/api/v1/namespaces/kube-system/pods/canal-2gfkb?timeout=30s
INFO[0108] Finished restoring snapshot [snapshot1] on all etcd hosts

Reopening the bug, if the snapshot name has an extension, we need issue warning and error out instead of keep continuing and failing at the end.

It would be even better if we error out and print the usages along with it. In that case, users will be better guided.

We can't error out because the user could have chosen to call his snapshot with the .zip (so snapshot.zip.zip as the snapshot file) extension and we dont want to completely block the user from using a snapshot in that case.

@superseb, so in this case the users need to send ^C to terminate the process? So the fix will remain as it is now? let me know I can close it since it works if I don't provide the zip

What about strip out .zip/tar.gz extensions from the name and warn the end
user about it ?

IMHO the case where the user want to call it's snapshot .zip.zip is not to
cover ...

regards

Le lun. 1 févr. 2021 à 20:56, sadiapoddar notifications@github.com a
écrit :

@superseb https://github.com/superseb, so this case the users need to
send ^C to terminate the process?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rke/issues/1501#issuecomment-771116770, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAVGM5KTSE7RLGKCQ3UNBU3S44BNFANCNFSM4IGWHZ6Q
.

We never enforced naming in the beginning so we can't enforce or change anything now because we didn't set the convention from the beginning. The main issue is that we have had snapshots without being an archive before and then we switched so we need to deal with both options and we can't enforce any because we need to support both. Manually stripping the extension or other magic is just going to be confusing as it's going to happen automatically. That's why I asked if a warning was enough in the beginning.

@sadiapoddar If you think we need more guidance in this situation we can file another issue that covers fine tuning the whole process which will probably involve better checking for file existence (and possibly suggest or try multiple names based on input), but that scope is way bigger than this and needs to be designed.

Tested the fix on rke 1.2.6-rc2 and rke 1.3.0-rc1

Test1. snapshot name didn't have any extension.

  1. Took a snapshot name snapshot1
    ./rke etcd snapshot-save --name snapshot1

--- logs for creating snapshot1 ------
WARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
INFO[0000] Starting saving snapshot on etcd hosts
....
....
INFO[0012] Finished saving/uploading snapshot [snapshot1] on all etcd hosts

  1. Verified during snapshot restore, if the snapshot name has an extension, currently we generate a warning for the users.

rke1.2.6 % ./rke etcd snapshot-restore --name snapshot1.zip
WARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
WARN[0000] The snapshot name [snapshot1.zip] ends with the file extension (.zip) which is not needed, the snapshot name should be provided without the extension
INFO[0000] Checking if state file is included in snapshot file for [snapshot1.zip]
....
....
NFO[0028] Removing container [etcd-restore] on host [x.x.x.x, try 1
FATA[0028] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 1, container logs: {"level":"info","ts":1612205309.9264648,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"/opt/rke/etcd-snapshots/snapshot1.zip","wal-dir":"/opt/rke/etcd-snapshots-restore/member/wal","data-dir":"/opt/rke/etcd-snapshots-restore/","snap-dir":"/opt/rke/etcd-snapshots-restore/member/snap"}
Error: snapshot missing hash but --skip-hash-check=false

  1. Verified if the snapshot name is given as it's provided during taking the snapshot, it passed

rke1.2.6 % ./rke etcd snapshot-restore --name snapshot1

rke cluster restored successfully when the snapshot name didn't have the extension
------------------ restore logs-------------

WARN[0000] This is not an officially supported version (v1.2.6-rc2) of RKE. Please download the latest official release at https://github.com/rancher/rke/releases
INFO[0000] Running RKE version: v1.2.6-rc2
INFO[0000] Checking if state file is included in snapshot file for [snapshot1]
....
....
INFO[0107] Finished building Kubernetes cluster successfully
INFO[0107] Restarting network, ingress, and metrics pods
I0201 10:52:31.675518 9527 request.go:655] Throttling request took 1.014137765s, request: DELETE:https://x.x.x.x:6443/api/v1/namespaces/kube-system/pods/canal-2gfkb?timeout=30s
INFO[0108] Finished restoring snapshot [snapshot1] on all etcd hosts

Test2: give an extension to the snapshot name during taking snapshot

Also tested by taking a snapshot by adding .zip extension to the snapshot name,

./rke etcd snapshot-save --name snaphost1.zip

During restore used the name as its given snapshot1.zip and saw a warning but it succeeded to restore the snapshot.
./rke etcd snapshot-restore --name snaphost1.zip

----- logs ------
NFO[0133] [addons] Metrics Server deployed successfully
INFO[0133] [ingress] Setting up nginx ingress controller
INFO[0133] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0133] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0133] [addons] Executing deploy job rke-ingress-controller
INFO[0133] [ingress] ingress controller nginx deployed successfully
INFO[0133] [addons] Setting up user addons
INFO[0133] Finished building Kubernetes cluster successfully
INFO[0133] Restarting network, ingress, and metrics pods
I0203 10:47:25.009762 29074 request.go:655] Throttling request took 1.024053481s, request: DELETE:https://x.x.x.x/api/v1/namespaces/kube-system/pods/calico-kube-controllers-7fbff695b4-ntj55?timeout=30s

Currently, the snapshot restore will work if the users provide the snapshot name as it's given during taking the snapshot. If users use an extra extension with the snapshot name the rke command will generate a warning for the users for the snapshot name and fail.

Was this page helpful?
0 / 5 - 0 ratings