Hi
I am looking for backup/restore procedure in the docs and cannot find one
any help here?
(source) ...step two is to completely rework the state management such that you no longer need to persist state in k8s. This is a much larger goal. I've fooled around a lot with the persistence layer in k8s (this project is running on sqlite3) so I'm basically doing a lot of work of figuring what state I can throw away. The theory is that all desired state comes from your yaml files. Actual state is actual (what really exists). Everything else in k8s should be purely derived and not important.
(source) ...but right now HA is not available, you can only create a high resilient system by backing up and restoring the single master
It sounds like the ultimate goal is for the backup/restore process to be dead simple. At this point, is all state that can't be regenerated automatically stored in SQLite or is there important state elsewhere?
The video from the meetup yesterday answers my question, and it sounds like all important state is indeed stored in SQLite:
(source) ...it's just a single file you have to backup if you want to manage it and do some type of automated backup and restore
(source) ...you can get a highly-resilient setup, which means if the master goes down, you can restore it very quickly somewhere else because the state is stored in a single file
If that's the case, a cron job regularly running sqlite3 .backup should do the trick. Then restoring should be as simple as spinning up a new master from your original image and running sqlite3 .restore. I haven't tested this out myself, though.
The video from the meetup yesterday answers my question, and it sounds like all important state is indeed stored in SQLite:
(source) ...it's just a single file you have to backup if you want to manage it and do some type of automated backup and restore
(source) ...you can get a highly-resilient setup, which means if the master goes down, you can restore it very quickly somewhere else because the state is stored in a single file
If that's the case, a cron job regularly running
sqlite3 .backupshould do the trick. Then restoring should be as simple as spinning up a new master from your original image and runningsqlite3 .restore. I haven't tested this out myself, though.
That's great. But the slqlite should switch to etcd in HA mode, and can use etcd snapshot to do this.
FYI: https://github.com/rancher/k3s/issues/42#issuecomment-468343584
Since we use sqlite and maybe etcd in the future it might be nice to have a solution which doesn't depend on implementation. I was trying to find a backup solution that might just use shell scripts and kubectl, not sure of the feasibility of something like that tho.
So to confirm, cluster state backups are essentially just backing up /var/lib/rancher/k3s/server/db/state.db?
I have file/database backups, but my database backup solution (KubeDB) requires that the database superuser credentials are identical for WAL restores, and those are autogenerated and stored in secrets. So I need some way of keeping those secrets around if I have to restore backups, and I'd rather just let them be autogenerated and leave them in-cluster than manage them myself.
Thanks.
This issue is a bit old now, I'm wondering if things changed. Can we get some pointers about the _current_ recommended method of backing up/restoring a k3s environment?
Thanks!
+1
I鈥檇 like to reset k3s without having to download the container images again.
So beyond the database, I'm not sure which directories should also be restored after a reinstall.
I have a single master+agent server.
With some investigation, it appears restoring the state.db file works (as long as the shm and wal files are deleted before restore).
I looked through containerd folders and note that the image blobs are here:
/var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content
However, deleting the other folders inside /var/lib/rancher/k3s/agent/containerd
and then restarting the k3s service appears to cause issues. My pods then show:
MountVolume.SetUp failed for volume "cert-manager-cainjector-token-92bgb" : failed to sync secret cache: timed out waiting for the condition
followed by a Pulling image event. This is unfortunate as I'd like to use k3s in a low-bandwidth location and pulling images constantly is painful.
@ibuildthecloud do you have any recommendations or should I just be snapshotting the entire "/var/lib/rancher/k3s" folder?
In my mind I should only need to backup the blobs and the database in order to revert back to a working state with minimal downloads but doesn't appear that way.
@brandonkal this isn't really a backup/restore issue, but it sounds like you want the airgap install option with pre-downloaded images.
At least there needs to be documentation on how to backup k3s cluster state. It could be really simple, it just needs to be documented!
What other files should be backed up aside from the db?
I assume the following are required as well:
/var/lib/rancher/k3s/server/cred
/var/lib/rancher/k3s/server/node-token
/var/lib/rancher/k3s/server/tls
/var/lib/rancher/k3s/server/token
Can anyone confirm?
Most helpful comment
At least there needs to be documentation on how to backup k3s cluster state. It could be really simple, it just needs to be documented!