Etcd: Improve etcd upgrade/downgrade policy and tests

Created on 9 Feb 2018 · 5Comments · Source: etcd-io/etcd

We don't have enough coverage on upgrades (none for downgrades). Only test case is upgrade from latest release to master branch https://github.com/coreos/etcd/blob/master/e2e/etcd_release_upgrade_test.go where we stop/restart with new versions of etcd (master branch) in CI.

Clearly document compatibilities between different versions
Early terminate (or warning) on unsafe upgrades/downgrades
Add more test cases (or document)
- What if newer versions of etcd join older versioned cluster, and vice versa?
- What if newer versions of etcd reboots from snapshot fetched from older-versioned etcd cluster?
- https://github.com/coreos/etcd/issues/6457

ref. https://github.com/coreos/etcd/issues/7308

/cc @jpbetz @saranbalaji90

aredoc arefeature aretesting stale

Source

gyuho

👍3

Most helpful comment

I am working on the "etcd downgrad design" documentation. The basic idea is to add a "etcdctl downgrade --target-version" command to initiate downgrade process, which enable the cluster to allow member replacement with target version. More details can be found in the design doc. @gyuho, @xiang90, @jpbetz and @jingyih have reviewed the design doc. I also posted a topic in etcd-dev.

[ ] add “downgrade” API, to temporarily whitelist lower versions in cluster version check.
[ ] add basic tests.
[ ] add other “downgrade” APIs (such as status, cancel).
[ ] implement unknown WAL log entry handling code.
[ ] more tests.
[ ] add downgrade guides (including developer responsibilities).
[ ] commit final design doc to /docs.

@knisbet Kevin, would you please kindly take a look at the design? We know that you have a lot of valuable experience in this area. Your input will be appreciated.

wenjiaswe on 16 Oct 2018

👍3

All 5 comments

I was able to upgrade my cluster by performing rolling update. Initially had a 3 node cluster with 3.0.17 version and upgraded it to 3.1.11 by removing old node one by one and adding new node simultaneously. It seems to be working but I'm yet to run e2e test on this new cluster.

Do we have list of things that etcd performs when you drop-in new binary and start etcd again with this new binary?
Also I'm curious while upgrading from 3.0 to 3.2, why do we recommend upgrading to 3.1 first and then to 3.2? Is it because etcd changes some underlying schema that restricts this upgrade or is it just that we haven't tested this yet?

SaranBalaji90 on 9 Feb 2018

[ ] add “downgrade” API, to temporarily whitelist lower versions in cluster version check.
[ ] add basic tests.
[ ] add other “downgrade” APIs (such as status, cancel).
[ ] implement unknown WAL log entry handling code.
[ ] more tests.
[ ] add downgrade guides (including developer responsibilities).
[ ] commit final design doc to /docs.

@knisbet Kevin, would you please kindly take a look at the design? We know that you have a lot of valuable experience in this area. Your input will be appreciated.

wenjiaswe on 16 Oct 2018

👍3

cc @YoyinZyc

wenjiaswe on 10 Oct 2019

11689 upgrading cluster may cause data corruption.

tangcong on 12 Mar 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.