Kong: Kong no-downtime upgrades

Created on 16 May 2016 · 10Comments · Source: Kong/kong

Currently executing no-downtime upgrades between different Kong versions it's tricky and not ideal. This issue has been known internally for a while, and I thought it would be a good idea to write it down.

Problem

The problem is caused by the fact that a newer Kong version may be executing schema migrations on a database that is currently being used by a previous version of Kong. Since some of the changes may be breaking changes, the previous Kong nodes may start throwing errors because effectively the new Kong version has updated the database under-the-hood with changes that may conflict with the previous code.

Current solution

The latest versions of Kong are able to process existing requests even when the database is down. A potential upgrade path consists in making the datastore unavailable to the current Kong nodes. Kong will still process the existing requests because the entities are cached in-memory.

At this point the new version of Kong can be started, which will execute the migrations on the datastore. Once the new version of Kong is up and running and the migrations have been completed, the load balancer can be instructed to remove the old Kong nodes, and start processing requests on the new Kong nodes.

This implies there is a load balancer in front of the Kong nodes.

Problems with the current solution

The above solution is not always acceptable for a few reasons:

It assumes a load-balancer with a round robin load-balancing policy. A different policy may not propagate the requests to every node, which means that when disconnecting the database not every node will have the same cached entities in memory, and some of them will be unable to process requests.
This solution is not 100% no-downtime. Assuming every node has the same cached entities, the requests will be up and running for the existing consumers, but Kong will effectively be down for new consumptions. It's a selective uptime: up for the previous consumers, down for any new consumption that requires talking to the datastore.
Executing migrations manually won't fix these problems.

tasfeature

Source

subnetmarco

Most helpful comment

Is this still being worked on?

When you design a solution, please take into account how it would work with container orchestrators like Kubernetes. In those clusters it is often cumbersome to start a single task before rolling out new software (you have to script that sequence yourself). E.g. Vanilla Kubernetes is just learning how to do this during deployments, also there are edge cases where you still might end up with 2 tasks running at the same time. I can easily do rolling upgrades on my Java apps because the migration framework (FlyWay) writes a lock to the database in order to coordinate the migration. Kong should do the same and not expect that there is only one task that does the migration.

486 on 18 Sep 2017

👍2

All 10 comments

Very rough idea to get the conversation started.

The node that receives the kong migration command (let's call it the coordinator) would use the same pubsub system used by the cache invalidation to tell all the nodes to make the DB invisible/disconnected.
The coordinator also makes its own DB invisible/disconnected everywhere except to the migration script.
The coordinator does the DB migration
Then the coordinator sends a reload signal to one node at a time, then finally reloads itself. Nodes come back with the DB visible

Problem: new users can't consume Kong during that process.

Possible solutions:

Broadcast ALL changes to the nodes that haven't reloaded yet and apply those changes to the cache.
Document this edge case and make the reload aggressive (for example reload as much as 1/4 of the nodes at a time) to mitigate it.

SGrondin on 16 May 2016

Both the first and second issue are based on in-memory-cache-misses. The cache is only filled if the data was at least requested once. So can we instruct the Kong nodes to cache everything? Just prior to an upgrade and disconnecting from the database.

@SGrondin I don't think the same Kong nodes get back, you need new ones, upgraded as well. So its not just reloading.

Also; migration could be done in a different namespace within the same database. This would go without any cache misses. So;

stop modifications through management api
export data from existing namespace
import in new namespace
migrate in new namespace
start new Kong instance(s) against new namespace
route traffic to new instances (using LB, DNS, or whatever the infrastructure in use offers)
starve and shutdown old nodes
re-enable management api
remove old namespace

Only data lost is request based data (eg. rate-limiting counters etc.). Effectively you're building a new cluster.

Tieske on 16 May 2016

Only data lost is request based data (eg. rate-limiting counters etc.). Effectively you're building a new cluster.

@Tieske What are the other things that would be lost? I am looking to solve this problem in a similar fashion but I'd like to know what all I'd be throwing away. If it's just rate-limiting counters, I think we can live with that. But if there is something else that I'm not thinking about that might bite us, I'd like to know.

Thanks!

iautom8things on 2 Nov 2016

Is this still being worked on?

486 on 18 Sep 2017

👍2

I am facing the exact same situation at work that @486 described. The recent 0.11.x release seems to make no downtime upgrades even more challenging, since it appears the caching mechanism has been changed. At the very least a Flyway locking approach to migrations would be great.

JanekLehr on 28 Sep 2017

@JanekLehr @486 the locking solution would work with postgres easily, but not with Cassandra due to its eventual-consistency model. If it were that easy, it would have been long fixed by now 😄

Tieske on 25 Oct 2017

@Tieske Right now we're using postgres so that would totally work for us 😉. I'm guessing you want to avoid separate solutions for both?

JanekLehr on 26 Oct 2017

@Tieske I am a bloody rookie when it comes to Cassandra, but is there actually anything better than writing and reading a lock key with consistency level ALL ?

486 on 26 Oct 2017

This is more or less solved with 1.0.0, or did you @subnetmarco have something else in your mind?

bungle on 3 Jan 2019

👍1

Closing this, given that it's been solved (and @subnetmarco's thumbs up on the previous comment). Huzzah!

p0pr0ck5 on 12 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Feature Request: Kong Declarative with Variable Interpolation

carnei-ro · 3Comments

custom_plugins configuration in kong_tests.conf appears to be ignored in helpers.lua

lavoiedn · 3Comments

failed to open plugin kong: plugin.Open("/tmp/go-plugins/kong"): plugin was built with a different version of package github.com/Kong/go-pdk/bridge

antoniott15 · 3Comments

Get {"message":"An unexpected error occurred"}

cospotato · 3Comments

Kong behind an outgoing corporate proxy

felixbecker · 3Comments