Kong: Kong no-downtime upgrades

Created on 16 May 2016  路  10Comments  路  Source: Kong/kong

Currently executing no-downtime upgrades between different Kong versions it's tricky and not ideal. This issue has been known internally for a while, and I thought it would be a good idea to write it down.

Problem

The problem is caused by the fact that a newer Kong version may be executing schema migrations on a database that is currently being used by a previous version of Kong. Since some of the changes may be breaking changes, the previous Kong nodes may start throwing errors because effectively the new Kong version has updated the database under-the-hood with changes that may conflict with the previous code.

Current solution

The latest versions of Kong are able to process existing requests even when the database is down. A potential upgrade path consists in making the datastore unavailable to the current Kong nodes. Kong will still process the existing requests because the entities are cached in-memory.

At this point the new version of Kong can be started, which will execute the migrations on the datastore. Once the new version of Kong is up and running and the migrations have been completed, the load balancer can be instructed to remove the old Kong nodes, and start processing requests on the new Kong nodes.

This implies there is a load balancer in front of the Kong nodes.

Problems with the current solution

The above solution is not always acceptable for a few reasons:

  • It assumes a load-balancer with a round robin load-balancing policy. A different policy may not propagate the requests to every node, which means that when disconnecting the database not every node will have the same cached entities in memory, and some of them will be unable to process requests.
  • This solution is not 100% no-downtime. Assuming every node has the same cached entities, the requests will be up and running for the existing consumers, but Kong will effectively be down for new consumptions. It's a selective uptime: up for the previous consumers, down for any new consumption that requires talking to the datastore.
  • Executing migrations manually won't fix these problems.
tasfeature

Most helpful comment

Is this still being worked on?

When you design a solution, please take into account how it would work with container orchestrators like Kubernetes. In those clusters it is often cumbersome to start a single task before rolling out new software (you have to script that sequence yourself). E.g. Vanilla Kubernetes is just learning how to do this during deployments, also there are edge cases where you still might end up with 2 tasks running at the same time. I can easily do rolling upgrades on my Java apps because the migration framework (FlyWay) writes a lock to the database in order to coordinate the migration. Kong should do the same and not expect that there is only one task that does the migration.

All 10 comments

Very rough idea to get the conversation started.

  • The node that receives the kong migration command (let's call it the coordinator) would use the same pubsub system used by the cache invalidation to tell all the nodes to make the DB invisible/disconnected.
  • The coordinator also makes its own DB invisible/disconnected everywhere except to the migration script.
  • The coordinator does the DB migration
  • Then the coordinator sends a reload signal to one node at a time, then finally reloads itself. Nodes come back with the DB visible

Problem: new users can't consume Kong during that process.

Possible solutions:

  • Broadcast ALL changes to the nodes that haven't reloaded yet and apply those changes to the cache.
  • Document this edge case and make the reload aggressive (for example reload as much as 1/4 of the nodes at a time) to mitigate it.

Both the first and second issue are based on in-memory-cache-misses. The cache is only filled if the data was at least requested once. So can we instruct the Kong nodes to cache everything? Just prior to an upgrade and disconnecting from the database.

@SGrondin I don't think the same Kong nodes get back, you need new ones, upgraded as well. So its not just reloading.

Also; migration could be done in a different namespace within the same database. This would go without any cache misses. So;

  • stop modifications through management api
  • export data from existing namespace
  • import in new namespace
  • migrate in new namespace
  • start new Kong instance(s) against new namespace
  • route traffic to new instances (using LB, DNS, or whatever the infrastructure in use offers)
  • starve and shutdown old nodes
  • re-enable management api
  • remove old namespace

Only data lost is request based data (eg. rate-limiting counters etc.). Effectively you're building a new cluster.

Only data lost is request based data (eg. rate-limiting counters etc.). Effectively you're building a new cluster.

@Tieske What are the other things that would be lost? I am looking to solve this problem in a similar fashion but I'd like to know what all I'd be throwing away. If it's just rate-limiting counters, I think we can live with that. But if there is something else that I'm not thinking about that might bite us, I'd like to know.

Thanks!

Is this still being worked on?

When you design a solution, please take into account how it would work with container orchestrators like Kubernetes. In those clusters it is often cumbersome to start a single task before rolling out new software (you have to script that sequence yourself). E.g. Vanilla Kubernetes is just learning how to do this during deployments, also there are edge cases where you still might end up with 2 tasks running at the same time. I can easily do rolling upgrades on my Java apps because the migration framework (FlyWay) writes a lock to the database in order to coordinate the migration. Kong should do the same and not expect that there is only one task that does the migration.

I am facing the exact same situation at work that @486 described. The recent 0.11.x release seems to make no downtime upgrades even more challenging, since it appears the caching mechanism has been changed. At the very least a Flyway locking approach to migrations would be great.

@JanekLehr @486 the locking solution would work with postgres easily, but not with Cassandra due to its eventual-consistency model. If it were that easy, it would have been long fixed by now 馃槃

@Tieske Right now we're using postgres so that would totally work for us 馃槈. I'm guessing you want to avoid separate solutions for both?

@Tieske I am a bloody rookie when it comes to Cassandra, but is there actually anything better than writing and reading a lock key with consistency level ALL ?

This is more or less solved with 1.0.0, or did you @subnetmarco have something else in your mind?

Closing this, given that it's been solved (and @subnetmarco's thumbs up on the previous comment). Huzzah!

Was this page helpful?
0 / 5 - 0 ratings