Currently executing no-downtime upgrades between different Kong versions it's tricky and not ideal. This issue has been known internally for a while, and I thought it would be a good idea to write it down.
The problem is caused by the fact that a newer Kong version may be executing schema migrations on a database that is currently being used by a previous version of Kong. Since some of the changes may be breaking changes, the previous Kong nodes may start throwing errors because effectively the new Kong version has updated the database under-the-hood with changes that may conflict with the previous code.
The latest versions of Kong are able to process existing requests even when the database is down. A potential upgrade path consists in making the datastore unavailable to the current Kong nodes. Kong will still process the existing requests because the entities are cached in-memory.
At this point the new version of Kong can be started, which will execute the migrations on the datastore. Once the new version of Kong is up and running and the migrations have been completed, the load balancer can be instructed to remove the old Kong nodes, and start processing requests on the new Kong nodes.
This implies there is a load balancer in front of the Kong nodes.
The above solution is not always acceptable for a few reasons:
Very rough idea to get the conversation started.
Problem: new users can't consume Kong during that process.
Possible solutions:
Both the first and second issue are based on in-memory-cache-misses. The cache is only filled if the data was at least requested once. So can we instruct the Kong nodes to cache everything? Just prior to an upgrade and disconnecting from the database.
@SGrondin I don't think the same Kong nodes get back, you need new ones, upgraded as well. So its not just reloading.
Also; migration could be done in a different namespace within the same database. This would go without any cache misses. So;
Only data lost is request based data (eg. rate-limiting counters etc.). Effectively you're building a new cluster.
Only data lost is request based data (eg. rate-limiting counters etc.). Effectively you're building a new cluster.
@Tieske What are the other things that would be lost? I am looking to solve this problem in a similar fashion but I'd like to know what all I'd be throwing away. If it's just rate-limiting counters, I think we can live with that. But if there is something else that I'm not thinking about that might bite us, I'd like to know.
Thanks!
Is this still being worked on?
When you design a solution, please take into account how it would work with container orchestrators like Kubernetes. In those clusters it is often cumbersome to start a single task before rolling out new software (you have to script that sequence yourself). E.g. Vanilla Kubernetes is just learning how to do this during deployments, also there are edge cases where you still might end up with 2 tasks running at the same time. I can easily do rolling upgrades on my Java apps because the migration framework (FlyWay) writes a lock to the database in order to coordinate the migration. Kong should do the same and not expect that there is only one task that does the migration.
I am facing the exact same situation at work that @486 described. The recent 0.11.x release seems to make no downtime upgrades even more challenging, since it appears the caching mechanism has been changed. At the very least a Flyway locking approach to migrations would be great.
@JanekLehr @486 the locking solution would work with postgres easily, but not with Cassandra due to its eventual-consistency model. If it were that easy, it would have been long fixed by now 馃槃
@Tieske Right now we're using postgres so that would totally work for us 馃槈. I'm guessing you want to avoid separate solutions for both?
@Tieske I am a bloody rookie when it comes to Cassandra, but is there actually anything better than writing and reading a lock key with consistency level ALL ?
This is more or less solved with 1.0.0, or did you @subnetmarco have something else in your mind?
Closing this, given that it's been solved (and @subnetmarco's thumbs up on the previous comment). Huzzah!
Most helpful comment
Is this still being worked on?
When you design a solution, please take into account how it would work with container orchestrators like Kubernetes. In those clusters it is often cumbersome to start a single task before rolling out new software (you have to script that sequence yourself). E.g. Vanilla Kubernetes is just learning how to do this during deployments, also there are edge cases where you still might end up with 2 tasks running at the same time. I can easily do rolling upgrades on my Java apps because the migration framework (FlyWay) writes a lock to the database in order to coordinate the migration. Kong should do the same and not expect that there is only one task that does the migration.