When running a kong migrations bootstrap on kong 1.0.2 or 0.15.0 when backed by a cassandra cluster it fails with the following error.
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: [Cassandra error] cluster_mutex callback threw an error: /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: [Read failure] Operation failed - received 0 responses and 1 failures (UNKNOWN_COLUMN=[/172.31.87.88])
Full Log Below
In this case it has failed during the teardown of core migration 001_14_to_15, but it does fail at other times, for example during the teardown of the oauth2 001_14_to_15 migration.
If backed by a cassandra cluster of 1 node there is no issue, only when backed by a cassandra cluster of more than 1 node. I have tried different cassandra config options with no avail, ie
cassandra_consistency: Used default value, LOCAL_QUORUM, QUORUM, ALL
cassandra_repl_strategy: NetworkTopologyStrategy & SimpleStrategy
cassandra_repl_factor: 1 or 2with 3 nodes
cassandra_data_centers: OLTP:1 or OLTP:2 with 3 nodes
cassandra_lb_policy: All options
The issue appears to be a cassandra schema consensus issue between nodes and the client. The errors usually appear on the teardown components and usually when checking or doing something with the additions added in the up migration.
For example the oauth2 001_14_to_15 migration the UP migration adds column redirect_uris to table oauth2_credentials, the teardown then fails because it does a select on the that column in that table and because the cassandra schema is not in consensus it fails.
The bootstrap mode runs both the up and the teardown migrations at the same time and this is where it fails. When migrating from an old version running a migrations up followed by a finish works fine.
There are 2 solutions here
DB:run_migrations function. bootstrap function in lua/5.1/kong/cmd/utils/migrations.lua file, as each stage runs a wait_for_schema_consensus when it finishes. egassert(db:run_migrations(schema_state.new_migrations, { run_up = true }))
assert(db:run_migrations(schema_state.new_migrations, { run_teardown = true }))
instead of doing them both in the one call.
I have tested both solutions in both versions and they work. 1 has the issue of adding waits between each migration but will work in all solutions, while 2 will only work if up migrations dont need to wait for consensus, but if that is the case then a migrations up will fail to.
Also worth noting a fix for something similar was done in 0.14.1. In 0.13.1 a migration up failure occurs in the 2017-01-24-132600_upstream_timeouts_2 migration. This fails during the select * from APIS query as the previous stage 2017-01-24-132600_upstream_timeouts has altered the APIS table and added timeout fields. 0.14.1 fixed this by making the call directly to the wait_for_schema_consensus in the upstream_timeouts_2 migration.
Thanks
root@api-gateway-1:~# kong migrations bootstrap -vv
2019/01/21 08:51:08 [verbose] Kong: 1.0.2
2019/01/21 08:51:08 [debug] ngx_lua: 10013
2019/01/21 08:51:08 [debug] nginx: 1013006
2019/01/21 08:51:08 [debug] Lua: LuaJIT 2.1.0-beta3
2019/01/21 08:51:08 [verbose] reading config file at /etc/kong/kong.conf
2019/01/21 08:51:08 [debug] reading environment variables
2019/01/21 08:51:08 [debug] admin_access_log = "logs/admin_access.log"
2019/01/21 08:51:08 [debug] admin_error_log = "logs/error.log"
2019/01/21 08:51:08 [debug] admin_listen = {"0.0.0.0:8001"}
2019/01/21 08:51:08 [debug] anonymous_reports = true
2019/01/21 08:51:08 [debug] cassandra_consistency = "LOCAL_QUORUM"
2019/01/21 08:51:08 [debug] cassandra_contact_points = {"oltp-cassandra-1.private.george1.do.eng.vixpulse.com","oltp-cassandra-2.private.george1.do.eng.vixpulse.com","oltp-cassandra-3.private.george1.do.eng.vixpulse.com"}
2019/01/21 08:51:08 [debug] cassandra_data_centers = {"OLTP:2"}
2019/01/21 08:51:08 [debug] cassandra_keyspace = "pulse_api_gateway"
2019/01/21 08:51:08 [debug] cassandra_lb_policy = "RequestRoundRobin"
2019/01/21 08:51:08 [debug] cassandra_password = "******"
2019/01/21 08:51:08 [debug] cassandra_port = 9042
2019/01/21 08:51:08 [debug] cassandra_repl_factor = 2
2019/01/21 08:51:08 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2019/01/21 08:51:08 [debug] cassandra_schema_consensus_timeout = 10000
2019/01/21 08:51:08 [debug] cassandra_ssl = false
2019/01/21 08:51:08 [debug] cassandra_ssl_verify = false
2019/01/21 08:51:08 [debug] cassandra_timeout = 5000
2019/01/21 08:51:08 [debug] cassandra_username = "pulse"
2019/01/21 08:51:08 [debug] client_body_buffer_size = "8k"
2019/01/21 08:51:08 [debug] client_max_body_size = "0"
2019/01/21 08:51:08 [debug] client_ssl = false
2019/01/21 08:51:08 [debug] database = "cassandra"
2019/01/21 08:51:08 [debug] db_cache_ttl = 0
2019/01/21 08:51:08 [debug] db_resurrect_ttl = 30
2019/01/21 08:51:08 [debug] db_update_frequency = 300
2019/01/21 08:51:08 [debug] db_update_propagation = 2
2019/01/21 08:51:08 [debug] dns_error_ttl = 1
2019/01/21 08:51:08 [debug] dns_hostsfile = "/etc/hosts"
2019/01/21 08:51:08 [debug] dns_no_sync = false
2019/01/21 08:51:08 [debug] dns_not_found_ttl = 30
2019/01/21 08:51:08 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2019/01/21 08:51:08 [debug] dns_resolver = {}
2019/01/21 08:51:08 [debug] dns_stale_ttl = 4
2019/01/21 08:51:08 [debug] error_default_type = "text/plain"
2019/01/21 08:51:08 [debug] headers = {"server_tokens","latency_tokens"}
2019/01/21 08:51:08 [debug] log_level = "notice"
2019/01/21 08:51:08 [debug] lua_package_cpath = ""
2019/01/21 08:51:08 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2019/01/21 08:51:08 [debug] lua_socket_pool_size = 30
2019/01/21 08:51:08 [debug] lua_ssl_verify_depth = 1
2019/01/21 08:51:08 [debug] mem_cache_size = "128m"
2019/01/21 08:51:08 [debug] nginx_admin_directives = {}
2019/01/21 08:51:08 [debug] nginx_daemon = "on"
2019/01/21 08:51:08 [debug] nginx_http_directives = {}
2019/01/21 08:51:08 [debug] nginx_optimizations = true
2019/01/21 08:51:08 [debug] nginx_proxy_directives = {}
2019/01/21 08:51:08 [debug] nginx_user = "nobody nobody"
2019/01/21 08:51:08 [debug] nginx_worker_processes = "auto"
2019/01/21 08:51:08 [debug] origins = {}
2019/01/21 08:51:08 [debug] pg_database = "kong"
2019/01/21 08:51:08 [debug] pg_host = "127.0.0.1"
2019/01/21 08:51:08 [debug] pg_port = 5432
2019/01/21 08:51:08 [debug] pg_ssl = false
2019/01/21 08:51:08 [debug] pg_ssl_verify = false
2019/01/21 08:51:08 [debug] pg_timeout = 5000
2019/01/21 08:51:08 [debug] pg_user = "kong"
2019/01/21 08:51:08 [debug] plugins = {"bundled"}
2019/01/21 08:51:08 [debug] prefix = "/usr/local/kong/"
2019/01/21 08:51:08 [debug] proxy_access_log = "logs/access.log"
2019/01/21 08:51:08 [debug] proxy_error_log = "logs/error.log"
2019/01/21 08:51:08 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl"}
2019/01/21 08:51:08 [debug] real_ip_header = "X-Forwarded-For"
2019/01/21 08:51:08 [debug] real_ip_recursive = "off"
2019/01/21 08:51:08 [debug] ssl_cipher_suite = "modern"
2019/01/21 08:51:08 [debug] ssl_ciphers = "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
2019/01/21 08:51:08 [debug] stream_listen = {"off"}
2019/01/21 08:51:08 [debug] trusted_ips = {"172.31.0.0/24","172.31.1.0/24","172.31.2.0/24"}
2019/01/21 08:51:08 [debug] upstream_keepalive = 60
2019/01/21 08:51:08 [verbose] prefix in use: /usr/local/kong
2019/01/21 08:51:08 [warn] 12527#0: *2 [lua] cluster.lua:496: refresh(): [lua-cassandra] found contact point with '0.0.0.0' as rpc_address, using 'oltp-cassandra-1.private.george1.do.eng.vixpulse.com' to contact it instead. If this is incorrect you should avoid using '0.0.0.0' in rpc_address, context: ngx.timer
2019/01/21 08:51:08 [debug] loading subsystems migrations...
2019/01/21 08:51:08 [verbose] retrieving keyspace schema state...
2019/01/21 08:51:08 [verbose] schema state retrieved
2019/01/21 08:51:08 [info] bootstrapping database...
2019/01/21 08:51:08 [debug] creating 'pulse_api_gateway' keyspace if not existing...
2019/01/21 08:51:08 [debug] successfully created 'pulse_api_gateway' keyspace
2019/01/21 08:51:08 [debug] creating 'schema_meta' table if not existing...
2019/01/21 08:51:08 [debug] successfully created 'schema_meta' table
2019/01/21 08:51:08 [debug] creating 'locks' table if not existing...
2019/01/21 08:51:08 [debug] successfully created 'locks' table
2019/01/21 08:51:08 [verbose] waiting for Cassandra schema consensus (60000ms timeout)...
2019/01/21 08:51:09 [verbose] Cassandra schema consensus: reached
2019/01/21 08:51:09 [info] migrating core on keyspace 'pulse_api_gateway'...
2019/01/21 08:51:09 [debug] running migration: 000_base
2019/01/21 08:51:11 [info] core migrated up to: 000_base (executed)
2019/01/21 08:51:11 [debug] running migration: 001_14_to_15
Error:
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: [Cassandra error] cluster_mutex callback threw an error: /usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: [Read failure] Operation failed - received 0 responses and 1 failures (UNKNOWN_COLUMN=[/172.31.87.88])
stack traceback:
[C]: in function 'assert'
...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:299: in function <...l/share/lua/5.1/kong/db/migrations/core/001_14_to_15.lua:264>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/db/init.lua:570: in function 'run_migrations'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
[C]: in function 'assert'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:68: in function </usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/db/init.lua:390: in function 'cluster_mutex'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:67: in function 'bootstrap'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
[C]: in function 'error'
/usr/local/share/lua/5.1/kong/cmd/utils/migrations.lua:75: in function 'bootstrap'
/usr/local/share/lua/5.1/kong/cmd/migrations.lua:118: in function 'cmd_exec'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:87>
[C]: in function 'xpcall'
/usr/local/share/lua/5.1/kong/cmd/init.lua:87: in function </usr/local/share/lua/5.1/kong/cmd/init.lua:44>
/usr/local/bin/kong:7: in function 'file_gen'
init_worker_by_lua:48: in function <init_worker_by_lua:46>
[C]: in function 'xpcall'
init_worker_by_lua:55: in function <init_worker_by_lua:53>
Hi,
Thank you for the report, this is helpful. I can confirm that this problem exists. I will push a PR first thing in the morning. It is delicate to decide where to introduce the schema consensus check, for too many such checks can significantly reduce the speed of the (already slow) Cassandra migrations. But something will be done.
Thanks, will look forward to the patch, Will it be released soon ? This is holding us up at the moment, I can migrate exiting systems to 1 but I cant create new environments without making one of the code changes.
@gchristidis Would you be able to give the patch a try to confirm that it resolves your issue? Here is a direct link to it: https://patch-diff.githubusercontent.com/raw/Kong/kong/pull/4233.patch
We are planning on a 1.0.3 release next week. It'd be safer to have higher confidence in this patch if you could give it a try and confirm that it resolves the issue. Thanks!
@thibaultcha I applied those changes in the patch to an installed 1.0.2 kong backed by a Cassandra cluster of 3 nodes and a NetworkTopologyStrategy with an RF or 2 and a bootstrap now works perfectly, if a little slower as expected.
Thanks for effort, looking forward to the release.
@gchristidis Thanks for reporting and testing the patch! I just merged the patch and we are on track to release 1.0.3 in the next couple of days.