Kong: Kong 2.0.5 -> 2.1.4 Migratons C* issues still possible.

Created on 28 Sep 2020  路  81Comments  路  Source: Kong/kong

Summary

Thought we were out of the woods here. But I just tested migrations from a 2.0.5 to 2.1.4 DB upgrades with our prod keyspace cloned (decided dry runs of this migrations stuff is best with real keyspace data too) and things went sideways again. Captured logs of each migrations execution from the up to having to run finish multiple times to reach a conclusion, detailed below:

First the Kong migrations up:

/ $ kong migrations up --db-timeout 120 --vv
2020/09/28 17:17:57 [verbose] Kong: 2.1.4
2020/09/28 17:17:57 [debug] ngx_lua: 10015
2020/09/28 17:17:57 [debug] nginx: 1015008
2020/09/28 17:17:57 [debug] Lua: LuaJIT 2.1.0-beta3
2020/09/28 17:17:57 [verbose] no config file found at /etc/kong/kong.conf
2020/09/28 17:17:57 [verbose] no config file found at /etc/kong.conf
2020/09/28 17:17:57 [verbose] no config file, skip loading
2020/09/28 17:17:57 [debug] reading environment variables
2020/09/28 17:17:57 [debug] KONG_PLUGINS ENV found with "kong-siteminder-auth,kong-kafka-log,stargate-waf-error-log,mtls,stargate-oidc-token-revoke,kong-tx-debugger,kong-plugin-oauth,zipkin,kong-error-log,kong-oidc-implicit-token,kong-response-size-limiting,request-transformer,kong-service-virtualization,kong-cluster-drain,kong-upstream-jwt,kong-splunk-log,kong-spec-expose,kong-path-based-routing,kong-oidc-multi-idp,correlation-id,oauth2,statsd,jwt,rate-limiting,acl,request-size-limiting,request-termination,cors"
2020/09/28 17:17:57 [debug] KONG_ADMIN_LISTEN ENV found with "0.0.0.0:8001 deferred reuseport"
2020/09/28 17:17:57 [debug] KONG_PROXY_ACCESS_LOG ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_USERNAME ENV found with "kongdba"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_PASSWORD ENV found with "******"
2020/09/28 17:17:57 [debug] KONG_PROXY_LISTEN ENV found with "0.0.0.0:8000, 0.0.0.0:8443 ssl http2 deferred reuseport"
2020/09/28 17:17:57 [debug] KONG_DNS_NO_SYNC ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_DB_UPDATE_PROPAGATION ENV found with "5"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_PORT ENV found with "9042"
2020/09/28 17:17:57 [debug] KONG_HEADERS ENV found with "latency_tokens"
2020/09/28 17:17:57 [debug] KONG_DNS_STALE_TTL ENV found with "4"
2020/09/28 17:17:57 [debug] KONG_WAF_DEBUG_LEVEL ENV found with "0"
2020/09/28 17:17:57 [debug] KONG_WAF_PARANOIA_LEVEL ENV found with "1"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_REFRESH_FREQUENCY ENV found with "0"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "server05503,server05505,server05507,server05502,server05504,server05506"
2020/09/28 17:17:57 [debug] KONG_DB_CACHE_WARMUP_ENTITIES ENV found with "services,plugins,consumers"
2020/09/28 17:17:57 [debug] KONG_NGINX_HTTP_SSL_PROTOCOLS ENV found with "TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_LOCAL_DATACENTER ENV found with "DC1"
2020/09/28 17:17:57 [debug] KONG_UPSTREAM_KEEPALIVE_IDLE_TIMEOUT ENV found with"30"
2020/09/28 17:17:57 [debug] KONG_DB_CACHE_TTL ENV found with "0"
2020/09/28 17:17:57 [debug] KONG_PG_SSL ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_WAF_REQUEST_NO_FILE_SIZE_LIMIT ENV found with "50000000"
2020/09/28 17:17:57 [debug] KONG_WAF_PCRE_MATCH_LIMIT_RECURSION ENV found with "10000"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_SCHEMA_CONSENSUS_TIMEOUT ENV found with "30000"
2020/09/28 17:17:57 [debug] KONG_LOG_LEVEL ENV found with "notice"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_TIMEOUT ENV found with "5000"
2020/09/28 17:17:57 [debug] KONG_NGINX_MAIN_WORKER_PROCESSES ENV found with "6"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_KEYSPACE ENV found with "kong_prod2"
2020/09/28 17:17:57 [debug] KONG_WAF ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_ERROR_DEFAULT_TYPE ENV found with "text/plain"
2020/09/28 17:17:57 [debug] KONG_UPSTREAM_KEEPALIVE_POOL_SIZE ENV found with "400"
2020/09/28 17:17:57 [debug] KONG_WORKER_CONSISTENCY ENV found with "eventual"
2020/09/28 17:17:57 [debug] KONG_CLIENT_SSL ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_TRUSTED_IPS ENV found with "0.0.0.0/0,::/0"
2020/09/28 17:17:57 [debug] KONG_SSL_CERT_KEY ENV found with "/usr/local/kong/ssl/kongprivatekey.key"
2020/09/28 17:17:57 [debug] KONG_MEM_CACHE_SIZE ENV found with "1024m"
2020/09/28 17:17:57 [debug] KONG_NGINX_PROXY_REAL_IP_HEADER ENV found with "X-Forwarded-For"
2020/09/28 17:17:57 [debug] KONG_DB_UPDATE_FREQUENCY ENV found with "5"
2020/09/28 17:17:57 [debug] KONG_DNS_ORDER ENV found with "LAST,SRV,A,CNAME"
2020/09/28 17:17:57 [debug] KONG_DNS_ERROR_TTL ENV found with "1"
2020/09/28 17:17:57 [debug] KONG_DATABASE ENV found with "cassandra"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_DATA_CENTERS ENV found with "DC1:3,DC2:3"
2020/09/28 17:17:57 [debug] KONG_WORKER_STATE_UPDATE_FREQUENCY ENV found with "5"
2020/09/28 17:17:57 [debug] KONG_LUA_SSL_VERIFY_DEPTH ENV found with "3"
2020/09/28 17:17:57 [debug] KONG_LUA_SOCKET_POOL_SIZE ENV found with "30"
2020/09/28 17:17:57 [debug] KONG_UPSTREAM_KEEPALIVE_MAX_REQUESTS ENV found with"50000"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_CONSISTENCY ENV found with "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] KONG_CLIENT_MAX_BODY_SIZE ENV found with "50m"
2020/09/28 17:17:57 [debug] KONG_ADMIN_ERROR_LOG ENV found with "/dev/stderr"
2020/09/28 17:17:57 [debug] KONG_DNS_NOT_FOUND_TTL ENV found with "30"
2020/09/28 17:17:57 [debug] KONG_PROXY_ERROR_LOG ENV found with "/dev/stderr"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_REPL_STRATEGY ENV found with "NetworkTopologyStrategy"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_SSL_VERIFY ENV found with "on"
2020/09/28 17:17:57 [debug] KONG_ADMIN_ACCESS_LOG ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_DNS_HOSTSFILE ENV found with "/etc/hosts"
2020/09/28 17:17:57 [debug] KONG_WAF_REQUEST_FILE_SIZE_LIMIT ENV found with "50000000"
2020/09/28 17:17:57 [debug] KONG_SSL_CERT ENV found with "/usr/local/kong/ssl/kongcert.crt"
2020/09/28 17:17:57 [debug] KONG_NGINX_PROXY_REAL_IP_RECURSIVE ENV found with "on"
2020/09/28 17:17:57 [debug] KONG_SSL_CIPHER_SUITE ENV found with "intermediate"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_SSL ENV found with "on"
2020/09/28 17:17:57 [debug] KONG_ANONYMOUS_REPORTS ENV found with "off"
2020/09/28 17:17:57 [debug] KONG_WAF_MODE ENV found with "On"
2020/09/28 17:17:57 [debug] KONG_CLIENT_BODY_BUFFER_SIZE ENV found with "50m"
2020/09/28 17:17:57 [debug] KONG_WAF_PCRE_MATCH_LIMIT ENV found with "10000"
2020/09/28 17:17:57 [debug] KONG_LUA_SSL_TRUSTED_CERTIFICATE ENV found with "/usr/local/kong/ssl/kongcert.pem"
2020/09/28 17:17:57 [debug] KONG_CASSANDRA_LB_POLICY ENV found with "RequestDCAwareRoundRobin"
2020/09/28 17:17:57 [debug] KONG_WAF_AUDIT ENV found with "RelevantOnly"
2020/09/28 17:17:57 [debug] admin_access_log = "off"
2020/09/28 17:17:57 [debug] admin_error_log = "/dev/stderr"
2020/09/28 17:17:57 [debug] admin_listen = {"0.0.0.0:8001 deferred reuseport"}
2020/09/28 17:17:57 [debug] anonymous_reports = false
2020/09/28 17:17:57 [debug] cassandra_consistency = "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] cassandra_contact_points = {"apsrp05503","apsrp05505","apsrp05507","apsrp05502","apsrp05504","apsrp05506"}
2020/09/28 17:17:57 [debug] cassandra_data_centers = {"DC1:3","DC2:3"}
2020/09/28 17:17:57 [debug] cassandra_keyspace = "kong_prod2"
2020/09/28 17:17:57 [debug] cassandra_lb_policy = "RequestDCAwareRoundRobin"
2020/09/28 17:17:57 [debug] cassandra_local_datacenter = "DC1"
2020/09/28 17:17:57 [debug] cassandra_password = "******"
2020/09/28 17:17:57 [debug] cassandra_port = 9042
2020/09/28 17:17:57 [debug] cassandra_read_consistency = "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] cassandra_refresh_frequency = 0
2020/09/28 17:17:57 [debug] cassandra_repl_factor = 1
2020/09/28 17:17:57 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2020/09/28 17:17:57 [debug] cassandra_schema_consensus_timeout = 30000
2020/09/28 17:17:57 [debug] cassandra_ssl = true
2020/09/28 17:17:57 [debug] cassandra_ssl_verify = true
2020/09/28 17:17:57 [debug] cassandra_timeout = 5000
2020/09/28 17:17:57 [debug] cassandra_username = "kongdba"
2020/09/28 17:17:57 [debug] cassandra_write_consistency = "LOCAL_QUORUM"
2020/09/28 17:17:57 [debug] client_body_buffer_size = "50m"
2020/09/28 17:17:57 [debug] client_max_body_size = "50m"
2020/09/28 17:17:57 [debug] client_ssl = false
2020/09/28 17:17:57 [debug] cluster_control_plane = "127.0.0.1:8005"
2020/09/28 17:17:57 [debug] cluster_listen = {"0.0.0.0:8005"}
2020/09/28 17:17:57 [debug] cluster_mtls = "shared"
2020/09/28 17:17:57 [debug] database = "cassandra"
2020/09/28 17:17:57 [debug] db_cache_ttl = 0
2020/09/28 17:17:57 [debug] db_cache_warmup_entities = {"services","plugins","consumers"}
2020/09/28 17:17:57 [debug] db_resurrect_ttl = 30
2020/09/28 17:17:57 [debug] db_update_frequency = 5
2020/09/28 17:17:57 [debug] db_update_propagation = 5
2020/09/28 17:17:57 [debug] dns_error_ttl = 1
2020/09/28 17:17:57 [debug] dns_hostsfile = "/etc/hosts"
2020/09/28 17:17:57 [debug] dns_no_sync = false
2020/09/28 17:17:57 [debug] dns_not_found_ttl = 30
2020/09/28 17:17:57 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2020/09/28 17:17:57 [debug] dns_resolver = {}
2020/09/28 17:17:57 [debug] dns_stale_ttl = 4
2020/09/28 17:17:57 [debug] error_default_type = "text/plain"
2020/09/28 17:17:57 [debug] go_plugins_dir = "off"
2020/09/28 17:17:57 [debug] go_pluginserver_exe = "/usr/local/bin/go-pluginserver"
2020/09/28 17:17:57 [debug] headers = {"latency_tokens"}
2020/09/28 17:17:57 [debug] host_ports = {}
2020/09/28 17:17:57 [debug] kic = false
2020/09/28 17:17:57 [debug] log_level = "notice"
2020/09/28 17:17:57 [debug] lua_package_cpath = ""
2020/09/28 17:17:57 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2020/09/28 17:17:57 [debug] lua_socket_pool_size = 30
2020/09/28 17:17:57 [debug] lua_ssl_trusted_certificate = "/usr/local/kong/ssl/kongcert.pem"
2020/09/28 17:17:57 [debug] lua_ssl_verify_depth = 3
2020/09/28 17:17:57 [debug] mem_cache_size = "1024m"
2020/09/28 17:17:57 [debug] nginx_admin_directives = {}
2020/09/28 17:17:57 [debug] nginx_daemon = "on"
2020/09/28 17:17:57 [debug] nginx_events_directives = {{name="multi_accept",value="on"},{name="worker_connections",value="auto"}}
2020/09/28 17:17:57 [debug] nginx_events_multi_accept = "on"
2020/09/28 17:17:57 [debug] nginx_events_worker_connections = "auto"
2020/09/28 17:17:57 [debug] nginx_http_client_body_buffer_size = "50m"
2020/09/28 17:17:57 [debug] nginx_http_client_max_body_size = "50m"
2020/09/28 17:17:57 [debug] nginx_http_directives = {{name="client_max_body_size",value="50m"},{name="ssl_prefer_server_ciphers",value="off"},{name="client_body_buffer_size",value="50m"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_timeout",value="1d"},{name="ssl_session_tickets",value="on"}}
2020/09/28 17:17:57 [debug] nginx_http_ssl_prefer_server_ciphers = "off"
2020/09/28 17:17:57 [debug] nginx_http_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] nginx_http_ssl_session_tickets = "on"
2020/09/28 17:17:57 [debug] nginx_http_ssl_session_timeout = "1d"
2020/09/28 17:17:57 [debug] nginx_http_status_directives = {}
2020/09/28 17:17:57 [debug] nginx_http_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
2020/09/28 17:17:57 [debug] nginx_http_upstream_keepalive = "60"
2020/09/28 17:17:57 [debug] nginx_http_upstream_keepalive_requests = "100"
2020/09/28 17:17:57 [debug] nginx_http_upstream_keepalive_timeout = "60s"
2020/09/28 17:17:57 [debug] nginx_main_daemon = "on"
2020/09/28 17:17:57 [debug] nginx_main_directives = {{name="daemon",value="on"},{name="worker_rlimit_nofile",value="auto"},{name="worker_processes",value="6"}}
2020/09/28 17:17:57 [debug] nginx_main_worker_processes = "6"
2020/09/28 17:17:57 [debug] nginx_main_worker_rlimit_nofile = "auto"
2020/09/28 17:17:57 [debug] nginx_optimizations = true
2020/09/28 17:17:57 [debug] nginx_proxy_directives = {{name="real_ip_header",value="X-Forwarded-For"},{name="real_ip_recursive",value="on"}}
2020/09/28 17:17:57 [debug] nginx_proxy_real_ip_header = "X-Forwarded-For"
2020/09/28 17:17:57 [debug] nginx_proxy_real_ip_recursive = "on"
2020/09/28 17:17:57 [debug] nginx_sproxy_directives = {}
2020/09/28 17:17:57 [debug] nginx_status_directives = {}
2020/09/28 17:17:57 [debug] nginx_stream_directives = {{name="ssl_session_tickets",value="on"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_timeout",value="1d"},{name="ssl_prefer_server_ciphers",value="off"}}
2020/09/28 17:17:57 [debug] nginx_stream_ssl_prefer_server_ciphers = "off"
2020/09/28 17:17:57 [debug] nginx_stream_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] nginx_stream_ssl_session_tickets = "on"
2020/09/28 17:17:57 [debug] nginx_stream_ssl_session_timeout = "1d"
2020/09/28 17:17:57 [debug] nginx_supstream_directives = {}
2020/09/28 17:17:57 [debug] nginx_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
2020/09/28 17:17:57 [debug] nginx_upstream_keepalive = "60"
2020/09/28 17:17:57 [debug] nginx_upstream_keepalive_requests = "100"
2020/09/28 17:17:57 [debug] nginx_upstream_keepalive_timeout = "60s"
2020/09/28 17:17:57 [debug] nginx_worker_processes = "auto"
2020/09/28 17:17:57 [debug] pg_database = "kong"
2020/09/28 17:17:57 [debug] pg_host = "127.0.0.1"
2020/09/28 17:17:57 [debug] pg_max_concurrent_queries = 0
2020/09/28 17:17:57 [debug] pg_port = 5432
2020/09/28 17:17:57 [debug] pg_ro_ssl = false
2020/09/28 17:17:57 [debug] pg_ro_ssl_verify = false
2020/09/28 17:17:57 [debug] pg_semaphore_timeout = 60000
2020/09/28 17:17:57 [debug] pg_ssl = false
2020/09/28 17:17:57 [debug] pg_ssl_verify = false
2020/09/28 17:17:57 [debug] pg_timeout = 5000
2020/09/28 17:17:57 [debug] pg_user = "kong"
2020/09/28 17:17:57 [debug] plugins = {"kong-siteminder-auth","kong-kafka-log","stargate-waf-error-log","mtls","stargate-oidc-token-revoke","kong-tx-debugger","kong-plugin-oauth","zipkin","kong-error-log","kong-oidc-implicit-token","kong-response-size-limiting","request-transformer","kong-service-virtualization","kong-cluster-drain","kong-upstream-jwt","kong-splunk-log","kong-spec-expose","kong-path-based-routing","kong-oidc-multi-idp","correlation-id","oauth2","statsd","jwt","rate-limiting","acl","request-size-limiting","request-termination","cors"}
2020/09/28 17:17:57 [debug] port_maps = {}
2020/09/28 17:17:57 [debug] prefix = "/usr/local/kong/"
2020/09/28 17:17:57 [debug] proxy_access_log = "off"
2020/09/28 17:17:57 [debug] proxy_error_log = "/dev/stderr"
2020/09/28 17:17:57 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl http2 deferred reuseport"}
2020/09/28 17:17:57 [debug] real_ip_header = "X-Real-IP"
2020/09/28 17:17:57 [debug] real_ip_recursive = "off"
2020/09/28 17:17:57 [debug] role = "traditional"
2020/09/28 17:17:57 [debug] ssl_cert = "/usr/local/kong/ssl/kongcert.crt"
2020/09/28 17:17:57 [debug] ssl_cert_key = "/usr/local/kong/ssl/kongprivatekey.key"
2020/09/28 17:17:57 [debug] ssl_cipher_suite = "intermediate"
2020/09/28 17:17:57 [debug] ssl_ciphers = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
2020/09/28 17:17:57 [debug] ssl_prefer_server_ciphers = "on"
2020/09/28 17:17:57 [debug] ssl_protocols = "TLSv1.1 TLSv1.2 TLSv1.3"
2020/09/28 17:17:57 [debug] ssl_session_tickets = "on"
2020/09/28 17:17:57 [debug] ssl_session_timeout = "1d"
2020/09/28 17:17:57 [debug] status_access_log = "off"
2020/09/28 17:17:57 [debug] status_error_log = "logs/status_error.log"
2020/09/28 17:17:57 [debug] status_listen = {"off"}
2020/09/28 17:17:57 [debug] stream_listen = {"off"}
2020/09/28 17:17:57 [debug] trusted_ips = {"0.0.0.0/0","::/0"}
2020/09/28 17:17:57 [debug] upstream_keepalive = 60
2020/09/28 17:17:57 [debug] upstream_keepalive_idle_timeout = 30
2020/09/28 17:17:57 [debug] upstream_keepalive_max_requests = 50000
2020/09/28 17:17:57 [debug] upstream_keepalive_pool_size = 400
2020/09/28 17:17:57 [debug] waf = "off"
2020/09/28 17:17:57 [debug] waf_audit = "RelevantOnly"
2020/09/28 17:17:57 [debug] waf_debug_level = "0"
2020/09/28 17:17:57 [debug] waf_mode = "On"
2020/09/28 17:17:57 [debug] waf_paranoia_level = "1"
2020/09/28 17:17:57 [debug] waf_pcre_match_limit = "10000"
2020/09/28 17:17:57 [debug] waf_pcre_match_limit_recursion = "10000"
2020/09/28 17:17:57 [debug] waf_request_file_size_limit = "50000000"
2020/09/28 17:17:57 [debug] waf_request_no_file_size_limit = "50000000"
2020/09/28 17:17:57 [debug] worker_consistency = "eventual"
2020/09/28 17:17:57 [debug] worker_state_update_frequency = 5
2020/09/28 17:17:57 [verbose] prefix in use: /usr/local/kong
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05503' to: 10.204.90.234
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05505' to: 10.204.90.235
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05507' to: 10.86.173.32
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05502' to: 10.106.184.179
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05504' to: 10.106.184.193
2020/09/28 17:17:57 [debug] resolved Cassandra contact point 'server05506' to: 10.87.52.252
2020/09/28 17:17:58 [debug] loading subsystems migrations...
2020/09/28 17:17:58 [verbose] retrieving keyspace schema state...
2020/09/28 17:17:58 [verbose] schema state retrieved
2020/09/28 17:17:58 [debug] loading subsystems migrations...
2020/09/28 17:17:58 [verbose] retrieving keyspace schema state...
2020/09/28 17:17:58 [verbose] schema state retrieved
2020/09/28 17:17:58 [debug] migrations to run:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:17:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:17:58 [debug] running migration: 009_200_to_210
2020/09/28 17:18:02 [info] core migrated up to: 009_200_to_210 (pending)
2020/09/28 17:18:02 [debug] running migration: 010_210_to_211
2020/09/28 17:18:02 [info] core migrated up to: 010_210_to_211 (pending)
2020/09/28 17:18:02 [debug] running migration: 011_212_to_213
2020/09/28 17:18:02 [info] core migrated up to: 011_212_to_213 (executed)
2020/09/28 17:18:02 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 003_200_to_210
2020/09/28 17:18:02 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/09/28 17:18:02 [info] migrating acl on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 003_200_to_210
2020/09/28 17:18:02 [info] acl migrated up to: 003_200_to_210 (pending)
2020/09/28 17:18:02 [debug] running migration: 004_212_to_213
2020/09/28 17:18:02 [info] acl migrated up to: 004_212_to_213 (pending)
2020/09/28 17:18:02 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 004_200_to_210
2020/09/28 17:18:02 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/09/28 17:18:02 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/09/28 17:18:02 [debug] running migration: 004_200_to_210
2020/09/28 17:18:03 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/09/28 17:18:03 [debug] running migration: 005_210_to_211
2020/09/28 17:18:03 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/09/28 17:18:03 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:18:09 [verbose] Cassandra schema consensus: reached
2020/09/28 17:18:09 [info] 9 migrations processed
2020/09/28 17:18:09 [info] 2 executed
2020/09/28 17:18:09 [info] 7 pending
2020/09/28 17:18:09 [debug] loading subsystems migrations...
2020/09/28 17:18:09 [verbose] retrieving keyspace schema state...
2020/09/28 17:18:09 [verbose] schema state retrieved
2020/09/28 17:18:09 [info]
Database has pending migrations; run 'kong migrations finish' when ready

So far so good right? Well its after the up that the finish starts to see problems, (removed env variable dump from verbosity for brevity):

/ $ kong migrations finish --db-timeout 120 --vv
2020/09/28 17:19:57 [debug] loading subsystems migrations...
2020/09/28 17:19:57 [verbose] retrieving keyspace schema state...
2020/09/28 17:19:57 [verbose] schema state retrieved
2020/09/28 17:19:57 [debug] loading subsystems migrations...
2020/09/28 17:19:57 [verbose] retrieving keyspace schema state...
2020/09/28 17:19:58 [verbose] schema state retrieved
2020/09/28 17:19:58 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:19:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:19:58 [debug] running migration: 009_200_to_210
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 009_200_to_210 (executed)
2020/09/28 17:20:12 [debug] running migration: 010_210_to_211
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 010_210_to_211 (executed)
2020/09/28 17:20:12 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:20:12 [debug] running migration: 003_200_to_210
2020/09/28 17:20:14 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:14 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Second run of finish to try again:

2020/09/28 17:22:14 [debug] loading subsystems migrations...
2020/09/28 17:22:14 [verbose] retrieving keyspace schema state...
2020/09/28 17:22:14 [verbose] schema state retrieved
2020/09/28 17:22:15 [debug] loading subsystems migrations...
2020/09/28 17:22:15 [verbose] retrieving keyspace schema state...
2020/09/28 17:22:15 [verbose] schema state retrieved
2020/09/28 17:22:15 [debug] pending migrations to finish:
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:22:15 [info] migrating acl on keyspace 'kong_prod2'...
2020/09/28 17:22:15 [debug] running migration: 003_200_to_210
2020/09/28 17:22:28 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:22:28 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Third Finish....:

2020/09/28 17:23:28 [verbose] schema state retrieved
2020/09/28 17:23:28 [debug] loading subsystems migrations...
2020/09/28 17:23:28 [verbose] retrieving keyspace schema state...
2020/09/28 17:23:29 [verbose] schema state retrieved
2020/09/28 17:23:29 [debug] pending migrations to finish:
   acl: 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:23:29 [info] migrating acl on keyspace 'kong_prod2'...
2020/09/28 17:23:29 [debug] running migration: 004_212_to_213
2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

4th Finish...... :

2020/09/28 17:24:15 [debug] loading subsystems migrations...
2020/09/28 17:24:15 [verbose] retrieving keyspace schema state...
2020/09/28 17:24:15 [verbose] schema state retrieved
2020/09/28 17:24:15 [debug] loading subsystems migrations...
2020/09/28 17:24:15 [verbose] retrieving keyspace schema state...
2020/09/28 17:24:16 [verbose] schema state retrieved
2020/09/28 17:24:16 [debug] pending migrations to finish:
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:24:16 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/09/28 17:24:16 [debug] running migration: 004_200_to_210
2020/09/28 17:24:18 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:24:18 [verbose] Cassandra schema consensus: reached
2020/09/28 17:24:18 [info] oauth2 migrated up to: 004_200_to_210 (executed)
2020/09/28 17:24:18 [debug] running migration: 005_210_to_211
2020/09/28 17:24:18 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:24:18 [verbose] Cassandra schema consensus: reached
2020/09/28 17:24:18 [info] oauth2 migrated up to: 005_210_to_211 (executed)
2020/09/28 17:24:18 [info] 2 migrations processed
2020/09/28 17:24:18 [info] 2 executed
2020/09/28 17:24:18 [debug] loading subsystems migrations...
2020/09/28 17:24:18 [verbose] retrieving keyspace schema state...
2020/09/28 17:24:18 [verbose] schema state retrieved
2020/09/28 17:24:18 [info] No pending migrations to finish

Lastly confirmed via list and finish again that there was nothing left to do:

2020/09/28 17:25:35 [debug] loading subsystems migrations...
2020/09/28 17:25:35 [verbose] retrieving keyspace schema state...
2020/09/28 17:25:35 [verbose] schema state retrieved
2020/09/28 17:25:35 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
          acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
       oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211

Looks like re-entrancy is still a problem for sure with the workspace tables, likely correlating to how many times I re-executed the finish command trying to get through this upgrade since the finish failed me a few times in a row and seemed to progress 1 table at a time with each failure:

image

Also find it interesting how I set the C* schema consensus timeouts long but it bounces back instantly saying consensus not reached? Ex:

2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached

cc @bungle @locao @hishamhm

Edit:

Yeah I see some tables like oauth2_credentials using ws_id: 38321a07-1d68-4aa8-b01f-8d64d5bf665a
and jwt_secrets using ws_id: 8f1d01e1-0609-4bda-a021-a01e619d1590

Additional Details & Logs

  • Kong version 2.1.4
cordmigrations tasneeds-investigation

Most helpful comment

Rewrote my copy keyspace as an Ansible script that can send it from a src C* cluster node to a totally different C* cluster node(assuming the keyspace does not already exist, no checks or anything fancy here):

- hosts: src_cassandra_node
  become: true
  tasks:
  - name: Make data directory for the clone files
    file:
      path: /data/cassandra/clone
      state: directory
  - name: Move into the C* folder we intend to put all our files and Export keyspace schema
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e 'DESCRIBE KEYSPACE {{ KEYSPACE_SOURCE_NAME }}' > {{ KEYSPACE_SOURCE_NAME }}.txt"
  - name: Alter keyspace schema name for later import in other cluster
    command: bash -c "cd /data/cassandra/clone && sed -i 's/{{ KEYSPACE_SOURCE_NAME }}/{{ KEYSPACE_DESTINATION_NAME }}/g' {{ KEYSPACE_SOURCE_NAME }}.txt"
  - name: Export table data into logical files for import later
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_SOURCE_NAME }}; describe tables;\" |  tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_SOURCE_NAME }}.% TO '%.dat'\""
    register: export_table_data
  - debug: msg="{{ export_table_data.stdout_lines }}"
  - name: Make local sure local directory is empty
    local_action: file path=/data/cassandra/clone state=absent
  - name: Make local temp local directory for cassandra files on controller
    local_action: command mkdir -p /data/cassandra/clone
  - shell: (cd /data/cassandra/clone; find . -maxdepth 1 -type f) | cut -d'/' -f2
    register: files_to_copy
  - fetch: src=/data/cassandra/clone/{{ item }} dest=/data/cassandra/clone/ flat=yes
    with_items: "{{ files_to_copy.stdout_lines }}"
  - name: Remove files and directory from remote host
    command: rm -rf /data/cassandra/clone

- hosts: dest_cassandra_node
  become: true
  tasks:
  - name: Remove files and directory from remote host
    command: rm -rf /data/cassandra/clone
  - name: Make data directory for the clone files
    file:
      path: /data/cassandra/clone
      state: directory
  - name: Move Cassandra files to destination C* node
    copy:
      src: /data/cassandra/clone/
      dest: /data/cassandra/clone
      owner: cassandra
      group: cassandra
      mode: 0755
  - name: Make local sure local ansible controller directory is emptied afterward
    local_action: file path=/data/cassandra/clone state=absent
  - name: Build new keyspace and schema
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -f '{{ KEYSPACE_SOURCE_NAME }}.txt'"
  - name: Import data to new keyspace
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_DESTINATION_NAME }}; describe tables;\" |  tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_DESTINATION_NAME }}.% FROM '%.dat'\""
    register: import_table_data
  - debug: msg="{{ import_table_data.stdout_lines }}"  
  - name: Repair newly made clone keyspace for good measure
    command: bash -c "nodetool repair -full {{ KEYSPACE_DESTINATION_NAME }}"
  - name: Clean out temp keyspace files and data on remote C* destination
    command: rm -rf /data/cassandra/clone

I will be resuming the new bungle PR testing here shortly tomorrow and into the weekend/Monday!

Much faster execution than my first go shell script hack too.

All 81 comments

Thank you for the detailed report as always! We'll be looking into it.

Yeah I think it would have worked fine if the finish had not died reaching schema consensus and continued processing. But that can't always be the case it looks like for us, therefore can't be assumed in code. I tried this same thing in another of our prod environments and same errors happened too. I decided to check the C* nodes as well looking for any discrepancies in schema mismatches or cluster healthy but nothing shows up there either. Because this may point to C* issues potentially as well alongside Kong not being able to handle all situations(such as restarting a migrations finish because the first execution didn't work out).

root@server05502:/root
# nodetool describecluster
Cluster Information:
    Name: Stargate
    Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
    DynamicEndPointSnitch: enabled
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        12b6ca53-8e84-3a66-8793-7ce091b5dcca: [10.204.90.234, 10.86.173.32, 10.204.90.235, 10.106.184.179, 10.106.184.193, 10.87.52.252]

root@server05502:/root
# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.204.90.234   113.89 MiB  256          100.0%            5be9bdf7-1337-42ac-8177-341044a4758b  RACK2
UN  10.86.173.32    117.4 MiB  256          100.0%            0f8d693c-4d99-453b-826e-6e5c68538037  RACK6
UN  10.204.90.235   335.54 MiB  256          100.0%            8e3192a7-e60c-41fb-8ca6-ce809440d982  RACK4
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.106.184.179  117.48 MiB  256          100.0%            33747998-70fd-42c4-87cb-c7c49a182c3e  RACK1
UN  10.87.52.252    104.64 MiB  256          100.0%            74a60b30-c83c-4aa8-b3ae-7ccb16fb817f  RACK5
UN  10.106.184.193  107.06 MiB  256          100.0%            61320d76-aa12-43f0-9684-b1413e57c90d  RACK3

So that wasn't any help. I wonder if there are any analytic health checks I can run on the cloned keyspace itself to check for weirdness or mismatches too.

Edit - Used another node in other DC w same commands above and they checked out fine too. 馃槩 so no smoking gun there as far as why C* could be the culprit to some of this. Regardless hope Kong enables the finish to be safer if it needs to execute multiple times and still achieve 1 ws_id all the tables end up using and have in sync together.

It certainly feels somekind of eventual consistency issue. We now insert the workspace only when after doing read-before-write (AFAIK), but perhaps that read will return nil when the write is done eventually, making us to do write again. And as the default ws id can be different, we cannot just do update.

Is it not possible to add logic into the migrations process for the finish command in code such that if a workspace table has a default record in it, use that record to continue processing? Or would this potentially break other enterprise use cases is the reservation there?

Or maybe thats exactly what you meant here?

We now insert the workspace only when after doing read-before-write (AFAIK), but perhaps that read will return nil when the write is done eventually

But I imagine independent executions of the finish command don't keep up with checking if there was already a ws generated, it probably just generates a fresh one and drops it in based on behavior I have seen?

Maybe the case we do full replication and local quorum r/w and all our Kong C* settings play a roll in this problem but we do so because it's a must to ensure OAuth2.0 bearer token plugin functions correctly for us .

To debug the behavior even if you cant produce this against your own localhost C* cluster:

...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures

I imagine can do manual break points in the code to reproduce the re-entrance of injecting ws id's when finish runs multiple times and stops at a certain portion of its execution like above situations happened for us.

Or maybe thats exactly what you meant here?

Yes, that's what we try to do. During finish, we first try read the default workspace value to use it, and if we get nothing, we create it. Looks like the SELECT is returning an empty response.

Maybe the case we do full replication and local quorum r/w and all our Kong C* settings play a roll in this problem but we do so because it's a must to ensure OAuth2.0 bearer token plugin functions correctly for us .

If anything, that should have brought _stronger_ consistency, right?

Puzzling.

I'm wondering if chasing after this isn't a more fruitful avenue:

Also find it interesting how I set the C* schema consensus timeouts long but it bounces back instantly saying consensus not reached? Ex:

2020/09/28 17:23:29 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:23:29 [verbose] Cassandra schema consensus: not reached

Hmm if it was eventual consistency problem though the time between executing the finish multiple times could not be that though :/ ... Pretty sure I could wait 10 mins between each time I run finish and it would still result in the new ws_id's being entered meaning that first select must always be failing somehow for us... odd.

And yeah one obvious red flag to me is how we set the schema consensus timeout huge but does not seem like its being honored. It either snaps back right away due to an underlying error(thats not printed or details given) or it just straight up won't wait for the given time and the wait is being derived from some other value either default or hardcoded.

I believe that it fails here:
https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L755-L764

Which means either of these fail:
https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L726-L730

It is busy looping, but it seems like it fails quite quick on either of those queries.

As a workaround, one thing is for us to try to reach schema consensus more than once. But as you said, even if you waited you could still get the issue. I also thinking that Cassandra indexes could lie or that some of the queries are not executed on right consistency level. Also could nodes go down while migrating, but you got it 4 times a row, so...

Can you execute those queries:

SELECT schema_version FROM system.local
-- AND
SELECT schema_version FROM system.peers

and share the data when this happens (or is about to happen)?

@jeremyjpj0916, could it be a tombstone related issue:
https://stackoverflow.com/a/37119894

Okay, It might be that we talk about two different issues.

  1. duplicated default workspaces
  2. waiting for schema consensus fails

For 1: duplicated default workspaces:

We might want to ensure that these:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L24
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L43
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/212_to_213.lua#L24
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/212_to_213.lua#L43

Runs on stricter consistency. At the moment we are not sure if changing the consistency level fixes it, but it is a good guess.

For 2: waiting for schema consensus to fails:

It feels like it is failing here:
https://github.com/Kong/kong/blob/master/kong/db/init.lua#L577

And Cassandra implements it here:
https://github.com/Kong/kong/blob/master/kong/db/strategies/cassandra/connector.lua#L396-L415

For some reason it fails sometimes with:

[Read failure] Operation failed - received 0 responses and 1 failures

This error message is coming from Cassandra itself, and is forwarded from here:
https://github.com/thibaultcha/lua-cassandra/blob/master/lib/resty/cassandra/cluster.lua#L726-L730

At the moment we don't fully understand why this error happens (tombstones?), but it seems to happen on a rather non-fatal place. We could try it more, e.g. try it for the full max_schema_consensus_wait (120000ms timeout), and not fail on the first [Read failure].

@jeremyjpj0916 there is now #6411, would it be too much asked if you could try that, and report back if it has any effect on your environment? That branch does not have much more that that compared to latest 2.1.4. as can be seen here: https://github.com/Kong/kong/compare/2.1.4...fix/cassandra-migrations

Just tried the two commands randomly on a given prod node where I had been testing these migrations on a cloned keyspace, but outside of context of the processing for how kong uses the C* lua client lib tibo wrote not sure if this helps, just shows these commands do generally work in a vacuume:

Can you execute those queries:

SELECT schema_version FROM system.local
-- AND
SELECT schema_version FROM system.peers

root@server05502:/root
# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
****@cqlsh> SELECT schema_version FROM system.local;

 schema_version
--------------------------------------
 12b6ca53-8e84-3a66-8793-7ce091b5dcca

(1 rows)
****@cqlsh> SELECT schema_version FROM system.peers;

 schema_version
--------------------------------------
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca
 12b6ca53-8e84-3a66-8793-7ce091b5dcca

(5 rows)
****@cqlsh>

As for tombstones I remember increasing the tombstone fail threshold higher too. Also making the gc policy more aggressive than the default:

ALTER TABLE kong_stage3.ratelimiting_metrics WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.locks WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_tokens WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.locks WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.snis WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.plugins WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.targets WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.consumers WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.upstreams WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.schema_meta WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.response_ratelimiting_metrics WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.tags WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.acls WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.cluster_ca WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.cluster_events WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.jwt_secrets WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.apis WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.basicauth_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.keyauth_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_credentials WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.certificates WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.services WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.oauth2_authorization_codes WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.routes WITH GC_GRACE_SECONDS = 3600;ALTER TABLE kong_stage3.hmacauth_credentials WITH GC_GRACE_SECONDS = 3600;

Which is 1 hr vs the default 10 days for GC grace. I have forgotten if the default_time_to_live field impacts anything as well with the gc grace set before then though. Are you familiar? I may need to google around on that if its not optimal currently too.

Ex:

image

image

Then the fail threshold was set to:

# When executing a scan, within or across a partition, we need to keep the
# tombstones seen in memory so we can return them to the coordinator, which
# will use them to make sure other replicas also know about the deleted rows.
# With workloads that generate a lot of tombstones, this can cause performance
# problems and even exaust the server heap.
# (http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets)
# Adjust the thresholds here if you understand the dangers and want to
# scan more tombstones anyway.  These thresholds may also be adjusted at runtime
# using the StorageService mbean.
tombstone_warn_threshold: 10000
tombstone_failure_threshold: 500000

vs the default of 100,000

What would the migrations be querying each time that would cause that though? The cluster events table get hit each time? I know the oauth2 token table(probably our largest table in terms of records) probably does not get hit on every migrations finish command, especially after its been passed through in the list. I mean I guess could oauth2 tokens table getting in the millions of tombstones within an hour be causing it? OR cluster events_table for that matter? Unsure... not sure how I could ever tune it more than that unless we set it to 0 which is not recommended in C* cluster environments because if a node goes down you start getting "ghost" records or something I have heard since proper consensus can't be achieved via tombstone info. Maybe on a table like oauth2 tokens or cluster_events its less important because the rows have timestamp values that help dictate their validity to kong if they are present in the query or not eh so maybe 0 isn't so bad for some of the more intensely written to tables.

@bungle More than happy to try your latest PR today, won't have results likely till later this evening so in about 10 or so hours due to some restrictions on when I can touch some of our hosts for this testing. Will get that PR of files dropped in so the modified Kong will be ready to go then though. Fingers crossed it yields some positive success.

Appreciate the work you put in trying to resolve this scenario.

Also did do a few queries against our live C* db:

****@cqlsh:kong_prod> select count(*) from cluster_events;

 count
-------
 87271

(1 rows)

Warnings :
Aggregation query used without partition key

Read 100 live rows and 89643 tombstone cells for query SELECT * FROM kong_prod.cluster_events LIMIT 100 (see tombstone_warn_threshold)

Lotta tombstones here for sure.

As for querying the oauth2_tokens table from one of the nodes via cqlsh... uhh thats probably not good:

****@cqlsh:kong_prod> select count(*) from oauth2_tokens;
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

Need a longer timeout just to figure things out I suppose.

Honestly pondering if not even tombstoning the cluster events and oauth tokens is probably the play... since that kinda data should not really need to be that in strict consensus(okay for a little mismatch data or temporary ghost records right?) since Kong can see timestamps and valid tokens etc. cluster_events may be more important but I am not sure.. pretty sure most of those events are 1 to 1 with oauth token related invalidations.

yep a little select confirms most of those entries too:

*****@cqlsh:kong_prod> select * from cluster_events limit 10;

 channel       | at                              | node_id                              | id                                   | data                                                   | nbf
---------------+---------------------------------+--------------------------------------+--------------------------------------+--------------------------------------------------------+---------------------------------
 invalidations | 2020-10-01 19:44:10.123000+0000 | 3ec49684-1e0b-45ac-a13b-33135cc09164 | 6de48d36-18cd-41e9-b431-6f24f0a507f0 | oauth2_tokens:XXX | 2020-10-01 19:44:15.111000+0000
 invalidations | 2020-10-01 19:44:10.153000+0000 | 3ec49684-1e0b-45ac-a13b-33135cc09164 | c683d31f-c001-4c45-80ba-47b95780f3f4 | oauth2_tokens:XXX  | 2020-10-01 19:44:15.152000+0000
 invalidations | 2020-10-01 19:44:10.167000+0000 | 7dccf201-351a-402e-a33f-f1f291063f87 | 407b95f7-05eb-47f8-bc25-689c3065b9bf | oauth2_tokens:XXX  | 2020-10-01 19:44:15.167000+0000
 invalidations | 2020-10-01 19:44:10.223000+0000 | 21881b66-c99a-470c-ab76-76dcd7fbbc90 | 86180cd0-07ea-43a0-9189-bc63452a817d | oauth2_tokens:XXX  | 2020-10-01 19:44:15.222000+0000
 invalidations | 2020-10-01 19:44:10.230000+0000 | 7456814f-9464-4602-9a1e-7e2b55e4c4b2 | 1aa7f89d-a98c-4a3b-a62f-cbe8458d2b12 | oauth2_tokens:XXX  | 2020-10-01 19:44:15.229000+0000
 invalidations | 2020-10-01 19:44:10.256000+0000 | 21881b66-c99a-470c-ab76-76dcd7fbbc90 | 6b5c3f94-db6b-4103-95c1-95b4f1de2700 | oauth2_tokens:XXX  | 2020-10-01 19:44:15.244000+0000

..... continued ... 

Probably not totally related, but the large amount of invalidation events due to oauth2 does start to clog up the cluster events table in general... :

I would think oauth2 client credential grant type tokens don't even need an "invalidation" event really ever? They are generated and valid for a given ttl, unless you could revoke them early(is that even a thing in the oauth2 CC pattern, Kong supports revoking of client credential bearer tokens? we personally have never used such a feature) I think nodes can query the tokens table and know whats valid/invalid based on the token generation timetamps in the table?? Isn't it just a /token endpoint to produce valid tokens with the right client_id and client_secret and bearer type provided for oauth2 cc?

Not super familiar either but nodetool cfstats details on that given oauth2_tokens table looks like this in one prod env:

        Table: oauth2_tokens
        SSTable count: 4
        Space used (live): 153678228
        Space used (total): 153678228
        Space used by snapshots (total): 0
        Off heap memory used (total): 2694448
        SSTable Compression Ratio: 0.5890186627210159
        Number of partitions (estimate): 1614912
        Memtable cell count: 5138
        Memtable data size: 1268839
        Memtable off heap memory used: 0
        Memtable switch count: 1288
        Local read count: 24577751
        Local read latency: 0.067 ms
        Local write count: 53152217
        Local write latency: 0.109 ms
        Pending flushes: 0
        Percent repaired: 70.18
        Bloom filter false positives: 1
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 2307288
        Bloom filter off heap memory used: 2307256
        Index summary off heap memory used: 363776
        Compression metadata off heap memory used: 23416
        Compacted partition minimum bytes: 43
        Compacted partition maximum bytes: 179
        Compacted partition mean bytes: 128
        Average live cells per slice (last five minutes): 1.0
        Maximum live cells per slice (last five minutes): 1
        Average tombstones per slice (last five minutes): 1.984081041968162
        Maximum tombstones per slice (last five minutes): 4
        Dropped Mutations: 2052

This is mostly just informational to get an idea of what its like running things like oauth2 plugin in a well used environment.

@jeremyjpj0916 good feedback as always. I will create cards for us to improve oauth2 situation. We have two lines in code:

kong.db.oauth2_authorization_codes:delete({ id = auth_code.id })

and

kong.db.oauth2_tokens:delete({ id = token.id })

There are couple of updates too. The invalidation event is for nodes to clear caches. But let's see if there is a better way.

@bungle

Files all prepped, will be cloning the keyspace again in a few hrs, I noticed you refactored all the core migrations.

Would others such as plugins need a refactor too like from my initial migration logs as well:

 jwt: 003_200_to_210
 acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211

To gain the benefits you are hoping will occur on the core migration executions?

Such as the connector:query() -> coordinator:execute() ?

Asking this because we saw in my first finish execution the core migrations seemed to have gone off without problem, but it was subsequent individual plugin migrations finish calls that seemed to get stuck 1 by 1:

For reference again:

2020/09/28 17:19:58 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/09/28 17:19:58 [info] migrating core on keyspace 'kong_prod2'...
2020/09/28 17:19:58 [debug] running migration: 009_200_to_210
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 009_200_to_210 (executed)
2020/09/28 17:20:12 [debug] running migration: 010_210_to_211
2020/09/28 17:20:12 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:12 [verbose] Cassandra schema consensus: reached
2020/09/28 17:20:12 [info] core migrated up to: 010_210_to_211 (executed)
2020/09/28 17:20:12 [info] migrating jwt on keyspace 'kong_prod2'...
2020/09/28 17:20:12 [debug] running migration: 003_200_to_210
2020/09/28 17:20:14 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/09/28 17:20:14 [verbose] Cassandra schema consensus: not reached
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Also worth noting odd behavior when cloning the keyspaces I go through and run a script that does some commands, the only table that throws an error is the oauth2 creds table oddly:

error: Unknown column redirect_uri during deserialization
-- StackTrace --
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
    at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:321)
    at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:511)
    at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385)
    at org.apache.cassandra.db.ColumnFamilyStore.loadNewSSTables(ColumnFamilyStore.java:771)
    at org.apache.cassandra.db.ColumnFamilyStore.loadNewSSTables(ColumnFamilyStore.java:709)
    at org.apache.cassandra.service.StorageService.loadNewSSTables(StorageService.java:5110)
    at sun.reflect.GeneratedMethodAccessor519.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
    at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
    at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
    at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
    at sun.reflect.GeneratedMethodAccessor547.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
    at sun.rmi.transport.Transport$1.run(Transport.java:200)
    at sun.rmi.transport.Transport$1.run(Transport.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

So the oauth2 creds don't refresh over right away when running the below command(the error above is what happens when I execute the below command):

ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_credentials"

So I follow up with a:


Original shell script doing keyspace cloning I am referring to:

#!/bin/bash
cloneKeyspace()
{
    local username="${1}"
    local start_host="${2}"
    local hostnames="${3}"
    local keyspace_source="${4}"
    local keyspace_destination="${5}"

    if ssh ${username}@${start_host} "echo 2>&1"; then
      ssh ${username}@${start_host} "cqlsh --ssl -e 'DROP KEYSPACE ${keyspace_destination}'"
    fi

    sleep 10

    for hostname in $(echo $hostnames | sed "s/,/ /g")
    do
      if ssh ${username}@${hostname} "echo 2>&1"; then
        ssh ${username}@${hostname} "rm -rf /data/cassandra/data/${keyspace_destination}"
        ssh ${username}@${hostname} "nodetool clearsnapshot -t copy"
      fi
    done

    if ssh ${username}@${start_host} "echo 2>&1"; then
     ssh ${username}@${start_host} "cqlsh --ssl -e 'DESCRIBE KEYSPACE ${keyspace_source}' > ${keyspace_source}.txt"
     ssh ${username}@${start_host} "sed -i 's/${keyspace_source}/${keyspace_destination}/g' ${keyspace_source}.txt"
     ssh ${username}@${start_host} "cqlsh --ssl -f '${keyspace_source}.txt'"
     ssh ${username}@${start_host} "rm ${keyspace_source}.txt"      
     echo "Success on Keyspace Schema clone: "${start_host}
    else
      echo "Failed on Keyspace Schema clone: "${start_host}
    fi

    for hostname in $(echo $hostnames | sed "s/,/ /g")
    do
       if ssh ${username}@${hostname} "echo 2>&1"; then
        ssh ${username}@${hostname} "nodetool snapshot -t copy ${keyspace_source}"
        echo "Success on Cassandra snapshot: "${hostname} ${keyspace_source}

        sleep 20

        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/acls-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/acls-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/apis-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/apis-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/ca_certificates-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/ca_certificates-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/certificates-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/certificates-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/cluster_events-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/cluster_events-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/consumers-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/consumers-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/jwt_secrets-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/jwt_secrets-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/keyauth_credentials-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/keyauth_credentials-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/locks-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/locks-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_authorization_codes-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_authorization_codes-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_credentials-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_credentials-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/oauth2_tokens-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/oauth2_tokens-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/plugins-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/plugins-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/ratelimiting_metrics-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/ratelimiting_metrics-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/routes-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/routes-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/schema_meta-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/schema_meta-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/services-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/services-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/snis-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/snis-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/tags-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/tags-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/targets-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/targets-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/upstreams-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/upstreams-*/"
        ssh ${username}@${hostname} "mv /data/cassandra/data/${keyspace_source}/workspaces-*/snapshots/copy/* /data/cassandra/data/${keyspace_destination}/workspaces-*/"

        ssh ${username}@${hostname} "nodetool clearsnapshot -t copy"

        echo "Success on Cassandra mv of snapshot files to destination: "${keyspace_destination}

        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} ratelimiting_metrics"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_tokens"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} locks"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} snis"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} plugins"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} targets"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} consumers"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} upstreams"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} schema_meta"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} tags"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} acls"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} cluster_events"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} workspaces"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} jwt_secrets"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} apis"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} keyauth_credentials"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_credentials"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} certificates"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} services"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} ca_certificates"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} oauth2_authorization_codes"
        ssh ${username}@${hostname} "nodetool refresh ${keyspace_destination} routes"

        echo "Success on Cassandra nodetool refresh on destination keyspace: "${keyspace_destination}
       else
        echo "Failed on Cassandra snapshot: "${hostname}
       fi
    done

    if ssh ${username}@${start_host} "echo 2>&1"; then
     ssh ${username}@${start_host} "nodetool repair -full ${keyspace_destination}"
     echo "Success on Keyspace repair: "${start_host} ${keyspace_destination}
    else
     echo "Failed on Keyspace repair : "${start_host} ${keyspace_destination}
    fi

    echo "Script processing completed!"
}

cloneKeyspace "${1}" "${2}" "${3}" "${4}" "${5}"

So to get the creds to import and show up when I query the table properly afterward I do:

***@cqlsh> COPY kong_src.oauth2_credentials TO 'oauth2_credentials.dat';
Using 1 child processes

Starting copy of kong_desto.oauth2_credentials with columns [id, client_id, client_secret, client_type, consumer_id, created_at, hash_secret, name, redirect_uris, tags].
Processed: 81 rows; Rate:      46 rows/s; Avg. rate:      23 rows/s
81 rows exported to 1 files in 3.563 seconds.
****@cqlsh> COPY kong_desto.oauth2_credentials FROM 'oauth2_credentials.dat';

Welp trying it out now as is:

First the migrations up(which so far has never given my grief):

kong migrations up --db-timeout 120 --vv

2020/10/02 02:39:50 [debug] loading subsystems migrations...
2020/10/02 02:39:50 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:50 [verbose] schema state retrieved
2020/10/02 02:39:50 [debug] loading subsystems migrations...
2020/10/02 02:39:50 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:50 [verbose] schema state retrieved
2020/10/02 02:39:50 [debug] migrations to run:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:39:51 [info] migrating core on keyspace 'kong_prod2'...
2020/10/02 02:39:51 [debug] running migration: 009_200_to_210
2020/10/02 02:39:53 [info] core migrated up to: 009_200_to_210 (pending)
2020/10/02 02:39:53 [debug] running migration: 010_210_to_211
2020/10/02 02:39:53 [info] core migrated up to: 010_210_to_211 (pending)
2020/10/02 02:39:53 [debug] running migration: 011_212_to_213
2020/10/02 02:39:53 [info] core migrated up to: 011_212_to_213 (executed)
2020/10/02 02:39:53 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/02 02:39:53 [debug] running migration: 003_200_to_210
2020/10/02 02:39:54 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/10/02 02:39:54 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 003_200_to_210
2020/10/02 02:39:54 [info] acl migrated up to: 003_200_to_210 (pending)
2020/10/02 02:39:54 [debug] running migration: 004_212_to_213
2020/10/02 02:39:54 [info] acl migrated up to: 004_212_to_213 (pending)
2020/10/02 02:39:54 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 004_200_to_210
2020/10/02 02:39:54 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/10/02 02:39:54 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/02 02:39:54 [debug] running migration: 004_200_to_210
2020/10/02 02:39:56 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/10/02 02:39:56 [debug] running migration: 005_210_to_211
2020/10/02 02:39:56 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/10/02 02:39:56 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:39:59 [verbose] Cassandra schema consensus: reached in 3049ms
2020/10/02 02:39:59 [info] 9 migrations processed
2020/10/02 02:39:59 [info] 2 executed
2020/10/02 02:39:59 [info] 7 pending
2020/10/02 02:39:59 [debug] loading subsystems migrations...
2020/10/02 02:39:59 [verbose] retrieving keyspace schema state...
2020/10/02 02:39:59 [verbose] schema state retrieved
2020/10/02 02:39:59 [info]
Database has pending migrations; run 'kong migrations finish' when ready

Welp migrations finish more problems again, the good news was the 120 seconds was respected it seems, but just never agreed on anything there, look at how fast those core migrations reached consensus too hah(3ms), the other one just hangs :/ :

kong migrations finish --db-timeout 120 --vv

The environment variables are the same from my original post so cutting those out for brevity.

2020/10/02 02:42:24 [debug] loading subsystems migrations...
2020/10/02 02:42:24 [verbose] retrieving keyspace schema state...
2020/10/02 02:42:24 [verbose] schema state retrieved
2020/10/02 02:42:24 [debug] loading subsystems migrations...
2020/10/02 02:42:24 [verbose] retrieving keyspace schema state...
2020/10/02 02:42:25 [verbose] schema state retrieved
2020/10/02 02:42:25 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:42:25 [info] migrating core on keyspace 'kong_prod2'...
2020/10/02 02:42:25 [debug] running migration: 009_200_to_210
2020/10/02 02:42:37 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:42:37 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:42:37 [info] core migrated up to: 009_200_to_210 (executed)
2020/10/02 02:42:37 [debug] running migration: 010_210_to_211
2020/10/02 02:42:37 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:42:37 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:42:37 [info] core migrated up to: 010_210_to_211 (executed)
2020/10/02 02:42:37 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/02 02:42:37 [debug] running migration: 003_200_to_210
2020/10/02 02:42:38 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:44:38 [verbose] Cassandra schema consensus: not reached in 120095ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Going to execute kong migrations finish multiple times until finish "completes" and will check the keyspace.

Second finish, again it did wait the 120 seconds, but didn't complete successfully it seems to reach a consensus:

2020/10/02 02:47:53 [debug] loading subsystems migrations...
2020/10/02 02:47:53 [verbose] retrieving keyspace schema state...
2020/10/02 02:47:53 [verbose] schema state retrieved
2020/10/02 02:47:53 [debug] loading subsystems migrations...
2020/10/02 02:47:53 [verbose] retrieving keyspace schema state...
2020/10/02 02:47:53 [verbose] schema state retrieved
2020/10/02 02:47:53 [debug] pending migrations to finish:
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:47:53 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:47:53 [debug] running migration: 003_200_to_210
2020/10/02 02:48:00 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:50:00 [verbose] Cassandra schema consensus: not reached in 120002ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

3rd :

2020/10/02 02:51:31 [debug] loading subsystems migrations...
2020/10/02 02:51:31 [verbose] retrieving keyspace schema state...
2020/10/02 02:51:32 [verbose] schema state retrieved
2020/10/02 02:51:32 [debug] loading subsystems migrations...
2020/10/02 02:51:32 [verbose] retrieving keyspace schema state...
2020/10/02 02:51:32 [verbose] schema state retrieved
2020/10/02 02:51:32 [debug] pending migrations to finish:
   acl: 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:51:32 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/02 02:51:32 [debug] running migration: 004_212_to_213
2020/10/02 02:51:32 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:53:32 [verbose] Cassandra schema consensus: not reached in 120024ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

4th, which oddly succeeded the rest of the way???

2020/10/02 02:54:41 [debug] loading subsystems migrations...
2020/10/02 02:54:41 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:41 [verbose] schema state retrieved
2020/10/02 02:54:41 [debug] loading subsystems migrations...
2020/10/02 02:54:41 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:41 [verbose] schema state retrieved
2020/10/02 02:54:41 [debug] pending migrations to finish:
oauth2: 004_200_to_210, 005_210_to_211
2020/10/02 02:54:41 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/02 02:54:41 [debug] running migration: 004_200_to_210
2020/10/02 02:54:43 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:54:43 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:54:43 [info] oauth2 migrated up to: 004_200_to_210 (executed)
2020/10/02 02:54:43 [debug] running migration: 005_210_to_211
2020/10/02 02:54:44 [verbose] waiting for Cassandra schema consensus (120000ms timeout)...
2020/10/02 02:54:44 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/02 02:54:44 [info] oauth2 migrated up to: 005_210_to_211 (executed)
2020/10/02 02:54:44 [info] 2 migrations processed
2020/10/02 02:54:44 [info] 2 executed
2020/10/02 02:54:44 [debug] loading subsystems migrations...
2020/10/02 02:54:44 [verbose] retrieving keyspace schema state...
2020/10/02 02:54:44 [verbose] schema state retrieved
2020/10/02 02:54:44 [info] No pending migrations to finish

And a migrations list command after:

2020/10/02 02:57:29 [debug] loading subsystems migrations...
2020/10/02 02:57:29 [verbose] retrieving keyspace schema state...
2020/10/02 02:57:30 [verbose] schema state retrieved
2020/10/02 02:57:30 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
          acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
       oauth2: 000_base_oauth2, 003_130_to_140, 004_200_to_210, 005_210_to_211

And a finish command again just to ensure it says no more migrations:

2020/10/02 02:58:12 [debug] loading subsystems migrations...
2020/10/02 02:58:12 [verbose] retrieving keyspace schema state...
2020/10/02 02:58:12 [verbose] schema state retrieved
2020/10/02 02:58:13 [debug] loading subsystems migrations...
2020/10/02 02:58:13 [verbose] retrieving keyspace schema state...
2020/10/02 02:58:13 [verbose] schema state retrieved
2020/10/02 02:58:13 [info] No pending migrations to finish

Time to check the keyspace and see if we got a bunch of workspacs injected in there. It is concerning still I execute multiple times w failures but ultimately if it actually did the right changes I am okay running it multiple times 馃槅 .

Well some excitement here! It seems I am no longer getting the duplicated workspace id's in the table!!!

[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
*****@cqlsh> use kong_prod2;
*****@cqlsh:kong_prod2> describe tables;

ratelimiting_metrics  response_ratelimiting_metrics  oauth2_credentials
oauth2_tokens         tags                           certificates
locks                 acls                           services
snis                  cluster_events                 ca_certificates
plugins               workspaces                     oauth2_authorization_codes
targets               jwt_secrets                    routes
consumers             apis                           hmacauth_credentials
upstreams             basicauth_credentials
schema_meta           keyauth_credentials

*****@cqlsh:kong_prod2> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 c709a0a2-687f-43c3-a46a-7bd437dc5c24 |    null |   null | 2020-10-02 02:42:24.000000+0000 | null | default

(1 rows)
kongdba@cqlsh:kong_prod2>

Lets go ahead and verify other tables are using this id all over the place.

ACLS, lgtm with respect to ws id and how its appended:

***@cqlsh:kong_prod2> select * from acls limit 5;

 id                                   | cache_key                                                                                                              | consumer_id                          | created_at                      | group                                | tags | ws_id
--------------------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------------------------------+--------------------------------------+------+--------------------------------------
 187d0487-176b-454c-8a24-9364eafa13e3 | acls:34fe7bbd-7470-4b02-9886-2f0445900ec6:d4d07b18-35d7-402a-a65a-e0f77950de97::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 34fe7bbd-7470-4b02-9886-2f0445900ec6 | 2019-12-16 08:01:59.000000+0000 | d4d07b18-35d7-402a-a65a-e0f77950de97 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 753c13ec-e781-4034-9126-3170c10144e9 | acls:97b8b271-da5e-4aff-afca-bfa0c1dad70f:555bd712-a309-414e-9145-1b3af9a79559::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 97b8b271-da5e-4aff-afca-bfa0c1dad70f | 2020-06-04 19:56:42.000000+0000 | 555bd712-a309-414e-9145-1b3af9a79559 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 4e9a2da9-b453-4c21-899f-3768725bfb6c | acls:aee2dab7-755d-4900-80f6-7c795ae0c51e:a56539e7-b061-48b5-a86b-646aab4548e8::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | aee2dab7-755d-4900-80f6-7c795ae0c51e | 2020-03-27 20:17:44.000000+0000 | a56539e7-b061-48b5-a86b-646aab4548e8 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 ef8a963f-5cbb-4065-9c54-51e85372c50a | acls:39af1af3-4720-46a6-ad3a-13607e835e3f:d7fd7390-d872-47aa-8eba-c1e436ccd08c::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 39af1af3-4720-46a6-ad3a-13607e835e3f | 2020-01-07 11:14:18.000000+0000 | d7fd7390-d872-47aa-8eba-c1e436ccd08c | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24
 b75330ea-8adc-4124-8406-ed9968115cc3 | acls:191c101d-1394-41c2-8e2a-520cbfb0d505:e2b94d6e-17d3-4757-9022-441a4b28e534::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | 191c101d-1394-41c2-8e2a-520cbfb0d505 | 2020-01-13 11:44:06.000000+0000 | e2b94d6e-17d3-4757-9022-441a4b28e534 | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(5 rows)
kongdba@cqlsh:kong_prod2>

ROUTES lgtm:

*****@cqlsh:kong_prod2> select * from routes limit 1;

 partition | id                                   | created_at                      | destinations | headers | hosts | https_redirect_status_code | methods | name | path_handling | paths                    | preserve_host | protocols         | regex_priority | service_id                           | snis | sources | strip_path | tags | updated_at                      | ws_id
-----------+--------------------------------------+---------------------------------+--------------+---------+-------+----------------------------+---------+------+---------------+--------------------------+---------------+-------------------+----------------+--------------------------------------+------+---------+------------+------+---------------------------------+--------------------------------------
    routes | 001eeede-9d8c-44fa-8737-e3f07ff9c876 | 2020-01-13 11:44:04.000000+0000 |         null |    null |  null |                        426 |    null | null |            v1 | ['/some/path/v1'] |         False | {'http', 'https'} |              0 | 8afa2fa1-29f5-4337-9b92-5d01814916de | null |    null |       True | null | 2020-01-13 11:44:04.000000+0000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
****@cqlsh:kong_prod2>

Services lgtm:

****@cqlsh:kong_prod2> select * from services limit 1;

 partition | id                                   | ca_certificates | client_certificate_id | connect_timeout | created_at                      | host           | name                                                              | path                                         | port | protocol | read_timeout | retries | tags | tls_verify | tls_verify_depth | updated_at                      | write_timeout | ws_id
-----------+--------------------------------------+-----------------+-----------------------+-----------------+---------------------------------+----------------+-------------------------------------------------------------------+----------------------------------------------+------+----------+--------------+---------+------+------------+------------------+---------------------------------+---------------+--------------------------------------
  services | 005a0644-06b4-4b2d-8eca-6dec316f0113 |            null |                  null |            2000 | 2019-05-23 21:21:04.000000+0000 | service.company.com | c709a0a2-687f-43c3-a46a-7bd437dc5c24:c360.prod.service.here.v5 | /api/my/prod/url/v5.0/search |  443 |    https |         9000 |       0 | null |       null |             null | 2019-05-23 21:21:04.000000+0000 |          9000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
******@cqlsh:kong_prod2>

jwt_secrets, LGTM:

****@cqlsh:kong_prod2> select * from jwt_secrets limit 1;

 id                                   | algorithm | consumer_id                          | created_at                      | key                                                                   | rsa_public_key | secret                           | tags | ws_id
--------------------------------------+-----------+--------------------------------------+---------------------------------+-----------------------------------------------------------------------+----------------+----------------------------------+------+--------------------------------------
 b8fb26c4-ec3d-4d38-be5b-f1cc0e27aa78 |     HS256 | 623021d2-dc20-4402-be44-582607a8fcb3 | 2019-06-05 05:44:10.000000+0000 | c709a0a2-687f-43c3-a46a-7bd437dc5c24:XXXXXXX |           null | XXXXXXXX | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
*****@cqlsh:kong_prod2>

oauth2_credentials, LGTM:

****@cqlsh:kong_prod2> select * from oauth2_credentials limit 1;

 id                                   | client_id                                                             | client_secret                    | client_type | consumer_id                          | created_at                      | hash_secret | name        | redirect_uris        | tags | ws_id
--------------------------------------+-----------------------------------------------------------------------+----------------------------------+-------------+--------------------------------------+---------------------------------+-------------+-------------+----------------------+------+--------------------------------------
 cf2f541c-7864-4be8-a9c8-cc1b57d50399 | c709a0a2-687f-43c3-a46a-7bd437dc5c24:XXXXXX | XXXXXX |        null | XXXXXXX | 2018-08-23 04:49:23.966000+0000 |        null | OAuth_Creds | {'http://company.com'} | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
****@cqlsh:kong_prod2>

Consumers, LGTM:

*****@cqlsh:kong_prod2> select * from consumers limit 1;

 id                                   | created_at                      | custom_id | tags | username                                  | ws_id
--------------------------------------+---------------------------------+-----------+------+-------------------------------------------+--------------------------------------
 4a7a7568-589a-42f6-a488-639dfb1ec21a | 2020-05-12 21:11:13.000000+0000 |      null | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24:username1 | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
*****@cqlsh:kong_prod2>

oauth2_tokens, no example records there sadly hmm maybe they are already garbage collected after my migration and this detective work:

*****@cqlsh:kong_prod2> select * from oauth2_tokens limit 1;

 id | access_token | authenticated_userid | created_at | credential_id | expires_in | refresh_token | scope | service_id | token_type | ws_id
----+--------------+----------------------+------------+---------------+------------+---------------+-------+------------+------------+-------

(0 rows)
*****@cqlsh:kong_prod2>

Will likely run again and check this specific table too by itself after execution of this whole process over again to ensure any existing tokens get proper set ^ .

cluster_events:

****@cqlsh:kong_prod2> select * from cluster_events limit 1;

 channel | at | node_id | id | data | nbf
---------+----+---------+----+------+-----

(0 rows)
****@cqlsh:kong_prod2>

Not modified with a ws_id in table structure huh???? Hmm.

And lastly plugin example, looks good to me:

*****@cqlsh:kong_prod2> select * from plugins limit 1;

 id                                   | api_id | cache_key                                                                                | config                                                                                   | consumer_id | created_at                      | enabled | name | protocols | route_id                             | run_on | service_id | tags | ws_id
--------------------------------------+--------+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-------------+---------------------------------+---------+------+-----------+--------------------------------------+--------+------------+------+--------------------------------------
 638b8470-fc50-45f7-964e-2a6155f32c72 |   null | plugins:acl:fb77c07a-4f9f-408f-969c-bdf447aae0ad::::c709a0a2-687f-43c3-a46a-7bd437dc5c24 | {"deny":null,"allow":["fb77c07a-4f9f-408f-969c-bdf447aae0ad"],"hide_groups_header":true} |        null | 2019-01-28 18:24:57.000000+0000 |    True |  acl |      null | fb77c07a-4f9f-408f-969c-bdf447aae0ad |  first |       null | null | c709a0a2-687f-43c3-a46a-7bd437dc5c24

(1 rows)
*****@cqlsh:kong_prod2>

Edit:

So oauth2_tokens is where I am unsure cause I just have no record data to ensure the migration woulda worked for that table right, likely cause the tokens expired during that time I guess, if the token data was somehow not cloned over to my temp keyspace ill need to check that too as well. Also the cluster_events not having a ws_id seems odd to me but guess its how the design is for now(i would think if kong cp's ever have segregations then the ws_id has to be integrated with cluster_events as well with my novice speculating thoughts 馃槅 ).

So still we don't seem to have a fix for some of the finish commands not able to read and reach a consensus(1 failure). Could it be related to some tables with 1000's of records and paging again??? Like runtime kong C* paging was fixed by that one guy. But could the migrations processes be using wrong paging calls and hitting the failures due to some variable scope problem?

But the good news is it seems now no dup workspaces AND the tables that needed to be modified got modified and modified correctly eh? So seems we are getting positive results here.

I also have 2 environments like this, so I tested the migrations on 1, and am leaving one untouched on the cloned keyspace for now, if you have any further tweaks or logging you want me to add to the files I can as well for a second test as well.

@jeremyjpj0916,

So it really looks that when this happens, it stays like that until you try again. So now I am thinking that should we close connection before we check the schema consensus. The error does not happen on migrations, but it happens on code that is ran between the migrations (which is less worrying). Good to know that multiple ws issue seems to be fixed.

@bungle right, having to execute the finish multiple times is sub-optimal in current state, at least for our use case. Yeah I think honoring the set schema consensus timeout was still a good change, but ultimately I will set that way lower than 120 even because it seems to not be the problem w my cluster, the wait isn't helping any. Do you have a snippit to drop into one of these files that will close the connection and re-open it for checking schema consensus as a test to confirm your suspicions? I can use that on our second prod environment that has a cloned keyspace ready for upgrading 馃榿 . The perfect fix for me would be one that runs all the way through without me having to execute the finish over and over.

Then after that I will probably do this clone keyspace and upgrade thing like 10-20 times in a row to feel really good that Kong will be stable on the jump from 2.0.5 to 2.1.x(whatever version this ends up 2.1.5?) .

Also important I am sure that any alterations to the migration process while making it work better for a multi DC 6 node c* cluster w high replication and local_q read/write AND still need to be applicable to function okay against a single node etc. As shown in these lastest 2.1.x upgrades it can sometimes be a challenge to make something that fits for best of both worlds it seems. Just for stats these environments have a few thousand service/route resources, like 5k or so plugins and then plenty of oauth2 tokens and a number of acls. Not super large db'es in terms of data size but decent at this point.

But hey at least the changes thus far made the finish command idempotent (@rsbrisci fancy big word, not mine). So the ws id alterations look to be correct!

@jeremyjpj0916, is there any errors logged to Cassandra when this happens? Or close to when it happens?

@bungle will check on this during the work day and get back to yah on if I see anything interesting in the C* logs. Probably should have been doing exactly that when performing these upgrades but it did slip my mind.

Just gonna break down the logs I can see of the node I believe we would have hit for migrations.

First up:

From server05503 looking on the 10/1 date.

/var/log/cassandra/cassandra.log:

Only error on 10/1 came from the keyspace cloning error from my shell script copying oauth2_credentials over(which I end up doing manually using copy tables commands as described earlier):

ERROR [SSTableBatchOpen:1] 2020-10-01 22:29:35,338 CassandraDaemon.java:235 - Exception in thread Thread[SSTableBatchOpen:1,5,main]
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
        at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:338) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_232]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.8.jar:3.11.8]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]

Next up:

/var/log/cassandra/system.log

ERROR [Reference-Reaper:1] 2020-10-01 20:36:03,866 Ref.java:224 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@69a76445) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1838738857:/data/cassandra/data/kong_prod2/oauth2_credentials-feb3d290044e11ebafdba3cdf7a167d7/mc-1-big was not released before the reference was garbage collected

ERROR [SSTableBatchOpen:1] 2020-10-01 22:29:35,338 CassandraDaemon.java:235 - Exception in thread Thread[SSTableBatchOpen:1,5,main]
java.lang.RuntimeException: Unknown column redirect_uri during deserialization
        at org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:338) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) ~[apache-cassandra-3.11.8.jar:3.11.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_232]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_232]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232]
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.8.jar:3.11.8]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]

But both of those logs were prior to the kong migrations finish command that would have ran... those would have been 1 during the keyspace copy for the 2nd error statement, similar to the one found in cassandra.log, then the first error was 2 hours before that and unrelated to the migrations effort. Unless the migrations hit a different node and I need to check each one of the nodes independently, doesn't seem like I can tell which node was used to handle the migrations work in debug prints(unless I removed those lines up above from the logs thinking they would not be relevant prior to posting).

Gonna check the other nodes too.

server05504 had this in /var/log/cassandra/system.log

ERROR [MessagingService-Incoming-/10.204.90.234] 2020-10-01 22:30:41,625 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.204.90.234,5,main]
java.lang.RuntimeException: Unknown column kind during deserialization
        at org.apache.cassandra.db.Columns$Serializer.deserialize(Columns.java:452) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.SerializationHeader$Serializer.deserializeForMessaging(SerializationHeader.java:412) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.deserializeHeader(UnfilteredRowIteratorSerializer.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:835) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:823) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:669) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:652) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.2.jar:3.11.2]
ERROR [MessagingService-Incoming-/10.204.90.234] 2020-10-01 22:30:42,569 CassandraDaemon.java:228 - Exception in thread Thread[MessagingService-Incoming-/10.204.90.234,5,main]
java.lang.RuntimeException: Unknown column kind during deserialization
        at org.apache.cassandra.db.Columns$Serializer.deserialize(Columns.java:452) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.SerializationHeader$Serializer.deserializeForMessaging(SerializationHeader.java:412) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.deserializeHeader(UnfilteredRowIteratorSerializer.java:195) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:835) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:823) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:425) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:669) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.service.MigrationManager$MigrationsSerializer.deserialize(MigrationManager.java:652) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:192) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:180) ~[apache-cassandra-3.11.2.jar:3.11.2]
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94) ~[apache-cassandra-3.11.2.jar:3.11.2]

But again I believe thats not even during the migrations executions but likely during keyspace clone related stuff because of timestamp aligning with the redirect_uri error from other server above(granted I don't remember the "kind" error showing up in output prior, only the redirect_uri one associated with oauth2 creds on keyspace clone and data copy over)...

Checking other final node in this DC for local quorum oddities. EDIT - Nope 0 logs extra there....

Some extra coordinator settings in my cassandra.yaml from a given node:

# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 5000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 5000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 1000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 60000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 10000

@bungle so the fix that prevents the dup workspaces is going into 2.2.0 now and not a 2.1.5 patch correct? So I should wait and upgrade my prod nodes potentially from 2.0.5 to 2.2.0 to not run into the ws migration errors I have been seeing(even if we can't get to the bottom of why the finish does not execute smoothly throughout for us and we have to run finish multiple times).

@jeremyjpj0916 the removal of migration helpers and passing of connection was considered breaking, so it could not go to 2.1.5 anymore. In my opinion that is overly cautious, but I guess better to be safe than sorry. Overly cautious in that it could only break custom plugins that use the undocumented helpers (from which we have already removed functions in a past without a notice... I just saw that it was not used anywhere, thus removed it and instead passed the connection/coordinator to teardown). Singature before (connector, undocumented_helpers), signature now (connector, connection/coordinator).

Perhaps instead of connector, connection/coordinator we should just pass migration_executor. But that is a far bigger change and even more breaking.

These commits are the problematic for patch:
https://github.com/Kong/kong/pull/6411/commits/998d13efb14b11290abd411397ed051d2208e26b
https://github.com/Kong/kong/pull/6411/commits/ed9f373d94ea5ce98d250319bd390093812368ed

We could perhaps try to backport or modify pr somehow to the workspace prevention code to patch. The wait for schema consensus is still something I am not sure if that is fixed with e.g. passing of the coordinator. Still it somehow feels like the connection becomes unstable. after running some migrations and then it cannot be used anymore to call wait for schema consensus. Perhaps we should not use connector there either, but just call coordinator. But then, what if that coordinator becomes unstable. I have tried to track down if any of our code calls :setkeepalive or :close to connection that is grabbed before we call wait_for_schema_consensus but I could not find anything.

Restarting finnish surely makes all the internal structures clean, and grabs a new connection... but still I do not understand why connection goes bad (can it become bad because of network issue or something, or after running too many migrations).

@bungle

Gotcha, I suppose I can also just roll with 2.1.5 with the exact files you had me test with your pending PR and roll that up through prod then end up moving onto 2.2.x at a later time as well, otherwise the jumping will get bigger and I would need to vet 2.2.x for problems and I think 2.0.5 to 2.1.5 is a big enough jump as is for prod.

Yeah was hoping C* logs might reveal a problem, but so far no dice, if Kong team gets any ideas for a snippet of a debug print line either in Kong layer or in the underlying C* client lib that would be insightful to help get an RCA let me know and I will give that a test run too.

@jeremyjpj0916, the PR I did was not mergeable to anywhere ;-(. So back to drawing board, here is a fix that can go to 2.1.5 and to 2.2.0: https://github.com/Kong/kong/pull/6439

Were these the same changes ultimately or a retest would be helpful to you here?

@jeremyjpj0916,

If you have time to try out #6439 and give us feedback it would be great. I do expect the multiple default workspace issue be solved (there are two fixes to that: 1. try to create it only once, 2. use proper consistency when creating and reading). The other issue is trickier. I found a couple of new places, but ultimately I don't think I found anything that could solve it. I didn't do reconnect because it feels like a workaround that just hides a possible issue that needs to be fixed. The biggest change is that we never call connect_migrations on migrations anymore (only get_stored_connection), and let framework to handle the connections. As always thanks for cooperating on this, all the feedback is welcomed. I had to rework the PR a bit so it is also good to give us feedback if something went wrong with the re-implementation (I tried to be careful).

Still it feels like the connection gets corrupted at some point. But it is strange where it does so: migration ok, recording migration ok, but waiting for schema consensus not ok). Perhaps it is a Cassandra server that closes connection. Something in lines of this: https://stackoverflow.com/questions/39343475/how-does-cassandra-server-identifies-crashed-client-connections. Could it be that there is some migrations that make Cassandra think that client has crashed or something.

One thing to note, on your last try the jwt failed once and acl plugin failed twice. Some notes about ACL migration:

  1. acl has composite cache key
  2. the acl has foreign key
  3. the acl uniques is uniques = {}

Anything else in your migrations went ok (on that last try).

The only thing I think that those migrations do is that they run UPDATEs while iterating the table. So could it be that iterating goes wrong when you have enough entities if you at the same time modify that table a lot. @jeremyjpj0916, I'll create you a separate branch that splits iterate update to two loops, so let's see if that makes a difference.

@jeremyjpj0916,

this branch has one additional commit:
https://github.com/Kong/kong/commits/fix/migrations-master-iteration-separated

That moves updates out of iteration. This does it only for those migrations that you saw problematic. There are older migrations that have similar pattern of update inside iterator. If you could do a test run with #6439 and fix/migrations-master-iteration-separated. All results are good as they close possibilities.

Gotcha, will give both of those a go and let you know what I see. I am revamping our keyspace migration scripting to be able to pull down that keyspace and data and clone it on a given different env of infra (been cloning on prod infra and we are in peak season so it seems wrong to do so haha. Gonna use ansible and push and pull some files around for my v2 iteration). Should have some info later this week or early next week from testings.

Thanks again for your diligence @bungle .

Rewrote my copy keyspace as an Ansible script that can send it from a src C* cluster node to a totally different C* cluster node(assuming the keyspace does not already exist, no checks or anything fancy here):

- hosts: src_cassandra_node
  become: true
  tasks:
  - name: Make data directory for the clone files
    file:
      path: /data/cassandra/clone
      state: directory
  - name: Move into the C* folder we intend to put all our files and Export keyspace schema
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e 'DESCRIBE KEYSPACE {{ KEYSPACE_SOURCE_NAME }}' > {{ KEYSPACE_SOURCE_NAME }}.txt"
  - name: Alter keyspace schema name for later import in other cluster
    command: bash -c "cd /data/cassandra/clone && sed -i 's/{{ KEYSPACE_SOURCE_NAME }}/{{ KEYSPACE_DESTINATION_NAME }}/g' {{ KEYSPACE_SOURCE_NAME }}.txt"
  - name: Export table data into logical files for import later
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_SOURCE_NAME }}; describe tables;\" |  tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_SOURCE_NAME }}.% TO '%.dat'\""
    register: export_table_data
  - debug: msg="{{ export_table_data.stdout_lines }}"
  - name: Make local sure local directory is empty
    local_action: file path=/data/cassandra/clone state=absent
  - name: Make local temp local directory for cassandra files on controller
    local_action: command mkdir -p /data/cassandra/clone
  - shell: (cd /data/cassandra/clone; find . -maxdepth 1 -type f) | cut -d'/' -f2
    register: files_to_copy
  - fetch: src=/data/cassandra/clone/{{ item }} dest=/data/cassandra/clone/ flat=yes
    with_items: "{{ files_to_copy.stdout_lines }}"
  - name: Remove files and directory from remote host
    command: rm -rf /data/cassandra/clone

- hosts: dest_cassandra_node
  become: true
  tasks:
  - name: Remove files and directory from remote host
    command: rm -rf /data/cassandra/clone
  - name: Make data directory for the clone files
    file:
      path: /data/cassandra/clone
      state: directory
  - name: Move Cassandra files to destination C* node
    copy:
      src: /data/cassandra/clone/
      dest: /data/cassandra/clone
      owner: cassandra
      group: cassandra
      mode: 0755
  - name: Make local sure local ansible controller directory is emptied afterward
    local_action: file path=/data/cassandra/clone state=absent
  - name: Build new keyspace and schema
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -f '{{ KEYSPACE_SOURCE_NAME }}.txt'"
  - name: Import data to new keyspace
    command: bash -c "cd /data/cassandra/clone && cqlsh --ssl -e \"use {{ KEYSPACE_DESTINATION_NAME }}; describe tables;\" |  tr \" \" \"\\n\" | sort | sed -r '/^\s*$/d' | xargs -I % cqlsh --ssl -e \"COPY {{ KEYSPACE_DESTINATION_NAME }}.% FROM '%.dat'\""
    register: import_table_data
  - debug: msg="{{ import_table_data.stdout_lines }}"  
  - name: Repair newly made clone keyspace for good measure
    command: bash -c "nodetool repair -full {{ KEYSPACE_DESTINATION_NAME }}"
  - name: Clean out temp keyspace files and data on remote C* destination
    command: rm -rf /data/cassandra/clone

I will be resuming the new bungle PR testing here shortly tomorrow and into the weekend/Monday!

Much faster execution than my first go shell script hack too.

First go at #6439 PR against my cloned prod 2.0.5 keyspace and its data. Made sure I had some oauth2 tokens present and valid in the db this time too.

kong migrations list --vv

2020/10/13 05:12:30 [debug] loading subsystems migrations...
2020/10/13 05:12:30 [verbose] retrieving keyspace schema state...
2020/10/13 05:12:30 [verbose] schema state retrieved
2020/10/13 05:12:30 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200
          jwt: 000_base_jwt, 002_130_to_140
          acl: 000_base_acl, 002_130_to_140
rate-limiting: 000_base_rate_limiting, 003_10_to_112
       oauth2: 000_base_oauth2, 003_130_to_140
2020/10/13 05:12:30 [info]
New migrations available:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:12:30 [info]
Run 'kong migrations up' to proceed

kong migrations up --db-timeout 20 --vv

2020/10/13 05:13:55 [debug] loading subsystems migrations...
2020/10/13 05:13:55 [verbose] retrieving keyspace schema state...
2020/10/13 05:13:55 [verbose] schema state retrieved
2020/10/13 05:13:56 [debug] loading subsystems migrations...
2020/10/13 05:13:56 [verbose] retrieving keyspace schema state...
2020/10/13 05:13:56 [verbose] schema state retrieved
2020/10/13 05:13:56 [debug] migrations to run:
         core: 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 003_200_to_210
          acl: 003_200_to_210, 004_212_to_213
rate-limiting: 004_200_to_210
       oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:13:56 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:13:56 [debug] running migration: 009_200_to_210
2020/10/13 05:14:02 [info] core migrated up to: 009_200_to_210 (pending)
2020/10/13 05:14:02 [debug] running migration: 010_210_to_211
2020/10/13 05:14:02 [info] core migrated up to: 010_210_to_211 (pending)
2020/10/13 05:14:02 [debug] running migration: 011_212_to_213
2020/10/13 05:14:02 [info] core migrated up to: 011_212_to_213 (executed)
2020/10/13 05:14:02 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/13 05:14:02 [debug] running migration: 003_200_to_210
2020/10/13 05:14:03 [info] jwt migrated up to: 003_200_to_210 (pending)
2020/10/13 05:14:03 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 003_200_to_210
2020/10/13 05:14:03 [info] acl migrated up to: 003_200_to_210 (pending)
2020/10/13 05:14:03 [debug] running migration: 004_212_to_213
2020/10/13 05:14:03 [info] acl migrated up to: 004_212_to_213 (pending)
2020/10/13 05:14:03 [info] migrating rate-limiting on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 004_200_to_210
2020/10/13 05:14:03 [info] rate-limiting migrated up to: 004_200_to_210 (executed)
2020/10/13 05:14:03 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/13 05:14:03 [debug] running migration: 004_200_to_210
2020/10/13 05:14:05 [info] oauth2 migrated up to: 004_200_to_210 (pending)
2020/10/13 05:14:05 [debug] running migration: 005_210_to_211
2020/10/13 05:14:05 [info] oauth2 migrated up to: 005_210_to_211 (pending)
2020/10/13 05:14:05 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/13 05:14:07 [verbose] Cassandra schema consensus: reached in 2537ms
2020/10/13 05:14:07 [info] 9 migrations processed
2020/10/13 05:14:07 [info] 2 executed
2020/10/13 05:14:07 [info] 7 pending
2020/10/13 05:14:07 [debug] loading subsystems migrations...
2020/10/13 05:14:07 [verbose] retrieving keyspace schema state...
2020/10/13 05:14:08 [verbose] schema state retrieved
2020/10/13 05:14:08 [info]
Database has pending migrations; run 'kong migrations finish' when ready
/ $

First run of finish:
kong migrations finish --db-timeout 20 --vv

2020/10/13 05:15:11 [debug] loading subsystems migrations...
2020/10/13 05:15:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:11 [verbose] schema state retrieved
2020/10/13 05:15:11 [debug] loading subsystems migrations...
2020/10/13 05:15:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:11 [verbose] schema state retrieved
2020/10/13 05:15:11 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:15:12 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:15:12 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Second run of finish:
kong migrations finish --db-timeout 20 --vv

2020/10/13 05:15:53 [debug] loading subsystems migrations...
2020/10/13 05:15:53 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:53 [verbose] schema state retrieved
2020/10/13 05:15:53 [debug] loading subsystems migrations...
2020/10/13 05:15:53 [verbose] retrieving keyspace schema state...
2020/10/13 05:15:54 [verbose] schema state retrieved
2020/10/13 05:15:54 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:15:54 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:15:54 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Third run of finish:
kong migrations finish --db-timeout 20 --vv

2020/10/13 05:16:29 [debug] loading subsystems migrations...
2020/10/13 05:16:29 [verbose] retrieving keyspace schema state...
2020/10/13 05:16:30 [verbose] schema state retrieved
2020/10/13 05:16:30 [debug] loading subsystems migrations...
2020/10/13 05:16:30 [verbose] retrieving keyspace schema state...
2020/10/13 05:16:30 [verbose] schema state retrieved
2020/10/13 05:16:30 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:16:31 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:16:31 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Fourth run of finish:
kong migrations finish --db-timeout 20 --vv

2020/10/13 05:17:11 [debug] loading subsystems migrations...
2020/10/13 05:17:11 [verbose] retrieving keyspace schema state...
2020/10/13 05:17:12 [verbose] schema state retrieved
2020/10/13 05:17:12 [debug] loading subsystems migrations...
2020/10/13 05:17:12 [verbose] retrieving keyspace schema state...
2020/10/13 05:17:12 [verbose] schema state retrieved
2020/10/13 05:17:12 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 05:17:12 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 05:17:12 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Umm @bungle its now erring every time on trying to find a default workspace. Well lets check the keyspace:

# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.8 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
*****@cqlsh> use kong_prod2;
*****@cqlsh:kong_prod2> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 926762fb-7c62-4ddb-8535-5577af0cefac |    null |   null | 2020-10-13 05:15:11.000000+0000 | null | default

(1 rows)
*****@cqlsh:kong_prod2>

Def was injected at 05:15:11 by the first migrations finish I imagine, no idea why the rest of the processing and subsequent finish commands are not seeing it.

Did a diff on 009_200_to_210 core migration file just to make sure your patch file was indeed in there and identical and it was. Seems whatever logic was dropped in this smaller PR caused it to not be as effective or introduced a new bug preventing the finding of the injected workspace.

Will hold off on the other pending branch you have until you tell me to test it if you feel 6439 needs more attention. If you want me to go ahead with your other experimental branch with changes I can do so as well, let me know how you prefer to proceed.

Edit - Looks like we land here: https://github.com/Kong/kong/blob/0d64a67d34d40e0facd66b015c46f8a6ba994f71/kong/db/migrations/core/009_200_to_210.lua#L251 , wonder why the prior: local default_ws, err = operations.cassandra_ensure_default_ws(coordinator) isn't returning an err, its just got default_ws as nil I suppose... when I am able to query and see it just fine in the C* db itself on all nodes in that datacenter it could possibly be using.

@jeremyjpj0916, the query is here:
https://github.com/Kong/kong/blob/0d64a67d34d40e0facd66b015c46f8a6ba994f71/kong/db/migrations/operations/200_to_210.lua#L24

I like that we got a new problem as that might make us find the real issue. It seems like query did not return rows. Also it seems like that query does not return rows even when you start the migration again.

Is this query wrong:

SELECT id FROM workspaces WHERE name='default'

Can you execute it on cql?

@bungle, Yep works like a charm, so I guess the question is why does the cql statement not work in the context of Kong for me, and for multiple runs even of migration finish. And what is different about your first big PR that got scratched that did cause migrations to seemingly work and get past this step, vs what we have now that gets stuck on it in pr 6439:

***@cqlsh:kong_prod2> select * from workspaces;

 id                                   | comment | config | created_at                      | meta | name
--------------------------------------+---------+--------+---------------------------------+------+---------
 926762fb-7c62-4ddb-8535-5577af0cefac |    null |   null | 2020-10-13 05:15:11.000000+0000 | null | default

(1 rows)
***@cqlsh:kong_prod2> SELECT id FROM workspaces WHERE name='default';

 id
--------------------------------------
 926762fb-7c62-4ddb-8535-5577af0cefac

(1 rows)

@jeremyjpj0916, what if you do:

CONSISTENCY serial; SELECT id FROM workspaces WHERE name='default';

or

CONSISTENCY quorum; SELECT id FROM workspaces WHERE name='default';
***@cqlsh:kong_prod2> CONSISTENCY serial; SELECT id FROM workspaces WHERE name='default';
Consistency level set to SERIAL.

 id
--------------------------------------
 926762fb-7c62-4ddb-8535-5577af0cefac
***@cqlsh:kong_prod2> CONSISTENCY quorum; SELECT id FROM workspaces WHERE name='default';
Consistency level set to QUORUM.

 id
--------------------------------------
 926762fb-7c62-4ddb-8535-5577af0cefac

Also tried migrations finish from the newer Kong container again today against that keyspace to see if any change, including full output here, tldr same error:

/ $ kong version --vv
2020/10/13 18:27:33 [verbose] Kong: 2.1.4
2020/10/13 18:27:33 [debug] ngx_lua: 10015
2020/10/13 18:27:33 [debug] nginx: 1015008
2020/10/13 18:27:33 [debug] Lua: LuaJIT 2.1.0-beta3
2.1.4
/ $ kong migrations finish --vv
2020/10/13 18:27:39 [verbose] Kong: 2.1.4
2020/10/13 18:27:39 [debug] ngx_lua: 10015
2020/10/13 18:27:39 [debug] nginx: 1015008
2020/10/13 18:27:39 [debug] Lua: LuaJIT 2.1.0-beta3
2020/10/13 18:27:39 [verbose] no config file found at /etc/kong/kong.conf
2020/10/13 18:27:39 [verbose] no config file found at /etc/kong.conf
2020/10/13 18:27:39 [verbose] no config file, skip loading
2020/10/13 18:27:39 [debug] reading environment variables
2020/10/13 18:27:39 [debug] KONG_PLUGINS ENV found with "kong-siteminder-auth,kong-kafka-log,stargate-waf-error-log,mtls,stargate-oidc-token-revoke,kong-tx-debugger,kong-plugin-oauth,zipkin,kong-error-log,kong-oidc-implicit-token,kong-response-size-limiting,request-transformer,kong-service-virtualization,kong-cluster-drain,kong-upstream-jwt,kong-splunk-log,kong-spec-expose,kong-path-based-routing,kong-oidc-multi-idp,correlation-id,oauth2,statsd,jwt,rate-limiting,acl,request-size-limiting,request-termination,cors"
2020/10/13 18:27:39 [debug] KONG_ADMIN_LISTEN ENV found with "0.0.0.0:8001 deferred reuseport"
2020/10/13 18:27:39 [debug] KONG_PROXY_ACCESS_LOG ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_USERNAME ENV found with "kongdba"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_PASSWORD ENV found with "******"
2020/10/13 18:27:39 [debug] KONG_PROXY_LISTEN ENV found with "0.0.0.0:8000, 0.0.0.0:8443 ssl http2 deferred reuseport"
2020/10/13 18:27:39 [debug] KONG_DNS_NO_SYNC ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_DB_UPDATE_PROPAGATION ENV found with "5"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_PORT ENV found with "9042"
2020/10/13 18:27:39 [debug] KONG_HEADERS ENV found with "latency_tokens"
2020/10/13 18:27:39 [debug] KONG_DNS_STALE_TTL ENV found with "4"
2020/10/13 18:27:39 [debug] KONG_WAF_DEBUG_LEVEL ENV found with "0"
2020/10/13 18:27:39 [debug] KONG_WAF_PARANOIA_LEVEL ENV found with "1"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_REFRESH_FREQUENCY ENV found with "0"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_CONTACT_POINTS ENV found with "server00466,server00468,server00469,server00467,server00470,server00471"
2020/10/13 18:27:39 [debug] KONG_DB_CACHE_WARMUP_ENTITIES ENV found with "services,plugins,consumers"
2020/10/13 18:27:39 [debug] KONG_NGINX_HTTP_SSL_PROTOCOLS ENV found with "TLSv1.2 TLSv1.3"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_LOCAL_DATACENTER ENV found with "CTC"
2020/10/13 18:27:39 [debug] KONG_UPSTREAM_KEEPALIVE_IDLE_TIMEOUT ENV found with "30"
2020/10/13 18:27:39 [debug] KONG_DB_CACHE_TTL ENV found with "0"
2020/10/13 18:27:39 [debug] KONG_PG_SSL ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_SCHEMA_CONSENSUS_TIMEOUT ENV found with "30000"
2020/10/13 18:27:39 [debug] KONG_LOG_LEVEL ENV found with "notice"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_TIMEOUT ENV found with "5000"
2020/10/13 18:27:39 [debug] KONG_NGINX_MAIN_WORKER_PROCESSES ENV found with "6"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_KEYSPACE ENV found with "kong_prod2"
2020/10/13 18:27:39 [debug] KONG_WAF ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_ERROR_DEFAULT_TYPE ENV found with "text/plain"
2020/10/13 18:27:39 [debug] KONG_UPSTREAM_KEEPALIVE_POOL_SIZE ENV found with "400"
2020/10/13 18:27:39 [debug] KONG_WORKER_CONSISTENCY ENV found with "eventual"
2020/10/13 18:27:39 [debug] KONG_CLIENT_SSL ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_TRUSTED_IPS ENV found with "0.0.0.0/0,::/0"
2020/10/13 18:27:39 [debug] KONG_SSL_CERT_KEY ENV found with "/usr/local/kong/ssl/kongprivatekey.key"
2020/10/13 18:27:39 [debug] KONG_MEM_CACHE_SIZE ENV found with "1024m"
2020/10/13 18:27:39 [debug] KONG_NGINX_PROXY_REAL_IP_HEADER ENV found with "X-Forwarded-For"
2020/10/13 18:27:39 [debug] KONG_DB_UPDATE_FREQUENCY ENV found with "5"
2020/10/13 18:27:39 [debug] KONG_DNS_ORDER ENV found with "LAST,SRV,A,CNAME"
2020/10/13 18:27:39 [debug] KONG_DNS_ERROR_TTL ENV found with "1"
2020/10/13 18:27:39 [debug] KONG_DATABASE ENV found with "cassandra"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_DATA_CENTERS ENV found with "DC1:3,DC2:3"
2020/10/13 18:27:39 [debug] KONG_WORKER_STATE_UPDATE_FREQUENCY ENV found with "5"
2020/10/13 18:27:39 [debug] KONG_LUA_SSL_VERIFY_DEPTH ENV found with "3"
2020/10/13 18:27:39 [debug] KONG_LUA_SOCKET_POOL_SIZE ENV found with "30"
2020/10/13 18:27:39 [debug] KONG_UPSTREAM_KEEPALIVE_MAX_REQUESTS ENV found with "50000"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_CONSISTENCY ENV found with "LOCAL_QUORUM"
2020/10/13 18:27:39 [debug] KONG_CLIENT_MAX_BODY_SIZE ENV found with "50m"
2020/10/13 18:27:39 [debug] KONG_ADMIN_ERROR_LOG ENV found with "/dev/stderr"
2020/10/13 18:27:39 [debug] KONG_DNS_NOT_FOUND_TTL ENV found with "30"
2020/10/13 18:27:39 [debug] KONG_PROXY_ERROR_LOG ENV found with "/dev/stderr"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_REPL_STRATEGY ENV found with "NetworkTopologyStrategy"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_SSL_VERIFY ENV found with "on"
2020/10/13 18:27:39 [debug] KONG_ADMIN_ACCESS_LOG ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_DNS_HOSTSFILE ENV found with "/etc/hosts"
2020/10/13 18:27:39 [debug] KONG_WAF_REQUEST_FILE_SIZE_LIMIT ENV found with "50000000"
2020/10/13 18:27:39 [debug] KONG_SSL_CERT ENV found with "/usr/local/kong/ssl/kongcert.crt"
2020/10/13 18:27:39 [debug] KONG_NGINX_PROXY_REAL_IP_RECURSIVE ENV found with "on"
2020/10/13 18:27:39 [debug] KONG_SSL_CIPHER_SUITE ENV found with "intermediate"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_SSL ENV found with "on"
2020/10/13 18:27:39 [debug] KONG_ANONYMOUS_REPORTS ENV found with "off"
2020/10/13 18:27:39 [debug] KONG_WAF_MODE ENV found with "On"
2020/10/13 18:27:39 [debug] KONG_CLIENT_BODY_BUFFER_SIZE ENV found with "50m"
2020/10/13 18:27:39 [debug] KONG_WAF_PCRE_MATCH_LIMIT ENV found with "10000"
2020/10/13 18:27:39 [debug] KONG_LUA_SSL_TRUSTED_CERTIFICATE ENV found with "/usr/local/kong/ssl/kongcert.pem"
2020/10/13 18:27:39 [debug] KONG_CASSANDRA_LB_POLICY ENV found with "RequestDCAwareRoundRobin"
2020/10/13 18:27:39 [debug] KONG_WAF_AUDIT ENV found with "RelevantOnly"
2020/10/13 18:27:39 [debug] admin_access_log = "off"
2020/10/13 18:27:39 [debug] admin_error_log = "/dev/stderr"
2020/10/13 18:27:39 [debug] admin_listen = {"0.0.0.0:8001 deferred reuseport"}
2020/10/13 18:27:39 [debug] anonymous_reports = false
2020/10/13 18:27:39 [debug] cassandra_consistency = "LOCAL_QUORUM"
2020/10/13 18:27:39 [debug] cassandra_contact_points = {"server00466","server00468","server00469","server00467","server00470","server00471"}
2020/10/13 18:27:39 [debug] cassandra_data_centers = {"DC1:3","DC2:3"}
2020/10/13 18:27:39 [debug] cassandra_keyspace = "kong_prod2"
2020/10/13 18:27:39 [debug] cassandra_lb_policy = "RequestDCAwareRoundRobin"
2020/10/13 18:27:39 [debug] cassandra_local_datacenter = "DC1"
2020/10/13 18:27:39 [debug] cassandra_password = "******"
2020/10/13 18:27:39 [debug] cassandra_port = 9042
2020/10/13 18:27:39 [debug] cassandra_read_consistency = "LOCAL_QUORUM"
2020/10/13 18:27:39 [debug] cassandra_refresh_frequency = 0
2020/10/13 18:27:39 [debug] cassandra_repl_factor = 1
2020/10/13 18:27:39 [debug] cassandra_repl_strategy = "NetworkTopologyStrategy"
2020/10/13 18:27:39 [debug] cassandra_schema_consensus_timeout = 30000
2020/10/13 18:27:39 [debug] cassandra_ssl = true
2020/10/13 18:27:39 [debug] cassandra_ssl_verify = true
2020/10/13 18:27:39 [debug] cassandra_timeout = 5000
2020/10/13 18:27:39 [debug] cassandra_username = "*******"
2020/10/13 18:27:39 [debug] cassandra_write_consistency = "LOCAL_QUORUM"
2020/10/13 18:27:39 [debug] client_body_buffer_size = "50m"
2020/10/13 18:27:39 [debug] client_max_body_size = "50m"
2020/10/13 18:27:39 [debug] client_ssl = false
2020/10/13 18:27:39 [debug] cluster_control_plane = "127.0.0.1:8005"
2020/10/13 18:27:39 [debug] cluster_listen = {"0.0.0.0:8005"}
2020/10/13 18:27:39 [debug] cluster_mtls = "shared"
2020/10/13 18:27:39 [debug] database = "cassandra"
2020/10/13 18:27:39 [debug] db_cache_ttl = 0
2020/10/13 18:27:39 [debug] db_cache_warmup_entities = {"services","plugins","consumers"}
2020/10/13 18:27:39 [debug] db_resurrect_ttl = 30
2020/10/13 18:27:39 [debug] db_update_frequency = 5
2020/10/13 18:27:39 [debug] db_update_propagation = 5
2020/10/13 18:27:39 [debug] dns_error_ttl = 1
2020/10/13 18:27:39 [debug] dns_hostsfile = "/etc/hosts"
2020/10/13 18:27:39 [debug] dns_no_sync = false
2020/10/13 18:27:39 [debug] dns_not_found_ttl = 30
2020/10/13 18:27:39 [debug] dns_order = {"LAST","SRV","A","CNAME"}
2020/10/13 18:27:39 [debug] dns_resolver = {}
2020/10/13 18:27:39 [debug] dns_stale_ttl = 4
2020/10/13 18:27:39 [debug] error_default_type = "text/plain"
2020/10/13 18:27:39 [debug] go_plugins_dir = "off"
2020/10/13 18:27:39 [debug] go_pluginserver_exe = "/usr/local/bin/go-pluginserver"
2020/10/13 18:27:39 [debug] headers = {"latency_tokens"}
2020/10/13 18:27:39 [debug] host_ports = {}
2020/10/13 18:27:39 [debug] kic = false
2020/10/13 18:27:39 [debug] log_level = "notice"
2020/10/13 18:27:39 [debug] lua_package_cpath = ""
2020/10/13 18:27:39 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2020/10/13 18:27:39 [debug] lua_socket_pool_size = 30
2020/10/13 18:27:39 [debug] lua_ssl_trusted_certificate = "/usr/local/kong/ssl/kongcert.pem"
2020/10/13 18:27:39 [debug] lua_ssl_verify_depth = 3
2020/10/13 18:27:39 [debug] mem_cache_size = "1024m"
2020/10/13 18:27:39 [debug] nginx_admin_directives = {}
2020/10/13 18:27:39 [debug] nginx_daemon = "on"
2020/10/13 18:27:39 [debug] nginx_events_directives = {{name="multi_accept",value="on"},{name="worker_connections",value="auto"}}
2020/10/13 18:27:39 [debug] nginx_events_multi_accept = "on"
2020/10/13 18:27:39 [debug] nginx_events_worker_connections = "auto"
2020/10/13 18:27:39 [debug] nginx_http_client_body_buffer_size = "50m"
2020/10/13 18:27:39 [debug] nginx_http_client_max_body_size = "50m"
2020/10/13 18:27:39 [debug] nginx_http_directives = {{name="client_max_body_size",value="50m"},{name="ssl_prefer_server_ciphers",value="off"},{name="client_body_buffer_size",value="50m"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_timeout",value="1d"},{name="ssl_session_tickets",value="on"}}
2020/10/13 18:27:39 [debug] nginx_http_ssl_prefer_server_ciphers = "off"
2020/10/13 18:27:39 [debug] nginx_http_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/10/13 18:27:39 [debug] nginx_http_ssl_session_tickets = "on"
2020/10/13 18:27:39 [debug] nginx_http_ssl_session_timeout = "1d"
2020/10/13 18:27:39 [debug] nginx_http_status_directives = {}
2020/10/13 18:27:39 [debug] nginx_http_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
2020/10/13 18:27:39 [debug] nginx_http_upstream_keepalive = "60"
2020/10/13 18:27:39 [debug] nginx_http_upstream_keepalive_requests = "100"
2020/10/13 18:27:39 [debug] nginx_http_upstream_keepalive_timeout = "60s"
2020/10/13 18:27:39 [debug] nginx_main_daemon = "on"
2020/10/13 18:27:39 [debug] nginx_main_directives = {{name="daemon",value="on"},{name="worker_rlimit_nofile",value="auto"},{name="worker_processes",value="6"}}
2020/10/13 18:27:39 [debug] nginx_main_worker_processes = "6"
2020/10/13 18:27:39 [debug] nginx_main_worker_rlimit_nofile = "auto"
2020/10/13 18:27:39 [debug] nginx_optimizations = true
2020/10/13 18:27:39 [debug] nginx_proxy_directives = {{name="real_ip_header",value="X-Forwarded-For"},{name="real_ip_recursive",value="on"}}
2020/10/13 18:27:39 [debug] nginx_proxy_real_ip_header = "X-Forwarded-For"
2020/10/13 18:27:39 [debug] nginx_proxy_real_ip_recursive = "on"
2020/10/13 18:27:39 [debug] nginx_sproxy_directives = {}
2020/10/13 18:27:39 [debug] nginx_status_directives = {}
2020/10/13 18:27:39 [debug] nginx_stream_directives = {{name="ssl_session_tickets",value="on"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_timeout",value="1d"},{name="ssl_prefer_server_ciphers",value="off"}}
2020/10/13 18:27:39 [debug] nginx_stream_ssl_prefer_server_ciphers = "off"
2020/10/13 18:27:39 [debug] nginx_stream_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/10/13 18:27:39 [debug] nginx_stream_ssl_session_tickets = "on"
2020/10/13 18:27:39 [debug] nginx_stream_ssl_session_timeout = "1d"
2020/10/13 18:27:39 [debug] nginx_supstream_directives = {}
2020/10/13 18:27:39 [debug] nginx_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive_timeout",value="60s"},{name="keepalive",value="60"}}
2020/10/13 18:27:39 [debug] nginx_upstream_keepalive = "60"
2020/10/13 18:27:39 [debug] nginx_upstream_keepalive_requests = "100"
2020/10/13 18:27:39 [debug] nginx_upstream_keepalive_timeout = "60s"
2020/10/13 18:27:39 [debug] nginx_worker_processes = "auto"
2020/10/13 18:27:39 [debug] pg_database = "kong"
2020/10/13 18:27:39 [debug] pg_host = "127.0.0.1"
2020/10/13 18:27:39 [debug] pg_max_concurrent_queries = 0
2020/10/13 18:27:39 [debug] pg_port = 5432
2020/10/13 18:27:39 [debug] pg_ro_ssl = false
2020/10/13 18:27:39 [debug] pg_ro_ssl_verify = false
2020/10/13 18:27:39 [debug] pg_semaphore_timeout = 60000
2020/10/13 18:27:39 [debug] pg_ssl = false
2020/10/13 18:27:39 [debug] pg_ssl_verify = false
2020/10/13 18:27:39 [debug] pg_timeout = 5000
2020/10/13 18:27:39 [debug] pg_user = "kong"
2020/10/13 18:27:39 [debug] plugins = {"kong-siteminder-auth","kong-kafka-log","stargate-waf-error-log","mtls","stargate-oidc-token-revoke","kong-tx-debugger","kong-plugin-oauth","zipkin","kong-error-log","kong-oidc-implicit-token","kong-response-size-limiting","request-transformer","kong-service-virtualization","kong-cluster-drain","kong-upstream-jwt","kong-splunk-log","kong-spec-expose","kong-path-based-routing","kong-oidc-multi-idp","correlation-id","oauth2","statsd","jwt","rate-limiting","acl","request-size-limiting","request-termination","cors"}
2020/10/13 18:27:39 [debug] port_maps = {}
2020/10/13 18:27:39 [debug] prefix = "/usr/local/kong/"
2020/10/13 18:27:39 [debug] proxy_access_log = "off"
2020/10/13 18:27:39 [debug] proxy_error_log = "/dev/stderr"
2020/10/13 18:27:39 [debug] proxy_listen = {"0.0.0.0:8000","0.0.0.0:8443 ssl http2 deferred reuseport"}
2020/10/13 18:27:39 [debug] real_ip_header = "X-Real-IP"
2020/10/13 18:27:39 [debug] real_ip_recursive = "off"
2020/10/13 18:27:39 [debug] role = "traditional"
2020/10/13 18:27:39 [debug] ssl_cert = "/usr/local/kong/ssl/kongcert.crt"
2020/10/13 18:27:39 [debug] ssl_cert_key = "/usr/local/kong/ssl/kongprivatekey.key"
2020/10/13 18:27:39 [debug] ssl_cipher_suite = "intermediate"
2020/10/13 18:27:39 [debug] ssl_ciphers = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
2020/10/13 18:27:39 [debug] ssl_prefer_server_ciphers = "on"
2020/10/13 18:27:39 [debug] ssl_protocols = "TLSv1.1 TLSv1.2 TLSv1.3"
2020/10/13 18:27:39 [debug] ssl_session_tickets = "on"
2020/10/13 18:27:39 [debug] ssl_session_timeout = "1d"
2020/10/13 18:27:39 [debug] status_access_log = "off"
2020/10/13 18:27:39 [debug] status_error_log = "logs/status_error.log"
2020/10/13 18:27:39 [debug] status_listen = {"off"}
2020/10/13 18:27:39 [debug] stream_listen = {"off"}
2020/10/13 18:27:39 [debug] trusted_ips = {"0.0.0.0/0","::/0"}
2020/10/13 18:27:39 [debug] upstream_keepalive = 60
2020/10/13 18:27:39 [debug] upstream_keepalive_idle_timeout = 30
2020/10/13 18:27:39 [debug] upstream_keepalive_max_requests = 50000
2020/10/13 18:27:39 [debug] upstream_keepalive_pool_size = 400
2020/10/13 18:27:39 [debug] worker_consistency = "eventual"
2020/10/13 18:27:39 [debug] worker_state_update_frequency = 5
2020/10/13 18:27:39 [verbose] prefix in use: /usr/local/kong
2020/10/13 18:27:39 [debug] resolved Cassandra contact point 'server00466' to: 10.86.168.39
2020/10/13 18:27:39 [debug] resolved Cassandra contact point 'server00468' to: 10.204.90.237
2020/10/13 18:27:39 [debug] resolved Cassandra contact point 'server00469' to: 10.86.177.62
2020/10/13 18:27:39 [debug] resolved Cassandra contact point 'server00467' to: 10.87.53.57
2020/10/13 18:27:39 [debug] resolved Cassandra contact point 'server00470' to: 10.87.40.192
2020/10/13 18:27:39 [debug] resolved Cassandra contact point 'server00471' to: 10.106.152.199
2020/10/13 18:27:39 [debug] loading subsystems migrations...
2020/10/13 18:27:39 [verbose] retrieving keyspace schema state...
2020/10/13 18:27:39 [verbose] schema state retrieved
2020/10/13 18:27:39 [debug] loading subsystems migrations...
2020/10/13 18:27:39 [verbose] retrieving keyspace schema state...
2020/10/13 18:27:40 [verbose] schema state retrieved
2020/10/13 18:27:40 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 18:27:40 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 18:27:40 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:48: in function <init_worker_by_lua:46>
        [C]: in function 'xpcall'
        init_worker_by_lua:55: in function <init_worker_by_lua:53>

@jeremyjpj0916:

Can you add:

print(require "inspect"(rows))

To here:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L30

On my machine when running migrations finish it prints like this (the first one is ok):

{
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "6bc76bdd-49dc-4a32-b724-90823e3e68ca"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}

Also how have you configured the Cassandra contact points? Any issues there?

@bungle Interesting data here, looks to dup my single workspace row entry 4 times of the same id lol?

2020/10/13 19:53:20 [debug] resolved Cassandra contact point 'server00466' to: 10.86.168.39
2020/10/13 19:53:20 [debug] resolved Cassandra contact point 'server00468' to: 10.204.90.237
2020/10/13 19:53:20 [debug] resolved Cassandra contact point 'server00469' to: 10.86.177.62
2020/10/13 19:53:20 [debug] resolved Cassandra contact point 'server00467' to: 10.87.53.57
2020/10/13 19:53:20 [debug] resolved Cassandra contact point 'server00470' to: 10.87.40.192
2020/10/13 19:53:20 [debug] resolved Cassandra contact point 'server00471' to: 10.106.152.199
2020/10/13 19:53:20 [debug] loading subsystems migrations...
2020/10/13 19:53:20 [verbose] retrieving keyspace schema state...
2020/10/13 19:53:21 [verbose] schema state retrieved
2020/10/13 19:53:21 [debug] loading subsystems migrations...
2020/10/13 19:53:21 [verbose] retrieving keyspace schema state...
2020/10/13 19:53:21 [verbose] schema state retrieved
2020/10/13 19:53:21 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 19:53:21 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 19:53:21 [debug] running migration: 009_200_to_210
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'

This is interesting, so it seems to get it 3 times, and then suddenly does not get it. Let's see what happens on each step.

  1. SELECT id FROM workspaces WHERE name='default' -> ok 1
  2. updates keys of upstreams -> ok 2
  3. updates keys of targets -> ok 3
  4. updates keys of consumers -> ok 4
  5. updates keys of certificates -> fail (the first one that has partition)

Can we confirm that here:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L406

by adding:

print(table_name)

there?

Adding:

    -- Update keys to workspace-aware formats
    ws_update_keys = function(_, connector, table_name, unique_keys, is_partitioned)
      local coordinator = assert(connector:get_stored_connection())
      local default_ws, err = cassandra_get_default_ws(coordinator)
      if err then
        return nil, err
      end

      ngx.log(ngx.ERR, "Table name: " .. table_name)

      if not default_ws then
        return nil, "unable to find a default workspace"
      end

Now and executing.

Also there is 1 cert resource in there for reference:

image

@bungle as you suspected, confirmed:

oauth2: 004_200_to_210, 005_210_to_211
2020/10/13 21:45:58 [info] migrating core on keyspace 'kong_prod2'...
2020/10/13 21:45:58 [debug] running migration: 009_200_to_210
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/13 21:45:58 [error] 638#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: upstreams, context: ngx.timer
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/13 21:45:58 [error] 638#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: targets, context: ngx.timer
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/13 21:45:58 [error] 638#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: consumers, context: ngx.timer
{
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/13 21:45:58 [error] 638#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: certificates, context: ngx.timer
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:

Current state of certificates schema:

*******@cqlsh:kong_prod2> describe certificates;

CREATE TABLE kong_prod2.certificates (
    partition text,
    id uuid,
    cert text,
    created_at timestamp,
    key text,
    tags set<text>,
    ws_id uuid,
    PRIMARY KEY (partition, id)
) WITH CLUSTERING ORDER BY (id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 86400
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX certificates_ws_id_idx ON kong_prod2.certificates (ws_id);

Stepping out for now, can run more stuff in like 5-6 hours if you have any other tests or if you think you have an idea whats going on now will test any fixes that come along for it 馃憤 .

@jeremyjpj0916,

Let's try one more thing. In your cqlsh:

drop index workspaces_name_idx;

AND _remove_ this:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L303

Step 1 taken:

# cqlsh --ssl
Connected to Stargate at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.8 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
****@cqlsh> use kong_prod2;
****@cqlsh:kong_prod2> drop index workspaces_name_idx;
****@cqlsh:kong_prod2>

Step 2 removed CREATE INDEX IF NOT EXISTS workspaces_name_idx ON workspaces(name); from file. File shown below here:

-- Helper module for 200_to_210 migration operations.
--
-- Operations are versioned and specific to a migration so they remain
-- fixed in time and are not modified for use in future migrations.
--
-- If you want to reuse these operations in a future migration,
-- copy the functions over to a new versioned module.


local ngx = ngx
local uuid = require "resty.jit-uuid"
local cassandra = require "cassandra"


local default_ws_id = uuid.generate_v4()


local function render(template, keys)
  return (template:gsub("$%(([A-Z_]+)%)", keys))
end


local function cassandra_get_default_ws(coordinator)
  local rows, err = coordinator:execute("SELECT id FROM workspaces WHERE name='default'", nil, {
    consistency = cassandra.consistencies.serial,
  })
  if err then
    return nil, err
  end

  print(require "inspect"(rows))

  if not rows
     or not rows[1]
     or not rows[1].id
  then
    return nil
  end

  return rows[1].id
end


local function cassandra_create_default_ws(coordinator)
  local created_at = ngx.time() * 1000

  local _, err = coordinator:execute("INSERT INTO workspaces(id, name, created_at) VALUES (?, 'default', ?)", {
    cassandra.uuid(default_ws_id),
    cassandra.timestamp(created_at),
  }, {
    consistency = cassandra.consistencies.quorum,
  })
  if err then
    return nil, err
  end

  return cassandra_get_default_ws(coordinator)
end


local function cassandra_ensure_default_ws(coordinator)
  local default_ws, err = cassandra_get_default_ws(coordinator)
  if err then
    return nil, err
  end

  if default_ws then
    return default_ws
  end

  return cassandra_create_default_ws(coordinator)
end


--------------------------------------------------------------------------------
-- Postgres operations for Workspace migration
--------------------------------------------------------------------------------


local postgres = {

  up = {

    ----------------------------------------------------------------------------
    -- Add `workspaces` table.
    -- @return string: SQL
    ws_add_workspaces = function(_)
      return render([[

        CREATE TABLE IF NOT EXISTS "workspaces" (
          "id"         UUID                       PRIMARY KEY,
          "name"       TEXT                       UNIQUE,
          "comment"    TEXT,
          "created_at" TIMESTAMP WITH TIME ZONE   DEFAULT (CURRENT_TIMESTAMP(0) AT TIME ZONE 'UTC'),
          "meta"       JSONB,
          "config"     JSONB
        );

        -- Create default workspace
        INSERT INTO workspaces(id, name)
        VALUES ('$(ID)', 'default') ON CONFLICT DO NOTHING;

      ]], {
        ID = default_ws_id
      })
    end,

    ----------------------------------------------------------------------------
    -- Add `ws_id` field to a table.
    -- @param table_name string: name of the table, e.g. "services"
    -- @param fk_users {string:string}: map of tables and field names
    -- for other tables that use this table as a foreign key.
    -- We do NOT get these from the schemas because
    -- we want the migration to remain self-contained and unchanged no matter
    -- what changes to the schemas in the latest version of Kong.
    -- @return string: SQL
    ws_add_ws_id = function(_, table_name, fk_users)
      local out = {}
      table.insert(out, render([[

        -- Add ws_id to $(TABLE), populating all of them with the default workspace id
        DO $$
        BEGIN
          EXECUTE format('ALTER TABLE IF EXISTS ONLY "$(TABLE)" ADD "ws_id" UUID REFERENCES "workspaces" ("id") DEFAULT %L',
                         (SELECT "id" FROM "workspaces" WHERE "name" = 'default'));
        EXCEPTION WHEN DUPLICATE_COLUMN THEN
          -- Do nothing, accept existing state
        END;
        $$;


      ]], { TABLE = table_name }))

      table.insert(out, render([[

        -- Ensure (id, ws_id) pair is unique
        DO $$
        BEGIN
          ALTER TABLE IF EXISTS ONLY "$(TABLE)" ADD CONSTRAINT "$(TABLE)_id_ws_id_unique" UNIQUE ("id", "ws_id");
        EXCEPTION WHEN DUPLICATE_TABLE THEN
          -- Do nothing, accept existing state
        END$$;

      ]], { TABLE = table_name }))

      return table.concat(out, "\n")
    end,

    ----------------------------------------------------------------------------
    -- Make field unique per workspace only.
    -- @param table_name string: name of the table, e.g. "services"
    -- @param field_name string: name of the field, e.g. "name"
    -- @return string: SQL
    ws_unique_field = function(_, table_name, field_name)
      return render([[

          -- Make '$(TABLE).$(FIELD)' unique per workspace
          ALTER TABLE IF EXISTS ONLY "$(TABLE)" DROP CONSTRAINT IF EXISTS "$(TABLE)_$(FIELD)_key";

          -- Ensure (ws_id, $(FIELD)) pair is unique
          DO $$
          BEGIN
            ALTER TABLE IF EXISTS ONLY "$(TABLE)" ADD CONSTRAINT "$(TABLE)_ws_id_$(FIELD)_unique" UNIQUE ("ws_id", "$(FIELD)");
          EXCEPTION WHEN DUPLICATE_TABLE THEN
            -- Do nothing, accept existing state
          END$$;

      ]], {
        TABLE = table_name,
        FIELD = field_name,
      })
    end,

    ----------------------------------------------------------------------------
    -- Adjust foreign key to take ws_id into account and ensure it matches
    -- @param table_name string: name of the table e.g. "routes"
    -- @param fk_prefix string: name of the foreign field in the schema,
    -- which is used as a prefix in foreign key entries in tables e.g. "service"
    -- @param foreign_table_name string: name of the table the foreign field
    -- refers to e.g. "services"
    -- @return string: SQL
    ws_adjust_foreign_key = function(_, table_name, fk_prefix, foreign_table_name, is_cascade)
      return render([[

          -- Update foreign key relationship
          ALTER TABLE IF EXISTS ONLY "$(TABLE)" DROP CONSTRAINT IF EXISTS "$(TABLE)_$(FK)_id_fkey";

          DO $$
          BEGIN
            ALTER TABLE IF EXISTS ONLY "$(TABLE)"
                        ADD CONSTRAINT "$(TABLE)_$(FK)_id_fkey"
                           FOREIGN KEY ("$(FK)_id", "ws_id")
                            REFERENCES $(FOREIGN_TABLE)("id", "ws_id") $(CASCADE);
          EXCEPTION WHEN DUPLICATE_OBJECT THEN
            -- Do nothing, accept existing state
          END$$;

      ]], {
        TABLE = table_name,
        FK = fk_prefix,
        FOREIGN_TABLE = foreign_table_name,
        CASCADE = is_cascade and "ON DELETE CASCADE" or "",
      })
    end,

  },

  teardown = {

    ------------------------------------------------------------------------------
    -- Update composite cache keys to workspace-aware formats
    ws_update_composite_cache_key = function(_, connector, table_name, is_partitioned)
      local _, err = connector:query(render([[
        UPDATE "$(TABLE)"
        SET cache_key = CONCAT(cache_key, ':',
                               (SELECT id FROM workspaces WHERE name = 'default'))
        WHERE cache_key LIKE '%:';
      ]], {
        TABLE = table_name,
      }))
      if err then
        return nil, err
      end

      return true
    end,


    ------------------------------------------------------------------------------
    -- Update keys to workspace-aware formats
    ws_update_keys = function(_, connector, table_name, unique_keys)
      -- Reset default value for ws_id once it is populated
      local _, err = connector:query(render([[
        ALTER TABLE IF EXISTS ONLY "$(TABLE)" ALTER "ws_id" SET DEFAULT NULL;
      ]], {
        TABLE = table_name,
      }))
      if err then
        return nil, err
      end

      return true
    end,


    ------------------------------------------------------------------------------
    -- General function to fixup a plugin configuration
    fixup_plugin_config = function(_, connector, plugin_name, fixup_fn)
      local pgmoon_json = require("pgmoon.json")
      for plugin, err in connector:iterate("SELECT id, name, config FROM plugins") do
        if err then
          return nil, err
        end

        if plugin.name == plugin_name then
          local fix = fixup_fn(plugin.config)

          if fix then
            local sql = render(
              "UPDATE plugins SET config = $(NEW_CONFIG)::jsonb WHERE id = '$(ID)'", {
              NEW_CONFIG = pgmoon_json.encode_json(plugin.config),
              ID = plugin.id,
            })

            local _, err = connector:query(sql)
            if err then
              return nil, err
            end
          end
        end
      end

      return true
    end,
  },

}


--------------------------------------------------------------------------------
-- Cassandra operations for Workspace migration
--------------------------------------------------------------------------------


local cassandra = {

  up = {

    ----------------------------------------------------------------------------
    -- Add `workspaces` table.
    -- @return string: CQL
    ws_add_workspaces = function(_)
      return [[

          CREATE TABLE IF NOT EXISTS workspaces(
            id         uuid,
            name       text,
            comment    text,
            created_at timestamp,
            meta       text,
            config     text,
            PRIMARY KEY (id)
          );



      ]]
    end,

    ----------------------------------------------------------------------------
    -- Add `ws_id` field to a table.
    -- @param table_name string: name of the table, e.g. "services"
    -- @param fk_users {string:string}: map of tables and field names
    -- for other tables that use this table as a foreign key.
    -- We do NOT get these from the schemas because
    -- we want the migration to remain self-contained and unchanged no matter
    -- what changes to the schemas in the latest version of Kong.
    -- @return string: CQL
    ws_add_ws_id = function(_, table_name, fk_users)
      return render([[

          -- 1. Add ws_id to $(TABLE)
          ALTER TABLE $(TABLE) ADD ws_id uuid;

          -- 2. Create index to filter by workspace
          CREATE INDEX IF NOT EXISTS $(TABLE)_ws_id_idx ON $(TABLE)(ws_id);

      ]], {
        TABLE = table_name
      })
    end,

    ----------------------------------------------------------------------------
    -- Make field unique per workspace only.
    -- @param table_name string: name of the table, e.g. "services"
    -- @param field_name string: name of the field, e.g. "name"
    -- @return string: CQL
    ws_unique_field = function(_, table_name, field_name)
      -- The Cassandra strategy controls this via Lua code
      return ""
    end,

    ----------------------------------------------------------------------------
    -- Adjust foreign key to take ws_id into account and ensure it matches
    -- @param table_name string: name of the table e.g. "routes"
    -- @param fk_prefix string: name of the foreign field in the schema,
    -- which is used as a prefix in foreign key entries in tables e.g. "service"
    -- @param foreign_table_name string: name of the table the foreign field
    -- refers to e.g. "services"
    -- @return string: CQL
    ws_adjust_foreign_key = function(_, table_name, fk_prefix, foreign_table_name)
      -- The Cassandra strategy controls this via Lua code
      return ""
    end,
  },

  teardown = {

    ------------------------------------------------------------------------------
    -- Update composite cache keys to workspace-aware formats
    ws_update_composite_cache_key = function(_, connector, table_name, is_partitioned)
      local coordinator = assert(connector:get_stored_connection())
      local default_ws, err = cassandra_get_default_ws(coordinator)
      if err then
        return nil, err
      end

      if not default_ws then
        return nil, "unable to find a default workspace"
      end

      for rows, err in coordinator:iterate("SELECT id, cache_key FROM " .. table_name) do
        if err then
          return nil, err
        end

        for i = 1, #rows do
          local row = rows[i]
          if row.cache_key:match(":$") then
            local cql = render("UPDATE $(TABLE) SET cache_key = '$(CACHE_KEY)' WHERE $(PARTITION) id = $(ID)", {
              TABLE = table_name,
              CACHE_KEY = row.cache_key .. ":" .. default_ws,
              PARTITION = is_partitioned
                        and "partition = '" .. table_name .. "' AND"
                        or  "",
              ID = row.id,
            })

            local _, err = coordinator:execute(cql)
            if err then
              return nil, err
            end
          end
        end
      end

      return true
    end,

    ------------------------------------------------------------------------------
    -- Update keys to workspace-aware formats
    ws_update_keys = function(_, connector, table_name, unique_keys, is_partitioned)
      local coordinator = assert(connector:get_stored_connection())
      local default_ws, err = cassandra_get_default_ws(coordinator)
      if err then
        return nil, err
      end

      ngx.log(ngx.ERR, "Table name: " .. table_name)

      if not default_ws then
        return nil, "unable to find a default workspace"
      end

      for rows, err in coordinator:iterate("SELECT * FROM " .. table_name) do
        if err then
          return nil, err
        end

        for i = 1, #rows do
          local row = rows[i]
          if row.ws_id == nil then
            local set_list = { "ws_id = " .. default_ws }
            for _, key in ipairs(unique_keys) do
              if row[key] then
                table.insert(set_list, render([[$(KEY) = '$(WS):$(VALUE)']], {
                  KEY = key,
                  WS = default_ws,
                  VALUE = row[key],
                }))
              end
            end

            local cql = render("UPDATE $(TABLE) SET $(SET_LIST) WHERE $(PARTITION) id = $(ID)", {
              PARTITION = is_partitioned
                          and "partition = '" .. table_name .. "' AND"
                          or  "",
              TABLE = table_name,
              SET_LIST = table.concat(set_list, ", "),
              ID = row.id,
            })

            local _, err = coordinator:execute(cql)
            if err then
              return nil, err
            end
          end
        end
      end

      return true
    end,

    ------------------------------------------------------------------------------
    -- General function to fixup a plugin configuration
    fixup_plugin_config = function(_, connector, plugin_name, fixup_fn)
      local coordinator = assert(connector:get_stored_connection())
      local cassandra = require("cassandra")
      local cjson = require("cjson")

      for rows, err in coordinator:iterate("SELECT id, name, config FROM plugins") do
        if err then
          return nil, err
        end

        for i = 1, #rows do
          local plugin = rows[i]
          if plugin.name == plugin_name then
            if type(plugin.config) ~= "string" then
              return nil, "plugin config is not a string"
            end
            local config = cjson.decode(plugin.config)
            local fix = fixup_fn(config)

            if fix then
              local _, err = coordinator:execute("UPDATE plugins SET config = ? WHERE id = ?", {
                cassandra.text(cjson.encode(config)),
                cassandra.uuid(plugin.id)
              })
              if err then
                return nil, err
              end
            end
          end
        end
      end

      return true
    end,

  }

}


--------------------------------------------------------------------------------
-- Higher-level operations for Workspace migration
--------------------------------------------------------------------------------


local function ws_adjust_fields(ops, entities)
  local out = {}

  for _, entity in ipairs(entities) do

    table.insert(out, ops:ws_add_ws_id(entity.name))

    for _, fk in ipairs(entity.fks) do
      table.insert(out, ops:ws_adjust_foreign_key(entity.name,
                                                  fk.name,
                                                  fk.reference,
                                                  fk.on_delete == "cascade"))
    end

    for _, unique in ipairs(entity.uniques) do
      table.insert(out, ops:ws_unique_field(entity.name, unique))
    end

  end

  return table.concat(out, "\n")
end


local function ws_adjust_data(ops, connector, entities)
  for _, entity in ipairs(entities) do
    if entity.cache_key and #entity.cache_key > 1 then
      local _, err = ops:ws_update_composite_cache_key(connector, entity.name, entity.partitioned)
      if err then
        return nil, err
      end
    end

    local _, err = ops:ws_update_keys(connector, entity.name, entity.uniques, entity.partitioned)
    if err then
      return nil, err
    end
  end

  return true
end


postgres.up.ws_adjust_fields = ws_adjust_fields
cassandra.up.ws_adjust_fields = ws_adjust_fields


postgres.teardown.ws_adjust_data = ws_adjust_data
cassandra.teardown.ws_adjust_data = ws_adjust_data


--------------------------------------------------------------------------------
-- Super high-level shortcut for plugins
--------------------------------------------------------------------------------


local function ws_migrate_plugin(plugin_entities)

  local function ws_migration_up(ops)
    return ops:ws_adjust_fields(plugin_entities)
  end

  local function ws_migration_teardown(ops)
    return function(connector)
      return ops:ws_adjust_data(connector, plugin_entities)
    end
  end

  return {
    postgres = {
      up = ws_migration_up(postgres.up),
      teardown = ws_migration_teardown(postgres.teardown),
    },

    cassandra = {
      up = ws_migration_up(cassandra.up),
      teardown = ws_migration_teardown(cassandra.teardown),
    },
  }
end


--------------------------------------------------------------------------------


return {
  postgres = postgres,
  cassandra = cassandra,
  ws_migrate_plugin = ws_migrate_plugin,
  cassandra_get_default_ws = cassandra_get_default_ws,
  cassandra_create_default_ws = cassandra_create_default_ws,
  cassandra_ensure_default_ws = cassandra_ensure_default_ws,
}

After making said modifications, trying to come up online with the 2.1.4 node with the modified PRs throws this error that was not occurring before causing our runtime container to crash and cycle restart since never healthy. This is problematic because even after a kong migration up, the database in its pending state should be able to start on a new node with runtime until a migrations finish can be completed is my understanding:

2020/10/14 05:48:09 [warn] 45#0: [lua] init.lua:457: init(): database has pending migrations:
--
聽 | core: 009_200_to_210, 010_210_to_211
聽 | jwt: 003_200_to_210
聽 | acl: 003_200_to_210, 004_212_to_213
聽 | oauth2: 004_200_to_210, 005_210_to_211
聽 | nginx: [warn] [lua] init.lua:457: init(): database has pending migrations:
聽 | core: 009_200_to_210, 010_210_to_211
聽 | jwt: 003_200_to_210
聽 | acl: 003_200_to_210, 004_212_to_213
聽 | oauth2: 004_200_to_210, 005_210_to_211
聽 | 2020/10/14 05:48:10 [error] 45#0: init_by_lua error: ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:378: table index is nil
聽 | stack traceback:
聽 | ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:378: in function 'new'
聽 | ...cal/kong/luarocks/share/lua/5.1/kong/runloop/handler.lua:488: in function 'build_plugins_iterator'
聽 | /usr/local/kong/luarocks/share/lua/5.1/kong/init.lua:497: in function 'init'
聽 | init_by_lua:3: in main chunk
聽 | nginx: [error] init_by_lua error: ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:378: table index is nil
聽 | stack traceback:
聽 | ...luarocks/share/lua/5.1/kong/runloop/plugins_iterator.lua:378: in function 'new'
聽 | ...cal/kong/luarocks/share/lua/5.1/kong/runloop/handler.lua:488: in function 'build_plugins_iterator'
聽 | /usr/local/kong/luarocks/share/lua/5.1/kong/init.lua:497: in function 'init'
聽 | init_by_lua:3: in main chunk

Still for science I am going to open up a debug terminal and run the kong migrations finish command in isolate to still see what happens:
kong migrations finish --vv

2020/10/14 05:53:41 [debug] loading subsystems migrations...
2020/10/14 05:53:41 [verbose] retrieving keyspace schema state...
2020/10/14 05:53:41 [verbose] schema state retrieved
2020/10/14 05:53:42 [debug] loading subsystems migrations...
2020/10/14 05:53:42 [verbose] retrieving keyspace schema state...
2020/10/14 05:53:42 [verbose] schema state retrieved
2020/10/14 05:53:42 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/14 05:53:42 [info] migrating core on keyspace 'kong_prod2'...
2020/10/14 05:53:42 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: [Invalid] Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute thisquery despite the performance unpredictability, use ALLOW FILTERING
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Now just to clarify @bungle , you did not expect me to clone the keyspace from scratch and do the migrations UP and finish command all over again before dropping that index and removing that string of code right? Because I used the same keyspace in its kong migration up'ed state.

Hope the above helps! Interesting new little CQL error with the thing asking to use ALLOW FILTERING in the query.

@jeremyjpj0916, okay, hmm... sounds reasonable.

This:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L24

needs to be changed from:

  local rows, err = coordinator:execute("SELECT id FROM workspaces WHERE name='default'", nil, {

to:

  local rows, err = coordinator:execute("SELECT id FROM workspaces WHERE name='default' ALLOW FILTERING", nil, {

What I am after this is that I want to exclude index in query, just to exclude it from equation.

You do not need to start from scratch. It is faster to test these without starting from scratch.

Change made, testing the migrations finish again!

@bungle After adding allow filtering there I get:

2020/10/14 16:31:47 [debug] loading subsystems migrations...
2020/10/14 16:31:47 [verbose] retrieving keyspace schema state...
2020/10/14 16:31:47 [verbose] schema state retrieved
2020/10/14 16:31:48 [debug] loading subsystems migrations...
2020/10/14 16:31:48 [verbose] retrieving keyspace schema state...
2020/10/14 16:31:48 [verbose] schema state retrieved
2020/10/14 16:31:48 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/14 16:31:48 [info] migrating core on keyspace 'kong_prod2'...
2020/10/14 16:31:48 [debug] running migration: 009_200_to_210
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/14 16:31:48 [error] 23#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: upstreams, context: ngx.timer
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/14 16:31:48 [error] 23#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: targets, context: ngx.timer
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/14 16:31:48 [error] 23#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: consumers, context: ngx.timer
{
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/14 16:31:48 [error] 23#0: *2 [lua] 200_to_210.lua:409: ws_update_keys(): Table name: certificates, context: ngx.timer
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

So back to same error.

@jeremyjpj0916 thank you for that. So it looks like code runs fine, but then suddenly does not. Good that index did not make a difference, bad that we don't know yet. I have a feeling that Cassandra driver might get confused or something. Perhaps we should reconnect when this happens, and try again.

Is that what the old larger PR did that seemed to work for us that was dropped from this newer minor PR that was already merged(#6439)? If you have a spot to drop that reconnection in I can test it and confirm if it has not been proven to help yet.

@jeremyjpj0916,

Can you change this:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L401-L409

to:

      local coordinator = assert(connector:get_stored_connection())
      local default_ws, err = cassandra_get_default_ws(coordinator)
      if err then
        return nil, err
      end

      if not default_ws then
        connector:close()
        coordinator = assert(connector:connect_migrations())

        default_ws, err = cassandra_get_default_ws(coordinator)
        if err then
          return nil, err
        end

        if not default_ws then
          return nil, "unable to find a default workspace"
        end
      end

So we reconnect in case there is some issue with connection.

Shall I revert the other tweaks you had me change like dropping the index or okay to leave all that as is and add these lines(which is what I am going to attempt now unless told otherwise hah)?

@jeremyjpj0916, I don't think it matters if you keep them or remove them, to get info how this goes.

No dice there either:

2020/10/14 20:15:13 [debug] loading subsystems migrations...
2020/10/14 20:15:13 [verbose] retrieving keyspace schema state...
2020/10/14 20:15:13 [verbose] schema state retrieved
2020/10/14 20:15:13 [debug] loading subsystems migrations...
2020/10/14 20:15:13 [verbose] retrieving keyspace schema state...
2020/10/14 20:15:14 [verbose] schema state retrieved
2020/10/14 20:15:14 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/14 20:15:14 [info] migrating core on keyspace 'kong_prod2'...
2020/10/14 20:15:14 [debug] running migration: 009_200_to_210
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: unable to find a default workspace
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

Not sure if we will reach an RCA here.. I can say the C* cluster seems healthy, works for normal Kong runtime stuff on 2.0.5 fine and direct queries seem good(plus no weird C* logs during these migration executions) so I think there has to be something going on wrong from Kong client side -> C* causing my problems for migrations... If we are out of ideas though then maybe I use the PR that was scrapped since it worked for me personally(had to run finish multiple times which kinda is weird but I am okay with doing so as long as db ends up in a good state) just to get us to the finish line to 2.1.4 or 2.2 even from 2.0.5 . Not ideal to run custom code that wasn't merged into master or a release but I also don't want yall burning effort into this if you don't believe its a Kong problem yourselves as well, or out of ideas for me to try to help reach a conclusion. Willing to keep investigating though as a supportive QA if any other ideas come around though. I think it important we don't get our prod regions stuck on 2.0.5 though too much longer cause falling behind is never good in open source land.

In your environment it seems like the default workspace got inserted. Then we could retrieve it couple of times, then disappeared. Then you start this again, and same happens. The thing that worked with your environment was where we inserted default workspace multiple times. Surprisingly that didn't create anymore multiple of them, perhaps because of the adjusted consistency levels. In this latest I added another check that only tries to it ever create once.

I have one last idea. Let me get back to you on this. The idea is that now that we have adjusted the consistency of workspace query, perhaps that is the reason that it fails, because some other migration before it did so many changes that it takes a long to reach consensus. We have two ways to solve it, I think.

  1. just retrieve default workspace once
  2. wait for schema consensus after each adjustment

@jeremyjpj0916

Can you try wait for schema consensus, so change this:
https://github.com/Kong/kong/blob/master/kong/db/migrations/operations/200_to_210.lua#L521-L537

to:

local function ws_adjust_data(ops, connector, entities)
  for _, entity in ipairs(entities) do
    if entity.cache_key and #entity.cache_key > 1 then
      local _, err = ops:ws_update_composite_cache_key(connector, entity.name, entity.partitioned)
      if err then
        return nil, err
      end

      local ok, err = connector:wait_for_schema_consensus()
      if not ok then
        return nil, err
      end
    end

    local _, err = ops:ws_update_keys(connector, entity.name, entity.uniques, entity.partitioned)
    if err then
      return nil, err
    end

    local ok, err = connector:wait_for_schema_consensus()
    if not ok then
      return nil, err
    end
  end

  return true
end

You can keep all the current modifications.

Going to bed I am on my mobile, 3:06am my time, will try this in morning though and report back!

@jeremyjpj0916,

And if that does not work, this is another that I think is finally going to work, e.g. store ws_id to module level (yes, it is a workaround, but hopefully it allows you to go through migrations step-by-step at least). This one can be found in this PR:
https://github.com/Kong/kong/pull/6477

@bungle

Output after trying the schema conesnsus timeout wait thing:

2020/10/15 16:47:15 [debug] loading subsystems migrations...
2020/10/15 16:47:15 [verbose] retrieving keyspace schema state...
2020/10/15 16:47:15 [verbose] schema state retrieved
2020/10/15 16:47:15 [debug] loading subsystems migrations...
2020/10/15 16:47:15 [verbose] retrieving keyspace schema state...
2020/10/15 16:47:15 [verbose] schema state retrieved
2020/10/15 16:47:15 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/15 16:47:15 [info] migrating core on keyspace 'kong_prod2'...
2020/10/15 16:47:15 [debug] running migration: 009_200_to_210
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/15 16:47:15 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/15 16:47:15 [verbose] Cassandra schema consensus: reached in 3ms
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/15 16:47:15 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/15 16:47:15 [verbose] Cassandra schema consensus: reached in 4ms
{ {
    id = "926762fb-7c62-4ddb-8535-5577af0cefac"
  },
  meta = {
    has_more_pages = false
  },
  type = "ROWS"
}
2020/10/15 16:47:16 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/15 16:47:36 [verbose] Cassandra schema consensus: not reached in 20064ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

As we can see it just hangs at that certificates step one right? Otherwise schema consensus is blazing fast on all the rest of the calls. Super odd.

Going to test your other idea now too!

argg I think removing that index means more allow filterings need to be added on more lines.

2020/10/15 17:18:23 [info] migrating core on keyspace 'kong_prod2'...
2020/10/15 17:18:23 [debug] running migration: 009_200_to_210
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to run migration '009_200_to_210' teardown: [Invalid] Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute thisquery despite the performance unpredictability, use ALLOW FILTERING
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

I am just going to redo the clone keyspace, try with the base merged PR and your latest suggestion that has a pending PR and confirm if that works 馃槅 .

Yes, so this:

 [Read failure] Operation fail

Looks to be related to:

unable to find a default workspace

If one happens, the other one would most probably happen too when executed in same place. So the PR #6477 may get us to original state where you can possibly get further by running the migration again. As long as the one set of migrations passes (as it is recorded after succesful run).

Testing the unaltered #6439 paired with #6477 now against the keyspace re-cloned without that index being removed. Will report back in a few mins w results.

quick kong migrations list after the up, the migrations up went fine as per usual:

2020/10/16 18:34:35 [debug] loading subsystems migrations...
2020/10/16 18:34:35 [verbose] retrieving keyspace schema state...
2020/10/16 18:34:36 [verbose] schema state retrieved
2020/10/16 18:34:36 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 011_212_to_213
          jwt: 000_base_jwt, 002_130_to_140
          acl: 000_base_acl, 002_130_to_140
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
       oauth2: 000_base_oauth2, 003_130_to_140
2020/10/16 18:34:36 [info]
Pending migrations:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/16 18:34:36 [info]
Run 'kong migrations finish' when ready
/ $

1st finish:

2020/10/16 18:35:42 [debug] loading subsystems migrations...
2020/10/16 18:35:42 [verbose] retrieving keyspace schema state...
2020/10/16 18:35:42 [verbose] schema state retrieved
2020/10/16 18:35:42 [debug] loading subsystems migrations...
2020/10/16 18:35:42 [verbose] retrieving keyspace schema state...
2020/10/16 18:35:43 [verbose] schema state retrieved
2020/10/16 18:35:43 [debug] pending migrations to finish:
  core: 009_200_to_210, 010_210_to_211
   jwt: 003_200_to_210
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/16 18:35:43 [info] migrating core on keyspace 'kong_prod2'...
2020/10/16 18:35:43 [debug] running migration: 009_200_to_210
2020/10/16 18:35:51 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:35:51 [verbose] Cassandra schema consensus: reached in 3ms
2020/10/16 18:35:51 [info] core migrated up to: 009_200_to_210 (executed)
2020/10/16 18:35:51 [debug] running migration: 010_210_to_211
2020/10/16 18:35:51 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:35:51 [verbose] Cassandra schema consensus: reached in 5ms
2020/10/16 18:35:51 [info] core migrated up to: 010_210_to_211 (executed)
2020/10/16 18:35:51 [info] migrating jwt on keyspace 'kong_prod2'...
2020/10/16 18:35:51 [debug] running migration: 003_200_to_210
2020/10/16 18:35:52 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:36:12 [verbose] Cassandra schema consensus: not reached in 20090ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

2nd finish:

2020/10/16 18:37:09 [debug] loading subsystems migrations...
2020/10/16 18:37:09 [verbose] retrieving keyspace schema state...
2020/10/16 18:37:09 [verbose] schema state retrieved
2020/10/16 18:37:09 [debug] loading subsystems migrations...
2020/10/16 18:37:09 [verbose] retrieving keyspace schema state...
2020/10/16 18:37:09 [verbose] schema state retrieved
2020/10/16 18:37:09 [debug] pending migrations to finish:
   acl: 003_200_to_210, 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/16 18:37:10 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/16 18:37:10 [debug] running migration: 003_200_to_210
2020/10/16 18:37:26 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:37:46 [verbose] Cassandra schema consensus: not reached in 20079ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

3rd finish:

2020/10/16 18:38:18 [debug] loading subsystems migrations...
2020/10/16 18:38:18 [verbose] retrieving keyspace schema state...
2020/10/16 18:38:18 [verbose] schema state retrieved
2020/10/16 18:38:18 [debug] loading subsystems migrations...
2020/10/16 18:38:18 [verbose] retrieving keyspace schema state...
2020/10/16 18:38:19 [verbose] schema state retrieved
2020/10/16 18:38:19 [debug] pending migrations to finish:
   acl: 004_212_to_213
oauth2: 004_200_to_210, 005_210_to_211
2020/10/16 18:38:19 [info] migrating acl on keyspace 'kong_prod2'...
2020/10/16 18:38:19 [debug] running migration: 004_212_to_213
2020/10/16 18:38:19 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:38:39 [verbose] Cassandra schema consensus: not reached in 20023ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

4th finish:

2020/10/16 18:38:59 [debug] loading subsystems migrations...
2020/10/16 18:38:59 [verbose] retrieving keyspace schema state...
2020/10/16 18:39:00 [verbose] schema state retrieved
2020/10/16 18:39:00 [debug] loading subsystems migrations...
2020/10/16 18:39:00 [verbose] retrieving keyspace schema state...
2020/10/16 18:39:00 [verbose] schema state retrieved
2020/10/16 18:39:00 [debug] pending migrations to finish:
oauth2: 004_200_to_210, 005_210_to_211
2020/10/16 18:39:00 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/16 18:39:00 [debug] running migration: 004_200_to_210
2020/10/16 18:39:10 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:39:30 [verbose] Cassandra schema consensus: not reached in 20099ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

5th finish:

2020/10/16 18:39:52 [debug] loading subsystems migrations...
2020/10/16 18:39:52 [verbose] retrieving keyspace schema state...
2020/10/16 18:39:53 [verbose] schema state retrieved
2020/10/16 18:39:53 [debug] loading subsystems migrations...
2020/10/16 18:39:53 [verbose] retrieving keyspace schema state...
2020/10/16 18:39:53 [verbose] schema state retrieved
2020/10/16 18:39:53 [debug] pending migrations to finish:
oauth2: 005_210_to_211
2020/10/16 18:39:53 [info] migrating oauth2 on keyspace 'kong_prod2'...
2020/10/16 18:39:53 [debug] running migration: 005_210_to_211
2020/10/16 18:39:54 [verbose] waiting for Cassandra schema consensus (20000ms timeout)...
2020/10/16 18:40:14 [verbose] Cassandra schema consensus: not reached in 20025ms
Error:
...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: [Cassandra error] cluster_mutex callback threw an error: ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: [Cassandra error] failed to wait for schema consensus: [Read failure] Operation failed - received 0 responses and 1 failures
stack traceback:
        [C]: in function 'assert'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:142: in function <...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:364: in function </usr/local/kong/luarocks/share/lua/5.1/kong/db/init.lua:314>
        [C]: in function 'pcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/concurrency.lua:45: in function 'cluster_mutex'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:126: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>
stack traceback:
        [C]: in function 'error'
        ...ong/luarocks/share/lua/5.1/kong/cmd/utils/migrations.lua:161: in function 'finish'
        ...ocal/kong/luarocks/share/lua/5.1/kong/cmd/migrations.lua:184: in function 'cmd_exec'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88>
        [C]: in function 'xpcall'
        /usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:88: in function </usr/local/kong/luarocks/share/lua/5.1/kong/cmd/init.lua:45>
        /usr/bin/kong:9: in function 'file_gen'
        init_worker_by_lua:50: in function <init_worker_by_lua:48>
        [C]: in function 'xpcall'
        init_worker_by_lua:57: in function <init_worker_by_lua:55>

6th finish(confirming its all done):

2020/10/16 18:40:32 [debug] loading subsystems migrations...
2020/10/16 18:40:32 [verbose] retrieving keyspace schema state...
2020/10/16 18:40:32 [verbose] schema state retrieved
2020/10/16 18:40:32 [debug] loading subsystems migrations...
2020/10/16 18:40:32 [verbose] retrieving keyspace schema state...
2020/10/16 18:40:33 [verbose] schema state retrieved
2020/10/16 18:40:33 [info] No pending migrations to finish

And wrap it up with a Kong migrations list:

2020/10/16 18:41:11 [debug] loading subsystems migrations...
2020/10/16 18:41:11 [verbose] retrieving keyspace schema state...
2020/10/16 18:41:11 [verbose] schema state retrieved
2020/10/16 18:41:11 [info] Executed migrations:
         core: 000_base, 003_100_to_110, 004_110_to_120, 005_120_to_130, 006_130_to_140, 007_140_to_150, 008_150_to_200, 009_200_to_210, 010_210_to_211, 011_212_to_213
          jwt: 000_base_jwt, 002_130_to_140, 003_200_to_210
          acl: 000_base_acl, 002_130_to_140, 003_200_to_210, 004_212_to_213
rate-limiting: 000_base_rate_limiting, 003_10_to_112, 004_200_to_210
       oauth

So yes seems those two PRs paired does get us back to the capability of the first dropped larger PR with respect to my environment. Will verify the tables in C* look good post upgrade too.

Edit - Yep DB checkout looks good, all tables got the right ws_id set and cache keys/names all look like the workspace was injected into the right spot! So even though I have to run finish multiple times it still works for me to proceed at least. Thanks again for the hard investigations! We can close this issue since at least its no longer a "problem" for the upgrade path forward for us, just run a cli command 5-6 times does the trick.

Closing this as we got a solution out of this that did work for our Kong environments! Appreciate the collaboration @bungle and @hishamhm . Odd I am seemingly the only one reporting these kinda issues but appreciate the dedication to enhancing existing codes to make things work for everyone.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SunshineYang picture SunshineYang  路  39Comments

noamelf picture noamelf  路  36Comments

jason-riddle picture jason-riddle  路  40Comments

timusketeers picture timusketeers  路  36Comments

throrin19 picture throrin19  路  39Comments