When receiving high traffic to any of our APIs that are exposed through kong we have downtime during any updates to service/route endpoints. We have read through how kong does local caching/invalidation and how the local router is rebuilt whenever certain resources are updated. We also experimented with different db_update_frequency values but always received the same issue.
We receive the below error during this situation:
[C]: in function 'post'
/usr/local/share/lua/5.1/pgmoon/init.lua:294: in function 'execute'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:565: in function 'page'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:819: in function 'each'
/usr/local/share/lua/5.1/kong/db/dao/init.lua:244: in function 'each'
/usr/local/share/lua/5.1/kong/core/handler.lua:83: in function 'build_router'
/usr/local/share/lua/5.1/kong/core/handler.lua:456: in function 'before'
/usr/local/share/lua/5.1/kong/init.lua:391: in function 'access'
access_by_lua(nginx-kong.conf:89):2: in function <access_by_lua(nginx-kong.conf:89):1>, client: 172.22.17.140, server: kong, request: "GET / HTTP/1.1", host: "app.kong-test.nonprod.internal"
Othertimes we see this as well:
2018/09/07 10:26:49 [error] 52901#0: *73903 lua entry thread aborted: runtime error: .../share/lua/5.1/kong/db/strategies/postgres/connector.lua:157: bad request
stack traceback:
coroutine 0:
[C]: in function 'keepalive'
.../share/lua/5.1/kong/db/strategies/postgres/connector.lua:157: in function 'setkeepalive'
.../share/lua/5.1/kong/db/strategies/postgres/connector.lua:181: in function 'execute'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:672: in function 'select'
/usr/local/share/lua/5.1/kong/db/dao/init.lua:187: in function 'select'
/usr/local/share/lua/5.1/kong/core/handler.lua:96: in function 'build_router'
/usr/local/share/lua/5.1/kong/core/handler.lua:456: in function 'before'
/usr/local/share/lua/5.1/kong/init.lua:391: in function 'access'
access_by_lua(nginx-kong.conf:89):2: in function <access_by_lua(nginx-kong.conf:89):1>, client: 172.22.17.140, server: kong, request: "GET / HTTP/1.1", host: "app.kong-test.nonprod.internal"
2018/09/07 10:26:49 [error] 52901#0: *72398 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/pgmoon/init.lua:535: attempt to index field 'sock' (a nil value)
stack traceback:
coroutine 0:
/usr/local/share/lua/5.1/pgmoon/init.lua: in function 'receive_message'
/usr/local/share/lua/5.1/pgmoon/init.lua:299: in function 'query'
.../share/lua/5.1/kong/db/strategies/postgres/connector.lua:132: in function 'connect'
.../share/lua/5.1/kong/db/strategies/postgres/connector.lua:175: in function 'execute'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:565: in function 'page'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:819: in function 'each'
/usr/local/share/lua/5.1/kong/db/dao/init.lua:244: in function 'each'
/usr/local/share/lua/5.1/kong/core/handler.lua:83: in function 'build_router'
/usr/local/share/lua/5.1/kong/core/handler.lua:456: in function 'before'
/usr/local/share/lua/5.1/kong/init.lua:391: in function 'access'
access_by_lua(nginx-kong.conf:89):2: in function <access_by_lua(nginx-kong.conf:89):1>, client: 172.22.17.140, server: kong, request: "GET / HTTP/1.1", host: "app.kong-test.nonprod.internal"
Is there any recommended configuration to handle updating kong routes/services during heavy load when we have multiple kong servers so we do not receive this error?
{
"configuration": {
"admin_access_log": "logs/admin_access.log",
"admin_error_log": "logs/error.log",
"admin_listen": [
"0.0.0.0:8001"
],
"admin_listeners": [
{
"http2": false,
"ip": "0.0.0.0",
"listener": "0.0.0.0:8001",
"port": 8001,
"proxy_protocol": false,
"ssl": false
}
],
"admin_ssl_cert_csr_default": "/usr/local/kong/ssl/admin-kong-default.csr",
"admin_ssl_cert_default": "/usr/local/kong/ssl/admin-kong-default.crt",
"admin_ssl_cert_key_default": "/usr/local/kong/ssl/admin-kong-default.key",
"admin_ssl_enabled": false,
"anonymous_reports": true,
"cassandra_consistency": "ONE",
"cassandra_contact_points": [
"127.0.0.1"
],
"cassandra_data_centers": [
"dc1:2",
"dc2:3"
],
"cassandra_keyspace": "kong",
"cassandra_lb_policy": "RoundRobin",
"cassandra_port": 9042,
"cassandra_repl_factor": 1,
"cassandra_repl_strategy": "SimpleStrategy",
"cassandra_schema_consensus_timeout": 10000,
"cassandra_ssl": false,
"cassandra_ssl_verify": false,
"cassandra_timeout": 5000,
"cassandra_username": "kong",
"client_body_buffer_size": "8k",
"client_max_body_size": "0",
"client_ssl": false,
"client_ssl_cert_csr_default": "/usr/local/kong/ssl/kong-default.csr",
"client_ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt",
"client_ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key",
"custom_plugins": {},
"database": "postgres",
"db_cache_ttl": 3600,
"db_update_frequency": 5,
"db_update_propagation": 0,
"dns_error_ttl": 1,
"dns_hostsfile": "/etc/hosts",
"dns_no_sync": false,
"dns_not_found_ttl": 30,
"dns_order": [
"SRV",
"A",
"CNAME"
],
"dns_resolver": {},
"dns_stale_ttl": 4,
"error_default_type": "text/plain",
"kong_env": "/usr/local/kong/.kong_env",
"latency_tokens": true,
"log_level": "info",
"lua_package_cpath": "",
"lua_package_path": "./?.lua;./?/init.lua;",
"lua_socket_pool_size": 30,
"lua_ssl_verify_depth": 1,
"mem_cache_size": "128m",
"nginx_acc_logs": "/usr/local/kong/logs/access.log",
"nginx_admin_acc_logs": "/usr/local/kong/logs/admin_access.log",
"nginx_conf": "/usr/local/kong/nginx.conf",
"nginx_daemon": "on",
"nginx_err_logs": "/usr/local/kong/logs/error.log",
"nginx_kong_conf": "/usr/local/kong/nginx-kong.conf",
"nginx_optimizations": true,
"nginx_pid": "/usr/local/kong/pids/nginx.pid",
"nginx_worker_processes": "auto",
"pg_database": "********",
"pg_host": "*********",
"pg_password": "******",
"pg_port": 5432,
"pg_ssl": false,
"pg_ssl_verify": false,
"pg_user": "*******",
"plugins": {
"acl": true,
"aws-lambda": true,
"basic-auth": true,
"bot-detection": true,
"correlation-id": true,
"cors": true,
"datadog": true,
"file-log": true,
"hmac-auth": true,
"http-log": true,
"ip-restriction": true,
"jwt": true,
"key-auth": true,
"ldap-auth": true,
"loggly": true,
"oauth2": true,
"rate-limiting": true,
"request-size-limiting": true,
"request-termination": true,
"request-transformer": true,
"response-ratelimiting": true,
"response-transformer": true,
"runscope": true,
"statsd": true,
"syslog": true,
"tcp-log": true,
"udp-log": true
},
"prefix": "/usr/local/kong",
"proxy_access_log": "logs/access.log",
"proxy_error_log": "logs/error.log",
"proxy_listen": [
"0.0.0.0:8000",
"0.0.0.0:8443 ssl"
],
"proxy_listeners": [
{
"http2": false,
"ip": "0.0.0.0",
"listener": "0.0.0.0:8000",
"port": 8000,
"proxy_protocol": false,
"ssl": false
},
{
"http2": false,
"ip": "0.0.0.0",
"listener": "0.0.0.0:8443 ssl",
"port": 8443,
"proxy_protocol": false,
"ssl": true
}
],
"proxy_ssl_enabled": true,
"real_ip_header": "X-Real-IP",
"real_ip_recursive": "off",
"server_tokens": true,
"ssl_cert": "/usr/local/kong/ssl/kong-default.crt",
"ssl_cert_csr_default": "/usr/local/kong/ssl/kong-default.csr",
"ssl_cert_default": "/usr/local/kong/ssl/kong-default.crt",
"ssl_cert_key": "/usr/local/kong/ssl/kong-default.key",
"ssl_cert_key_default": "/usr/local/kong/ssl/kong-default.key",
"ssl_cipher_suite": "modern",
"ssl_ciphers": "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256",
"trusted_ips": {},
"upstream_keepalive": 60
},
"hostname": "ip-172-22-134-113",
"lua_version": "LuaJIT 2.1.0-beta3",
"node_id": "6ebefee4-ad03-43e5-8681-fa1ddd90817c",
"plugins": {
"available_on_server": {
"acl": true,
"aws-lambda": true,
"basic-auth": true,
"bot-detection": true,
"correlation-id": true,
"cors": true,
"datadog": true,
"file-log": true,
"hmac-auth": true,
"http-log": true,
"ip-restriction": true,
"jwt": true,
"key-auth": true,
"ldap-auth": true,
"loggly": true,
"oauth2": true,
"rate-limiting": true,
"request-size-limiting": true,
"request-termination": true,
"request-transformer": true,
"response-ratelimiting": true,
"response-transformer": true,
"runscope": true,
"statsd": true,
"syslog": true,
"tcp-log": true,
"udp-log": true
},
"enabled_in_cluster": [
"acl",
"cors",
"key-auth",
"datadog",
"syslog",
"rate-limiting",
"request-termination",
"http-log",
"basic-auth"
]
},
"prng_seeds": {
"pid: 9519": 132422051391,
"pid: 9520": 233104212281,
"pid: 9521": 232147691847,
"pid: 9522": 231106811371
},
"tagline": "Welcome to kong",
"timers": {
"pending": 5,
"running": 0
},
"version": "0.13.1"
}
Hi,
Thank you for the detailed report. Unfortunately, the first error you provided is incomplete. It contains the stacktrace but not what the actual error is. Would you please post it here, or update your message? Thank you.
The second error you are encountering (bad request) should have been fixed in 0.14.0 and 0.14.1 (see https://github.com/Kong/kong/pull/3423).
Concerning the bulk of the issue (updates to Services/Routes under high throughput), we are currently aware of the issue and will be working towards improving this. Please stay tuned for updates; thanks!
Hello,
Here is the full message:
2018/09/07 14:25:33 [error] 52902#0: *2627194 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/pgmoon/init.lua:294: bad request
stack traceback:
coroutine 0:
[C]: in function 'post'
/usr/local/share/lua/5.1/pgmoon/init.lua:294: in function 'execute'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:565: in function 'page'
...local/share/lua/5.1/kong/db/strategies/postgres/init.lua:819: in function 'each'
/usr/local/share/lua/5.1/kong/db/dao/init.lua:244: in function 'each'
/usr/local/share/lua/5.1/kong/core/handler.lua:83: in function 'build_router'
/usr/local/share/lua/5.1/kong/core/handler.lua:456: in function 'before'
/usr/local/share/lua/5.1/kong/init.lua:391: in function 'access'
access_by_lua(nginx-kong.conf:89):2: in function <access_by_lua(nginx-kong.conf:89):1>, client: 172.22.17.140, server: kong, request: "GET / HTTP/1.1", host: "app.kong-test.nonprod.intern
We can update to kong 0.14.1 as it looks like the issue is taken care of in the pr you linked.
For the main issue since you are aware of it, is there anything we can do to mitigate the issue during high throughput? If not we can try to see if we can schedule creation of new services/routes outside of high throughput window. Any info on timeline would be great too!
Thanks!
@mathematician Till the official fix is available, this PR may address the issue you are facing
https://github.com/Kong/kong/pull/3696
@mathematician @ganeshs FYI I have posted https://github.com/Kong/kong/pull/3782 as a solution for this is concurrency issue.
@mathematician / @ganeshs,
Can you check if @p0pr0ck5's PR has the wanted impact on this?
Closing this, given the router rebuild mutex was written specifically to solve this, and is available in 1.0.
If there are similar reports of this in releases of Kong after 1.0, please note this here and we can re-examine the issue.