Kong: Declarative config reload sometimes leaves broken workers at 100% CPU

Created on 24 Aug 2020  路  8Comments  路  Source: Kong/kong

Summary

When reloading kongs configuration, sometimes the old worker processes do not exit and instead sit with 100% CPU usage. Subsequent reloads sometimes resolve the issue, while other times it can make the issue worse leaving multiple sets of workers running.

I originally thought this might have been the same as #6055 but the issue persists on 2.1.3 which contains the #6121 which was suspected to fix the issue. Since then I've realised I don't see the same memory fragmentation log messages either.

Kong 2.0.4 with the same configuration did not have this issue.

Steps To Reproduce

  1. Start Kong with a declarative config file, in my case I have 2 workers matching the number of vCPUs the VM has.
    $ ps auxww | grep nginx
    root 22585 0.8 1.7 749376 68848 ? Ss 14:26 0:05 nginx: master process /usr/local/openresty/nginx/sbin/nginx -p /usr/local/kong -c nginx.conf
    nobody 22711 0.1 2.9 797928 112864 ? S 14:33 0:00 nginx: worker process
    nobody 22712 0.2 2.9 798052 113304 ? S 14:33 0:00 nginx: worker process

  2. $ systemctl reload kong

  3. $ ps auxww | grep nginx
    root 22585 0.9 1.7 750144 69748 ? Ss 14:26 0:06 nginx: master process /usr/local/openresty/nginx/sbin/nginx -p /usr/local/kong -c nginx.conf
    nobody 22711 11.1 2.9 797928 112864 ? R 14:33 0:25 nginx: worker process
    nobody 22712 11.1 4.5 862240 177484 ? R 14:33 0:25 nginx: worker process
    nobody 22795 0.9 2.9 798628 113128 ? S 14:37 0:00 nginx: worker process
    nobody 22796 0.8 2.9 798568 112584 ? S 14:37 0:00 nginx: worker process

Note there are now 4 workers, the original ones with pids 22711 and 22712 did not terminate themselves and are maxing CPU.

It does not occur every reload, it can sometimes take a few tries before the issue presents itself. The configuration does not need to have changed to trigger the issue.

Additional Details & Logs

  • Kong version 2.1.3 (RHEL7 rpm), also reproduced on 2.1.0 and 2.1.2. Same configuration did not have this issue under version 2.0.4
  • Kong debug-level startup logs ($ kong start --vv):
2020/08/24 14:25:50 [verbose] Kong: 2.1.3
2020/08/24 14:25:50 [debug] ngx_lua: 10015
2020/08/24 14:25:50 [debug] nginx: 1015008
2020/08/24 14:25:50 [debug] Lua: LuaJIT 2.1.0-beta3
2020/08/24 14:25:50 [verbose] reading config file at /etc/kong/kong.conf
2020/08/24 14:25:50 [debug] reading environment variables
2020/08/24 14:25:50 [debug] admin_access_log = "logs/admin_access.log"
2020/08/24 14:25:50 [debug] admin_error_log = "logs/error.log"
2020/08/24 14:25:50 [debug] admin_listen = {"off"}
2020/08/24 14:25:50 [debug] anonymous_reports = true
2020/08/24 14:25:50 [debug] cassandra_contact_points = {"127.0.0.1"}
2020/08/24 14:25:50 [debug] cassandra_data_centers = {"dc1:2","dc2:3"}
2020/08/24 14:25:50 [debug] cassandra_keyspace = "kong"
2020/08/24 14:25:50 [debug] cassandra_lb_policy = "RequestRoundRobin"
2020/08/24 14:25:50 [debug] cassandra_port = 9042
2020/08/24 14:25:50 [debug] cassandra_read_consistency = "ONE"
2020/08/24 14:25:50 [debug] cassandra_refresh_frequency = 60
2020/08/24 14:25:50 [debug] cassandra_repl_factor = 1
2020/08/24 14:25:50 [debug] cassandra_repl_strategy = "SimpleStrategy"
2020/08/24 14:25:50 [debug] cassandra_schema_consensus_timeout = 10000
2020/08/24 14:25:50 [debug] cassandra_ssl = false
2020/08/24 14:25:50 [debug] cassandra_ssl_verify = false
2020/08/24 14:25:50 [debug] cassandra_timeout = 5000
2020/08/24 14:25:50 [debug] cassandra_username = "kong"
2020/08/24 14:25:50 [debug] cassandra_write_consistency = "ONE"
2020/08/24 14:25:50 [debug] client_body_buffer_size = "8k"
2020/08/24 14:25:50 [debug] client_max_body_size = "0"
2020/08/24 14:25:50 [debug] client_ssl = false
2020/08/24 14:25:50 [debug] cluster_control_plane = "127.0.0.1:8005"
2020/08/24 14:25:50 [debug] cluster_listen = {"0.0.0.0:8005"}
2020/08/24 14:25:50 [debug] cluster_mtls = "shared"
2020/08/24 14:25:50 [debug] database = "off"
2020/08/24 14:25:50 [debug] db_cache_ttl = 0
2020/08/24 14:25:50 [debug] db_cache_warmup_entities = {"services","plugins"}
2020/08/24 14:25:50 [debug] db_resurrect_ttl = 30
2020/08/24 14:25:50 [debug] db_update_frequency = 5
2020/08/24 14:25:50 [debug] db_update_propagation = 0
2020/08/24 14:25:50 [debug] declarative_config = "/etc/kong/kong.yaml"
2020/08/24 14:25:50 [debug] dns_error_ttl = 1
2020/08/24 14:25:50 [debug] dns_hostsfile = "/etc/hosts"
2020/08/24 14:25:50 [debug] dns_no_sync = false
2020/08/24 14:25:50 [debug] dns_not_found_ttl = 30
2020/08/24 14:25:50 [debug] dns_order = {"SRV","A","CNAME"}
2020/08/24 14:25:50 [debug] dns_resolver = {"127.0.0.1"}
2020/08/24 14:25:50 [debug] dns_stale_ttl = 4
2020/08/24 14:25:50 [debug] error_default_type = "application/json"
2020/08/24 14:25:50 [debug] go_plugins_dir = "off"
2020/08/24 14:25:50 [debug] go_pluginserver_exe = "/usr/local/bin/go-pluginserver"
2020/08/24 14:25:50 [debug] headers = {"latency_tokens"}
2020/08/24 14:25:50 [debug] host_ports = {}
2020/08/24 14:25:50 [debug] kic = false
2020/08/24 14:25:50 [debug] log_level = "notice"
2020/08/24 14:25:50 [debug] lua_package_cpath = ""
2020/08/24 14:25:50 [debug] lua_package_path = "./?.lua;./?/init.lua;"
2020/08/24 14:25:50 [debug] lua_socket_pool_size = 30
2020/08/24 14:25:50 [debug] lua_ssl_trusted_certificate = "/etc/pki/tls/certs/ca-bundle.crt"
2020/08/24 14:25:50 [debug] lua_ssl_verify_depth = 5
2020/08/24 14:25:50 [debug] mem_cache_size = "128m"
2020/08/24 14:25:50 [debug] nginx_admin_directives = {}
2020/08/24 14:25:50 [debug] nginx_daemon = "on"
2020/08/24 14:25:50 [debug] nginx_events_directives = {{name="worker_connections",value="auto"},{name="multi_accept",value="on"}}
2020/08/24 14:25:50 [debug] nginx_events_multi_accept = "on"
2020/08/24 14:25:50 [debug] nginx_events_worker_connections = "auto"
2020/08/24 14:25:50 [debug] nginx_http_client_body_buffer_size = "8k"
2020/08/24 14:25:50 [debug] nginx_http_client_max_body_size = "64m"
2020/08/24 14:25:50 [debug] nginx_http_directives = {{name="proxy_ssl_verify_depth",value="5"},{name="ssl_dhparam",value="/usr/local/kong/ssl/dhparam.pem"},{name="proxy_ssl_verify",value="on"},{name="client_max_body_size",value="64m"},{name="resolver",value="127.0.0.1"},{name="ssl_prefer_server_ciphers",value="off"},{name="client_body_buffer_size",value="8k"},{name="proxy_ssl_trusted_certificate",value="/etc/pki/tls/certs/ca-bundle.crt"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_session_tickets",value="off"},{name="ssl_stapling",value="on"},{name="ssl_session_timeout",value="1d"}}
2020/08/24 14:25:50 [debug] nginx_http_proxy_ssl_trusted_certificate = "/etc/pki/tls/certs/ca-bundle.crt"
2020/08/24 14:25:50 [debug] nginx_http_proxy_ssl_verify = "on"
2020/08/24 14:25:50 [debug] nginx_http_proxy_ssl_verify_depth = "5"
2020/08/24 14:25:50 [debug] nginx_http_resolver = "127.0.0.1"
2020/08/24 14:25:50 [debug] nginx_http_ssl_dhparam = "/usr/local/kong/ssl/dhparam.pem"
2020/08/24 14:25:50 [debug] nginx_http_ssl_prefer_server_ciphers = "off"
2020/08/24 14:25:50 [debug] nginx_http_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/08/24 14:25:50 [debug] nginx_http_ssl_session_tickets = "off"
2020/08/24 14:25:50 [debug] nginx_http_ssl_session_timeout = "1d"
2020/08/24 14:25:50 [debug] nginx_http_ssl_stapling = "on"
2020/08/24 14:25:50 [debug] nginx_http_status_directives = {}
2020/08/24 14:25:50 [debug] nginx_http_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive",value="60"},{name="keepalive_timeout",value="60s"}}
2020/08/24 14:25:50 [debug] nginx_http_upstream_keepalive = "60"
2020/08/24 14:25:50 [debug] nginx_http_upstream_keepalive_requests = "100"
2020/08/24 14:25:50 [debug] nginx_http_upstream_keepalive_timeout = "60s"
2020/08/24 14:25:50 [debug] nginx_main_daemon = "on"
2020/08/24 14:25:50 [debug] nginx_main_directives = {{name="daemon",value="on"},{name="worker_rlimit_nofile",value="auto"},{name="worker_processes",value="auto"}}
2020/08/24 14:25:50 [debug] nginx_main_worker_processes = "auto"
2020/08/24 14:25:50 [debug] nginx_main_worker_rlimit_nofile = "auto"
2020/08/24 14:25:50 [debug] nginx_optimizations = true
2020/08/24 14:25:50 [debug] nginx_proxy_directives = {{name="real_ip_header",value="X-Real-IP"},{name="real_ip_recursive",value="off"}}
2020/08/24 14:25:50 [debug] nginx_proxy_real_ip_header = "X-Real-IP"
2020/08/24 14:25:50 [debug] nginx_proxy_real_ip_recursive = "off"
2020/08/24 14:25:50 [debug] nginx_sproxy_directives = {}
2020/08/24 14:25:50 [debug] nginx_status_directives = {}
2020/08/24 14:25:50 [debug] nginx_stream_directives = {{name="ssl_session_timeout",value="1d"},{name="ssl_session_tickets",value="off"},{name="ssl_protocols",value="TLSv1.2 TLSv1.3"},{name="ssl_prefer_server_ciphers",value="off"}}
2020/08/24 14:25:50 [debug] nginx_stream_ssl_prefer_server_ciphers = "off"
2020/08/24 14:25:50 [debug] nginx_stream_ssl_protocols = "TLSv1.2 TLSv1.3"
2020/08/24 14:25:50 [debug] nginx_stream_ssl_session_tickets = "off"
2020/08/24 14:25:50 [debug] nginx_stream_ssl_session_timeout = "1d"
2020/08/24 14:25:50 [debug] nginx_supstream_directives = {}
2020/08/24 14:25:50 [debug] nginx_upstream_directives = {{name="keepalive_requests",value="100"},{name="keepalive",value="60"},{name="keepalive_timeout",value="60s"}}
2020/08/24 14:25:50 [debug] nginx_upstream_keepalive = "60"
2020/08/24 14:25:50 [debug] nginx_upstream_keepalive_requests = "100"
2020/08/24 14:25:50 [debug] nginx_upstream_keepalive_timeout = "60s"
2020/08/24 14:25:50 [debug] nginx_worker_processes = "auto"
2020/08/24 14:25:50 [debug] pg_database = "kong"
2020/08/24 14:25:50 [debug] pg_host = "127.0.0.1"
2020/08/24 14:25:50 [debug] pg_max_concurrent_queries = 0
2020/08/24 14:25:50 [debug] pg_port = 5432
2020/08/24 14:25:50 [debug] pg_ro_ssl = false
2020/08/24 14:25:50 [debug] pg_ro_ssl_verify = false
2020/08/24 14:25:50 [debug] pg_semaphore_timeout = 60000
2020/08/24 14:25:50 [debug] pg_ssl = false
2020/08/24 14:25:50 [debug] pg_ssl_verify = false
2020/08/24 14:25:50 [debug] pg_timeout = 5000
2020/08/24 14:25:50 [debug] pg_user = "kong"
2020/08/24 14:25:50 [debug] plugins = {"bundled"}
2020/08/24 14:25:50 [debug] port_maps = {}
2020/08/24 14:25:50 [debug] prefix = "/usr/local/kong/"
2020/08/24 14:25:50 [debug] proxy_access_log = "logs/access.log"
2020/08/24 14:25:50 [debug] proxy_error_log = "logs/error.log"
2020/08/24 14:25:50 [debug] proxy_listen = {"0.0.0.0:443 http2 ssl reuseport backlog=16384"}
2020/08/24 14:25:50 [debug] real_ip_header = "X-Real-IP"
2020/08/24 14:25:50 [debug] real_ip_recursive = "off"
2020/08/24 14:25:50 [debug] role = "traditional"
2020/08/24 14:25:50 [debug] ssl_cert = "/usr/local/kong/ssl/kong.crt"
2020/08/24 14:25:50 [debug] ssl_cert_key = "/usr/local/kong/ssl/kong.key"
2020/08/24 14:25:50 [debug] ssl_cipher_suite = "custom"
2020/08/24 14:25:50 [debug] ssl_ciphers = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"
2020/08/24 14:25:50 [debug] ssl_prefer_server_ciphers = "off"
2020/08/24 14:25:50 [debug] ssl_protocols = "TLSv1.2 TLSv1.3"
2020/08/24 14:25:50 [debug] ssl_session_tickets = "off"
2020/08/24 14:25:50 [debug] ssl_session_timeout = "1d"
2020/08/24 14:25:50 [debug] status_access_log = "off"
2020/08/24 14:25:50 [debug] status_error_log = "logs/status_error.log"
2020/08/24 14:25:50 [debug] status_listen = {"0.0.0.0:8100"}
2020/08/24 14:25:50 [debug] stream_listen = {"off"}
2020/08/24 14:25:50 [debug] trusted_ips = {}
2020/08/24 14:25:50 [debug] upstream_keepalive = 60
2020/08/24 14:25:50 [debug] upstream_keepalive_idle_timeout = 60
2020/08/24 14:25:50 [debug] upstream_keepalive_max_requests = 100
2020/08/24 14:25:50 [debug] upstream_keepalive_pool_size = 60
2020/08/24 14:25:50 [debug] worker_consistency = "strict"
2020/08/24 14:25:50 [debug] worker_state_update_frequency = 5
2020/08/24 14:25:50 [verbose] prefix in use: /usr/local/kong
2020/08/24 14:25:50 [debug] loading subsystems migrations...
2020/08/24 14:25:50 [verbose] retrieving cache schema state...
2020/08/24 14:25:50 [verbose] schema state retrieved
2020/08/24 14:25:50 [verbose] preparing nginx prefix directory at /usr/local/kong
2020/08/24 14:25:50 [warn] ulimit is currently set to "1024". For better performance set it to at least "4096" using "ulimit -n"
2020/08/24 14:25:50 [debug] searching for OpenResty 'nginx' executable
2020/08/24 14:25:50 [debug] /usr/local/openresty/nginx/sbin/nginx -v: 'nginx version: openresty/1.15.8.3'
2020/08/24 14:25:50 [debug] found OpenResty 'nginx' executable at /usr/local/openresty/nginx/sbin/nginx
2020/08/24 14:25:50 [debug] testing nginx configuration: KONG_NGINX_CONF_CHECK=true /usr/local/openresty/nginx/sbin/nginx -t -p /usr/local/kong -c nginx.conf
2020/08/24 14:25:51 [debug] searching for OpenResty 'nginx' executable
2020/08/24 14:25:51 [debug] /usr/local/openresty/nginx/sbin/nginx -v: 'nginx version: openresty/1.15.8.3'
2020/08/24 14:25:51 [debug] found OpenResty 'nginx' executable at /usr/local/openresty/nginx/sbin/nginx
2020/08/24 14:25:51 [debug] sending signal to pid at: /usr/local/kong/pids/nginx.pid
2020/08/24 14:25:51 [debug] kill -0 `cat /usr/local/kong/pids/nginx.pid` >/dev/null 2>&1
2020/08/24 14:25:51 [debug] starting nginx: /usr/local/openresty/nginx/sbin/nginx -p /usr/local/kong -c nginx.conf
2020/08/24 14:25:52 [debug] nginx started
2020/08/24 14:25:52 [info] Kong started
  • Kong error logs (<KONG_PREFIX>/logs/error.log)

During a reload that succeeds:

2020/08/24 14:27:48 [warn] 22619#0: load balancing method redefined in /usr/local/kong/nginx-kong.conf:67
2020/08/24 14:27:48 [notice] 22619#0: signal process started
2020/08/24 14:27:48 [notice] 22585#0: signal 1 (SIGHUP) received from 22619, reconfiguring
2020/08/24 14:27:48 [notice] 22585#0: reconfiguring
2020/08/24 14:27:48 [warn] 22585#0: load balancing method redefined in /usr/local/kong/nginx-kong.conf:67
2020/08/24 14:27:50 [notice] 22585#0: using the "epoll" event method
2020/08/24 14:27:50 [notice] 22585#0: start worker processes
2020/08/24 14:27:50 [notice] 22585#0: start worker process 22630
2020/08/24 14:27:50 [notice] 22585#0: start worker process 22631
2020/08/24 14:27:50 [notice] 22585#0: signal 17 (SIGCHLD) received from 22621
2020/08/24 14:27:50 [notice] 22631#0: *615 [lua] cache.lua:374: purge(): [DB cache] purging (local) cache, context: init_worker_by_lua*
2020/08/24 14:27:50 [warn] 22630#0: *616 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.001 seconds), context: init_worker_by_lua*
2020/08/24 14:27:50 [notice] 22631#0: *615 [lua] cache.lua:374: purge(): [DB cache] purging (local) cache, context: init_worker_by_lua*
2020/08/24 14:27:50 [notice] 22595#0: gracefully shutting down
2020/08/24 14:27:50 [notice] 22595#0: exiting
2020/08/24 14:27:50 [notice] 22594#0: gracefully shutting down
2020/08/24 14:27:50 [notice] 22594#0: exiting
2020/08/24 14:27:50 [warn] 22630#0: *616 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.002 seconds), context: init_worker_by_lua*
2020/08/24 14:27:50 [warn] 22630#0: *616 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.004 seconds), context: init_worker_by_lua*
2020/08/24 14:27:50 [warn] 22630#0: *616 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.008 seconds), context: init_worker_by_lua*
2020/08/24 14:27:50 [warn] 22630#0: *616 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.016 seconds), context: init_worker_by_lua*
2020/08/24 14:27:50 [warn] 22630#0: *616 [lua] globalpatches.lua:55: sleep(): executing a blocking 'sleep' (0.032 seconds), context: init_worker_by_lua*
2020/08/24 14:27:50 [notice] 22631#0: *615 [kong] init.lua:330 declarative config loaded from /etc/kong/kong.yaml, context: init_worker_by_lua*
2020/08/24 14:27:50 [notice] 22594#0: exit
2020/08/24 14:27:50 [notice] 22595#0: exit
2020/08/24 14:27:50 [notice] 22585#0: signal 17 (SIGCHLD) received from 22595
2020/08/24 14:27:50 [notice] 22585#0: worker process 22595 exited with code 0
2020/08/24 14:27:50 [notice] 22585#0: signal 17 (SIGCHLD) received from 22594
2020/08/24 14:27:50 [notice] 22585#0: worker process 22594 exited with code 0
2020/08/24 14:27:50 [notice] 22585#0: signal 29 (SIGIO) received

During a reload where the old workers do not terminate:

2020/08/24 14:27:53 [warn] 22654#0: load balancing method redefined in /usr/local/kong/nginx-kong.conf:67
2020/08/24 14:27:53 [notice] 22654#0: signal process started
2020/08/24 14:27:53 [notice] 22585#0: signal 1 (SIGHUP) received from 22654, reconfiguring
2020/08/24 14:27:53 [notice] 22585#0: reconfiguring
2020/08/24 14:27:53 [warn] 22585#0: load balancing method redefined in /usr/local/kong/nginx-kong.conf:67
2020/08/24 14:27:54 [notice] 22585#0: using the "epoll" event method
2020/08/24 14:27:54 [notice] 22585#0: start worker processes
2020/08/24 14:27:54 [notice] 22585#0: start worker process 22665
2020/08/24 14:27:54 [notice] 22585#0: start worker process 22666
2020/08/24 14:27:54 [notice] 22666#0: *654 [lua] cache.lua:374: purge(): [DB cache] purging (local) cache, context: init_worker_by_lua*
2020/08/24 14:27:54 [notice] 22666#0: *654 [lua] cache.lua:374: purge(): [DB cache] purging (local) cache, context: init_worker_by_lua*
2020/08/24 14:27:55 [notice] 22585#0: signal 17 (SIGCHLD) received from 22656
2020/08/24 14:27:54 [notice] 22666#0: *654 [kong] init.lua:330 declarative config loaded from /etc/kong/kong.yaml, context: init_worker_by_lua*
  • Kong configuration (the output of a GET request to Kong's Admin port - see
    https://docs.konghq.com/latest/admin-api/#retrieve-node-information)

Admin port is not enabled, status port doesn't seem to be able to show this?

  • Operating system
    RHEL 7 x64
tasbug

Most helpful comment

I have the same issue (tested on 2.1.0 and latest 2.1.3 installed from ubuntu bionic deb).
There is some debug info I've collected
nginx calls graph
output from gdb:

...
2584  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2583  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2584  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2618  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2621  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2618  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2621  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 

Program received signal SIGTRAP, Trace/breakpoint trap.
0x000056132f24af62 in ngx_http_lua_ffi_shdict_get (zone=<optimized out>, key=0x7f2a3aadc078 "events-last", key_len=11, 
    value_type=<optimized out>, str_value_buf=0x7f2a53e8ee18, str_value_len=0x7f2a53edebe0, num_value=0x7f2a53e8edb8, 
    user_flags=0x7f2a53e8ed98, get_stale=0, is_stale=0x7f2a53e8edd8, err=0x7f2a53e8ee38)
    at ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c:2623
2623  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 

Program received signal SIGSEGV, Segmentation fault.
0x000056132f24af65 in ngx_http_lua_ffi_shdict_get (zone=<optimized out>, key=0x7f2a3aadc078 "events-last", key_len=11, 
    value_type=<optimized out>, str_value_buf=<error reading variable: Cannot access memory at address 0x10>, 
    str_value_len=<error reading variable: Cannot access memory at address 0x8>, 
    num_value=<error reading variable: Cannot access memory at address 0xb0>, 
    user_flags=<error reading variable: Cannot access memory at address 0xb8>, 
    get_stale=<error reading variable: Cannot access memory at address 0xc0>, 
    is_stale=<error reading variable: Cannot access memory at address 0xc8>, 
    err=<error reading variable: Cannot access memory at address 0xd0>)
    at ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c:2623
2623  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb)

All 8 comments

@nvx Thank you for the detailed report and for testing over multiple versions! We will try to reproduce this on our end and continue investigating from there!

2.0.5 also looks good, so seems to be 2.1.0 that introduced the regression.

You could use the following code instead for service:

    cat >> /usr/lib/systemd/system/kong.service << EOF
    [Unit]
    Description=Kong
    Documentation=https://docs.konghq.com/
    After=syslog.target network.target remote-fs.target nss-lookup.target

    [Service]
    ExecStart=/usr/local/bin/kong start
    ExecReload=/usr/local/bin/kong reload
    ExecStop=/usr/local/bin/kong stop
    PrivateTmp=true

    # All environment variables prefixed with `KONG_` and capitalized will override
    # the settings specified in the `/etc/kong/kong.conf.default` file.
    #
    # For example:
    #   `log_level = debug` in the .conf file -> `KONG_LOG_LEVEL=debug` env var.
    Environment=KONG_NGINX_DAEMON=off

    # You can control this limit through /etc/security/limits.conf
    LimitNOFILE=infinity

    [Install]
    WantedBy=multi-user.target
    EOF

My understanding is that "kong reload" ends up sending a SIGHUP to nginx which would have the same behaviour of reloading nginx as the packaged systemd unit has today:

ExecReload=/usr/local/openresty/nginx/sbin/nginx -p /usr/local/kong -c nginx.conf -s reload

As such I imagine it would have the same issue. Unless you accidentally meant to post that on my other unrelated issue #6271?

Some findings:

  • It is very easy to reproduce: start kong with declarative config, and then issue kong reload. In many cases, the "old" workers get stuck at 100% CPU usage. In my case, it usually happens with just 2-4 reloads.
  • The "new" workers seem to work normally and process requests.
  • Another kong reload usually kills stuck workers and returns processing state to normal.
  • The "old" workers are busy-waiting to acquire the "events-last" shdict lock.
  • The issue was introduced at commit a15aea3ee0

I have the same issue (tested on 2.1.0 and latest 2.1.3 installed from ubuntu bionic deb).
There is some debug info I've collected
nginx calls graph
output from gdb:

...
2584  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2583  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2584  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2618  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2621  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2618  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 
2621  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 

Program received signal SIGTRAP, Trace/breakpoint trap.
0x000056132f24af62 in ngx_http_lua_ffi_shdict_get (zone=<optimized out>, key=0x7f2a3aadc078 "events-last", key_len=11, 
    value_type=<optimized out>, str_value_buf=0x7f2a53e8ee18, str_value_len=0x7f2a53edebe0, num_value=0x7f2a53e8edb8, 
    user_flags=0x7f2a53e8ed98, get_stale=0, is_stale=0x7f2a53e8edd8, err=0x7f2a53e8ee38)
    at ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c:2623
2623  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 

Program received signal SIGSEGV, Segmentation fault.
0x000056132f24af65 in ngx_http_lua_ffi_shdict_get (zone=<optimized out>, key=0x7f2a3aadc078 "events-last", key_len=11, 
    value_type=<optimized out>, str_value_buf=<error reading variable: Cannot access memory at address 0x10>, 
    str_value_len=<error reading variable: Cannot access memory at address 0x8>, 
    num_value=<error reading variable: Cannot access memory at address 0xb0>, 
    user_flags=<error reading variable: Cannot access memory at address 0xb8>, 
    get_stale=<error reading variable: Cannot access memory at address 0xc0>, 
    is_stale=<error reading variable: Cannot access memory at address 0xc8>, 
    err=<error reading variable: Cannot access memory at address 0xd0>)
    at ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c:2623
2623  in ../ngx_lua-0.10.15/src/ngx_http_lua_shdict.c
(gdb) 

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb)

@nvx @migolovanov do you still see this on Kong 2.1.3? I don't run declarative config I run with a db and I see the 100% maxed out stale workers on kong reload too.

cc @hishamhm

Edit - Actually so this was merged into master 7 days ago(after 2.1.3 I imagine). Any way this can be hotpatched into 2.1.3 or a 2.1.4 candidate? If this is to fix the zombie worker processes and 100% cpu pegging then it makes sense to cut a version of Kong so people are not stuck on 2.1.3 with the issue?

Actually so this was merged into master 7 days ago(after 2.1.3 I imagine). Any way this can be hotpatched into 2.1.3 or a 2.1.4 candidate?

Correct, it got in after 2.1.3 and is planned for 2.1.4!

Was this page helpful?
0 / 5 - 0 ratings