Kong: Rate Limiting plugin memory consumption in "local" strategy

Created on 21 Feb 2018 · 16Comments · Source: Kong/kong

https://github.com/Kong/kong/issues/3124 was closed.
I'd like to submit a test which demonstrates the memory leak.
thibaultcha is this okay?

Best,
Dave

pending author feedback

Source

davidsiracusa

All 16 comments

@davidsiracusa Do so, but I certainly hope this isn't about values without TTLs not being LRU evicted again. If it comes down to not being a memory leak from the plugin, but once again about the shm consumption pattern of both the DB cache and the plugin conflicting (as I have described _numerous times_), then I will simply point you to the numerous allusions I already made about the issue and how to fix it and why it is delicate to resolve as of today.

That said, we are always interested in learning about issues such as memory leaks and encourage users reporting them.

thibaultcha on 21 Feb 2018

I want to be careful, and not cover already discussed items. We have a current Kong 0.11.X production environment, where we have scheduled periodic openresty systemtap ngx-shm to pull back free memory for zone kong_cache. Once it gets too low, we perform a graceful restart of the node. Without this, we experience 'no memory" errors. Whether we select 1024M, or 2048M doesn't matter. The burn down to 0K free depends on how busy our site is and the cache size. A lua dump of the cache yields the culprit. Yes, the offender are the keys related to rate-limiting (per second limit). I patched, and built openresty, deployed the nginx binary in staging and I was able to sustain 7k RPS for hours across 300 APIs using an artificially low cache size. A dump of the cache after testing only yields rate-limits (per day limits). I'm reluctant to deploy the patched version, as the periodic restarts can be managed. I can reproduce this issue, however if I'm rehashing an existing discussion, I don't want to upset anyone. I only need assurance that init ttl will be passed to a future openresty. It would be nice to know when this will be planned (0.14?).

davidsiracusa on 21 Feb 2018

So, I'm not sure in what way to phrase it anymore. But I will try to once again - and maybe I am wrong, an maybe we'll get to fix something :)

Whether we select 1024M, or 2048M doesn't matter.

Indeed, it should not - if you give more memory to a caching layer, it will get comfortable and consume all of it, so is its nature.

Maybe we should try an experiment, since you are willing to experiment with patches and custom builds and short-term solutions - which is good news.

Would you try the following:

Add a new lua_shared_dict that we will dedicate to rate-limiting counters
Update the RL plugin to use this shared dict instead of kong_cache

Eventually, the patch should look something like this:

diff --git a/kong/plugins/rate-limiting/policies/init.lua b/kong/plugins/rate-limiting/policies/init.lua
index 86bbd0dd..6f669f59 100644
--- a/kong/plugins/rate-limiting/policies/init.lua
+++ b/kong/plugins/rate-limiting/policies/init.lua
@@ -4,7 +4,7 @@ local redis = require "resty.redis"
 local policy_cluster = require "kong.plugins.rate-limiting.policies.cluster"
 local reports = require "kong.core.reports"
 local ngx_log = ngx.log
-local shm = ngx.shared.kong_cache
+local shm = ngx.shared.kong_rl

 local pairs = pairs
 local fmt = string.format
diff --git a/kong/templates/nginx_kong.lua b/kong/templates/nginx_kong.lua
index 5639f319..81ea27fe 100644
--- a/kong/templates/nginx_kong.lua
+++ b/kong/templates/nginx_kong.lua
@@ -30,6 +30,7 @@ lua_max_running_timers 4096;
 lua_max_pending_timers 16384;
 lua_shared_dict kong                5m;
 lua_shared_dict kong_cache          ${{MEM_CACHE_SIZE}};
+lua_shared_dict kong_rl             5m;
 lua_shared_dict kong_process_events 5m;
 lua_shared_dict kong_cluster_events 5m;
 lua_shared_dict kong_healthchecks   5m;

If you have a custom NGINX template then that is even better, instead of updating the default template.

Now what this will do is _only store rate-limiting counters in the same shm_. You can keep that shm size relatively small (but be aware that if keys get evicted, you'll lose some counters).

Now, because we split the _large_ caching shm and the _busy_ counters one, you _should_ not see those "no memory" errors anymore. Why? Because when the _busy_ shm (counters) will be full, its LRU eviction will free a slot of the same size as the newly inserted value (another counter).
The issue with using the same shm for both the cache and rate-limiting (as said numerous times already), is that when the cache evicts a value from the shm, and that evicted value ends up being one of the RL counters (probably the seconds limits), the free space available is smaller than the new value to insert (the valeu from the cache, usually some JSON).

I patched, and built openresty

I assume you mean with my init_ttl patch?

thibaultcha on 21 Feb 2018

👍1

I will respond as soon as possible

davidsiracusa on 23 Feb 2018

👍1

How's it going @davidsiracusa ? I experienced similar issue after upgrading to 0.11.2, and never had once previously in 0.10.x, no matter how many memory we gave out to Kong. Eventually, lived with scheduled periodic graceful restart. Hope you guys can work it out.

jtuthehien on 8 Mar 2018

@jtuthehien Have you tried the above patch? It is a much more elegant/easy to manage workaround. Also, a full shm doesn’t _necessarily_ mean things are going bad.m, as it is used as a caching layer. It also isn’t a memory leak.

I think we’ll apply the patch of #3124 to master soon even though I don’t like setting the precedent of inserting plugin-specific shms in the configuration by default...

thibaultcha on 8 Mar 2018

I was waiting for a POC, just delivered. I will start as soon as possible.

davidsiracusa on 8 Mar 2018

thibaultcha sorry for the delay. I needed to get the POC setup.
I was able to reproduce my original "no memory" issue.
I will run any test you wish, I first wanted to provide a summary of my test first.
I believe it provide hints.

Test summary:

Stand up Kong 0.11.0, configure postgreSQL
Install openresty system tap toolkit
Set mem_cache_size = 12m, restart Kong
I created an application to provision: 101 APIs with the following plugins: cors, syslog, and rate-limiting (policy: local, second: 10000, day: 100000000000), each targeting the same upstream VIP
Use jmeter (4 servers, each with 275 threads) to target the first 100 APIs for 60 seconds, throttled to not exceed 4k RPS
Run jmeter multiple times, until ngx_shm yields 0k free for zone kong_cache
Run jmeter tests several more times
No errors observed during tests
Bring up a browser, target API 101
A "no memory" occurs
Targeting the first 100 APIs yield no errors from jmeter or from a browser
Wait 10 minutes
Target API 101 from a browser
A "no memory" occurs
Randomly target any of the first 100 APIs from a browser
No errors

I've reproduced the same error even with: mem_cache_size = 1024m
I suspect the error may go away in a little while, after some of keys expire (3600 s TTL), however key saturation will likely occur again. We have 300+ APIs in production, and "no memory' errors waxes and wanes.

The error returned by the browser:
{"message":"An unexpected error occurred"}

An error in errors.log:
2018/03/13 17:47:29 [error] 13369#0: *1253753 [lua] responses.lua:107: send_HTTP_INTERNAL_SERVER_ERROR(): failed to get from node cache: could not write lua_shared_dict: no memory, client: 108.52.24.152, server: kong, request: "GET /test101?mock=true HTTP/1.1", host: "ec2-XXX-XX-XX-XXX-XXX.us-west-1.compute.aonaws.com:8000"

Update: I tried to hit API 101, an hour later, still getting an error.

davidsiracusa on 13 Mar 2018

I will try the patch. If you need detail for the test above, please let me know.

davidsiracusa on 13 Mar 2018

6 hours later, no activity since the earlier testing, targeting API 101 from a browser, still yields an "no memory" error. A Kong restart fixes the issue.

davidsiracusa on 14 Mar 2018

Ok... Why are you _once again_ describing your problem to me? I know what it is. I sent you the patch to fix it. I explained already the rationale behind why and how we could fix it. Why are we still spending time on this issue after we made a path to resolution clear already (several times)? I will close this issue now. Please try the patch, keep in touch with what is happening in this repository (because as I said, maybe a dozen times by now, we are aware and we will fix this).

thibaultcha on 14 Mar 2018

Thank you for your patience.
I will post back results after the patch has been applied.

The patch should also be applied to: /kong/plugins/response-ratelimiting/policies/init.lua.
Perhaps constants.DICTS be updated as well?

davidsiracusa on 14 Mar 2018

The patch worked, I agree with your notes.
Thanks for you help.

davidsiracusa on 26 Mar 2018

FYI, 0.13.1 now ships with this patch.

thibaultcha on 24 Apr 2018

👍1

I noticed DICTS in constants.lua is missing the new dictionary (kong_rate_limiting_counters) in 0.13.1

davidsiracusa on 11 May 2018

@davidsiracusa That is intended for backwards-compatibility reasons.