https://github.com/Kong/kong/issues/3124 was closed.
I'd like to submit a test which demonstrates the memory leak.
thibaultcha is this okay?
Best,
Dave
@davidsiracusa Do so, but I certainly hope this isn't about values without TTLs not being LRU evicted again. If it comes down to not being a memory leak from the plugin, but once again about the shm consumption pattern of both the DB cache and the plugin conflicting (as I have described _numerous times_), then I will simply point you to the numerous allusions I already made about the issue and how to fix it and why it is delicate to resolve as of today.
That said, we are always interested in learning about issues such as memory leaks and encourage users reporting them.
I want to be careful, and not cover already discussed items. We have a current Kong 0.11.X production environment, where we have scheduled periodic openresty systemtap ngx-shm to pull back free memory for zone kong_cache. Once it gets too low, we perform a graceful restart of the node. Without this, we experience 'no memory" errors. Whether we select 1024M, or 2048M doesn't matter. The burn down to 0K free depends on how busy our site is and the cache size. A lua dump of the cache yields the culprit. Yes, the offender are the keys related to rate-limiting (per second limit). I patched, and built openresty, deployed the nginx binary in staging and I was able to sustain 7k RPS for hours across 300 APIs using an artificially low cache size. A dump of the cache after testing only yields rate-limits (per day limits). I'm reluctant to deploy the patched version, as the periodic restarts can be managed. I can reproduce this issue, however if I'm rehashing an existing discussion, I don't want to upset anyone. I only need assurance that init ttl will be passed to a future openresty. It would be nice to know when this will be planned (0.14?).
So, I'm not sure in what way to phrase it anymore. But I will try to once again - and maybe I am wrong, an maybe we'll get to fix something :)
Whether we select 1024M, or 2048M doesn't matter.
Indeed, it should not - if you give more memory to a caching layer, it will get comfortable and consume all of it, so is its nature.
Maybe we should try an experiment, since you are willing to experiment with patches and custom builds and short-term solutions - which is good news.
Would you try the following:
lua_shared_dict that we will dedicate to rate-limiting counterskong_cacheEventually, the patch should look something like this:
diff --git a/kong/plugins/rate-limiting/policies/init.lua b/kong/plugins/rate-limiting/policies/init.lua
index 86bbd0dd..6f669f59 100644
--- a/kong/plugins/rate-limiting/policies/init.lua
+++ b/kong/plugins/rate-limiting/policies/init.lua
@@ -4,7 +4,7 @@ local redis = require "resty.redis"
local policy_cluster = require "kong.plugins.rate-limiting.policies.cluster"
local reports = require "kong.core.reports"
local ngx_log = ngx.log
-local shm = ngx.shared.kong_cache
+local shm = ngx.shared.kong_rl
local pairs = pairs
local fmt = string.format
diff --git a/kong/templates/nginx_kong.lua b/kong/templates/nginx_kong.lua
index 5639f319..81ea27fe 100644
--- a/kong/templates/nginx_kong.lua
+++ b/kong/templates/nginx_kong.lua
@@ -30,6 +30,7 @@ lua_max_running_timers 4096;
lua_max_pending_timers 16384;
lua_shared_dict kong 5m;
lua_shared_dict kong_cache ${{MEM_CACHE_SIZE}};
+lua_shared_dict kong_rl 5m;
lua_shared_dict kong_process_events 5m;
lua_shared_dict kong_cluster_events 5m;
lua_shared_dict kong_healthchecks 5m;
If you have a custom NGINX template then that is even better, instead of updating the default template.
Now what this will do is _only store rate-limiting counters in the same shm_. You can keep that shm size relatively small (but be aware that if keys get evicted, you'll lose some counters).
Now, because we split the _large_ caching shm and the _busy_ counters one, you _should_ not see those "no memory" errors anymore. Why? Because when the _busy_ shm (counters) will be full, its LRU eviction will free a slot of the same size as the newly inserted value (another counter).
The issue with using the same shm for both the cache and rate-limiting (as said numerous times already), is that when the cache evicts a value from the shm, and that evicted value ends up being one of the RL counters (probably the seconds limits), the free space available is smaller than the new value to insert (the valeu from the cache, usually some JSON).
I patched, and built openresty
I assume you mean with my init_ttl patch?
I will respond as soon as possible
How's it going @davidsiracusa ? I experienced similar issue after upgrading to 0.11.2, and never had once previously in 0.10.x, no matter how many memory we gave out to Kong. Eventually, lived with scheduled periodic graceful restart. Hope you guys can work it out.
@jtuthehien Have you tried the above patch? It is a much more elegant/easy to manage workaround. Also, a full shm doesn鈥檛 _necessarily_ mean things are going bad.m, as it is used as a caching layer. It also isn鈥檛 a memory leak.
I think we鈥檒l apply the patch of #3124 to master soon even though I don鈥檛 like setting the precedent of inserting plugin-specific shms in the configuration by default...
I was waiting for a POC, just delivered. I will start as soon as possible.
thibaultcha sorry for the delay. I needed to get the POC setup.
I was able to reproduce my original "no memory" issue.
I will run any test you wish, I first wanted to provide a summary of my test first.
I believe it provide hints.
Test summary:
I've reproduced the same error even with: mem_cache_size = 1024m
I suspect the error may go away in a little while, after some of keys expire (3600 s TTL), however key saturation will likely occur again. We have 300+ APIs in production, and "no memory' errors waxes and wanes.
The error returned by the browser:
{"message":"An unexpected error occurred"}
An error in errors.log:
2018/03/13 17:47:29 [error] 13369#0: *1253753 [lua] responses.lua:107: send_HTTP_INTERNAL_SERVER_ERROR(): failed to get from node cache: could not write lua_shared_dict: no memory, client: 108.52.24.152, server: kong, request: "GET /test101?mock=true HTTP/1.1", host: "ec2-XXX-XX-XX-XXX-XXX.us-west-1.compute.aonaws.com:8000"
Update: I tried to hit API 101, an hour later, still getting an error.
I will try the patch. If you need detail for the test above, please let me know.
6 hours later, no activity since the earlier testing, targeting API 101 from a browser, still yields an "no memory" error. A Kong restart fixes the issue.
Ok... Why are you _once again_ describing your problem to me? I know what it is. I sent you the patch to fix it. I explained already the rationale behind why and how we could fix it. Why are we still spending time on this issue after we made a path to resolution clear already (several times)? I will close this issue now. Please try the patch, keep in touch with what is happening in this repository (because as I said, maybe a dozen times by now, we are aware and we will fix this).
Thank you for your patience.
I will post back results after the patch has been applied.
The patch should also be applied to: /kong/plugins/response-ratelimiting/policies/init.lua.
Perhaps constants.DICTS be updated as well?
The patch worked, I agree with your notes.
Thanks for you help.
FYI, 0.13.1 now ships with this patch.
I noticed DICTS in constants.lua is missing the new dictionary (kong_rate_limiting_counters) in 0.13.1
@davidsiracusa That is intended for backwards-compatibility reasons.