Hi,
I am having a problem with lua scripts. When using envoy.lua http filter envoy allocates memory after each request, bot does not free it. In my case, after around 920K request, 1GB is allocated and envoy crashes.
This happens even when I just use the respond function:
function envoy_on_request(request_handle)
request_handle:respond({[":status"] = "200"})
end
But it does not happen when I use print:
function envoy_on_request(request_handle)
print("hello")
end
Since I am using a docker image, I dont have a stacktrace of this error, but here is full console output.
envoy_output.txt
I have also tried using "collectgarbage()" function but it doesn't help, just slows down whole envoy.
@michalholecek thanks for the report. I haven't heard of this issue for. You should be able to use the tcmalloc heap profiler to potentially get some heap dumps to see what is going on? https://gperftools.github.io/gperftools/heapprofile.html
@mattklein123
I reproduced this bug in my environment. Envoy will be panic when the number of requests approaches around 900k.
$ HEAPCHECK=normal /usr/local/bin/envoy -c config/envoy.json
...
[2018-06-13 04:50:12.666][12177][error][lua] source/common/http/filter/lua/lua_filter.cc:509] script log: test
PANIC: unprotected error in call to Lua API (not enough memory)
[2018-06-13 04:50:12.736][12178][error][lua] source/common/http/filter/lua/lua_filter.cc:509] script log: test
PANIC: unprotected error in call to Lua API (not enough memory)
Thread finding failed with -1 errno=1
Could not find thread stacks. Will likely report false leak positives.
Have memory regions w/o callers: might report false leaks
Leak check _main_ detected leaks of 34375808 bytes in 1074244 objects
The 2 largest leaks:
Using local file /usr/local/bin/envoy.
Leak of 30805792 bytes in 962681 objects allocated from:
@ 4b5dff free
@ 4b125f free
@ 6963d4 free
@ 6973ff free
@ 6a605a free
@ 6a24fe free
@ 6a7fa0 free
@ 6a3c82 free
@ 6a5dfb free
@ 693da0 free
@ 567d97 free
@ 5666ff free
@ 566f1e free
@ 5602d8 free
@ 8a0422 free
@ 8a0b7f free
@ 55bab6 free
@ 8a9bae free
@ 7fd6ce0796ba start_thread
Leak of 3570016 bytes in 111563 objects allocated from:
@ 4b5dff free
@ 4b125f free
@ 6966e0 free
@ 4aca20 free
@ 4adee7 free
@ 4ce159 free
$ pprof --stacks /usr/local/bin/envoy envoy.hprof.2_11705.0020.heap
...
(pprof) top
Total: 36.3 MB
Leak of 30738624 bytes in 960582 objects allocated from:
@ 004b5dff unknown
@ 00000000004b125e free ??:0
@ 00000000006963d3 free ??:0
@ 00000000006973fe free ??:0
@ 00000000006a6059 free ??:0
@ 00000000006a24fd free ??:0
@ 00000000006a7f9f free ??:0
@ 00000000006a3c81 free ??:0
@ 00000000006a5dfa free ??:0
@ 0000000000693d9f free ??:0
@ 0000000000567d96 free ??:0
@ 00000000005666fe free ??:0
@ 0000000000566f1d free ??:0
@ 00000000005602d7 free ??:0
@ 00000000008a0421 free ??:0
@ 00000000008a0b7e free ??:0
@ 000000000055bab5 free ??:0
@ 00000000008a9bad free ??:0
@ 00007f65728646b9 start_thread ??:0
Leak of 4123296 bytes in 128853 objects allocated from:
@ 004b5dff unknown
@ 00000000004b125e free ??:0
@ 00000000006966df free ??:0
@ 00000000004aca1f free ??:0
@ 00000000004adee6 free ??:0
@ 00000000004ce158 free ??:0
Leak of 219751 bytes in 3290 objects allocated from:
@ 00b69229 unknown
Leak of 98304 bytes in 3 objects allocated from:
@ 00b4b8bb unknown
Leak of 81920 bytes in 10 objects allocated from:
@ 00890f53 unknown
@ 00000000008940fb free ??:0
@ 0000000000896242 free ??:0
@ 00000000008902ac free ??:0
@ 00000000007fc5ce free ??:0
@ 00000000006a3ac4 free ??:0
@ 00000000006a4156 free ??:0
@ 0000000000696a17 free ??:0
@ 00000000004aca1f free ??:0
@ 00000000004adee6 free ??:0
@ 00000000004ce158 free ??:0
Leak of 74256 bytes in 1 objects allocated from:
@ 005b770f unknown
@ 000000000056c9d5 free ??:0
@ 000000000056017e free ??:0
@ 000000000054b00f free ??:0
@ 000000000045f665 free ??:0
@ 000000000045f94f free ??:0
@ 000000000040844b free ??:0
@ 00007f65721aa82f __libc_start_main ??:0
...
Maybe I do something wrong when I used the profiler, the result not contains functions exactly.
@lwhile Seems to be lack of symbol information, you can use bazel -c dbg to build envoy
Hi, sorry for the late response, here is my heap profile:
heap_profile_debug_38.txt
here is also heap profile created using non debug envoy binary which shows the leak in "spdlog::details::os::thread_id::tid"
heap_profile_non_debug_38.txt
@zyfjeff Thanks for your help.
PANIC: unprotected error in call to Lua API (not enough memory)
Thread finding failed with -1 errno=1
Could not find thread stacks. Will likely report false leak positives.
Have memory regions w/o callers: might report false leaks
Leak check _main_ detected leaks of 17614432 bytes in 550451 objects
The 2 largest leaks:
Using local file ./envoy-static-dbg.
Leak of 14306944 bytes in 447092 objects allocated from:
@ 5536a2 Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::createCoroutine
@ 54201b Envoy::Extensions::HttpFilters::Lua::FilterConfig::createCoroutine
@ 53f128 Envoy::Extensions::HttpFilters::Lua::Filter::doHeaders
@ 5420f4 Envoy::Extensions::HttpFilters::Lua::Filter::decodeHeaders
@ b9dbc3 Envoy::Http::ConnectionManagerImpl::ActiveStreamDecoderFilter::decodeHeaders
@ b95d31 Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders
@ b95583 Envoy::Http::ConnectionManagerImpl::ActiveStream::decodeHeaders
@ c2544a
Leak of 3307488 bytes in 103359 objects allocated from:
@ 5536a2 Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::createCoroutine
@ 54201b Envoy::Extensions::HttpFilters::Lua::FilterConfig::createCoroutine
@ 53f128 Envoy::Extensions::HttpFilters::Lua::Filter::doHeaders
@ 5421f0 Envoy::Extensions::HttpFilters::Lua::Filter::encodeHeaders
@ b97420 Envoy::Http::ConnectionManagerImpl::ActiveStream::encodeHeaders
@ b998da Envoy::Http::ConnectionManagerImpl::ActiveStreamDecoderFilter::encodeHeaders
@ 53f689 Envoy::Extensions::HttpFilters::Lua::Filter::DecoderCallbacks::respond
@michalholecek what OS are you running on? I think someone else reported this. Can you search old issues?
The heap dumps are not telling me anything. The best way to do it is to grab some differential dumps, and then visualize them using --gv
@mattklein123
I am using Ubuntu 18.04.
By the old issue you probably mean #1684, but the problem disappeared in envoy1.4 and it probably wasn't lua related.
I have made some differential dumps and here is the visualization:
profile.pdf
@michalholecek can you provide some full /stats dumps as the "leak" is happening? It's unclear to me if there is an actual leak here or if there are just a bunch of long lived requests/streams.
@mattklein123 here are the stats just before the envoy crash:
stats_last.txt
And here: mmap_log.txt is also mmap log obtained using the Google heap profiler (original log was 8K lines long and every line was a mmap allocation, munmap calling started only after this lua error:
PANIC: unprotected error in call to Lua API (not enough memory)
at the end of this file, envoy crashes.
@michalholecek sorry not sure. Are you somehow saving lua variables into global variables such that they never get garbage collected? I would try to simplify your script a lot and see if you still have issues (e.g., make it a NOP).
@mattklein123 I have partially solved this issue. When I was changing or adding headers the error was caused by not having enough file descriptors.
But the other part is caused by lua not freeing memory either when:
1) request_handle:respond() function is called
e.g. script1.lua.txt
This does not cause lua to log an error, just allocates 1KB of memory and does not free it.
Stats and visualized differential dumps: stats1.txt, gv1.pdf.
2) when lua gets a bad argument
e.g. script2.lua.txt
This causes lua to log an error, also allocates 1KB of memory per request and does not free it.
Log:
[2018-06-18 14:43:53.826][17303][error][lua] source/extensions/filters/http/lua/lua_filter.cc:512] script log: script2.lua:3: bad argument #2 to 'replace' (string expected, got nil)
Stats and visualized differential dumps: stats2.txt, gv2.pdf.
At the end, both of these issues cause Segmentation fault.
In those scripts I don't use global variables, and NOP scripts work fine.
@michalholecek I'm not sure what is going on here, but we have tests for both of the cases above and there are no leaks that I know of. If you can create a self-contained repro that shows the leak I can take a look.
@mattklein123 here is the repo: https://github.com/michalholecek/envoy-lua-leak
I'm marking this help wanted in case someone else wants to take a look. I'm not sure when I will get to this but will look eventually if no one else does first.
We use lua filters with request_handle:respond extensively and thus have been getting a lot Envoy segfaults in our fleet due to this issue. To do more debugging, I integrated the https://github.com/yaukeywang/LuaMemorySnapshotDump tool into one of our lua filters and took snapshots before and after doing a large number of requests that execute request_handle:respond.
The results suggest that request_handle:respond is leaking coroutines; for instance, I see thousands of lines like this in the output after the test:
thread: 0x5dd4fa68 registry.100354 1
thread: 0x4fc6f548 registry.39237 1
thread: 0x58c08260 registry.78750 1
thread: 0x477f20f8 registry.10927 1
thread: 0x52589b30 registry.50757 1
thread: 0x58c0aba8 registry.78763 1
thread: 0x58c0f768 registry.78771 1
thread: 0x58c14d78 registry.78770 1
thread: 0x477f7090 registry.1141 1
thread: 0x4a7fefe0 registry.55633 1
(the format is [address] [name] [num refs])
Looking into the implementation of request_handle:respond in the Envoy code, it looks like it's doing a yield to suspend execution and prevent code after the call from being run: https://github.com/envoyproxy/envoy/blob/fe945288faeb0fc9a0a3a31500cb726311663888/source/extensions/filters/http/lua/lua_filter.cc#L134.
I think what might be happening here is that because the coroutine is suspended indefinitely, it's not being garbage collected later on. I did some testing in a local lua interpreter by creating a bunch of suspended coroutines, and the dump from https://github.com/yaukeywang/LuaMemorySnapshotDump looked very similar.
As a fix, I tried changing https://github.com/envoyproxy/envoy/blob/fe945288faeb0fc9a0a3a31500cb726311663888/source/extensions/filters/http/lua/lua_filter.cc#L134 to just return 0;, which stopped the leak. The problem with this approach, though, is that I think it slightly changes the behavior of request_handle:respond because the code after the call might be executed. However, users can get the old behavior by doing a request_handle:respond followed by a return, I think.
Does this seem plausible? If so, any other ideas for fixing?
@yolken-stripe thanks a ton for digging into this. Let me page this code in and I will get back to you. cc @dio since he might also be interested in looking at this.
@yolken-stripe this is not a real fix but this will probably fix your issue:
index d4db822..d949fcd 100644
--- a/source/extensions/filters/http/lua/lua_filter.h
+++ b/source/extensions/filters/http/lua/lua_filter.h
@@ -124,6 +124,7 @@ public:
http_request_->cancel();
http_request_ = nullptr;
}
+ coroutine_.reset();
There is a circular reference which is preventing yielded coroutines from getting cleaned up. I will think about a real fix tomorrow. Figuring out the right way to test this is also complicated.
@mattklein123 thanks for the quick response! I tried the patch above, but caused it a segfault when I processed a request via our lua filters:
[2018-07-13 17:50:37.422058] [2018-07-13 17:50:37.421][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:125] Caught Segmentation fault, suspect faulting address 0xbcd4c2d7
[2018-07-13 17:50:37.422241] [2018-07-13 17:50:37.422][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:94] Backtrace thr<29033> obj</pay/jenkins-artifacts/envoy/git-7b3935ae3f67481b73e7c46018f188155323dffb/envoy-stripe> (If unsymbolized, use tools/stack_decode.py):
[2018-07-13 17:50:37.423676] [2018-07-13 17:50:37.423][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #0 0x5181c4 lua_rawgeti
[2018-07-13 17:50:37.424930] [2018-07-13 17:50:37.424][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #1 0x526322 luaL_unref
[2018-07-13 17:50:37.426092] [2018-07-13 17:50:37.426][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #2 0x50cf16 Envoy::Extensions::HttpFilters::Lua::Filter::~Filter()
[2018-07-13 17:50:37.427309] [2018-07-13 17:50:37.427][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #3 0x8316c9 Envoy::Http::ConnectionManagerImpl::ActiveStreamDecoderFilter::~ActiveStreamDecoderFilter()
[2018-07-13 17:50:37.428487] [2018-07-13 17:50:37.428][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #4 0x8366a1 std::__cxx11::_List_base<>::_M_clear()
[2018-07-13 17:50:37.429700] [2018-07-13 17:50:37.429][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #5 0x83b5c8 Envoy::Http::ConnectionManagerImpl::ActiveStream::~ActiveStream()
[2018-07-13 17:50:37.430877] [2018-07-13 17:50:37.430][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #6 0x83b720 Envoy::Http::ConnectionManagerImpl::ActiveStream::~ActiveStream()
[2018-07-13 17:50:37.432059] [2018-07-13 17:50:37.431][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #7 0x6c09ec Envoy::Event::DispatcherImpl::clearDeferredDeleteList()
[2018-07-13 17:50:37.433251] [2018-07-13 17:50:37.433][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #8 0x9ddab7 event_process_active_single_queue.isra.29
[2018-07-13 17:50:37.434422] [2018-07-13 17:50:37.434][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #9 0x9de01e event_base_loop
[2018-07-13 17:50:37.437323] [2018-07-13 17:50:37.435][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #10 0x6bc5a5 Envoy::Server::WorkerImpl::threadRoutine()
[2018-07-13 17:50:37.437361] [2018-07-13 17:50:37.437][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #11 0x9e786d Envoy::Thread::Thread::Thread()::{lambda()#1}::_FUN()
[2018-07-13 17:50:37.437393] [2018-07-13 17:50:37.437][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] thr<29033> obj</lib/x86_64-linux-gnu/libpthread.so.0>
[2018-07-13 17:50:37.437430] [2018-07-13 17:50:37.437][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<29033> #12 0x7f251489a6b9 start_thread
[2018-07-13 17:50:37.437488] [2018-07-13 17:50:37.437][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] thr<29033> obj</lib/x86_64-linux-gnu/libc.so.6>
[2018-07-13 17:50:37.437529] [2018-07-13 17:50:37.437][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:117] thr<29033> #13 0x7f25142c741c (unknown)
[2018-07-13 17:50:37.437580] [2018-07-13 17:50:37.437][29033][critical][backtrace] bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:121] end backtrace thread 29033
As a temporary fix for us, we were thinking of just using the return 0; change mentioned above. Besides the change in behavior (which may make it inappropriate for general release), can you think of any technical problems with doing this?
As long as you do nothing after you call respond it should be safe. I have a real fix though that I'm going to post in a couple of minutes that you can test if you want.
Oh, nice. Will give your updated fix a try and let you know!
@mattklein123 tried out your fix, and seems to solve the problem. Thanks again for the quick response!
@yolken-stripe great thanks for the verification.
Thanks, @mattklein123.
That is an interesting investigation on how we hold refs to the underlying coroutines.