while debugging #3332, i find memory leak with spawn
i using such timeout function:
def with_timeout(timeout = 60.0, &block)
channel = Channel(Nil).new
spawn do
block.call
begin
channel.send(nil)
rescue Channel::ClosedError
end
end
spawn do
break_at = Time.now + timeout.seconds
loop do
sleep 1.0
break if Time.now > break_at || channel.closed?
end
unless channel.closed?
begin
channel.send(nil)
rescue Channel::ClosedError
end
end
end
channel.receive
channel.close
end
require "http/server"
server = HTTP::Server.new "0.0.0.0", 8080 do |context|
context.response.headers["Content-Type"] = "text/plain"
str = ""
with_timeout do
str += "a" * 10000
sleep rand(0.5)
end
context.response.print(str)
end
spawn do
loop do
p `ps -p #{Process.pid} -o rss`
sleep 30
end
end
puts "Listening on http://0.0.0.0:8080"
server.listen
osx:
ab -n 50000 -c 50 http://127.0.0.1:8080/
rss 695Mb
if i remove with_timeout:
ab -n 50000 -c 50 http://127.0.0.1:8080/
rss 150Mb
i not see any peace of code which can leak here:
btw 150Mb also quite big.
may be related https://github.com/crystal-lang/crystal/issues/3061
@kostya You are right, there seems to be a memory leak. This short program reproduces it, memory keeps growing, slowly but steadily:
loop do
spawn do
sleep 0.1
end
sleep 0.005
end
puts "Running!"
sleep
I've been looking at the code and I see that each Fiber has a @resume_event that's created on sleep, and when the fiber finishes running it's freed. However, fibers are not immediately reclaimed: they are in a pool and are reused. So if a fiber is reused the @resume_event will still be used, which is a bit suspicious... I tried setting @resume_event to nil when the fiber finished running, but that didn't fix it, so we'll need to investigate it a bit more, but there's definitely a leak somewhere.
on linux this small loop not leak 1.6Mb,
my first program also leak on linux, gives: 705Mb
ab -n 150000 -c 50 http://127.0.0.1:8080/
Listening on http://0.0.0.0:8080
" RSS\n 2820\n"
" RSS\n87900\n"
" RSS\n178400\n"
" RSS\n263892\n"
" RSS\n338948\n"
" RSS\n416864\n"
" RSS\n499876\n"
" RSS\n576536\n"
" RSS\n655924\n"
" RSS\n731980\n"
" RSS\n813068\n"
" RSS\n892084\n"
" RSS\n963960\n"
" RSS\n1038896\n"
" RSS\n1116220\n"
" RSS\n1197416\n"
" RSS\n1272220\n"
" RSS\n1352668\n"
" RSS\n1430256\n"
" RSS\n1507112\n"
" RSS\n1590900\n"
" RSS\n1668196\n"
" RSS\n1750616\n"
" RSS\n1825152\n"
" RSS\n1902296\n"
" RSS\n1984748\n"
" RSS\n2000984\n"
and seems stop growing at 2032600
Btw, is this float division always now? https://github.com/crystal-lang/crystal/blob/91020b31b6829126fce7c458926e628812b4b553/src/fiber.cr#L89
@ozra #2968 hasn't been merged (was not accepted _yet_) The only recent change in relation to division is _floored_ one.
Alright. Sorry for sliding OT.
@kostya What happens if you put a GC.collect right at the end of the server's block? That is, a collect after each request.
yes with GC.collect memory freezed at 136Mb
So I don't think it's a leak on our side, it's just when the GC chooses to free memory... Maybe we need to tune the GC, and/or eventually write our own.
I have 5 daemons which running for a 24 hours, listen on http, parsing htmls, each average cpu usage 20%, it should not leak memory (timeout function as above disabled, no shared state, myhtml free all it memory after each parse).
After start, for ~2 hours, them used < 100Mb.
After 24 hours memory usage for them: min 210Mb, max 870Mb, gc stats for 210Mb GC::Stats(@collections=257517, @bytes_found=438208), gc stats for 870Mb: GC::Stats(@collections=257560, @bytes_found=484704).
And after some point (i don't know exactly when) this 870Mb process became crazy, it cpu usage ~ 370%, others daemons 25% (all processing equal amount of tasks).
Is there any way how i can debug this 870Mb process, what used memory, and why some much cpu usage. when i stop processing, cpu usage 0%, but when start again > 300%, strace on this process not show any strange.
ltrace for 5 seconds, show
% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
69.21 12.874828 182 70689 memcmp
15.14 2.817164 237 11838 memset
14.68 2.729765 253 10775 memcpy
0.22 0.040995 68 600 free
0.17 0.030980 68 449 memmove
0.12 0.022489 67 335 malloc
Hm, that GC::Stats isn't very useful, I think it was added in a PR but doesn't provide much information. There's another function you can use:
LibGC.get_heap_usage_safe(out heap_size, out free_bytes, out unmapped_bytes, out bytes_since_gc, out total_bytes)
pp heap_size, free_bytes, unmapped_bytes, bytes_since_gc, total_bytes
Though I don't know what to do with that info later, but it can be useful to see if GC is growing a lot, maybe.
A program can also be run with GC_PRINT_STATS=1 and you'll get some GC stats printed on the console.
Maybe you can also try profiling the program to see where that memcmp function call comes from, so we can learn where is all that CPU activity coming from. At least in Mac there's Instruments and you can see this profiled call trace, I guess there would be a similar tool for linux but I don't know.
It would also be useful to know what shards you are using (I guess myhtml?) and see if the CPU usage comes from any of those shards, or the standard library. Or maybe provide some similar code that we could reproduce...
used kemal, msgpack,myhtml, redis
but this is should not use more than 100% because parsing is single threaded (myhtml compiled without pthreads), and blocking.
btw, by default, crystal create for one process 15 threads?
for crazy process main process used cpu 1.5h, and every thread 16mins
for ok process 40 mins, and 1mins for it threads.
Bohem GC creates many threads, always
i solve this leak, just by create one coroutine with https://github.com/kostya/timeouter
I think this problem is still present (asterite's example grows monotonically for me with git master), care to reopen it? I also think possibly related to #3983 (i.e. causing other problems :| )
wait till it "blows" in linux, then report:
OK, my latest experiment, with this:
spawn do
loop do
GC.collect
sleep 0.05
end
end
loop do
spawn do
sleep 0.001
end
sleep 0.00005
end
It "stabilizies" for me with about 130MB in use (OK actually in Linux it got up to 320MB RSS after a few days, but at least it's not leaking quickly, as a note, calls to GC.collect seem to get much more expensive the more it uses).
If I run this instead:
loop do
spawn do
sleep 0.001
end
sleep 0.00005
GC.collect
end
it runs about 1/3 the speed but uses only like 7MB (on linux, 20 MB after days).
Without any GC.collect it goes up monotonically to like 3.2GB used when I stopped it.
Valgrind reports:
==27653== HEAP SUMMARY:
==27653== in use at exit: 11,576 bytes in 21 blocks
==27653== total heap usage: 307,190 allocs, 307,169 frees, 41,787,672 bytes allocated
==27653==
==27653== LEAK SUMMARY:
==27653== definitely lost: 8,192 bytes in 1 blocks
==27653== indirectly lost: 0 bytes in 0 blocks
==27653== possibly lost: 288 bytes in 1 blocks
==27653== still reachable: 3,096 bytes in 19 blocks
==27653== suppressed: 0 bytes in 0 blocks
==27653== Rerun with --leak-check=full to see details of leaked memory
Which seems normal, so it "feels" as if the problem is not necessarily a RAM leak as a reclamation or "somebody keeping a handle to old stuff" issue.
So I'm starting to wonder if it's related "GC.collect not being called often enough" so I guess continue the conversation in #3997