Crystal: memory leak

Created on 21 Sep 2016 · 17Comments · Source: crystal-lang/crystal

while debugging #3332, i find memory leak with spawn

i using such timeout function:

def with_timeout(timeout = 60.0, &block)
  channel = Channel(Nil).new

  spawn do
    block.call
    begin
      channel.send(nil)
    rescue Channel::ClosedError
    end
  end

  spawn do
    break_at = Time.now + timeout.seconds
    loop do
      sleep 1.0
      break if Time.now > break_at || channel.closed?
    end

    unless channel.closed?
      begin
        channel.send(nil)
      rescue Channel::ClosedError
      end
    end
  end

  channel.receive
  channel.close
end

require "http/server"

server = HTTP::Server.new "0.0.0.0", 8080 do |context|
  context.response.headers["Content-Type"] = "text/plain"

  str = ""

  with_timeout do
    str += "a" * 10000
    sleep rand(0.5)
  end

  context.response.print(str)
end

spawn do
  loop do
    p `ps -p #{Process.pid} -o rss`
    sleep 30
  end
end

puts "Listening on http://0.0.0.0:8080"
server.listen

osx:

ab -n 50000 -c 50 http://127.0.0.1:8080/
rss 695Mb

if i remove with_timeout:

ab -n 50000 -c 50 http://127.0.0.1:8080/
rss 150Mb

i not see any peace of code which can leak here:
btw 150Mb also quite big.

bug stdlib

Source

kostya

All 17 comments

kostya on 21 Sep 2016

@kostya You are right, there seems to be a memory leak. This short program reproduces it, memory keeps growing, slowly but steadily:

loop do
  spawn do
    sleep 0.1
  end
  sleep 0.005
end

puts "Running!"
sleep

I've been looking at the code and I see that each Fiber has a @resume_event that's created on sleep, and when the fiber finishes running it's freed. However, fibers are not immediately reclaimed: they are in a pool and are reused. So if a fiber is reused the @resume_event will still be used, which is a bit suspicious... I tried setting @resume_event to nil when the fiber finished running, but that didn't fix it, so we'll need to investigate it a bit more, but there's definitely a leak somewhere.

asterite on 21 Sep 2016

on linux this small loop not leak 1.6Mb,
my first program also leak on linux, gives: 705Mb

kostya on 21 Sep 2016

ab -n 150000 -c 50 http://127.0.0.1:8080/

Listening on http://0.0.0.0:8080
"  RSS\n 2820\n"
"  RSS\n87900\n"
"  RSS\n178400\n"
"  RSS\n263892\n"
"  RSS\n338948\n"
"  RSS\n416864\n"
"  RSS\n499876\n"
"  RSS\n576536\n"
"  RSS\n655924\n"
"  RSS\n731980\n"
"  RSS\n813068\n"
"  RSS\n892084\n"
"  RSS\n963960\n"
"  RSS\n1038896\n"
"  RSS\n1116220\n"
"  RSS\n1197416\n"
"  RSS\n1272220\n"
"  RSS\n1352668\n"
"  RSS\n1430256\n"
"  RSS\n1507112\n"
"  RSS\n1590900\n"
"  RSS\n1668196\n"
"  RSS\n1750616\n"
"  RSS\n1825152\n"
"  RSS\n1902296\n"
"  RSS\n1984748\n"
"  RSS\n2000984\n"

and seems stop growing at 2032600

kostya on 21 Sep 2016

Btw, is this float division always now? https://github.com/crystal-lang/crystal/blob/91020b31b6829126fce7c458926e628812b4b553/src/fiber.cr#L89

ozra on 21 Sep 2016

@ozra #2968 hasn't been merged (was not accepted _yet_) The only recent change in relation to division is _floored_ one.

luislavena on 22 Sep 2016

Alright. Sorry for sliding OT.

ozra on 22 Sep 2016

@kostya What happens if you put a GC.collect right at the end of the server's block? That is, a collect after each request.

asterite on 22 Sep 2016

yes with GC.collect memory freezed at 136Mb

kostya on 22 Sep 2016

So I don't think it's a leak on our side, it's just when the GC chooses to free memory... Maybe we need to tune the GC, and/or eventually write our own.

asterite on 22 Sep 2016

I have 5 daemons which running for a 24 hours, listen on http, parsing htmls, each average cpu usage 20%, it should not leak memory (timeout function as above disabled, no shared state, myhtml free all it memory after each parse).
After start, for ~2 hours, them used < 100Mb.
After 24 hours memory usage for them: min 210Mb, max 870Mb, gc stats for 210Mb GC::Stats(@collections=257517, @bytes_found=438208), gc stats for 870Mb: GC::Stats(@collections=257560, @bytes_found=484704).
And after some point (i don't know exactly when) this 870Mb process became crazy, it cpu usage ~ 370%, others daemons 25% (all processing equal amount of tasks).

Is there any way how i can debug this 870Mb process, what used memory, and why some much cpu usage. when i stop processing, cpu usage 0%, but when start again > 300%, strace on this process not show any strange.

ltrace for 5 seconds, show

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 69.21   12.874828         182     70689 memcmp
 15.14    2.817164         237     11838 memset
 14.68    2.729765         253     10775 memcpy
  0.22    0.040995          68       600 free
  0.17    0.030980          68       449 memmove
  0.12    0.022489          67       335 malloc

kostya on 1 Oct 2016

Hm, that GC::Stats isn't very useful, I think it was added in a PR but doesn't provide much information. There's another function you can use:

LibGC.get_heap_usage_safe(out heap_size, out free_bytes, out unmapped_bytes, out bytes_since_gc, out total_bytes)
pp heap_size, free_bytes, unmapped_bytes, bytes_since_gc, total_bytes

Though I don't know what to do with that info later, but it can be useful to see if GC is growing a lot, maybe.

A program can also be run with GC_PRINT_STATS=1 and you'll get some GC stats printed on the console.

Maybe you can also try profiling the program to see where that memcmp function call comes from, so we can learn where is all that CPU activity coming from. At least in Mac there's Instruments and you can see this profiled call trace, I guess there would be a similar tool for linux but I don't know.

It would also be useful to know what shards you are using (I guess myhtml?) and see if the CPU usage comes from any of those shards, or the standard library. Or maybe provide some similar code that we could reproduce...

asterite on 1 Oct 2016

❤1

used kemal, msgpack,myhtml, redis

but this is should not use more than 100% because parsing is single threaded (myhtml compiled without pthreads), and blocking.

btw, by default, crystal create for one process 15 threads?

for crazy process main process used cpu 1.5h, and every thread 16mins
for ok process 40 mins, and 1mins for it threads.

kostya on 1 Oct 2016

Bohem GC creates many threads, always

asterite on 1 Oct 2016

i solve this leak, just by create one coroutine with https://github.com/kostya/timeouter

kostya on 8 Jan 2017

I think this problem is still present (asterite's example grows monotonically for me with git master), care to reopen it? I also think possibly related to #3983 (i.e. causing other problems :| )

rdp on 4 Feb 2017

wait till it "blows" in linux, then report:

OK, my latest experiment, with this:

spawn do
  loop do
    GC.collect
    sleep 0.05
  end
end
loop do
  spawn do
    sleep 0.001
  end
  sleep 0.00005
end

It "stabilizies" for me with about 130MB in use (OK actually in Linux it got up to 320MB RSS after a few days, but at least it's not leaking quickly, as a note, calls to GC.collect seem to get much more expensive the more it uses).

If I run this instead:

loop do
  spawn do
    sleep 0.001
  end
  sleep 0.00005
  GC.collect
end

it runs about 1/3 the speed but uses only like 7MB (on linux, 20 MB after days).

Without any GC.collect it goes up monotonically to like 3.2GB used when I stopped it.

Valgrind reports:

==27653== HEAP SUMMARY:
==27653==     in use at exit: 11,576 bytes in 21 blocks
==27653==   total heap usage: 307,190 allocs, 307,169 frees, 41,787,672 bytes allocated
==27653==
==27653== LEAK SUMMARY:
==27653==    definitely lost: 8,192 bytes in 1 blocks
==27653==    indirectly lost: 0 bytes in 0 blocks
==27653==      possibly lost: 288 bytes in 1 blocks
==27653==    still reachable: 3,096 bytes in 19 blocks
==27653==         suppressed: 0 bytes in 0 blocks
==27653== Rerun with --leak-check=full to see details of leaked memory

Which seems normal, so it "feels" as if the problem is not necessarily a RAM leak as a reclamation or "somebody keeping a handle to old stuff" issue.

So I'm starting to wonder if it's related "GC.collect not being called often enough" so I guess continue the conversation in #3997

rdp on 23 Feb 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings