Crystal: memory leak

Created on 21 Sep 2016  路  17Comments  路  Source: crystal-lang/crystal

while debugging #3332, i find memory leak with spawn

i using such timeout function:

def with_timeout(timeout = 60.0, &block)
  channel = Channel(Nil).new

  spawn do
    block.call
    begin
      channel.send(nil)
    rescue Channel::ClosedError
    end
  end

  spawn do
    break_at = Time.now + timeout.seconds
    loop do
      sleep 1.0
      break if Time.now > break_at || channel.closed?
    end

    unless channel.closed?
      begin
        channel.send(nil)
      rescue Channel::ClosedError
      end
    end
  end

  channel.receive
  channel.close
end

require "http/server"

server = HTTP::Server.new "0.0.0.0", 8080 do |context|
  context.response.headers["Content-Type"] = "text/plain"

  str = ""

  with_timeout do
    str += "a" * 10000
    sleep rand(0.5)
  end

  context.response.print(str)
end

spawn do
  loop do
    p `ps -p #{Process.pid} -o rss`
    sleep 30
  end
end

puts "Listening on http://0.0.0.0:8080"
server.listen

osx:

ab -n 50000 -c 50 http://127.0.0.1:8080/
rss 695Mb

if i remove with_timeout:

ab -n 50000 -c 50 http://127.0.0.1:8080/
rss 150Mb

i not see any peace of code which can leak here:
btw 150Mb also quite big.

bug stdlib

All 17 comments

@kostya You are right, there seems to be a memory leak. This short program reproduces it, memory keeps growing, slowly but steadily:

loop do
  spawn do
    sleep 0.1
  end
  sleep 0.005
end

puts "Running!"
sleep

I've been looking at the code and I see that each Fiber has a @resume_event that's created on sleep, and when the fiber finishes running it's freed. However, fibers are not immediately reclaimed: they are in a pool and are reused. So if a fiber is reused the @resume_event will still be used, which is a bit suspicious... I tried setting @resume_event to nil when the fiber finished running, but that didn't fix it, so we'll need to investigate it a bit more, but there's definitely a leak somewhere.

on linux this small loop not leak 1.6Mb,
my first program also leak on linux, gives: 705Mb

ab -n 150000 -c 50 http://127.0.0.1:8080/

Listening on http://0.0.0.0:8080
"  RSS\n 2820\n"
"  RSS\n87900\n"
"  RSS\n178400\n"
"  RSS\n263892\n"
"  RSS\n338948\n"
"  RSS\n416864\n"
"  RSS\n499876\n"
"  RSS\n576536\n"
"  RSS\n655924\n"
"  RSS\n731980\n"
"  RSS\n813068\n"
"  RSS\n892084\n"
"  RSS\n963960\n"
"  RSS\n1038896\n"
"  RSS\n1116220\n"
"  RSS\n1197416\n"
"  RSS\n1272220\n"
"  RSS\n1352668\n"
"  RSS\n1430256\n"
"  RSS\n1507112\n"
"  RSS\n1590900\n"
"  RSS\n1668196\n"
"  RSS\n1750616\n"
"  RSS\n1825152\n"
"  RSS\n1902296\n"
"  RSS\n1984748\n"
"  RSS\n2000984\n"

and seems stop growing at 2032600

@ozra #2968 hasn't been merged (was not accepted _yet_) The only recent change in relation to division is _floored_ one.

Alright. Sorry for sliding OT.

@kostya What happens if you put a GC.collect right at the end of the server's block? That is, a collect after each request.

yes with GC.collect memory freezed at 136Mb

So I don't think it's a leak on our side, it's just when the GC chooses to free memory... Maybe we need to tune the GC, and/or eventually write our own.

I have 5 daemons which running for a 24 hours, listen on http, parsing htmls, each average cpu usage 20%, it should not leak memory (timeout function as above disabled, no shared state, myhtml free all it memory after each parse).
After start, for ~2 hours, them used < 100Mb.
After 24 hours memory usage for them: min 210Mb, max 870Mb, gc stats for 210Mb GC::Stats(@collections=257517, @bytes_found=438208), gc stats for 870Mb: GC::Stats(@collections=257560, @bytes_found=484704).
And after some point (i don't know exactly when) this 870Mb process became crazy, it cpu usage ~ 370%, others daemons 25% (all processing equal amount of tasks).

Is there any way how i can debug this 870Mb process, what used memory, and why some much cpu usage. when i stop processing, cpu usage 0%, but when start again > 300%, strace on this process not show any strange.

ltrace for 5 seconds, show

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 69.21   12.874828         182     70689 memcmp
 15.14    2.817164         237     11838 memset
 14.68    2.729765         253     10775 memcpy
  0.22    0.040995          68       600 free
  0.17    0.030980          68       449 memmove
  0.12    0.022489          67       335 malloc

Hm, that GC::Stats isn't very useful, I think it was added in a PR but doesn't provide much information. There's another function you can use:

LibGC.get_heap_usage_safe(out heap_size, out free_bytes, out unmapped_bytes, out bytes_since_gc, out total_bytes)
pp heap_size, free_bytes, unmapped_bytes, bytes_since_gc, total_bytes

Though I don't know what to do with that info later, but it can be useful to see if GC is growing a lot, maybe.

A program can also be run with GC_PRINT_STATS=1 and you'll get some GC stats printed on the console.

Maybe you can also try profiling the program to see where that memcmp function call comes from, so we can learn where is all that CPU activity coming from. At least in Mac there's Instruments and you can see this profiled call trace, I guess there would be a similar tool for linux but I don't know.

It would also be useful to know what shards you are using (I guess myhtml?) and see if the CPU usage comes from any of those shards, or the standard library. Or maybe provide some similar code that we could reproduce...

used kemal, msgpack,myhtml, redis

but this is should not use more than 100% because parsing is single threaded (myhtml compiled without pthreads), and blocking.

btw, by default, crystal create for one process 15 threads?

for crazy process main process used cpu 1.5h, and every thread 16mins
for ok process 40 mins, and 1mins for it threads.

Bohem GC creates many threads, always

i solve this leak, just by create one coroutine with https://github.com/kostya/timeouter

I think this problem is still present (asterite's example grows monotonically for me with git master), care to reopen it? I also think possibly related to #3983 (i.e. causing other problems :| )

wait till it "blows" in linux, then report:

OK, my latest experiment, with this:

spawn do
  loop do
    GC.collect
    sleep 0.05
  end
end
loop do
  spawn do
    sleep 0.001
  end
  sleep 0.00005
end

It "stabilizies" for me with about 130MB in use (OK actually in Linux it got up to 320MB RSS after a few days, but at least it's not leaking quickly, as a note, calls to GC.collect seem to get much more expensive the more it uses).

If I run this instead:

loop do
  spawn do
    sleep 0.001
  end
  sleep 0.00005
  GC.collect
end

it runs about 1/3 the speed but uses only like 7MB (on linux, 20 MB after days).

Without any GC.collect it goes up monotonically to like 3.2GB used when I stopped it.

Valgrind reports:

==27653== HEAP SUMMARY:
==27653==     in use at exit: 11,576 bytes in 21 blocks
==27653==   total heap usage: 307,190 allocs, 307,169 frees, 41,787,672 bytes allocated
==27653==
==27653== LEAK SUMMARY:
==27653==    definitely lost: 8,192 bytes in 1 blocks
==27653==    indirectly lost: 0 bytes in 0 blocks
==27653==      possibly lost: 288 bytes in 1 blocks
==27653==    still reachable: 3,096 bytes in 19 blocks
==27653==         suppressed: 0 bytes in 0 blocks
==27653== Rerun with --leak-check=full to see details of leaked memory

Which seems normal, so it "feels" as if the problem is not necessarily a RAM leak as a reclamation or "somebody keeping a handle to old stuff" issue.

So I'm starting to wonder if it's related "GC.collect not being called often enough" so I guess continue the conversation in #3997

Was this page helpful?
0 / 5 - 0 ratings

Related issues

asterite picture asterite  路  3Comments

RX14 picture RX14  路  3Comments

costajob picture costajob  路  3Comments

oprypin picture oprypin  路  3Comments

will picture will  路  3Comments