Objects with finalizers are collected after two GC cycles. The first cycle will run the finalizer and the second one will collect the object memory. If you add a call to GC.collect inside the loop the memory usage will stay really low.

For this reason I think it's a bad idea to allocate so many unreachable objects with finalizers in a tight loop. In fact, I think finalizers should be avoided whenever possible. They are useful to release OS resources but otherwise there might be always another (more deterministic) option to release resources or finish some work.

Can you give more details about what are you trying to accomplish with finalizers?

waj on 3 Feb 2017

👍1

It does stay down with GC.collect (wait is GC.collect not being called underneath? I see finalizers being called? Is there a difference between GC.collect and standard garbage collection, and a reason why GC.collect seems to be called infrequently while RAM usage goes up and up?). That exception trace still occurs occasionally (at least on OS X).

I'm trying to debug here why servers on "some machines" seems to grow in used memory forever. The SSL sockets used do use finalizers FWIW, but I somehow doubt they're the culprit https://github.com/crystal-lang/crystal/issues/3983

Maybe crystal is keeping an "old reference" to ancient fibers around, or maybe only allocating small amount of memory each time which BDW doesn't count towards it "should gc now" metric or some odd?

rdp on 4 Feb 2017

What I'm seeing in regards to GC is that the amount of heap memory seems to keep growing, but the actual memory usage is mostly lower see #3992
If I print GC stats I get this result:

GC::Stats(@heap_size=3233890304, @free_bytes=3233832960, @unmapped_bytes=0, @bytes_since_gc=0, @total_bytes=8000019728)

As you can see in my example the free bytes is almost the same as the heap size, maybe that's the case here as well?

benoist on 4 Feb 2017

The good news here is that if I add in an explicit GC.malloc every so often:

spawn do
  loop do
    sleep 5
    GC.collect
  end
end

It seems RES memory is "constrained" to around 1GB. Or if it's sleep 0.5 then stays around 100 MB. So the GC feels like it's working...not allowing leaks...just not being called often enough perhaps for some odd reason?

rdp on 5 Feb 2017

@rdp GC.collect is expensive, ram not being used by other programs is almost free. For that reason, it's likely that the GC will instead simply use more ram instead of reclaiming more often under heavy allocation pressure (i.e. allocating in a loop).

RX14 on 5 Feb 2017

Could we talk about GC in a single thread and close other issues? They're all related and down to the same thing: don't allocate repeatedly.

I believe @RX14 put the final nail: don't test GC on something that it's very bad at, and warns you repeatedly about it, telling exactly what you notice: memory leak!

If you want to test the GC, do it on real applications. Don't allocate in a tight loop, but reuse a single Array object for example, instead of allocating a new one on each iteration. HTTP::Server keeps HTTP::Response objects for example, resets and reuses them for another request.

ysbaddaden on 5 Feb 2017

Well the reason I started asking these questions is because I'm seeing it in my real application. The actual amount of items assigned in an array is a lot lower and it takes a lot longer for the memory to have grown to a size where it causes problems, but it is causing a problem. The only option at that point is to restart the application and it can run fine for days.

Any tips on how to test where the leaks actually are in a real application on a production server without reducing code to things the GC wasn't made to do in the first place?

benoist on 5 Feb 2017

OK I see now that allocations in finalize don't leak per se, so I'm proposing we kind of hijack this as the main thread for cases of "if I perform periodic GC.collect, my process stays at 10MB, otherwise it goes up to 5GB", if that's all right. Here's an example:

class Foo
  def finalize
    "bye #{3}"
  end
end

loop {
  Foo.new
}

I guess what I'd like answered is...if you allocate memory frequently, why is a full GC.collect (which would seem useful) seeming to not be being fired, or being fired only rarely? Are there bdwgc options that might help, for instance:) Just for discussion, as it were...

3992 doesn't seem like quite the same issue, since it's more concerned with allocating very large objects, and GC.collect didn't help there at all, FWIW.

rdp on 6 Feb 2017

A quick search turned this answer, which looks interesting:
https://www.quora.com/Is-there-any-way-to-implement-a-smart-pointer-in-C

ysbaddaden on 6 Feb 2017

👍2

3333 might have been related to this

rdp on 23 Feb 2017

I also have an issue with runaway memory use due to repeated allocations within a loop:

fileAll = File.open("/target-path/all.xml", "a")

Dir.glob("/source-path/*.xml").sort.each do |path|
    xml = File.read(path, "utf8")
        .gsub(/>[\s]*\n[\s]*</, "><")
        # ... various replaces ...
        .gsub(/<\/Product>/, "</Product>\n")

    fileAll.puts xml
end

fileAll.close()

Of course I get the classic error "GC Warning: Repeated allocation of very large block (appr. size 35000000): May lead to memory leak and poor performance.".

I have tried the code as seen above, along with various other experimental measures:

GC.collect at top of loop
GC.collect at bottom of loop
explicitly setting xml to null at bottom of loop
open/close file within loop
combinations of the above

In all cases, as the script processes these files of about 20-40MB, the memory usage inevitably creeps up and eventually will crash.

Most advice I see seems to be simply "don't repeatedly allocate within a loop". But ... how to avoid it?

sa-0001 on 1 Mar 2017

What are you allocating that needs size 35000000 I wonder...

On Wed, Mar 1, 2017 at 6:12 AM, sa-0001 notifications@github.com wrote:

I also have an issue with runaway memory use due to repeated allocations
within a loop:

fileAll = File.open("/target-path/all.xml", "a")

Dir.glob("/source-path/.xml").sort.each do |path|
xml = File.read(path, "utf8")
.gsub(/>[\s]\n[\s]*<")
# ... various replaces ...
.gsub(/<\/Product>/, "\n")

fileAll.puts xml
end

fileAll.close()

Of course I get the classic error "GC Warning: Repeated allocation of very
large block (appr. size 35000000): May lead to memory leak and poor
performance.".

I have tried the code as seen above, along with various other experimental
measures:

GC.collect at top of loop

GC.collect at bottom of loop

open/close file within loop

combinations of the above

In all cases, as the script processes these files of about 20-40MB, the
memory usage inevitably creeps up and eventually will crash.

Most advice I see seems to be simply "don't repeatedly allocate within a
loop". But ... how to avoid it?

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/crystal-lang/crystal/issues/3997#issuecomment-283335794,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAw0Mgew3MEZNX1vMWJFu9GfLXHnNuoks5rhW6ogaJpZM4L21j8
.

rdp on 1 Mar 2017

What are you allocating that needs size 35000000 I wonder...

The script processes hundreds of files of around 20-40MB. Incredibly, the same script in Node.js doesn't crash, although its memory usage certainly levels off pretty high.

sa-0001 on 1 Mar 2017

@sa-0001 the best solution would be to not load the whole file into memory, but as far as I know crystal doesn't have any good ways to work with replacements on streams the same way you can on strings, so I won't advise that.

One thing you sould try is using GC.free(xml.as(Void*)). Make sure you put this after the last usage of the object. And I still can't guarantee it's safe. Obviously you shouldn't have to do this.

RX14 on 1 Mar 2017

I'm just making a ping here so that I have track on this issue - ignore me B-)

ozra on 2 Mar 2017

Keeping these memory discussions in one place. I'm finding crystal never seems to reduce it's heap size even when traffic drops. Instead the GC seems to run less often. This may be intended behaviour.

This graphed data is a section from yesterday
https://www.duodesign.co.uk/memstats/
I've attached an image of the graph as the above link may change in the future.

screen shot 2017-03-02 at 09 51 46

At 13:14 I ran apache bench and as you'd expect the memory usage leapt up. It then stays there until I restart crystal.

crisward on 2 Mar 2017

❤2

Libgc is a non-compacting garbage collector, which buckets allocations into seperate heaps based on allocation size. This means that the heap(s) can get fragmented. Also, i'm not sure what the heuristics for when to return heap to the OS actually are.

Again, if you want an answer from people who know what they're talking about, ask the libgc guys.

@ozra you can subscribe to issue notifications without commenting by clicking here:

RX14 on 2 Mar 2017

https://github.com/ivmai/bdwgc/issues/152#issuecomment-283733454

First, libgc should be built with configure --enable-munmap option. An optional parameter can be specified along with the option - to set the desired memory blocks unmapping threshold (the number of sequential garbage collections for which a candidate block for unmapping should remain free).

Not sure if this is set? Though if not, it would explain the current behaviour

After some testing with @sdogruyol's help, we discovered go-lang seems to release memory to the os after around 5 mins, returning to pre-load state after up-to 30 mins. It would be good if crystal could mimic this behaviour.

crisward on 2 Mar 2017

👍1

It seems like a good thing to try. Right now we aren't setting that flag. I didn't even know libgc could be configured like that (I thought it would never give the memory back to the OS).

On OSX we are using bdw-gc from homebrew, which doesn't seem to set it either...

I wonder if there's a reason why it's not enabled by default. Why could it be?

asterite on 2 Mar 2017

The only docs I can find referencing this are here which makes me wonder if it's a windows only option.

crisward on 2 Mar 2017

I've compiled libgc with that option on mac. I'll try building crystal with it and try some tests.

crisward on 2 Mar 2017

🎉2

@crisward here are docs about it:

https://github.com/ivmai/bdwgc/blob/64bed186fc1ca9a654c9bca063089f5b7ead87be/doc/README.macros#L287-L291

It seems it combines it with mmap:

Causes memory to be returned to the OS under the right circumstances. This currently disables VM-based incremental collection (except for Win32 with GetWriteWatch() available).
Works under some Unix, Linux and Windows versions.
Requires USE_MMAP except for Windows.

luislavena on 2 Mar 2017

I've got it working.
screen shot 2017-03-02 at 20 03 52

crisward on 2 Mar 2017

❤6

I set it to 3, so it returns memory after 3 gc cycles. ie.

brew install automake pkg-config #mac only
git clone https://github.com/ivmai/bdwgc.git
cd bdwgc
git clone https://github.com/ivmai/libatomic_ops.git
autoreconf -vif
automake --add-missing
./configure --enable-munmap=3 
make
make check
sudo make install

If you want to reproduce this on mac, you need to do the brew steps above

crisward on 2 Mar 2017

👍2

Looks promising @crisward, but perhaps 3 cycles is too short for a long running process (and might result in slowdown when freeing and capturing these resources back/from OS during high traffic).

There are also interesting details unrelated to this but also apply to parallelism: by default libgc might fail in multi-threaded environments unless nummap is used (ref). I think that change is long gone, but usage of mmap and possible parallel marking might be required for thread support.

/cc @asterite

luislavena on 2 Mar 2017

@luislavena you may be right. However I found with high memory usage the GC cycle seems to slow down to once every 3 minutes, so 3 would equate to 9 minutes (see my first graph above). Go seems to start returning memory after around 5, so this seems roughly inline. Like anything I imagine this is going to take some real world use to fine a good balance.

crisward on 2 Mar 2017

👍1

A couple of months ago I poc worked a bit on a set of traits (modules) that would facilitate and sort of "standardize" different methods of pooling / re-using / re-initialization -sort of getting manual style #free() methodology while still having the safety net of GC.

I think standardizing such a framework in stdlib is well worth elaborating on - re-using objs and taking _some_ responsibility of saying when one is done with a resource of the machine is prudent and can help immensely with memory usage, avoiding the GC crunching first and foremost.

[_ED:_ the problem back then, to make it cleanly "user-coder"-friendly - keeping user's code DRY - was that the 'piler insists on doing the "reduced extent inference" in initializers - something I thought it could ignore if specifying distinct types for all ivars (despite it not being able to see that it actually holds true - for compilation performance reasons) - but I apparently still hadn't grok'd the compiler phases fully despite digging through the source over and over, haha. When I have less stress on the plate, I will be back diving in to the guts of it again. _I love the stuff you've all done on the compiler. It's just so clean compared to other's I've hacked on and studied_ [names omitted]!)

I'm really interested in these aspects, have made many allocation models over the years for embedded situations etc - it's a fun puzzle. Of course, I'm strapped with time, I haven't even had time to check up on the developments in Crystal the last few months. But, anyhow, those two cents.

Obviously we all are eager to jump in and code a whole bunch of precise GC's in the future - right!? :-) If the API for the GC is kept well thought out - there's room for specialized variants optimized for multi-threaded, single-threaded, throughput, latency - whatever.

But now, I'm sailing away into "someday, maybe" mode - sorry!

@RX14 - OT (meta questions): so clicking that will land it in "participating" automatically, saving me from looking like a fool for ping-posting? B-)

ozra on 2 Mar 2017

@ozra yes, it'll keep you notified of the issue.

RX14 on 2 Mar 2017

Any ideas how I'd get a compiled version of bdwgc into the heroku buildpack?

crisward on 2 Mar 2017

@crisward there used to be something called vulkan to compile binaries for Heroku. Not sure what is the method nowdays.

But you will need to fork the buildpack to change the bundled libgc with the one you produced before.

luislavena on 2 Mar 2017

@luislavena Thanks.

crisward on 2 Mar 2017

Just though it worth feedback my findings. I've overloaded a server (in dev mode) with 1000 requests with 50 concurrent connections.

ab -n 1000 -c 50 http://127.0.0.1:3000/

The app uses crystal-mysql and kemal so includes a handful of database queries per connection.

The app starts at 5.8mb of memory (after a bit of clicking round). After the bench it goes up to around 20mb. Shortly after the spike, it releases just under half the memory back (down to 11mb). Then after 3/4 gc cycles it start to release memory back to the OS in every decreasing chucks. Getting back to 5.8 after about 1.5hrs. Though it's already down to 6.4 after 30mins.

Over the weekend I'm going to try and get this change into a fork of the buildpack so I can try this out on my liver server using Dokku.

screen shot 2017-03-03 at 10 33 47

crisward on 3 Mar 2017

👍2

Nice @crisward, this is more inline with what I expected to see with the heap size.

benoist on 3 Mar 2017

I've looked at the Heroku buildpack and it looks like it is getting a release of Crystal from github which has libgc included in the /embedded/lib folder.
Libgc needs building with the options outlined about and adding to this release.

The buildpack current uses

https://github.com/crystal-lang/crystal/releases/download/0.21.0/crystal-0.21.0-1-linux-x86_64.tar.gz

Who would be able to do this?

crisward on 3 Mar 2017

@crisward that will be Manas team, which is why I suggested you fork the buildpack and introduce your own copy in the compile script (to replace the one from the package).

You can download another file and replace it, then point your app to your custom buildpack and should work.

luislavena on 3 Mar 2017

(I mean: your own copy of the libgc.a that you uploaded somewhere accessible by the buildpack/heroku build step)

luislavena on 3 Mar 2017

I think it needs to be added here https://github.com/crystal-lang/omnibus-crystal/blob/master/config/software/bdw-gc.rb

benoist on 3 Mar 2017

Should we also send a PR to homebrew's formula to fix this?

asterite on 3 Mar 2017

@asterite isn't that too early? from the comment in bdwgc might not be ready for prime time.

luislavena on 3 Mar 2017

@luislavena Oh, I didn't know he replied to me :-)

So for now I don't think there's much we can do here other than wait, right?

asterite on 3 Mar 2017

I think needs more testing. I'm to update my local build and do a bit of that this weekend and report back.

I might recommend a custom build of libgc.a, upload somewhere like S3 and then fork the buildpack to patch embedded one, then anyone can deploy to Heroku the update version without having to compile everything.

luislavena on 3 Mar 2017

I'll fork crystal, the buildpack etc and give it a test. If this works, we could perhaps have an opt-in experimental buildpack release?

crisward on 3 Mar 2017

@crisward not sure forking crystal is required, just compile a new version of libgc.a, either by using the omnibus-crystal recipes or on Heroku itself (heroku run bash) and upload the new libgc.a somewhere you can replace the one from the package.

luislavena on 3 Mar 2017

Will do, thanks.

crisward on 3 Mar 2017

I've patched the buildpack with a re-compiled libgc.a and it's working well.

crisward on 4 Mar 2017

👍3

I've got the GC returning memory to the OS. However after leaving it running for over 12 hours over night, the GC starts not collecting until it almost exhaust available space, then starts GC cycles again.

It's running inside a Docker container so it's limited to 100mb, it using around 70mb.

GC is only running about every 30 minutes now. I've got my app set up so I can manually trigger a GC cycle and memory usage comes down if I do this. But the goes back up and GC doesn't run again until memory is high.

There is a global variable describe here https://github.com/ivmai/bdwgc#the-c-interface-to-the-allocator GC_free_space_divisor which seems to allow the GC to be tuned to favour cycles over memory use. It could be good if this was exposed via the GC interface.

Anyone have ideas how I'd go about doing this? I'd like to try and feed it different values to see how it effects my applications memory consumption, ideally during runtime.

crisward on 4 Mar 2017

👍2

I'm doing this by changing the setting from 3 to 4 (I think) Not sure if the value is a reference or a copy? Hoping it's a reference.

lib LibGC
  $free_space_divisor = GC_free_space_divisor : UInt64
end

LibGC.free_space_divisor = 4_u64

crisward on 4 Mar 2017

👍1

I've done two tests now, over 2 14 hour periods. The first before the change to 4. In both cases after around 12 hrs I'm getting a spike in memory from 5mb 80mb at 3 and 4mb to 40mb at 4.

Not sure what's causing the spike. The first was at 7:24am, the second at 8:28am the next day. I've checked logs and a there was almost no activity at all. Right now I could do with some way of listing memory allocations... I'm guessing there is now way of doing this. Also taking 12hrs to reproduce is a real pain.

crisward on 5 Mar 2017

Hack the allocation funcs in crystal so they count alloc stats per class in mem, that can be dumped by signalling the app? Prealloced big fat chunk mem for stat array (maybe via mmap to keep it outside GC too if you want), so that stays static after start all the way through...

ozra on 6 Mar 2017

Found the cause of the spike. https://github.com/kemalcr/kemal/issues/323 fixed in kemal.

Using the flag in the GC is working well. I don't think the LibGC.free_space_divisor isn't necessary.

Now I've fixed that bug I'll monitor memory for a week and let you know, but so far it's returning to the os when it can and holding around 3mb.

crisward on 10 Mar 2017

_(quote from comment in https://github.com/kemalcr/kemal/issues/323)_

2.7mb. I'm surprised the memory in a fibre isn't cleaned up when an exception is called.

Are you saying that, when code raises within a fiber, it's not cleaned up? The stack there is allocated via mmap - outside of GC. I know there are a few issues reporting mysterious leaks, that I still _think are open_.

If I understood you right, you might have found a long haunting ghost!

Ping @waj @asterite just in case, otherwise - sorry to derail!

ozra on 10 Mar 2017

Exceptions in spawn are rescued and a message is printed about it. The the fiber normally terminates, that shouldn't keep the fiber's stack allocated.

ysbaddaden on 12 Mar 2017

Alright @ysbaddaden - it was a longshot of unlikeliness :) !

ozra on 12 Mar 2017

I've done some experiments with the GC and spawn, with libgc compiled with --enable-munmap=3

str = "hello world " * 100000
p "string size #{str.size}"
str = nil

counter = 0
loop do
  counter += 1
  puts "#{counter}. heap #{GC.stats.heap_size} free_bytes #{GC.stats.free_bytes} bytes_since_gc #{GC.stats.bytes_since_gc}"
  sleep 1.seconds
  GC.collect
end

This outputs, which is what I'd expect

"string size 1200000"
1. heap 1286144 free_bytes 24576 bytes_since_gc 1211920
2. heap 1286144 free_bytes 1220608 bytes_since_gc 208
3. heap 1286144 free_bytes 1220608 bytes_since_gc 208
4. heap 1286144 free_bytes 1220608 bytes_since_gc 208
5. heap 1273856 free_bytes 1208320 bytes_since_gc 208
6. heap 65536 free_bytes 0 bytes_since_gc 208
7. heap 65536 free_bytes 0 bytes_since_gc 208

However if I do this and create large var in spawn.

spawn do
  str = "hello world " * 100000
  p "string size #{str.size}"
  sleep 1.seconds
  str = nil
  GC.collect
end

counter = 0
loop do
  counter += 1
  puts "#{counter}. heap #{GC.stats.heap_size} free_bytes #{GC.stats.free_bytes} bytes_since_gc #{GC.stats.bytes_since_gc}"
  sleep 1.seconds
  GC.collect
end

The output is...

1. heap 65536 free_bytes 28672 bytes_since_gc 2960
"string size 1200000"
2. heap 1286144 free_bytes 20480 bytes_since_gc 208
3. heap 1286144 free_bytes 24576 bytes_since_gc 208
4. heap 1286144 free_bytes 24576 bytes_since_gc 208
5. heap 1282048 free_bytes 20480 bytes_since_gc 208
6. heap 1282048 free_bytes 20480 bytes_since_gc 208
7. heap 1261568 free_bytes 0 bytes_since_gc 208
8. heap 1261568 free_bytes 0 bytes_since_gc 208
9. heap 1261568 free_bytes 0 bytes_since_gc 208
10. heap 1261568 free_bytes 0 bytes_since_gc 208

So it's freeing some of the bytes, but not all.
I realise this is a bit of a naive example. I'm just trying to get a better understanding of this before I bring exceptions and writing IO into this.

crisward on 12 Mar 2017

Ok, this is weird, but maybe someone else has an idea... (I'm guessing it a timing issue)

If I repeat the test above but add a line to debug in crystal's fiber.cr libraray (line 92)

  protected def self.allocate_stack
    p "fiber allocate_stack"
    @@stack_pool.pop? || LibC.mmap(nil, Fiber::STACK_SIZE,
      LibC::PROT_READ | LibC::PROT_WRITE,
      LibC::MAP_PRIVATE | LibC::MAP_ANON,
      -1, 0).tap do |pointer|
      raise Errno.new("Cannot allocate new fiber stack") if pointer == LibC::MAP_FAILED
      {% if flag?(:linux) %}
        LibC.madvise(pointer, Fiber::STACK_SIZE, LibC::MADV_NOHUGEPAGE)
      {% end %}
      LibC.mprotect(pointer, 4096, LibC::PROT_NONE)
    end
  end

This happens - the memory is collected, see free byes and heap.

"fiber allocate_stack"
"fiber allocate_stack"
"fiber allocate_stack"
1. heap 65536 free_bytes 4096 bytes_since_gc 13040
"string size 1200000"
"fiber allocate_stack"
2. heap 1372160 free_bytes 106496 bytes_since_gc 208
3. heap 1372160 free_bytes 1310720 bytes_since_gc 208
4. heap 1372160 free_bytes 1310720 bytes_since_gc 208
5. heap 1368064 free_bytes 1306624 bytes_since_gc 208
6. heap 1368064 free_bytes 1306624 bytes_since_gc 208
7. heap 61440 free_bytes 0 bytes_since_gc 208
8. heap 61440 free_bytes 0 bytes_since_gc 208
9. heap 61440 free_bytes 0 bytes_since_gc 208
10. heap 61440 free_bytes 0 bytes_since_gc 208

crisward on 13 Mar 2017

This also works, adding a p to spawn. Weird.

def spawn(*, name : String? = nil, &block)
  p "spawn"
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

crisward on 13 Mar 2017

What happens if you output something in

self.stack_pool_collect

This method is called from a Fiber every 5 seconds if I'm not mistaken.

benoist on 13 Mar 2017

Nothing, I tried that. Also interestingly this does nothing...

def spawn(*, name : String? = nil, &block)
  STDOUT.puts "spawn"
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

The only difference between puts, and p seems to be prettyprint is invoked. I'm guessing something in there is helping.

crisward on 13 Mar 2017

yep - PrettyPrint.format("spawn", STDOUT, 79) helps memory drop too.

crisward on 13 Mar 2017

And replacing it with Fiber.yield does that also release the memory?

benoist on 13 Mar 2017

Fiber.yield also does nothing (to release memory)...

crisward on 13 Mar 2017

maybe something happening in the printer.flush call performed by the pretty print...

benoist on 13 Mar 2017

I was looking at that, not sure why it would... I also thought it may be a timing issue, so tried sleep 1.millisecond but that didn't do anything, event tried sleep 1.second.

crisward on 13 Mar 2017

def spawn(*, name : String? = nil, &block)
  "a".inspect
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

just calling inspect also seems to work, this is what pretty print does

benoist on 13 Mar 2017

Same here. Next, what does inspect do?

crisward on 13 Mar 2017

def spawn(*, name : String? = nil, &block)
  String.build {}
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

also works

benoist on 13 Mar 2017

All of these things seem to push initial memory use up from 1286144 to 1372160. Not sure if that is significant?

crisward on 13 Mar 2017

well the string builder allocates some capacity with GC.malloc_atomic(capacity.to_u32).as(UInt8*)

benoist on 13 Mar 2017

Adding GC.malloc_atomic(2048.to_u32).as(UInt8*) works too.

crisward on 13 Mar 2017

The size of the string created inside the fiber seems to have no effect on the gc collecting. But adding GC.malloc_atomic(2048.to_u32).as(UInt8*) always helps the collection.

crisward on 13 Mar 2017

So GC doesn't collect / cleanup until called upon to do new allocation - or is it definitely _only_ for plain value-alloc (malloc_atomic)?

ozra on 13 Mar 2017

Having GC.malloc_atomic(2048.to_u32).as(UInt8*) added to crystal's spawn method seems to cause the fibers memory to be cleaned up. Also messing with exceptions inside the fiber, they seem to add a further 4096 bytes to the heap for each exception. I'm not sure of the significance of that number. I'm still playing with various scenarios to get something which approximates the issue I'm having inside kemal with errors in fibers.

crisward on 13 Mar 2017

Out of curiosity: does both GC.malloc_atomic(2048) and GC.malloc(2048) help?
And size? Or is there a minimum required to trigger?

ozra on 14 Mar 2017

Another interesting thing to point out:

benoist@benoist-i7$ crystal test.cr
1. heap 65536 free_bytes 28672 bytes_since_gc 2960
"string size 1200000"
2. heap 1286144 free_bytes 20480 bytes_since_gc 208
3. heap 1286144 free_bytes 24576 bytes_since_gc 208
4. heap 1286144 free_bytes 24576 bytes_since_gc 208
5. heap 1286144 free_bytes 24576 bytes_since_gc 208
^C[~/code]
benoist@benoist-i7$ crystal test.cr --release
1. heap 65536 free_bytes 28672 bytes_since_gc 2960
"string size 1200000"
2. heap 1286144 free_bytes 16384 bytes_since_gc 208
3. heap 1286144 free_bytes 1224704 bytes_since_gc 208
4. heap 1286144 free_bytes 1224704 bytes_since_gc 208
5. heap 1286144 free_bytes 1224704 bytes_since_gc 208

So in release mode allocating new memory before spawning is not required.

benoist on 14 Mar 2017

@benoist Thanks for your help with this.
Ultimately I'm trying to track down where memory is leaking to cause the issue I'm having in Kemal. Running GC.malloc_atomic(2048) does seem to help in this simple test case, but if an exception is thrown inside the fiber I get the same problem of a big spike in memory. I thought it may be to do with the socket not being closed and flushed, but I can't find any evidence of this, I've tried patching HTTP::Server::Response.write and adding a rescue with close and flush but to no effect.

Also my leak is happening with the --release flag too, which points to it being some cause other than fibers, though it could be a contributing factor.

My leaks are always proceeded with a Exception: Error writing file: Broken pipe (Errno) as detailed here

I could do with input from someone with better understanding of exactly how fibers and io work, especially in connection with the GC. @asterite @ysbaddaden ?

crisward on 14 Mar 2017

@crisward I've seen evidence of Exception: Error writing file: Broken pipe (Errno) in cases when connections were interrupted.

Sija on 14 Mar 2017

It's easy enough to recreate - using something like wrk -c2 -t2 -d1 http://127.0.0.1:3000. It seems to leak a noticeable amount when you're returning lots of data, or there are lots of things going on in the fiber.

wrk cuts out mid request when the time runs out, so creates the bug. Apache bench does full requests each time so doesn't create the leak, event with similar loads.

crisward on 14 Mar 2017

The tricky part here is it may not be a memory "leak" per se (if it is, run
valgrind ./your_crystal_app and it will report a memory leak) but it's
possibly a "left over reference" where something is still pointing to the
big block, so it cannot be reclaimed. See also the quora link pointed to
before. So it's a bit tricky to find "what's pointing to what" as it
were...though maybe possible :)

On Tue, Mar 14, 2017 at 7:22 AM, Cris Ward notifications@github.com wrote:

It's easy enough to recreate - using something like wrk -c2 -t2 -d1
http://127.0.0.1:3000. It seems to leak a noticeable amount when you're
returning lots of data, or there are lots of things going on in the fiber.

wrk cuts out mid request when the time runs out, so creates the bug.
Apache bench does full requests each time so doesn't create the leak, event
with similar loads.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/crystal-lang/crystal/issues/3997#issuecomment-286418874,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAw0HMULjzic2qNSmBIlmIzJ4Lz8bgdks5rlpScgaJpZM4L21j8
.

rdp on 14 Mar 2017

@rdp I don't disagree. I've done check with apple's instruments and no leaks appear. Finding what's pointing where though is a tough one.

crisward on 14 Mar 2017

not to be nitpicking, @rdp, but a left over ref causing unwanted heap growth is a leak in my book, just like circular refs for gc's not coping with those, etc., whatever B-)

@crisward - did you create an absolute minimum case of code to reproduce the behaviour on cut off connections? And for the mystic cleanup triggers: did you try if standard "conservative" malloc also triggers just like malloc_atomic?

(the more clues the better)

ozra on 14 Mar 2017

@ozra GC.malloc doesn't seem to work. It free's some memory, but not all unused like, malloc_atomic

Not got a very minimal use case, I'd say that would have be just using the standard library. Will work on it when I get chance. Being looking into getting valgrind working. Think I'll have to do it in a vm as it not mac os friendly (well not sierra at least).

crisward on 14 Mar 2017

It's either that or my theory was that a "full" GC wasn't being called for
some odd reason [poor math? too many small mallocs? might have to dive into
the bdw-gc and see if that's the case and if so why, etc.] Good luck!

On Tue, Mar 14, 2017 at 4:48 PM, Cris Ward notifications@github.com wrote:

@ozra https://github.com/ozra GC.malloc doesn't seem to work. It free's
some memory, but not all unused like, malloc_atomic

Not got a very minimal use case, I'd say that would have be just using the
standard library. Will work on it when I get chance. Being looking into
getting valgrind working. Think I'll have to do it in a vm as it not mac os
friendly (well not sierra at least).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/crystal-lang/crystal/issues/3997#issuecomment-286585747,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAw0Fn_LEAQLZYziIvXN60vuSJnPM-Tks5rlxk1gaJpZM4L21j8
.

rdp on 15 Mar 2017

Super simple reproduction. Though please correct me if I'm misunderstanding the data.

require "http/server"

server = HTTP::Server.new(8080) do |context|
  context.response.content_type = "text/plain"
  sleep 100.milliseconds # slow down to simulat db stuff
  bigstr = "Hello World! " * 100000
  context.response.print bigstr
end

spawn do
  counter = 0
  loop do
    counter += 1
    puts "#{counter}. heap #{GC.stats.heap_size} free_bytes #{GC.stats.free_bytes} bytes_since_gc #{GC.stats.bytes_since_gc}"
    GC.collect
    sleep 1.seconds
  end
end

puts "Listening on http://127.0.0.1:8080"
server.listen

2 Tests, one using ab apache bench and one using wrk. ab does not cause broken pipe errors because it runs for n number of requests. wrk runs for n number of seconds and cuts off, mid request if necessary so causes broken pipes.

The results are

ab -c10 -n100 http://127.0.0.1:8080/

Listening on http://127.0.0.1:8080
1. heap 131072 free_bytes 49152 bytes_since_gc 24096
2. heap 131072 free_bytes 53248 bytes_since_gc 320
3. heap 131072 free_bytes 53248 bytes_since_gc 208
4. heap 131072 free_bytes 53248 bytes_since_gc 208
5. heap 81920 free_bytes 4096 bytes_since_gc 208
6. heap 77824 free_bytes 0 bytes_since_gc 208
7. heap 19546112 free_bytes 5500928 bytes_since_gc 10573104
8. heap 19546112 free_bytes 6930432 bytes_since_gc 10465904
9. heap 19546112 free_bytes 18956288 bytes_since_gc 2992
10. heap 19546112 free_bytes 19378176 bytes_since_gc 112
11. heap 19546112 free_bytes 19378176 bytes_since_gc 112
12. heap 19542016 free_bytes 19374080 bytes_since_gc 112
13. heap 16834560 free_bytes 16666624 bytes_since_gc 112
14. heap 167936 free_bytes 0 bytes_since_gc 112
15. heap 167936 free_bytes 0 bytes_since_gc 112
16. heap 167936 free_bytes 0 bytes_since_gc 112

wrk -c10 -t10 -d1 http://127.0.0.1:8080/

1. heap 131072 free_bytes 49152 bytes_since_gc 24096
2. heap 131072 free_bytes 53248 bytes_since_gc 320
3. heap 19550208 free_bytes 6156288 bytes_since_gc 10416224

Unhandled exception in spawn:
Error writing file: Broken pipe (Errno)
0x10fda8b32: *CallStack::unwind:Array(Pointer(Void)) at ??
0x10fda8ad1: *CallStack#initialize:Array(Pointer(Void)) at ??
0x10fda8aa8: *CallStack::new:CallStack at ??
0x10fd9df61: *raise<Errno>:NoReturn at ??
0x10fe79914: *TCPSocket+@IO::FileDescriptor#unbuffered_write<Slice(UInt8)>:Int32 at ??
0x10fe795d1: *TCPSocket+@IO::Buffered#flush:TCPSocket+ at ??
0x10fe79cab: *TCPSocket+@IO::Buffered#close:Nil at ??
0x10fe74b37: *HTTP::Server::RequestProcessor#process<(OpenSSL::SSL::Socket::Server | TCPSocket+), (OpenSSL::SSL::Socket::Server | TCPSocket+), IO::FileDescriptor>:Nil at ??
0x10fe6da65: *HTTP::Server::RequestProcessor#process<(OpenSSL::SSL::Socket::Server | TCPSocket+), (OpenSSL::SSL::Socket::Server | TCPSocket+)>:Nil at ??
0x10fe6d986: *HTTP::Server#handle_client<(TCPSocket+ | Nil)>:Nil at ??
0x10fda6ce6: ~procProc(Nil)@/usr/local/Cellar/crystal-lang/0.21.1/src/http/server.cr:163 at ??
0x10fdb2ec4: *Fiber#run:(IO::FileDescriptor | Nil) at ??
0x10fda3199: ~proc2Proc(Fiber, (IO::FileDescriptor | Nil))@/usr/local/Cellar/crystal-lang/0.21.1/src/fiber.cr:29 at ??
(removed others for brevity)

4. heap 19562496 free_bytes 7094272 bytes_since_gc 10748320
5. heap 19562496 free_bytes 17895424 bytes_since_gc 112
6. heap 19562496 free_bytes 18034688 bytes_since_gc 112
7. heap 19562496 free_bytes 18034688 bytes_since_gc 112
8. heap 18808832 free_bytes 17281024 bytes_since_gc 112
9. heap 12140544 free_bytes 10612736 bytes_since_gc 112
10. heap 1527808 free_bytes 0 bytes_since_gc 112
11. heap 1527808 free_bytes 0 bytes_since_gc 112
12. heap 1527808 free_bytes 0 bytes_since_gc 112
13. heap 1527808 free_bytes 0 bytes_since_gc 112
14. heap 1527808 free_bytes 0 bytes_since_gc 112
...
141. heap 1527808 free_bytes 0 bytes_since_gc 112

After broken pipes server shows 1.5mb of memory usage vs 0.16mb with ab and no broken pipes. Also the GC is reporting no free_bytes.

crisward on 15 Mar 2017

Correct me if I'm wrong, but doesn't spawn from HTTP::Server#listen prevents rescue in HTTP::Server::RequestProcessor#process from working?

Sija on 16 Mar 2017

@crisward what happens with --no-debug? That shouldn't have any impact unless the stacktrace is printed. We currently load and decode the DWARF sections (function names, file line numbers) once, then memorize it globally, in order to speed up further stacktrace prints (no re-load of the file, no re-decode of DWARF sections). Used memory shouldn't grow anymore after the first stacktrace is printed.

ysbaddaden on 16 Mar 2017

@ysbaddaden --no-debug doesn't seem to have an effect. However there is something odd going on. Sporadically the process can run fine after being hit by wrk, but most of the time I get an effect similar to the above. Just to put this into context, my live server is going from 3mb to 60mb of heap after one or two broken pipes in succession, there is a lot more code running / databases etc but I'm still surprised it can almost instantly grow 20x.

crisward on 16 Mar 2017

I digged around a bit in boehm over lunch, but didn't find anything obvious concerning free'ing vis a vis atomic allocs. But it tries to make shortcuts to keep collection time down, doing partial ones sometimes. Can only speculate that some flag is set on kind-no-ptr when doing the atomic which triggers a more detailed collect. But these speculations aren't helpful in _any_ way, so.. :-/

ozra on 16 Mar 2017

I believe there is the option in bdwgc as to whether it collects "as it
goes" or not might be interesting to try also experimenting with that
FWIW...

On Thu, Mar 16, 2017 at 8:34 AM, Oscar Campbell notifications@github.com
wrote:

I digged around a bit in boehm over lunch, but didn't find anything
obvious concerning free'ing vis a vis atomic allocs. But it tries to make
shortcuts to keep collection time down, doing partial ones sometimes. Can
only speculate that some flag is set on kind-no-ptr when doing the atomic
which triggers a more detailed collect. But these speculations aren't
helpful in any way, so.. :-/

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/crystal-lang/crystal/issues/3997#issuecomment-287076316,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAw0N9KXznk9q40MyKgRJYEHzbuulTWks5rmUhqgaJpZM4L21j8
.

rdp on 16 Mar 2017

@ysbaddaden is there any way to free memory used in memorizing the stack trace? Just wondering to what extent this is effecting heap allocations.

crisward on 21 Mar 2017

The stacktraces aren't memorized, and should be collected when the exception object is collected.

The decoded DWARF sections, used to have human friendly backtrackes, are memorized, and won't be collected.

ysbaddaden on 22 Mar 2017

Got it, thanks - https://github.com/crystal-lang/crystal/blob/ae678b204d8eeaeeb1125bc3fae42e8292041399/src/callstack.cr#L169

crisward on 22 Mar 2017

@ysbaddaden I think I've found the problem. @@dwarf_line_numbers seems to be what is eating over 60mb of ram on my production server.

I patched callstack with

struct CallStack
  def self.empty
    @@dwarf_line_numbers = nil
  end
end

I also created a method which causes an error.

def trip
  1 / 0
end

I linked both of these to controller endpoints. If I call the exception, the memory of my app goes up from 1.2mb to around 60mb.

As soon as I call the empty method. The free bytes jumps up to 58mb and then shortly afterwards the memory is returned to the OS, back down to 1.2mb of Heap.

The @@dwarf_line_numbers must be a lot smaller on a mac, as the memory only goes up to around 2mb. On ubuntu without the release flag, it only goes up to 10mb, but the empty method still brings it back down to 1.5mb. Not sure why the release version is much larger.

The earlier suspicions of Broken pipe are unfounded, it just happened to be the only uncaught exception I was getting live.

I suppose the two questions the leap to mind are

Does @@dwarf_line_numbers have to be memorized?
Is there anyway of doing this using much less memory?

crisward on 22 Mar 2017

🎉1

60MB isn't significant. if compiled with --no-debug then it should use no memory at all.

Yes, memoization is required, otherwise

ysbaddaden on 22 Mar 2017

@crisward i'm amazed by your debugging skills 💯

sdogruyol on 22 Mar 2017

👍3

... it would require to reopen and redecode the debug sections, every time an exception is printed, which is costly (both memory and CPU) usage. This could be abused as an attack vector, too.

ysbaddaden on 22 Mar 2017

@ysbaddaden what about cleaning up the @@dwarf_line_numbers after a reasonable interval? I noticed there is already something the cleans-up unused fibers in kernel.cr. Something similar could be done.

The --no-debug option isn't easy for me to activate as I'm using the heroku buildpack. Also I'd quite like stack traces to be sent to sentry, I just don't want to have applications which can suddenly use more memory than I expected, the hold onto it. The exception which triggered this only happens maybe once a day.

Crystal's low memory footprint is one of it's benefits. We run lots of content managed sites from a single server. The chances of them all having uncaught exceptions at the same time is very low. So being able to clear the memory used after a reasonable interval would be good. I can obviously do this at the application level, but it could be useful to other developers.

crisward on 22 Mar 2017

60MB is significant if you normally only use 1MB. Having said that, 60MB shouldn't be a problem. If it is a problem because you want to run as many processes on a 512 VM, compiling with --no-debug seems like a good solution.

Even if you clear the dwarf line numbers, having your application spike like that can be dangerous if memory availability is low.

benoist on 22 Mar 2017

In someways, it may be better to assign the @@dwarf_line_numbers on application launch. At least you know what you're dealing with, instead of working under the false assumption you have a 1mb app running. Though I think if you're running lots of small apps, if the memory usage drops again after a minute, one or two of them spiking at once wouldn't be a problem. Also I find it odd that different setups create vastly different memory loads for the same app?

crisward on 22 Mar 2017

👍1

assign @@dwarf_line_numbers on application launch

That would affect programs with a short lifespan.

ysbaddaden on 22 Mar 2017

What about adding the patched method above. Perhaps call it CallStack.drop_exception_cache. Then document what it does so app developers can decide if they want to use it?

crisward on 22 Mar 2017

I could see maybe some method to "never memoize" that people could call if
they want to keep RAM down and are OK with the cpu tradeoff...or would it
still spike to 60MB then return down even with something like that?

On Wed, Mar 22, 2017 at 10:35 AM, Cris Ward notifications@github.com
wrote:

What about adding the patched method above. Perhaps call it
CallStack.drop_exception_cache. Then document what it does so app
developers can decide if they want to use it?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/crystal-lang/crystal/issues/3997#issuecomment-288458895,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAw0P9kqkVXDqPhP6xl3wnDr8BgXs6Mks5roU2-gaJpZM4L21j8
.

rdp on 22 Mar 2017

👍1

@ysbaddaden - attacking a swarm of micro apps expected to stay at MB's each, running on a server, and spiking them all via requests causing OOM killings and what not, could also be considered an attack vector though.

@crisward , as @ysbaddaden says: --no-debug flag! - If that's hard to flip in your deployment, then I think you're using the wrong distro/packager/framework/whatever setup!

ozra on 23 Mar 2017

GC and HEAP memory is only one side of the memory usage. What about the process memory? What about the stacks (each Fiber consumes 8MB for it's stack)?

That being said, I don't understand how so much memory can be used; a hello world program only allocates less than 6KB of memory. Maybe Debug::DWARF::LineNumbers retains some references that can't be freed (e.g. @io).

ysbaddaden on 23 Mar 2017

@ozra I use https://github.com/crystal-lang/heroku-buildpack-crystal (I have a fork with the updated Libgc and I've added --no-debug for now, but actually find stack traces useful)

@ysbaddaden I agree, I'll have a read through the code, if I uncover anything I'll report back.

crisward on 23 Mar 2017

I've created the same app, and built for release to run on a mac and the memory issue doesn't happen with stack traces. So whatever is causing this seems platform specific. I know dwarf has a variety of formats, and their are few platform checks within the Callstack code, so either of these play a part. If the code was holding references to io, it's either in one of the platform branches or isn't the culprit.

crisward on 23 Mar 2017

On a side note: @ysbaddaden - The stacks for fibers are mmap'ed, and the kernel maps those pages (which are 4KB for small ones on most systems) to physical mem on page-faults - so _most fibers should probably just use 8KB for stack_ actual reserved physical memory (if I remember correctly, there's a page set off as "protection zone" in the fiber stack alloc). Now, I might have missed something completely—or forgotten—about the alloc-code...

ozra on 28 Mar 2017

lib LibGC
  struct ProfStats
    heap_size : Word
    free_bytes : Word
    unmapped_bytes : Word
    bytes_since_gc : Word
    bytes_before_gc : Word
    non_gc_bytes : Word
    gc_no : Word
    markers_m1 : Word
    bytes_reclaimed_since_gc : Word
    reclaimed_bytes_before_gc : Word
  end

  fun get_prof_stats = GC_get_prof_stats(stats : ProfStats*, size : SizeT)
end

def gc_stats
  stats = Pointer(LibGC::ProfStats).malloc
  LibGC.get_prof_stats(stats, sizeof(LibGC::ProfStats))
  stats.value
end

I've recently started using the code above to get a bit more info about the GC, for example the amount of times it runs.

benoist on 6 Aug 2017

My initial example seems quite stable around 2MB RAM used now with 0.30.1 thanks everyone! for followers, to adjust how often it collect apparently you use the GC_free_space_divisor method and possibly some others mentioned near there, in case anybody ever wants to go down that route...Also apparently for programs that use a "lot of heap" there's a generational mode GC_enable_incremental that might be nice to add a method for sometime, it's said to be more responsive anyway. (descriptions https://github.com/ivmai/bdwgc/ ) Cheers!

rdp on 30 Aug 2019

Interestingly in my app on production, it tends to use 150MB of RAM, but if I have a GC loop:

spawn do
  loop do
    sleep 1.0
    GC.collect
  end
end

it stays around 40MB. This isn't as bad as it once was but might still be worth investigation at some point, at least to see if the API's mentioned earlier can keep the RAM down or not [and does it affect performance or not]. Leave a comment if anybody would like this reopened, if not I might get to it "some decade or so" :)

rdp on 13 Feb 2020

This probably has been mentioned before in this thread, I didn't reread, but generally this is working as intended for most GCs.

Allocating memory from the OS is costly, so many GCs employ heuristic on whether to release memory back to the system or not. And the simplest metric for a heuristic like that is the number of GC cycles that the space stayed free, hence your loop doing what it does. So generally most long running work loads have spiky memory usage, think a server application that processes and answers requests, then throws the state for that away, or a background processing daemon which is basically the same thing. Here there's clear performance advantages to avoiding the roundtrip to the OS memory allocator each time a request comes in. Then for short lived programs it really doesn't matter at being good at releasing memory ASAP, as the program is gonna terminate quickly anyways. So it's better to optimize the heuristic for the first case.

jhass on 13 Feb 2020

Yeah those are good points. I need to think more about this. A couple more random ideas: perform a gc "on idle" (when the scheduler detects idle). Maybe make a flag "don't cache the dwarf line numbers ever" [if exceptions are rare enough and you can handle the RAM usage when you do need an exception, or maybe even an option "don't add line information to exceptions ever" to avoid the RAM hit use entirely [if you can't afford the RAM hit at all, or "re-lookup PC locations one by one, so it never builds the large dwarf structure" ]. Just some thoughts... :)

rdp on 15 Feb 2020

For followers, you can also control some GC functions using environment variables: https://github.com/ivmai/bdwgc/blob/master/doc/README.environment

rdp on 21 Apr 2020

👍1

Crystal: GC doesn't seem aggressive enough at times

Most helpful comment

All 112 comments

3992 doesn't seem like quite the same issue, since it's more concerned with allocating very large objects, and GC.collect didn't help there at all, FWIW.

3333 might have been related to this

Related issues