Truffleruby: Repeatedly initializing a new OptCarrot-instance breaks the VM

Created on 10 Jan 2018  路  15Comments  路  Source: oracle/truffleruby

This is a succession of Issue #919

Using the JAR [of issue #919], a repeated call of

Optcarrot::NES.new(["--benchmark", "examples/Lan_Master.nes"]).run

keeps delivering results, but the VM slows down enormously at ~ 220 calls.

optcarrot_runtime_over_total_time

(x-axis: Seconds, y-axis: FPS)

bug

All 15 comments

This looks like a memory leak.
Running locally shows increasing RSS and never decreasing.
Probably it goes into swap at you after a few hundred iterations or so.

Or alternatively, the max heap limit is reached and it starts GC-ing a lot.

Y-Axis: Memory (MB), X-Axis: Seconds
optcarrot_memory_over_total_time

The max heap is definitely reached, since I'm using the flags --jvm.Xmn512m --jvm.Xms1024m --jvm.Xmx2048m -J-Dgraal.TraceTruffleCompilation=false --jvm.server - the process actually exceeds the limit of 2GB.

Interesting, and it seems other implementations also leak in this configuration, but not as fast.
Could you zoom on MRI (exclude jruby/truffleruby) to see how is the curve there?

optcarrot_memory_over_total_time

It seems to me that only MRI 1.8.7 is affected, which shows a permanent increase - they changed the GC several times since then, though.

Locally I see MRI 2.5.0 also with a stable memory usage, around 86MB and 704k GC.stat[:heap_live_slots].

This looks like the Fibers created in the benchmark (one per .run) leak.
We currently implement Fibers with Java threads and it seems those threads are never GC'd.

Ah, and these Fibers never terminate (PPU#run has an infinite loop, no break or return).
So we need that fancy trick of killing the Fiber if it is no longer referenced by anything.
That's currently not implemented and very likely the fix for this leak.

A very simple reproducer for this is:

loop { Fiber.new {} }
$ jt ruby -J-Xmx2g -J-verbose:gc -e 'loop { Fiber.new {} }'
[GC (Allocation Failure)  64512K->11119K(245760K), 0.0126776 secs]
[GC (Allocation Failure)  75631K->17020K(245760K), 0.0244291 secs]
[GC (Allocation Failure)  81532K->19169K(245760K), 0.0132140 secs]
[GC (Allocation Failure)  83681K->40904K(310272K), 0.0380130 secs]
[ruby] WARNING OutOfMemoryError
-e:1:in `initialize': failed to allocate memory (NoMemoryError)

These are some great insights, you could be hired as a detective! :wink:

It got fairly obvious once I put the heap dump in the Eclipse Memory Analyzer:
leak
(PolyglotThread is just a subclass of java.lang.Thread)

Running

$ jruby -v
jruby 9.1.13.0 (2.3.3) 2017-09-06 8e1c115 OpenJDK 64-Bit Server VM 25.121-b14 on 1.8.0_121-b14 +jit [linux-x86_64]

# Some Fibers are killed when they become unreachable:
$ jruby -e 'i=0; loop { p i+=1; Fiber.new {begin;Fiber.yield;ensure;$stderr.puts :bye; exit! 2; end}.resume;  }'
1
2
...
50
51
bye
bye
bye
byebye

$ jruby -J-verbose:gc -d -e 'i=0; loop { p i+=1; Fiber.new {}; }'
...
12207
12208
12209
[GC (System.gc())  71262K->56294K(225280K), 0.0434084 secs]
[Full GC (System.gc())  56294K->55475K(225280K), 0.6966010 secs]
Error: Your application demanded too many live threads, perhaps for Fiber or Enumerator.
Ensure your old Fibers and Enumerators are being cleaned up.
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:714)
    at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
    at org.jruby.ext.fiber.ThreadFiber.createThread(ThreadFiber.java:253)

So even though JRuby handles killing Fibers when they become unreachable, it doesn't seem to be enough to sustain such a tight loop of creating Fibers. JRuby tries a last GC before giving up, but that doesn't seem to clear Fibers (maybe there is not enough time after the GC finishes to let killed Fibers threads release their Java thread?).

Another problem is since Fibers are implemented with Java Threads, the only way to kill them is by throwing an exception into the Fiber. Throwing an exception executes Ruby code in ensure clauses.
Doing this when the Fiber becomes unreachable will happen concurrently with another Fiber in the same Ruby Thread, which breaks the fundamental assumption that Fibers of the same Ruby Thread do not execute concurrently and only pass control to each other explicitly.

MRI does not have that problem as they can deallocate the native coroutine stack without throwing an exception.

So in summary I'm not too fund of this feature, and it seems portable Ruby programs should properly finish their Fibers and not expect them to be magically GC'd as that only works well with native coroutines.
For Optcarrot, that could be done by quitting the main_loop based on the return value of Fiber.yield.

Another question in this issue is why we consume so much memory per leaked Fiber.

Doing this when the Fiber becomes unreachable will happen concurrently with another Fiber in the same Ruby Thread, which breaks the fundamental assumption that Fibers of the same Ruby Thread do not execute concurrently and only pass control to each other explicitly.

One possible thing might be to pause the current Fiber while throwing the exception in the Fiber to be recycled. But this still breaks the fact a Fiber only yields explicitly.
Or we could throw the exception to finish the fiber to recycle the next time the current Fiber yields, but that could be never.
Although if Fibers never yield then there are likely not many Fibers. Still, one could create a thousand fibers, resume them all once, then no longer reference them and keep running on the main Fiber forever.

I think the proper fix in this case is really to fix OptCarrot, if running the benchmark this way should be supported.
@Ichaelus If you are still interested to run the benchmark this way, could you make a pull request to OptCarrot to quit the emulation loop nicely, allowing prompt disposal of the Fibers and other resources?

I'll close this issue, because I think the real solution is changing OptCarrot to support this mode of running the benchmark better, and there doesn't seem to a satisfying solution not seriously breaking Fiber semantics. Moreover, all solutions proposed so far seem very brittle as they rely on GC and would need non-trivial changes.

One actual solution might be to use native coroutines (for example, from OpenJDK project Loom), but this might take a while.

Was this page helpful?
0 / 5 - 0 ratings