Truffleruby: Running the Optcarrot benchmark iteratively in the same VM results in a core dump

Created on 3 Jan 2018  路  16Comments  路  Source: oracle/truffleruby

Outline

  • Ubuntu Mate 16.04
  • Intel i7 870 - 4x 2.93GHz - 3.60 GHz
  • 16GB RAM

    Reproduction

  • Clone Optcarrot containing a looping function

  • Run ruby -S truffle-benchmark.rb with (an rbenv version of) TruffleRuby 0.30.2.

Error description

After ~ the 7th run, the JVM tries to dump its core. Exact message:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fea720e3ea6, pid=20200, tid=0x00007fea8b568700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.71-b01-internal-jvmci-0.38 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 15644 JVMCI org.graalvm.compiler.truffle.OptimizedCallTarget.callRoot([Ljava/lang/Object;)Ljava/lang/Object; (53 bytes) @ 0x00007fea720e3ea6 [0x00007fea720e0c20+0x3286] (#run /home/michael/Projects/RubyCompare/RubySuite/optcarrot/lib/optcarrot/nes.rb:74)
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/michael/Projects/RubyCompare/RubySuite/optcarrot/hs_err_pid20200.log
Loaded disassembler from /home/michael/.rbenv/graalvm-0.30.2/jre/lib/amd64/hsdis-amd64.so
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

The referenced log file can be found here. Version 0.29 does not seem to be affected of this issue.

All 16 comments

  • Run ruby -S truffle-benchmark.rb with (an rbenv version of) TruffleRuby 0.30.2.

The -S should not be needed here (but probably doesn't matter much).

Unfortunately I cannot reproduce this on Fedora 24.
Does it reproduce every time at you?

All 50 runs work successfully and it ends with:

truffle-benchmark.rb:5:in `method_missing': undefined method `start' for 50:Fixnum (NoMethodError)

Which is just that extra .start call.

I am running with the full path to bin/ruby: $ ../graalvm-0.30.2/bin/ruby -S truffle-benchmark.rb.

I see, the start call is just a fragment of the original extracted code. On my machine, the behavior has shown repeatable for ~20 times. Experiments with/without -S flag or full binpath are showing no effects.

Update: My Notebook does not run in any problems either (though using TR 0.30.1).. strange. Its runs the same Ubuntu with a slightly different Kernel and CPU. Don't know how to solve this on the desktop, though

I tried reproducing on a laptop and a server but couldn't on either.

Could you try with:

ruby --jvm.Dgraal.Dump=:2 --jvm.Dgraal.MethodFilter='*nes_rb:74*' -S truffle-benchmark.rb

It should dump some compiler graphs. You will see a message somewhere is the stdout. e.g.,:

...
Running for the 6th time
Dumping debug output in /home/gmdubosc/optcarrot/dumps/1515082214696
fps: 144.896338612011
...

If it doesn't reproduce with that command line, try removing the :2 after Dump=.

If you have a run that dumps graphs and reproduces, could you provide the hs_err file as well as the contents of the dump directory?

Thanks for investigating, @gilles-duboscq!

This is the cli-output when running the command you provided. Excluding :2 had no influence on the result.

...
Running for the 5th time
fps: 111.191726488053
checksum: 59662
Running for the 6th time
Dumping debug output in /home/michael/Projects/RubyCompare/Benches/optcarrot/dumps/1515083065092
fps: 51.849690467061
checksum: 59662
Running for the 7th time
fps: 123.352515793125
checksum: 59662
Running for the 8th time
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff0ca407066, pid=5138, tid=0x00007ff0e3ca6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.71-b01-internal-jvmci-0.38 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 15669 JVMCI org.graalvm.compiler.truffle.OptimizedCallTarget.callRoot([Ljava/lang/Object;)Ljava/lang/Object; (53 bytes) @ 0x00007ff0ca407066 [0x00007ff0ca403de0+0x3286] (#run /home/michael/Projects/RubyCompare/Benches/optcarrot/lib/optcarrot/nes.rb:74)
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/michael/Projects/RubyCompare/Benches/optcarrot/hs_err_pid5138.log
Loaded disassembler from /home/michael/.rbenv/graalvm-0.30.2/jre/lib/amd64/hsdis-amd64.so
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)

The files are attached to my comment:
1515083065092_dump.tar.gz
hs_err_pid5138.log

Thank you @Ichaelus ! The dump helped narrow down the part of the compiler that's doing the wrong transformation. I'm still looking into it.

Hi @Ichaelus, I have a potential fix which i could test against a unit test which reproduces something similar to your issue but it would help if you could confirm that it really fixes your issue.

To do so, could you replace graalvm-0.30.2/jre/lib/jvmci/graal.jar with the version from this archive:
graal.zip and run your benchmark?

I'm totally amazed! Can't stop grinning, since I tried to circumvent this error for a while now ;-)
Long story short - it works like a charm for me, thanks!

Thanks a lot for your help with diagnosing and testing this!

I'll start the review process and hopefully the fix should be in the next GraalVM version.

@eregon, i'm not sure what we should do with this issue here. Should it be closed since it's a Graal compiler issue and not a truffleruby issue? Or do you prefer to wait until there is a Graal version with the fix?

@gilles-duboscq and @davleopo Thanks a lot for the help!

I think it's best to wait until the fix is in Graal and then we can link it here, so we know the revision of the fix.

Does this fix include additional code as the graal.jar you posted earlier on?
Using the JAR, a repeated call of

Optcarrot::NES.new(["--benchmark", "examples/Lan_Master.nes"]).run

keeps delivering results, but the VM slows down enormously at ~ 220 calls.

optcarrot_runtime_over_total_time

(x-axis: Seconds, y-axis: FPS)

I know this is low-priority bug, but I felt like you could want to know about it.

Michael

The JAR i provided earlier was built by applying the same fix locally on the 0.30.2 branch.
Do you know if this performance issue after ~220 calls also happens on earlier GraalVM versions?
I'm guessing this is an other problem we need to look into.

Ok. I'll try generate a 0.28 vs. 0.30.2 report in the next couple of days.

@Ichaelus Could you open a separate bug for the slowdown?
BTW, I think this is a fairly uncommon way to run OptCarrot, notably it runs the NES startup sequence every time. But nevertheless this slowdown is worth investigating.

Ok we're at #959 now.
Yeah I'm aware of that, but I found it an interesting alternative :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

deepj picture deepj  路  6Comments

deepj picture deepj  路  5Comments

wildmaples picture wildmaples  路  6Comments

deepj picture deepj  路  5Comments

jaime-rave-conichi picture jaime-rave-conichi  路  4Comments