I've been testing a mandelbrot benchmark for sulong using both java and C for the work, with java mainly handling threading and some looping and c handling the calculation. While working with this, I found that sulong is slow to iterate through java arrays. Benchmarking this with JMH shows that it's incredibly slow:
[info] MandelbrotC.incrementValuesC avgt 5 239.103 ± 5.537 ms/op
[info] MandelbrotC.incrementValuesJ avgt 5 0.017 ± 0.002 ms/op
[info] MandelbrotC.initValuesC avgt 5 101.384 ± 20.213 ms/op
[info] MandelbrotC.initValuesJ avgt 5 0.040 ± 0.035 ms/op
[info] MandelbrotC.stepThroughValuesC avgt 5 155.466 ± 149.067 ms/op
[info] MandelbrotC.stepThroughValuesJ avgt 5 0.019 ± 0.012 ms/op
Here's the benchmark code:
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(time = 15, iterations=10, timeUnit = TimeUnit.SECONDS)
def initValuesC(ms: MandelbrotState): Unit = {
val doubleArray = Array.ofDim[Double](16000)
ms.contexts(0).initValues(doubleArray, 16000)
}
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(time = 15, iterations=10, timeUnit = TimeUnit.SECONDS)
def initValuesJ(blackhole: Blackhole): Unit = {
val wid_ht = 16000
val i0 = Array.ofDim[Double](wid_ht)
var xy = 0
while(xy < wid_ht) {
i0(xy) = 2.0 / wid_ht * xy - 1.0
i0(xy + 1) = 2.0 / wid_ht * (xy + 1) - 1.0
xy += 2
}
blackhole.consume(i0)
}
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(time = 15, iterations=10, timeUnit = TimeUnit.SECONDS)
def stepThroughValuesJ(mandelbrotState: MandelbrotState): Unit = {
val wid_ht = 16000
val values = Array.ofDim[Double](16000)
var i = 0
while(i < wid_ht) {
mandelbrotState.blkhole = values(i)
i += 1
}
}
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(time = 15, iterations=10, timeUnit = TimeUnit.SECONDS)
def stepThroughValuesC(mandelbrotState: MandelbrotState): Unit = {
val wid_ht = 16000
val values = Array.ofDim[Double](wid_ht)
mandelbrotState.contexts(0).stepThroughValues.executeVoid(values, wid_ht.asInstanceOf[Object])
}
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(time = 15, iterations=10, timeUnit = TimeUnit.SECONDS)
def incrementValuesJ(blackhole: Blackhole): Unit = {
val wid_ht = 16000
val values = Array.ofDim[Double](16000)
var i = 0
while(i < wid_ht) {
values(i) += 1
i += 1
}
blackhole.consume(values)
}
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(time = 15, iterations=10, timeUnit = TimeUnit.SECONDS)
def incrementValuesC(mandelbrotState: MandelbrotState): Unit = {
val wid_ht = 16000
val values = Array.ofDim[Double](wid_ht)
mandelbrotState.contexts(0).incrementValues.executeVoid(values, wid_ht.asInstanceOf[Object])
}
and the c code:
void initValues(double* i0, long wid_ht) {
for(long xy=0; xy<wid_ht; xy+=2)
{
i0[xy] = 2.0 / wid_ht * xy - 1.0;
i0[xy+1] = 2.0 / wid_ht * (xy+1) - 1.0;
}
}
double blkhole;
void stepThroughValues(double* i0, long wid_ht) {
for(long xy = 0; xy < wid_ht; xy++) {
blkhole = i0[xy];
}
}
void incrementValues(double* i0, long wid_ht) {
for(long xy=0; xy < wid_ht; xy++) {
i0[xy] += 1;
}
}
You can run the benches I wrote from this version of my benchmark repository: https://github.com/markehammons/languageshootout-jmh/commit/336523efd1591cd598df02315185e6deb13f9348
You will need an environmental variable GRAALVM_HOME set to the location of your graalvm. Then you will need to run sbt irCompile followed by sbt "jmh:run MandelbrotC.*Values"
Thanks for the report!
This is actually two separate issues, one in Sulong, the other in the Java interop code.
The first issue is already fixed (https://github.com/graalvm/sulong/commit/abadb1c796d766733b8d95df1b86e76ae36dce56). This should fix the biggest chunk of the performance difference:
[info] Benchmark Mode Cnt Score Error Units
[info] MandelbrotC.incrementValuesC avgt 5 1.543 ± 0.249 ms/op
[info] MandelbrotC.incrementValuesJ avgt 5 0.022 ± 0.003 ms/op
[info] MandelbrotC.initValuesC avgt 5 0.523 ± 0.029 ms/op
[info] MandelbrotC.initValuesJ avgt 5 0.068 ± 0.006 ms/op
[info] MandelbrotC.stepThroughValuesC avgt 5 0.979 ± 0.013 ms/op
[info] MandelbrotC.stepThroughValuesJ avgt 5 0.016 ± 0.001 ms/op
I'm currently working on the second issue in the Java interop code. I'm confident we can get the C performance very close, if not equal, to the Java performance.
@rschatz Great! I can't wait to see the results! Thanks for the quick work!
Second issue should be fixed in https://github.com/oracle/graal/commit/dc37c9a244fc97a92b7b37f716e484cb4b7ee5ed
Looking a lot better already ;)
[info] Benchmark Mode Cnt Score Error Units
[info] MandelbrotC.incrementValuesC avgt 5 0.031 ± 0.006 ms/op
[info] MandelbrotC.incrementValuesJ avgt 5 0.021 ± 0.001 ms/op
[info] MandelbrotC.initValuesC avgt 5 0.068 ± 0.001 ms/op
[info] MandelbrotC.initValuesJ avgt 5 0.068 ± 0.003 ms/op
[info] MandelbrotC.stepThroughValuesC avgt 5 0.023 ± 0.001 ms/op
[info] MandelbrotC.stepThroughValuesJ avgt 5 0.016 ± 0.001 ms/op
looking forward to testing it. I'm guessing this will be put into graalvm-ce-1.0.0-rc4?
I guess in theory i could build graalvm myself.
It will be in the release targeted for beginning of August.
Most helpful comment
Second issue should be fixed in https://github.com/oracle/graal/commit/dc37c9a244fc97a92b7b37f716e484cb4b7ee5ed
Looking a lot better already ;)