Attached micro benchmark shows some Escape Analysis work where GraalVM 1.0.0-rc12 on Linux x86 falls short compared to the classic C2.
The benchmark shows that a local object is placed on the stack when the compiler detects that it does not escape but it has to be of a certain size to be eligible.
Good thing is, that when we compare both array65 versions, the Graal runtime is superior. With smaller arrays, the expected optimization does not kick in and C2 wins by a lot.
Feedback appreciated.
Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler
Benchmark Mode Cnt Score Error Units
EscapeAnalysis.array63 avgt 2 47.312 ns/op
EscapeAnalysis.array63:·gc.alloc.rate avgt 2 4978.118 MB/sec
EscapeAnalysis.array63:·gc.alloc.rate.norm avgt 2 272.247 B/op
EscapeAnalysis.array63:·gc.count avgt 2 101.000 counts
EscapeAnalysis.array64 avgt 2 47.687 ns/op
EscapeAnalysis.array64:·gc.alloc.rate avgt 2 4942.030 MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm avgt 2 272.211 B/op
EscapeAnalysis.array64:·gc.count avgt 2 100.000 counts
EscapeAnalysis.array65 avgt 2 47.883 ns/op
EscapeAnalysis.array65:·gc.alloc.rate avgt 2 5062.994 MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm avgt 2 280.170 B/op
EscapeAnalysis.array65:·gc.count avgt 2 102.000 counts
The example with array size 63 and 64 are not placed on the stack as it happens when comparing to C2.
Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler
Benchmark Mode Cnt Score Error Units
EscapeAnalysis.array63 avgt 2 24.803 ns/op
EscapeAnalysis.array63:·gc.alloc.rate avgt 2 ≈ 10⁻⁴ MB/sec
EscapeAnalysis.array63:·gc.alloc.rate.norm avgt 2 ≈ 10⁻⁶ B/op
EscapeAnalysis.array63:·gc.count avgt 2 ≈ 0 counts
EscapeAnalysis.array64 avgt 2 24.427 ns/op
EscapeAnalysis.array64:·gc.alloc.rate avgt 2 ≈ 10⁻⁴ MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm avgt 2 ≈ 10⁻⁶ B/op
EscapeAnalysis.array64:·gc.count avgt 2 ≈ 0 counts
EscapeAnalysis.array65 avgt 2 52.076 ns/op
EscapeAnalysis.array65:·gc.alloc.rate avgt 2 4658.371 MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm avgt 2 280.000 B/op
EscapeAnalysis.array65:·gc.count avgt 2 94.000 counts
C2 keeps array on the stack for example size 63 and 64.
package org.sample;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class EscapeAnalysis
{
final Random r = new Random();
@Benchmark
public long array63()
{
int[] a = new int[63];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
@Benchmark
public long array64()
{
int[] a = new int[64];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
@Benchmark
public long array65()
{
int[] a = new int[65];
a[0] = r.nextInt();
a[1] = r.nextInt();
return a[0] + a[1];
}
}
@lukasstadler can you think of a reason Graal's PEA is not virtualizing the shorter arrays?
See https://github.com/openjdk-mirror/jdk7u-hotspot/blob/master/src/share/vm/opto/c2_globals.hpp#L459 - c2 allows up to 64 elements in EA'd arrays, whereas this is configured to 32 in Graal https://github.com/oracle/graal/blob/master/compiler/src/org.graalvm.compiler.core.common/src/org/graalvm/compiler/core/common/GraalOptions.java#L87
It would be a simple change, but I haven't seen real-world cases so far that benefit from this.
Would appreciate to keep such settings similar to avoid these simple performance regressions. Of course when there is a good reason for 32, I am all for it, but in this case maybe we go for the same number if it does not hurt.
We should be able to handle 128 without compiler footprint problems. Use case is likely patterns arising from fully unrolled loops with crypto or similar math.
This should be fixed by 8b51efbeedf1c7529fd070daccbece0a61584e4b.
Most helpful comment
Would appreciate to keep such settings similar to avoid these simple performance regressions. Of course when there is a good reason for 32, I am all for it, but in this case maybe we go for the same number if it does not hurt.