Graal: Graal doesn't run same escape analysis compared to C2

Created on 6 Feb 2019 · 5Comments · Source: oracle/graal

Attached micro benchmark shows some Escape Analysis work where GraalVM 1.0.0-rc12 on Linux x86 falls short compared to the classic C2.

What

The benchmark shows that a local object is placed on the stack when the compiler detects that it does not escape but it has to be of a certain size to be eligible.

Good thing is, that when we compare both array65 versions, the Graal runtime is superior. With smaller arrays, the expected optimization does not kick in and C2 wins by a lot.

Feedback appreciated.

GraalVM 1.0.0-rc12

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler

Benchmark                                             Mode  Cnt     Score   Error   Units
EscapeAnalysis.array63                                avgt    2    47.312           ns/op
EscapeAnalysis.array63:·gc.alloc.rate                 avgt    2  4978.118          MB/sec
EscapeAnalysis.array63:·gc.alloc.rate.norm            avgt    2   272.247            B/op
EscapeAnalysis.array63:·gc.count                      avgt    2   101.000          counts

EscapeAnalysis.array64                                avgt    2    47.687           ns/op
EscapeAnalysis.array64:·gc.alloc.rate                 avgt    2  4942.030          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm            avgt    2   272.211            B/op
EscapeAnalysis.array64:·gc.count                      avgt    2   100.000          counts

EscapeAnalysis.array65                                avgt    2    47.883           ns/op
EscapeAnalysis.array65:·gc.alloc.rate                 avgt    2  5062.994          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm            avgt    2   280.170            B/op
EscapeAnalysis.array65:·gc.count                      avgt    2   102.000          counts

The example with array size 63 and 64 are not placed on the stack as it happens when comparing to C2.

GraalVM 1.0.0-rc12 running without JVMCI

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler

Benchmark                                             Mode  Cnt     Score   Error   Units
EscapeAnalysis.array63                                avgt    2    24.803           ns/op
EscapeAnalysis.array63:·gc.alloc.rate                 avgt    2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array63:·gc.alloc.rate.norm            avgt    2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array63:·gc.count                      avgt    2       ≈ 0          counts

EscapeAnalysis.array64                                avgt    2    24.427           ns/op
EscapeAnalysis.array64:·gc.alloc.rate                 avgt    2    ≈ 10⁻⁴          MB/sec
EscapeAnalysis.array64:·gc.alloc.rate.norm            avgt    2    ≈ 10⁻⁶            B/op
EscapeAnalysis.array64:·gc.count                      avgt    2       ≈ 0          counts

EscapeAnalysis.array65                                avgt    2    52.076           ns/op
EscapeAnalysis.array65:·gc.alloc.rate                 avgt    2  4658.371          MB/sec
EscapeAnalysis.array65:·gc.alloc.rate.norm            avgt    2   280.000            B/op
EscapeAnalysis.array65:·gc.count                      avgt    2    94.000          counts

C2 keeps array on the stack for example size 63 and 64.

Test Code

package org.sample;

import java.util.Random;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class EscapeAnalysis
{
    final Random r = new Random();

    @Benchmark
    public long array63()
    {
        int[] a = new int[63];

        a[0] = r.nextInt();
        a[1] = r.nextInt();

        return a[0] + a[1];
    }

    @Benchmark
    public long array64()
    {
        int[] a = new int[64];

        a[0] = r.nextInt();
        a[1] = r.nextInt();

        return a[0] + a[1];
    }

    @Benchmark
    public long array65()
    {
        int[] a = new int[65];

        a[0] = r.nextInt();
        a[1] = r.nextInt();

        return a[0] + a[1];
    }
}

Source

rschwietzke

Most helpful comment

Would appreciate to keep such settings similar to avoid these simple performance regressions. Of course when there is a good reason for 32, I am all for it, but in this case maybe we go for the same number if it does not hurt.

rschwietzke on 7 Feb 2019

👍3

All 5 comments

@lukasstadler can you think of a reason Graal's PEA is not virtualizing the shorter arrays?

dougxc on 7 Feb 2019

See https://github.com/openjdk-mirror/jdk7u-hotspot/blob/master/src/share/vm/opto/c2_globals.hpp#L459 - c2 allows up to 64 elements in EA'd arrays, whereas this is configured to 32 in Graal https://github.com/oracle/graal/blob/master/compiler/src/org.graalvm.compiler.core.common/src/org/graalvm/compiler/core/common/GraalOptions.java#L87
It would be a simple change, but I haven't seen real-world cases so far that benefit from this.

lukasstadler on 7 Feb 2019

rschwietzke on 7 Feb 2019

👍3

We should be able to handle 128 without compiler footprint problems. Use case is likely patterns arising from fully unrolled loops with crypto or similar math.