Graal: Array copying is way slower than Hotspot even System.arraycopy

Created on 6 Feb 2019  路  16Comments  路  Source: oracle/graal

What

This examples demonstrates to my students that System.arraycopy is not faster than a manual arrayCopy in the long run due to the compiling and optimizations Hotspot applies. Hence the old saying, use arrayCopy it is native and much faster is not longer valid except when the system still warms up.

When running Graal, the entire benchmark is slower but also the manual array copy is far behind the System.arraycopy part. There seems to be more behind that, than just a different compilation of the naive manual array copy but also something that makes the native version slower.

Feedback appreciated.

Results

GraalVM 1.0.0-rc12 (build 25.192-b12-jvmci-0.54, mixed mode)

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler

Benchmark                               Mode  Cnt       Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  103022.322          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2   78596.127          ns/op

Graal no JVMCI

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler

Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  53958.507          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  53771.134          ns/op

OpenJDK 1.8.0.192

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch

Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  54005.722          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  54137.458          ns/op

Code

package org.sample;

import java.util.Random;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Warmup(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class IntrinsicsArrayCopy_AverageTime
{
    final int SIZE = 100_000;
    final int[] src = new int[SIZE];

    @Setup
    public void setup()
    {
        Random r = new Random();
        for (int i = 0; i < SIZE; i++)
        {
            src[i] = r.nextInt(1000);
        }
    }

    public int[] manualArrayCopy()
    {
        int[] target = new int[SIZE];
        for (int i = 0; i < SIZE; i++)
        {
            target[i] = src[i];
        }

        return target;
    }

    public int[] systemArrayCopy()
    {
        int[] target = new int[SIZE];
        System.arraycopy(src, 0, target, 0, SIZE);

        return target;
    }

    @Benchmark
    public int[] manual()
    {
        return manualArrayCopy();
    }

    @Benchmark
    public int[] system()
    {
        return systemArrayCopy();
    }   
}

All 16 comments

@rschwietzke thanks for the report. Is this measured with GraalVM EE or GraalVM CE?

GraalVM CE

Ok, thanks. We will investigate further.

Just an update on the data with RC13. Unchanged behavior. Hotspot is way faster for System.arrayCopy as well as the manual version.

JDK 1.8.0_201, Java HotSpot(TM) 64-Bit Server VM, 25.201-b09

Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  60979.437          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  61301.317          ns/op

JDK 1.8.0_192, GraalVM CE 1.0.0-rc12, 25.192-b12-jvmci-0.54

```Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 104206.081 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 80284.785 ns/op

### JDK 1.8.0_202, OpenJDK GraalVM CE 1.0.0-rc13, 25.202-b08-jvmci-0.55

Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 105197.449 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 82337.270 ns/op
```

Any explanation or progress ?

@rschwietzke could you rerun these benchmarks with a current version of GraalVM? 20.0.0 is scheduled for release next week, and preview builds are available at https://github.com/graalvm/graalvm-ce-dev-builds/releases

For reference, here are numbers for the current master version of GraalVM CE on my machine:
with -XX:-UseJVMCICompiler:

Benchmark                               Mode  Cnt      Score      Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  28772.107 卤 2712.440  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  28647.960 卤  793.576  ns/op

with -XX:+UseJVMCICompiler:

Benchmark                               Mode  Cnt      Score      Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  67083.638 卤 6759.119  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  28977.734 卤 1710.920  ns/op

The performance of System.arraycopy is now identical. The manual array copy on CE is slower than System.arraycopy because CE does not have loop vectorization. (For reference, manual performance with EE's loop vectorizer is equivalent to HotSpot and System.arraycopy.)
These runs were with -wi 5 -i 5 to get some more warmup and measurement iterations. Running only 2 iterations of each as specified by the benchmark gives the same performance. Still, insufficient warmup may have been one factor when this issue was opened a year ago.

The warmup was designed to be that way because it was about the preference of System.arrayCopy because of the system needed to warm. The identical performance of Hotspot was the target because the GraalVM goal is the same or better than Hotspot. Will run the tests again in a the next hours.

When I run this test locally I see exactly what I'd expect, which is that CE and C2 have the same performance on system but the manual copy is slower on CE because of missing optimizations. I also bumped up the iterations and warmup to 4 each since 2 iterations is way too few depending on the machine you are running on. Graal warmup can be worse than C2 and you need to ensure that you aren't simply seeing warmup differences. Here are my results on a recent imac with CE 19.3.1.

C2

 -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler -jar arraycopybenchmark.jar -f 1 -wi 4 -i 4
Benchmark                               Mode  Cnt      Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    4  25821.679 卤 349.242  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    4  26217.886 卤 528.460  ns/op

Graal

java -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler -jar arraycopybenchmark.jar -f 1 -wi 4 -i 4
Benchmark                               Mode  Cnt      Score      Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    4  58021.573 卤 1096.901  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    4  25973.943 卤  206.503  ns/op

Does increasing the number of iterations change what you see? What kind of machine are you running on? Does it reproduce on other machines?

Here we go. A rerun on all typical JVMs including the dev version in question. Executed on a Thinkpad T450s Intel i7 with 20GB RAM on AC power running Ubuntu 18.04 Kernel 4.15. Warmup was increase to 4x 5 sec.

All tests copy from a global array of int to a method local one which is returned.

  • manual runs a for-loop
  • system uses System.arraycopy
  • arrays uses Arrays.copyOf
# OpenJDK 8
# JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
enchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  81753.527          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  78014.515          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  55316.282          ns/op
# OpenJDk 11.0.2
# JDK 11.0.2, OpenJDK 64-Bit Server VM, 11.0.2+9
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  59394.494          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  57204.290          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  55791.300          ns/op
# OpenJDK 11.0.6
# JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  89652.132          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  56363.240          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  60527.110          ns/op
# GraalVM 1.0.0RC12, JDK 8
# JDK 1.8.0_192, GraalVM 1.0.0-rc12, 25.192-b12-jvmci-0.54
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  82002.283          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  58607.645          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  57061.915          ns/op
# GraalVM 19.1.1 JDK8
# JDK 1.8.0_222, OpenJDK 64-Bit GraalVM CE 19.1.1, 25.222-b08-jvmci-19.1-b01
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  82871.311          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  57831.741          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  56808.642          ns/op
# GraalVM 19.2.1 JDK8
# JDK 1.8.0_232, OpenJDK 64-Bit GraalVM CE 19.2.1, 25.232-b07-jvmci-19.2-b03
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  79876.667          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  56967.748          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  59907.105          ns/op
# GraalVM 20.0.b02 JDK 11
# JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  86256.789          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  61140.121          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  56324.370          ns/op

Fancy, cannot reproduce that anymore with all the different versions. Even with the old one. Not sure if this relates to a change in the kernel for instance. Also checked other measurements I did for a conference, which still says manual is as fast as system. That means a kernel change seems to be likely aka any of the Intel patches - https://training.xceptance.com/java/420-high-performance.html#/32

So let's consider that resolved for now and thanks for you work.

Update: Found it. JDK 11.0.2 shows what I measured before. Added the data to the measurements above. So we seem to have a regression starting with JDK 11.0.3 or so... or it was overoptimized and that was incorrect. Not sure. Still odd though because manual performance perfectly.

Below are results from Core i7 7700HQ:

Amazon Corretto 8

# VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b07
# VM invoker: /usr/lib/jvm/corretto-8/jre/bin/java
Benchmark                               Mode  Cnt      Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  33027.384 卤 811.564  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  33011.317 卤 536.010  ns/op

OpenJDK 15 (early access build)

# VM version: JDK 15-ea, OpenJDK 64-Bit Server VM, 15-ea+9-264
# VM invoker: /usr/lib/jvm/openjdk-15/bin/java
Benchmark                               Mode  Cnt      Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  34536.560 卤 493.867  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  32877.251 卤 117.639  ns/op

GraalVM CE 20.1 Java 8 (dev build)

# VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM GraalVM CE 20.1.0-dev, 25.242-b06-jvmci-20.0-b02
# VM invoker: /usr/lib/jvm/graalvm-ce-java8/jre/bin/java
Benchmark                               Mode  Cnt      Score       Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  69117.907 卤 37753.352  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  33158.599 卤  1006.665  ns/op

GraalVM CE 20.1 Java 11 (dev build)

# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
# VM invoker: /usr/lib/jvm/graalvm-ce-java11/jre/bin/java
Benchmark                               Mode  Cnt      Score       Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  72699.161 卤 28786.334  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  34478.258 卤   854.991  ns/op

Can you try Graal on JDK 11 too, please?

Done

Very interesting to see that still get results similar to my first measurements despite me being mostly not able to reproduce. Might try JDK 15 tomorrow.

I modified the previous benchmark to use JMH SampleTime (percentiles):
https://github.com/bourgesl/perfFX/blob/master/src/main/java/org/sample/IntrinsicsArrayCopy_SampleTime.java

Here are my results (JDK 11.0.6, Graal VM CE 19.3.1, Graal VM EE 19.3.1) made on Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (AVX2, not AVX512):

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual                 sample  298403    16773,412 卤  19,050  ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.00    sample            15952,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.50    sample            16208,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.90    sample            17120,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.95    sample            17472,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99    sample            32672,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.999   sample            50112,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.9999  sample            84776,858            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p1.00    sample           318976,000            ns/op
IntrinsicsArrayCopy_AverageTime.system                 sample  153675    16516,329 卤 104,729  ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.00    sample            15168,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.50    sample            15616,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.90    sample            16512,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.95    sample            16960,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99    sample            34752,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.999   sample            77609,472            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.9999  sample           158953,677            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p1.00    sample          4284416,000            ns/op
Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual                 sample   86663   115330,771 卤 267,108  ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.00    sample           111360,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.50    sample           111360,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.90    sample           118016,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.95    sample           131584,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99    sample           180736,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.999   sample           260694,016            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.9999  sample           390830,490            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p1.00    sample          5234688,000            ns/op
IntrinsicsArrayCopy_AverageTime.system                 sample  296017    16954,758 卤  47,524  ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.00    sample            15984,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.50    sample            16416,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.90    sample            17312,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.95    sample            17536,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99    sample            33588,480            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.999   sample            73213,696            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.9999  sample           130174,848            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p1.00    sample          3731456,000            ns/op
Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual                 sample  154629    16353,394 卤 120,066  ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.00    sample            15344,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.50    sample            15632,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.90    sample            16544,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.95    sample            16928,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99    sample            33920,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.999   sample            51567,360            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.9999  sample           146460,160            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p1.00    sample          3874816,000            ns/op
IntrinsicsArrayCopy_AverageTime.system                 sample  182424    16609,442 卤  42,102  ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.00    sample            15360,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.50    sample            15760,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.90    sample            16640,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.95    sample            17024,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99    sample            35008,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.999   sample            84608,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.9999  sample           173699,840            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p1.00    sample           344064,000            ns/op

Summary:

  • JDK: manual % system = 0.94
Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99    sample            32672,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99    sample            34752,000            ns/op
  • graal CE: manual % system = 5.38
Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99    sample           180736,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99    sample            33588,480            ns/op
  • graal EE: manual % system = 0.97
Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99    sample            33920,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99    sample            35008,000            ns/op

Conclusion:
GraalVM EE 19.3 is on par with JDK 11 in contrary to GraalVM CE 19.3 where manual copy is 5x slower (missing VectorizeLoops optimization)

Do you plan to merge VectorizeLoops optimizations in GraalVM CE ?
It is not available in GraalVM CE 20.02

At this time, we don't have plans for adding vectorization to GraalVM CE.

Was this page helpful?
0 / 5 - 0 ratings