Graal: Array copying is way slower than Hotspot even System.arraycopy

Created on 6 Feb 2019 · 16Comments · Source: oracle/graal

What

This examples demonstrates to my students that System.arraycopy is not faster than a manual arrayCopy in the long run due to the compiling and optimizations Hotspot applies. Hence the old saying, use arrayCopy it is native and much faster is not longer valid except when the system still warms up.

When running Graal, the entire benchmark is slower but also the manual array copy is far behind the System.arraycopy part. There seems to be more behind that, than just a different compilation of the naive manual array copy but also something that makes the native version slower.

Feedback appreciated.

Results

GraalVM 1.0.0-rc12 (build 25.192-b12-jvmci-0.54, mixed mode)

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler

Benchmark                               Mode  Cnt       Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  103022.322          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2   78596.127          ns/op

Graal no JVMCI

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler

Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  53958.507          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  53771.134          ns/op

OpenJDK 1.8.0.192

Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch

Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  54005.722          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  54137.458          ns/op

Code

package org.sample;

import java.util.Random;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Warmup(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class IntrinsicsArrayCopy_AverageTime
{
    final int SIZE = 100_000;
    final int[] src = new int[SIZE];

    @Setup
    public void setup()
    {
        Random r = new Random();
        for (int i = 0; i < SIZE; i++)
        {
            src[i] = r.nextInt(1000);
        }
    }

    public int[] manualArrayCopy()
    {
        int[] target = new int[SIZE];
        for (int i = 0; i < SIZE; i++)
        {
            target[i] = src[i];
        }

        return target;
    }

    public int[] systemArrayCopy()
    {
        int[] target = new int[SIZE];
        System.arraycopy(src, 0, target, 0, SIZE);

        return target;
    }

    @Benchmark
    public int[] manual()
    {
        return manualArrayCopy();
    }

    @Benchmark
    public int[] system()
    {
        return systemArrayCopy();
    }   
}

Source

rschwietzke

All 16 comments

@rschwietzke thanks for the report. Is this measured with GraalVM EE or GraalVM CE?

dougxc on 7 Feb 2019

GraalVM CE

rschwietzke on 7 Feb 2019

Ok, thanks. We will investigate further.

dougxc on 7 Feb 2019

Just an update on the data with RC13. Unchanged behavior. Hotspot is way faster for System.arrayCopy as well as the manual version.

JDK 1.8.0_201, Java HotSpot(TM) 64-Bit Server VM, 25.201-b09

Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  60979.437          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  61301.317          ns/op

JDK 1.8.0_192, GraalVM CE 1.0.0-rc12, 25.192-b12-jvmci-0.54

```Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 104206.081 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 80284.785 ns/op

### JDK 1.8.0_202, OpenJDK GraalVM CE 1.0.0-rc13, 25.202-b08-jvmci-0.55

Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 105197.449 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 82337.270 ns/op
```

rschwietzke on 16 Mar 2019

Any explanation or progress ?

bourgesl on 11 Feb 2020

@rschwietzke could you rerun these benchmarks with a current version of GraalVM? 20.0.0 is scheduled for release next week, and preview builds are available at https://github.com/graalvm/graalvm-ce-dev-builds/releases

For reference, here are numbers for the current master version of GraalVM CE on my machine:
with -XX:-UseJVMCICompiler:

Benchmark                               Mode  Cnt      Score      Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  28772.107 ± 2712.440  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  28647.960 ±  793.576  ns/op

with -XX:+UseJVMCICompiler:

Benchmark                               Mode  Cnt      Score      Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  67083.638 ± 6759.119  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  28977.734 ± 1710.920  ns/op

The performance of System.arraycopy is now identical. The manual array copy on CE is slower than System.arraycopy because CE does not have loop vectorization. (For reference, manual performance with EE's loop vectorizer is equivalent to HotSpot and System.arraycopy.)
These runs were with -wi 5 -i 5 to get some more warmup and measurement iterations. Running only 2 iterations of each as specified by the benchmark gives the same performance. Still, insufficient warmup may have been one factor when this issue was opened a year ago.

gergo- on 12 Feb 2020

👍1

The warmup was designed to be that way because it was about the preference of System.arrayCopy because of the system needed to warm. The identical performance of Hotspot was the target because the GraalVM goal is the same or better than Hotspot. Will run the tests again in a the next hours.

rschwietzke on 12 Feb 2020

When I run this test locally I see exactly what I'd expect, which is that CE and C2 have the same performance on system but the manual copy is slower on CE because of missing optimizations. I also bumped up the iterations and warmup to 4 each since 2 iterations is way too few depending on the machine you are running on. Graal warmup can be worse than C2 and you need to ensure that you aren't simply seeing warmup differences. Here are my results on a recent imac with CE 19.3.1.

 -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler -jar arraycopybenchmark.jar -f 1 -wi 4 -i 4
Benchmark                               Mode  Cnt      Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    4  25821.679 ± 349.242  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    4  26217.886 ± 528.460  ns/op

Graal

java -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler -jar arraycopybenchmark.jar -f 1 -wi 4 -i 4
Benchmark                               Mode  Cnt      Score      Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    4  58021.573 ± 1096.901  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    4  25973.943 ±  206.503  ns/op

Does increasing the number of iterations change what you see? What kind of machine are you running on? Does it reproduce on other machines?

tkrodriguez on 12 Feb 2020

Here we go. A rerun on all typical JVMs including the dev version in question. Executed on a Thinkpad T450s Intel i7 with 20GB RAM on AC power running Ubuntu 18.04 Kernel 4.15. Warmup was increase to 4x 5 sec.

All tests copy from a global array of int to a method local one which is returned.

manual runs a for-loop
system uses System.arraycopy
arrays uses Arrays.copyOf

# OpenJDK 8
# JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
enchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  81753.527          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  78014.515          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  55316.282          ns/op

# OpenJDk 11.0.2
# JDK 11.0.2, OpenJDK 64-Bit Server VM, 11.0.2+9
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  59394.494          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  57204.290          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  55791.300          ns/op

# OpenJDK 11.0.6
# JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  89652.132          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  56363.240          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  60527.110          ns/op

# GraalVM 1.0.0RC12, JDK 8
# JDK 1.8.0_192, GraalVM 1.0.0-rc12, 25.192-b12-jvmci-0.54
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  82002.283          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  58607.645          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  57061.915          ns/op

# GraalVM 19.1.1 JDK8
# JDK 1.8.0_222, OpenJDK 64-Bit GraalVM CE 19.1.1, 25.222-b08-jvmci-19.1-b01
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  82871.311          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  57831.741          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  56808.642          ns/op

# GraalVM 19.2.1 JDK8
# JDK 1.8.0_232, OpenJDK 64-Bit GraalVM CE 19.2.1, 25.232-b07-jvmci-19.2-b03
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  79876.667          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  56967.748          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  59907.105          ns/op

# GraalVM 20.0.b02 JDK 11
# JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
Benchmark                               Mode  Cnt      Score   Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    2  86256.789          ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    2  61140.121          ns/op
IntrinsicsArrayCopy_AverageTime.arrays  avgt    2  56324.370          ns/op

Fancy, cannot reproduce that anymore with all the different versions. Even with the old one. Not sure if this relates to a change in the kernel for instance. Also checked other measurements I did for a conference, which still says manual is as fast as system. That means a kernel change seems to be likely aka any of the Intel patches - https://training.xceptance.com/java/420-high-performance.html#/32

So let's consider that resolved for now and thanks for you work.

Update: Found it. JDK 11.0.2 shows what I measured before. Added the data to the measurements above. So we seem to have a regression starting with JDK 11.0.3 or so... or it was overoptimized and that was incorrect. Not sure. Still odd though because manual performance perfectly.

rschwietzke on 12 Feb 2020

Below are results from Core i7 7700HQ:

Amazon Corretto 8

# VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b07
# VM invoker: /usr/lib/jvm/corretto-8/jre/bin/java
Benchmark                               Mode  Cnt      Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  33027.384 ± 811.564  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  33011.317 ± 536.010  ns/op

OpenJDK 15 (early access build)

# VM version: JDK 15-ea, OpenJDK 64-Bit Server VM, 15-ea+9-264
# VM invoker: /usr/lib/jvm/openjdk-15/bin/java
Benchmark                               Mode  Cnt      Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  34536.560 ± 493.867  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  32877.251 ± 117.639  ns/op

GraalVM CE 20.1 Java 8 (dev build)

# VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM GraalVM CE 20.1.0-dev, 25.242-b06-jvmci-20.0-b02
# VM invoker: /usr/lib/jvm/graalvm-ce-java8/jre/bin/java
Benchmark                               Mode  Cnt      Score       Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  69117.907 ± 37753.352  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  33158.599 ±  1006.665  ns/op

GraalVM CE 20.1 Java 11 (dev build)

# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
# VM invoker: /usr/lib/jvm/graalvm-ce-java11/jre/bin/java
Benchmark                               Mode  Cnt      Score       Error  Units
IntrinsicsArrayCopy_AverageTime.manual  avgt    5  72699.161 ± 28786.334  ns/op
IntrinsicsArrayCopy_AverageTime.system  avgt    5  34478.258 ±   854.991  ns/op

plokhotnyuk on 12 Feb 2020

Can you try Graal on JDK 11 too, please?

rschwietzke on 12 Feb 2020

Done

plokhotnyuk on 12 Feb 2020

Very interesting to see that still get results similar to my first measurements despite me being mostly not able to reproduce. Might try JDK 15 tomorrow.

rschwietzke on 12 Feb 2020

I modified the previous benchmark to use JMH SampleTime (percentiles):
https://github.com/bourgesl/perfFX/blob/master/src/main/java/org/sample/IntrinsicsArrayCopy_SampleTime.java

Here are my results (JDK 11.0.6, Graal VM CE 19.3.1, Graal VM EE 19.3.1) made on Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (AVX2, not AVX512):

jdk: https://github.com/bourgesl/perfFX/blob/master/results/jmh_array.txt

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual                 sample  298403    16773,412 ±  19,050  ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.00    sample            15952,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.50    sample            16208,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.90    sample            17120,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.95    sample            17472,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.99    sample            32672,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.999   sample            50112,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.9999  sample            84776,858            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p1.00    sample           318976,000            ns/op
IntrinsicsArrayCopy_AverageTime.system                 sample  153675    16516,329 ± 104,729  ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.00    sample            15168,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.50    sample            15616,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.90    sample            16512,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.95    sample            16960,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.99    sample            34752,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.999   sample            77609,472            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.9999  sample           158953,677            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p1.00    sample          4284416,000            ns/op

graal CE: https://github.com/bourgesl/perfFX/blob/master/results/jmh_array_ce.txt

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual                 sample   86663   115330,771 ± 267,108  ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.00    sample           111360,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.50    sample           111360,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.90    sample           118016,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.95    sample           131584,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.99    sample           180736,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.999   sample           260694,016            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.9999  sample           390830,490            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p1.00    sample          5234688,000            ns/op
IntrinsicsArrayCopy_AverageTime.system                 sample  296017    16954,758 ±  47,524  ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.00    sample            15984,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.50    sample            16416,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.90    sample            17312,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.95    sample            17536,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.99    sample            33588,480            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.999   sample            73213,696            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.9999  sample           130174,848            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p1.00    sample          3731456,000            ns/op

graal EE: https://github.com/bourgesl/perfFX/blob/master/results/jmh_array_ee.txt

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual                 sample  154629    16353,394 ± 120,066  ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.00    sample            15344,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.50    sample            15632,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.90    sample            16544,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.95    sample            16928,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.99    sample            33920,000            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.999   sample            51567,360            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.9999  sample           146460,160            ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual·p1.00    sample          3874816,000            ns/op
IntrinsicsArrayCopy_AverageTime.system                 sample  182424    16609,442 ±  42,102  ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.00    sample            15360,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.50    sample            15760,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.90    sample            16640,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.95    sample            17024,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.99    sample            35008,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.999   sample            84608,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.9999  sample           173699,840            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p1.00    sample           344064,000            ns/op

Summary:

JDK: manual % system = 0.94

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.99    sample            32672,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.99    sample            34752,000            ns/op

graal CE: manual % system = 5.38

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.99    sample           180736,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.99    sample            33588,480            ns/op

graal EE: manual % system = 0.97

Benchmark                                                Mode     Cnt        Score     Error  Units
IntrinsicsArrayCopy_AverageTime.manual:manual·p0.99    sample            33920,000            ns/op
IntrinsicsArrayCopy_AverageTime.system:system·p0.99    sample            35008,000            ns/op

Conclusion:
GraalVM EE 19.3 is on par with JDK 11 in contrary to GraalVM CE 19.3 where manual copy is 5x slower (missing VectorizeLoops optimization)

bourgesl on 13 Feb 2020

Do you plan to merge VectorizeLoops optimizations in GraalVM CE ?
It is not available in GraalVM CE 20.02

bourgesl on 19 Feb 2020

At this time, we don't have plans for adding vectorization to GraalVM CE.

dougxc on 19 Feb 2020

😕4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[native-image] memory corruption in byte arrays obtained via getPrimitiveArrayCritical

johanvos · 3Comments

What .dlls are needed for a Windows binary?

borkdude · 3Comments

Build fails by "Error: No instances of java.io.FilePermission are allowed in the image heap as this class should be initialized at image runtime."

oroppas · 3Comments

Regression introduced: IndexOutOfBoundsException creating Micronaut native-image apps

ilopmar · 3Comments

Can AOT compilations run without an operating system?

ghost · 3Comments