This examples demonstrates to my students that System.arraycopy is not faster than a manual arrayCopy in the long run due to the compiling and optimizations Hotspot applies. Hence the old saying, use arrayCopy it is native and much faster is not longer valid except when the system still warms up.
When running Graal, the entire benchmark is slower but also the manual array copy is far behind the System.arraycopy part. There seems to be more behind that, than just a different compilation of the naive manual array copy but also something that makes the native version slower.
Feedback appreciated.
Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 103022.322 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 78596.127 ns/op
Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 53958.507 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 53771.134 ns/op
Options: -Xms2g -Xmx2g -XX:+AlwaysPreTouch
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 54005.722 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 54137.458 ns/op
package org.sample;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(1)
@Warmup(iterations = 2, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class IntrinsicsArrayCopy_AverageTime
{
final int SIZE = 100_000;
final int[] src = new int[SIZE];
@Setup
public void setup()
{
Random r = new Random();
for (int i = 0; i < SIZE; i++)
{
src[i] = r.nextInt(1000);
}
}
public int[] manualArrayCopy()
{
int[] target = new int[SIZE];
for (int i = 0; i < SIZE; i++)
{
target[i] = src[i];
}
return target;
}
public int[] systemArrayCopy()
{
int[] target = new int[SIZE];
System.arraycopy(src, 0, target, 0, SIZE);
return target;
}
@Benchmark
public int[] manual()
{
return manualArrayCopy();
}
@Benchmark
public int[] system()
{
return systemArrayCopy();
}
}
@rschwietzke thanks for the report. Is this measured with GraalVM EE or GraalVM CE?
GraalVM CE
Ok, thanks. We will investigate further.
Just an update on the data with RC13. Unchanged behavior. Hotspot is way faster for System.arrayCopy as well as the manual version.
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 60979.437 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 61301.317 ns/op
```Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 104206.081 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 80284.785 ns/op
### JDK 1.8.0_202, OpenJDK GraalVM CE 1.0.0-rc13, 25.202-b08-jvmci-0.55
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 105197.449 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 82337.270 ns/op
```
Any explanation or progress ?
@rschwietzke could you rerun these benchmarks with a current version of GraalVM? 20.0.0 is scheduled for release next week, and preview builds are available at https://github.com/graalvm/graalvm-ce-dev-builds/releases
For reference, here are numbers for the current master version of GraalVM CE on my machine:
with -XX:-UseJVMCICompiler:
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 5 28772.107 卤 2712.440 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 5 28647.960 卤 793.576 ns/op
with -XX:+UseJVMCICompiler:
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 5 67083.638 卤 6759.119 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 5 28977.734 卤 1710.920 ns/op
The performance of System.arraycopy is now identical. The manual array copy on CE is slower than System.arraycopy because CE does not have loop vectorization. (For reference, manual performance with EE's loop vectorizer is equivalent to HotSpot and System.arraycopy.)
These runs were with -wi 5 -i 5 to get some more warmup and measurement iterations. Running only 2 iterations of each as specified by the benchmark gives the same performance. Still, insufficient warmup may have been one factor when this issue was opened a year ago.
The warmup was designed to be that way because it was about the preference of System.arrayCopy because of the system needed to warm. The identical performance of Hotspot was the target because the GraalVM goal is the same or better than Hotspot. Will run the tests again in a the next hours.
When I run this test locally I see exactly what I'd expect, which is that CE and C2 have the same performance on system but the manual copy is slower on CE because of missing optimizations. I also bumped up the iterations and warmup to 4 each since 2 iterations is way too few depending on the machine you are running on. Graal warmup can be worse than C2 and you need to ensure that you aren't simply seeing warmup differences. Here are my results on a recent imac with CE 19.3.1.
C2
-Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:-UseJVMCICompiler -jar arraycopybenchmark.jar -f 1 -wi 4 -i 4
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 4 25821.679 卤 349.242 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 4 26217.886 卤 528.460 ns/op
Graal
java -Xms2g -Xmx2g -XX:+AlwaysPreTouch -XX:+UseJVMCICompiler -jar arraycopybenchmark.jar -f 1 -wi 4 -i 4
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 4 58021.573 卤 1096.901 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 4 25973.943 卤 206.503 ns/op
Does increasing the number of iterations change what you see? What kind of machine are you running on? Does it reproduce on other machines?
Here we go. A rerun on all typical JVMs including the dev version in question. Executed on a Thinkpad T450s Intel i7 with 20GB RAM on AC power running Ubuntu 18.04 Kernel 4.15. Warmup was increase to 4x 5 sec.
All tests copy from a global array of int to a method local one which is returned.
System.arraycopyArrays.copyOf# OpenJDK 8
# JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b08
enchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 81753.527 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 78014.515 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 55316.282 ns/op
# OpenJDk 11.0.2
# JDK 11.0.2, OpenJDK 64-Bit Server VM, 11.0.2+9
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 59394.494 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 57204.290 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 55791.300 ns/op
# OpenJDK 11.0.6
# JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 89652.132 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 56363.240 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 60527.110 ns/op
# GraalVM 1.0.0RC12, JDK 8
# JDK 1.8.0_192, GraalVM 1.0.0-rc12, 25.192-b12-jvmci-0.54
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 82002.283 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 58607.645 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 57061.915 ns/op
# GraalVM 19.1.1 JDK8
# JDK 1.8.0_222, OpenJDK 64-Bit GraalVM CE 19.1.1, 25.222-b08-jvmci-19.1-b01
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 82871.311 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 57831.741 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 56808.642 ns/op
# GraalVM 19.2.1 JDK8
# JDK 1.8.0_232, OpenJDK 64-Bit GraalVM CE 19.2.1, 25.232-b07-jvmci-19.2-b03
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 79876.667 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 56967.748 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 59907.105 ns/op
# GraalVM 20.0.b02 JDK 11
# JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 2 86256.789 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 2 61140.121 ns/op
IntrinsicsArrayCopy_AverageTime.arrays avgt 2 56324.370 ns/op
Fancy, cannot reproduce that anymore with all the different versions. Even with the old one. Not sure if this relates to a change in the kernel for instance. Also checked other measurements I did for a conference, which still says manual is as fast as system. That means a kernel change seems to be likely aka any of the Intel patches - https://training.xceptance.com/java/420-high-performance.html#/32
So let's consider that resolved for now and thanks for you work.
Update: Found it. JDK 11.0.2 shows what I measured before. Added the data to the measurements above. So we seem to have a regression starting with JDK 11.0.3 or so... or it was overoptimized and that was incorrect. Not sure. Still odd though because manual performance perfectly.
Below are results from Core i7 7700HQ:
# VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM, 25.242-b07
# VM invoker: /usr/lib/jvm/corretto-8/jre/bin/java
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 5 33027.384 卤 811.564 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 5 33011.317 卤 536.010 ns/op
# VM version: JDK 15-ea, OpenJDK 64-Bit Server VM, 15-ea+9-264
# VM invoker: /usr/lib/jvm/openjdk-15/bin/java
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 5 34536.560 卤 493.867 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 5 32877.251 卤 117.639 ns/op
# VM version: JDK 1.8.0_242, OpenJDK 64-Bit Server VM GraalVM CE 20.1.0-dev, 25.242-b06-jvmci-20.0-b02
# VM invoker: /usr/lib/jvm/graalvm-ce-java8/jre/bin/java
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 5 69117.907 卤 37753.352 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 5 33158.599 卤 1006.665 ns/op
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
# VM invoker: /usr/lib/jvm/graalvm-ce-java11/jre/bin/java
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual avgt 5 72699.161 卤 28786.334 ns/op
IntrinsicsArrayCopy_AverageTime.system avgt 5 34478.258 卤 854.991 ns/op
Can you try Graal on JDK 11 too, please?
Done
Very interesting to see that still get results similar to my first measurements despite me being mostly not able to reproduce. Might try JDK 15 tomorrow.
I modified the previous benchmark to use JMH SampleTime (percentiles):
https://github.com/bourgesl/perfFX/blob/master/src/main/java/org/sample/IntrinsicsArrayCopy_SampleTime.java
Here are my results (JDK 11.0.6, Graal VM CE 19.3.1, Graal VM EE 19.3.1) made on Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (AVX2, not AVX512):
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual sample 298403 16773,412 卤 19,050 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.00 sample 15952,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.50 sample 16208,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.90 sample 17120,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.95 sample 17472,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99 sample 32672,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.999 sample 50112,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.9999 sample 84776,858 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p1.00 sample 318976,000 ns/op
IntrinsicsArrayCopy_AverageTime.system sample 153675 16516,329 卤 104,729 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.00 sample 15168,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.50 sample 15616,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.90 sample 16512,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.95 sample 16960,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99 sample 34752,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.999 sample 77609,472 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.9999 sample 158953,677 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p1.00 sample 4284416,000 ns/op
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual sample 86663 115330,771 卤 267,108 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.00 sample 111360,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.50 sample 111360,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.90 sample 118016,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.95 sample 131584,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99 sample 180736,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.999 sample 260694,016 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.9999 sample 390830,490 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p1.00 sample 5234688,000 ns/op
IntrinsicsArrayCopy_AverageTime.system sample 296017 16954,758 卤 47,524 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.00 sample 15984,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.50 sample 16416,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.90 sample 17312,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.95 sample 17536,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99 sample 33588,480 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.999 sample 73213,696 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.9999 sample 130174,848 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p1.00 sample 3731456,000 ns/op
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual sample 154629 16353,394 卤 120,066 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.00 sample 15344,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.50 sample 15632,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.90 sample 16544,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.95 sample 16928,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99 sample 33920,000 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.999 sample 51567,360 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.9999 sample 146460,160 ns/op
IntrinsicsArrayCopy_AverageTime.manual:manual路p1.00 sample 3874816,000 ns/op
IntrinsicsArrayCopy_AverageTime.system sample 182424 16609,442 卤 42,102 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.00 sample 15360,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.50 sample 15760,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.90 sample 16640,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.95 sample 17024,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99 sample 35008,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.999 sample 84608,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.9999 sample 173699,840 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p1.00 sample 344064,000 ns/op
Summary:
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99 sample 32672,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99 sample 34752,000 ns/op
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99 sample 180736,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99 sample 33588,480 ns/op
Benchmark Mode Cnt Score Error Units
IntrinsicsArrayCopy_AverageTime.manual:manual路p0.99 sample 33920,000 ns/op
IntrinsicsArrayCopy_AverageTime.system:system路p0.99 sample 35008,000 ns/op
Conclusion:
GraalVM EE 19.3 is on par with JDK 11 in contrary to GraalVM CE 19.3 where manual copy is 5x slower (missing VectorizeLoops optimization)
Do you plan to merge VectorizeLoops optimizations in GraalVM CE ?
It is not available in GraalVM CE 20.02
At this time, we don't have plans for adding vectorization to GraalVM CE.