Graal: [native-image] native image Performance is worse than runs on JVM

Created on 14 Feb 2019  路  13Comments  路  Source: oracle/graal

Graal Version: 1.0.0-RC12, EE and CE

Expected Behaviour: native image executes faster than same program executed on JVM

Actual Behaviour: JVM execution is faster

I wrote some benchmarks with JMH in order to evaluate performance of Graal, both on JVM and as native image. You can find the code at https://github.com/turing85/graal-playground

When you execute the fibonacci-, primenumber-, and stream-benchmarks, you will observe that the native image is always slower, sometimes up to 5x slower.

Some of the performance degradation may be related to https://github.com/oracle/graal/issues/974, but since the benchmarks cover other parts aswell (e.g. recursion and pure iteration performance), this cannot be the sole reason.

native-image

Most helpful comment

The problem here is that you have a misguided expectation:

Expected Behaviour: native image executes faster than same program executed on JVM

Why would you assume this? Native images give you certain gains at the cost of other losses.

JVM execution of an application includes runtime optimization of the code that profits from profile information built up during execution. That includes the opportunities to inline a lot more of the code, locate hot code on direct paths (i.e. ensure better instruction cache locality) and cut out a lot of the code on cold paths (yes, on the JVM a lot of code does not get compiled until something tries to execute it -- it is replaced with a trap that causes deoptimization and recompilation). Removal of cold paths provides way more optimization opportunities than are available for ahead of time compilation because it significantly reduces the branch complexity and combinatoric logic of the smaller amount of hot code that is compiled.

By contrast, native image execution has to cater for all possible execution paths when it compiles code offline since it does not know which are the hot or cold paths and cannot use the trick of plating a trap and recompiling if it is hit. For the same reason it cannot load the dice to ensure that code cache conflicts are minimized by co-locating hot paths adjacent. Native image generation is able to remove some code because of the closed world hypothesis but that is often not enough to make up for all the benefits that profiling and runtime deopt & recompile provides to the JVM JIT compiler.

n.b. the price you pay for that potentially higher JVM speed is in footprint and startup time because

i) it takes some time before the JIT kicks in and fully optimizes the code
ii) the JVM has to retain a lot more metadata and compiler/profiler data to support the better optimizations that it can offer

The reason for i) is that code needs to be run interpreted for some time and, possibly, to be compiled several times before all potential optimizations are realized. An implication of i) is that for small, short-lived applications a native image may well be a better bet. Although the compiled code is not as well optimized it is available straight away..

There are several reasons for ii). The JVM does not have a closed world assumption. So, it has to be able to recompile code if loading of new classes implies that it needs to revise optimistic assumptions made at compile time. For example, if an interface has only one implementation it can make a call jump direct to that code. However, in the are case where a second implementation class is loaded the call site needs to be patched to test the type of the receiver instance and jump to to the code that belongs to it's class. Supporting optimizations like this one requires keeping track of a lot more details of the class base than a native image, including recording the full class and interface hierarchy, details of which methods override other methods, all method bytecode etc. In a native image most of the details of class structure and bytecode can be ignored at runtime.

The JVM also has to cope with changes to the class base or execution profiles that result in a thread going down a previously cold path. At that point the JVM has to jump out of the compiled code into the interpreter and recompile the code to cater for a new execution profile that includes the previously cold path. That requires keeping runtime info that allow a compiled stack frame to be replaced with one or more interpreter frames. It also requires runtime extensible profile counters to be allocated and updated to track what has or has not been executed.

Finally, the JVM also supports full reflection for runtime loaded classes. Thsi also requires it to keep a lot more details of the code base in memory.

It's not surprise that a native image improves on some aspect of JVM performance at the cost of others.The JVM has been worked on for roughly 30 years by some of the most talented people in the software industry and much of that work has been spent carefully and continuously improving performance. However, the JVM designers have regularly made trade-offs that have increased memory to gain speed. The native image option offered by GraalVM has to a large degree made the reverse trade-off. It would be very naive to think that GraalVM is going to come up with significant memory savings at no cost. If that were possible then the JVM would already have done it.

All 13 comments

Observed the same issue on 1.0.0-rc14 CE following Top 10 Things To Do With GraalVM article (source).

Steps to reproduce:

  1. With the source:
cd ./graalvm-ten-things-master
make large.txt # Generate the input file to feed the demo app
~/GraalVM/bin/javac TopTen.java # with javac from GraalVM distribution
~/GraalVM/bin/native-image  TopTen
  1. Run gtime -v ~/GraalVM/bin/java TopTen large.txt:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:30.9
  1. Run gtime -v ./topten large.txt:
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04.90

The problem here is that you have a misguided expectation:

Expected Behaviour: native image executes faster than same program executed on JVM

Why would you assume this? Native images give you certain gains at the cost of other losses.

JVM execution of an application includes runtime optimization of the code that profits from profile information built up during execution. That includes the opportunities to inline a lot more of the code, locate hot code on direct paths (i.e. ensure better instruction cache locality) and cut out a lot of the code on cold paths (yes, on the JVM a lot of code does not get compiled until something tries to execute it -- it is replaced with a trap that causes deoptimization and recompilation). Removal of cold paths provides way more optimization opportunities than are available for ahead of time compilation because it significantly reduces the branch complexity and combinatoric logic of the smaller amount of hot code that is compiled.

By contrast, native image execution has to cater for all possible execution paths when it compiles code offline since it does not know which are the hot or cold paths and cannot use the trick of plating a trap and recompiling if it is hit. For the same reason it cannot load the dice to ensure that code cache conflicts are minimized by co-locating hot paths adjacent. Native image generation is able to remove some code because of the closed world hypothesis but that is often not enough to make up for all the benefits that profiling and runtime deopt & recompile provides to the JVM JIT compiler.

n.b. the price you pay for that potentially higher JVM speed is in footprint and startup time because

i) it takes some time before the JIT kicks in and fully optimizes the code
ii) the JVM has to retain a lot more metadata and compiler/profiler data to support the better optimizations that it can offer

The reason for i) is that code needs to be run interpreted for some time and, possibly, to be compiled several times before all potential optimizations are realized. An implication of i) is that for small, short-lived applications a native image may well be a better bet. Although the compiled code is not as well optimized it is available straight away..

There are several reasons for ii). The JVM does not have a closed world assumption. So, it has to be able to recompile code if loading of new classes implies that it needs to revise optimistic assumptions made at compile time. For example, if an interface has only one implementation it can make a call jump direct to that code. However, in the are case where a second implementation class is loaded the call site needs to be patched to test the type of the receiver instance and jump to to the code that belongs to it's class. Supporting optimizations like this one requires keeping track of a lot more details of the class base than a native image, including recording the full class and interface hierarchy, details of which methods override other methods, all method bytecode etc. In a native image most of the details of class structure and bytecode can be ignored at runtime.

The JVM also has to cope with changes to the class base or execution profiles that result in a thread going down a previously cold path. At that point the JVM has to jump out of the compiled code into the interpreter and recompile the code to cater for a new execution profile that includes the previously cold path. That requires keeping runtime info that allow a compiled stack frame to be replaced with one or more interpreter frames. It also requires runtime extensible profile counters to be allocated and updated to track what has or has not been executed.

Finally, the JVM also supports full reflection for runtime loaded classes. Thsi also requires it to keep a lot more details of the code base in memory.

It's not surprise that a native image improves on some aspect of JVM performance at the cost of others.The JVM has been worked on for roughly 30 years by some of the most talented people in the software industry and much of that work has been spent carefully and continuously improving performance. However, the JVM designers have regularly made trade-offs that have increased memory to gain speed. The native image option offered by GraalVM has to a large degree made the reverse trade-off. It would be very naive to think that GraalVM is going to come up with significant memory savings at no cost. If that were possible then the JVM would already have done it.

We are not there yet, but our goal is to have native images with profile-guided optimizations (PGO) run at equivalent peak performance to the JIT-compiled JVM version.

The main downsides of using GraalVM in AOT mode instead of JIT mode are:

  • Longer build times due to image generation process
  • Target platform must be known at build time
  • Set of classes of the application and reflection usage must be known at build time

So @elgris I tested PGO for this example and it should give a very substantial speed-up of ~50%. Can you try the following on your machine?

native-image --pgo-instrument TopTen
./topten large.txt
native-image --pgo=default.iprof
gtime -v ./topten large.txt

We are not there yet, but our goal is to have native images with profile-guided optimizations (PGO) run at equivalent peak performance to the JIT-compiled JVM version.

The main downsides of using GraalVM in AOT mode instead of JIT mode are:

* Longer build times due to image generation process

* Target platform must be known at build time

* Set of classes of the application and reflection usage must be known at build time

You miss one key downside point:

  • The application profiling must be provided for PGO and must accurately represent application behaviour for all future executions of the AOT image.

The original questioner has unreasonable expectation that AOT is automagically faster than JIT and your answer perpetuates that misguided assumption. There are benefits to both options and users need an accurate model of the tradeoffs to understand which is most appropriate for their usage.

No, I do not think this is a key downside point. One could argue both ways - i.e., depending on the scenario, it can be an advantage. In JIT mode, most VMs (like e.g., HotSpot) make the assumption that the profile collected during the startup/warmup period accurately represents application behaviour for the whole run. This can be a major downside, because applications typically exercise exceptional behaviour during that time frame. We often see workloads that would run faster if their profile wouldn't get polluted during the startup sequence. Also, it makes performance much less predictable, because minor changes in behaviour during startup can have large influence on the machine code generated for the steady state.

My answer states that "equivalent peak performance" is possible. There is nothing about that statement that would support a "AOT is automagically faster than JIT" claim. Both JIT and AOT mode depend on profiling feedback reflecting actual application behaviour for maximum performance. In terms of total execution runtime, the expectation is indeed that AOT would be always faster for shorter running programs (e.g., when only executing a few seconds).

@thomaswue now that is really cool. Is it possible to also collect the profile when running the same code on regular HotSpot with Graal?
I can imagine a scenario where you prepare a profile for your web app by shooting at it with Gatling. Then you AOT compile it with that specific profile and voila you have a better performing web app albeit natively compiled. Instead of two native build passes you could get away with one.

Yes. It is possible to collect the profile also when running on via GraalVM on HotSpot. @JaroslavTulach created a prototype and describes it in the top section of http://wiki.apidesign.org/wiki/FourthGraalAdventures. We will add documented flags for this functionality to our next release.

@thomaswue please do tell me that you can even feed back that very same profile into new JVM instances without having to AOT compile and that this is also going to work for precompiled libgraal enabled JVMs which could eagerly JIT compile at VM startup... and my evening will be complete.

The collecting will also work for libgraal enabled JVMs when the GraalVM JIT compiler is itself AOT compiled.

On the question of using those profiles in JIT mode during VM startup to eagerly JIT compile: It could be possible, but I do not expect effects to be very substantial. Maybe small startup speed ups in the 10% range could be realistic to achieve this way, but way below the 50x improvements from real AOT compilation.

We do have however research in the pipeline to further blur the line between JIT and AOT. If one of the paths we are currently pursuing is successful, we might indeed make your evening truly complete ;-). Stay tuned.

Thanks for your interest in PGO and especially in collecting profiles in the HotSpotMode. I have created a Geometry example to demonstrate the overall benefits of using --pgo-instrument and --pgo options. It works great with GraalVM EE 19.0.2.

There is also HotSpotMode branch showing how to collect the data directly in JVM mode. It kind of works in 19.0.2 version, but there are still bugs that need to be fixed to improve the quality to production level.

@JaroslavTulach vyzera to dobre, dakujem.

Has there been any progress on this issue and bringing it into production ?

Was this page helpful?
0 / 5 - 0 ratings