Sdk: Dart native slower than Dart VM?

Created on 13 Nov 2019  路  34Comments  路  Source: dart-lang/sdk

I was (am) pretty excited about the dart2native announcement, and decided to test it.

Where I would really expect this to shine, is with the sort of heavy number crunching that generally makes scripting languages fall short or delegate the heavy work to C.

So I installed the image package version 2.1.8, and wrote a very basic script:

import 'dart:io';
import 'package:image/image.dart';

int calculate() {
  var stopwatch = new Stopwatch()..start();

  var image = decodeImage(File('input.jpg').readAsBytesSync());

  Image thumbnail = copyResize(image, width: 200, interpolation: Interpolation.average);

  File('output.jpg').writeAsBytesSync(encodePng(thumbnail));

  return stopwatch.elapsed.inMilliseconds;
}

And a basic console front-end:

import 'package:image_test/image_test.dart' as image_test;

main(List<String> arguments) {
  print('Time taken: ${image_test.calculate()}!');
}

I'm feeding it a big photo of 5760 x 3840 px, and as you can see, I'm using presumably the most expensive Interpolation algo available.

Run this with the VM:

> dart bin\main.dart
Time taken: 1262!

Let me interject here and say, this is by far the fastest I've ever seen any scripting language resize an image of this size - this library is single-threaded, so that is really incredibly fast! Kudos on delivering probably one of the fastest scripting language VMs ever created! 馃ぉ

But (obviously?) I was expecting this to be even faster when compiled to a native binary.

So I built it:

> dart2native bin\main.dart -o bin\image_test.exe
Generated: c:\workspace\dart\image-test\bin\image_test.exe

And ran it:

> bin\image_test
Time taken: 2084!

Almost 80% slower?

I ran both many times, and the results are pretty consistent.

I also pulled up a CPU monitor, and it does look like the Dart VM uses more CPU power - I see a spike on two CPU cores, whereas with the compiled binary, I see a spike only on a single core. Presumably the code runs single-threaded on the Dart VM, and the second CPU core spike is due to the VM making optimizations or doing garbage collection on the fly or something?

Anyhow, this result is more than a little surprising to me. 馃

Note that I'm using the 64-bit Windows build of the with Dart VM version 2.6.1 (Mon Nov 11 13:12:24 2019 +0100) - perhaps this isn't fully optimized for Windows yet?

Or perhaps the compiler has not been optimized for raw number crunching yet? I suppose the VM has been around for a lot longer and the native compiler is still very new, so maybe the VM has optimizations that the native compiler doesn't have yet?

area-vm type-performance vm-native

Most helpful comment

But (obviously?) I was expecting this to be even faster when compiled to a native binary.

It is a common misconception - which comes often enough that we should probably add it to an FAQ (/cc @mit-mit).

AOT and JIT compilation have different performance trade offs. JIT has access to accurate runtime profile of your application (including information about which parts of the code are hot, which classes are allocated and which receiver types are seem by each individual call site). Using this information JIT can speculate and produce very good machine code tailored for how your program is actually running. This speculation does not even have to be correct for an arbitrary inputs to your program - because JIT can always fallback to a slower version dynamically. That is why JIT usually gives you very good peak performance. However you have to pay for this with startup and warmup latency - which is visible when you need to run a lot of code before your application puts the first pixel on the screen.

AOT has a different story - it does not actually know how your code will run. It has to look at the application as whole, run various global analyses and try to recover information that JIT gets by observing the execution. It can't speculate - it has to produce the code that is guaranteed to work. Sometimes AOT can figure things out, sometimes it falls short of following the flow of types through the program and has to produce generic and rather inefficient code.

You might ask here: wait, is not Dart statically typed? why do we even need any sort of global analyses?

The answer to this question is: yes, Dart is statically typed, but static types don't necessarily give you enough information to produce good code. Take for example a variable v of static type List<int> . This variable can contain any of the following const [10], [10], Uint8List(1) and Int32List(1) (and more!). Which means in general case an access v[0] needs to be compiled in a way that supports all of them - which is rather inefficient compared to an element access specialised for a particular list type would look like.

This just scratches to surface of the problem - in reality situation is even more complex.

That said we do try to bring difference between AOT and JIT down as much as possible where it matters.

All 34 comments

But (obviously?) I was expecting this to be even faster when compiled to a native binary.

It is a common misconception - which comes often enough that we should probably add it to an FAQ (/cc @mit-mit).

AOT and JIT compilation have different performance trade offs. JIT has access to accurate runtime profile of your application (including information about which parts of the code are hot, which classes are allocated and which receiver types are seem by each individual call site). Using this information JIT can speculate and produce very good machine code tailored for how your program is actually running. This speculation does not even have to be correct for an arbitrary inputs to your program - because JIT can always fallback to a slower version dynamically. That is why JIT usually gives you very good peak performance. However you have to pay for this with startup and warmup latency - which is visible when you need to run a lot of code before your application puts the first pixel on the screen.

AOT has a different story - it does not actually know how your code will run. It has to look at the application as whole, run various global analyses and try to recover information that JIT gets by observing the execution. It can't speculate - it has to produce the code that is guaranteed to work. Sometimes AOT can figure things out, sometimes it falls short of following the flow of types through the program and has to produce generic and rather inefficient code.

You might ask here: wait, is not Dart statically typed? why do we even need any sort of global analyses?

The answer to this question is: yes, Dart is statically typed, but static types don't necessarily give you enough information to produce good code. Take for example a variable v of static type List<int> . This variable can contain any of the following const [10], [10], Uint8List(1) and Int32List(1) (and more!). Which means in general case an access v[0] needs to be compiled in a way that supports all of them - which is rather inefficient compared to an element access specialised for a particular list type would look like.

This just scratches to surface of the problem - in reality situation is even more complex.

That said we do try to bring difference between AOT and JIT down as much as possible where it matters.

Yep, I understand all of that.

And for complex functions, I would expect the JIT might be faster.

But for very simple functions, just sheer number crunching, AOT ought to be faster, since it doesn't need to do any of the run-time analyses or optimizations that the JIT needs to do.

And once you enter a very long loop, you know the data-type of the list before you start processing it, so at the very least, that should be faster?

It isn't:

> dart bin/resize.dart

1150 decodeJpg
743 copyResize
888 encodeJpg

> dart2native bin/resize.dart -o bin/resize.exe
> bin\resize.exe

1787 decodeJpg
1050 copyResize
1432 encodeJpg

I'd expect copyResize to be faster, at least?

The code is accurately type-hinted here and here to avoid e.g. type-checking list elements, so AOT really ought to be faster at least for this case, I think?

There should be enough static information available in this case for an AOT to at least beat the JIT on a tight closed loop with well-known types?

I'm not trying to be poignant here, but if AOT is going to be consistently slower than JIT, why even compile to native binary in the first place? Wouldn't it be more efficient to compile to bytecode and link the JIT run-time into the executable?

Wouldn't it be considerably less work and maintenance, too? You have the bytecode compiler and JIT run-time available anyhow - I'm sure maintaining a cross-platform binary back-end for the language is a pretty substantial effort.

Beyond producing stand-alone executables, what value proposition does dart2native have over the VM?

I was (am) excited about being able to produce stand-alone executables, but maybe compiling to native binaries isn't the best or simplest approach? If you could simply embed the JIT engine in a a stand-alone executable instead, we'd have the same portability, ease of deployment, better performance, access to reflection, etc. without any further ado.

Perhaps the main benefit of a native binary over an embedded VM approach would be the smaller file size - but is that very important in this day and age? My example with a web server that resizes images comes out around 8 megabytes anyway. I don't know how big an embedded JIT would be, but the bytecode likely would be a few hundred KB, so for many common use-cases, I suspect an embedded VM might even be comparable in terms of size?

And once you enter a very long loop, you know the data-type of the list before you start processing it, so at the very least, that should be faster?

To get peak performance you need to know data-type of the list at compile time - knowing that list type is invariant of the loop could theoretically help, but you would still have some sort of virtual dispatch in the loop itself.

[Note that type annotation Uint8List does not yield enough information to enable fastest possible way to access the list because Uint8List has multiple representations, at the very list it could be a normal Uint8List and it can be a view into another typed list].

But for very simple functions, just sheer number crunching, AOT ought to be faster, since it doesn't need to do any of the run-time analyses or optimizations that the JIT needs to do.

Again it is not a straightforward comparison. JIT for example has the chance to speculate on bitwidth of the numbers involved. AOT has to be conservative and prove things.

In general we do have a problem that our AOT compiler does not produce the best numeric code for tight loops (especially with integers) and this is something that we plan to eventually fix.

I looked at the code generated for copyResize - it is true that we can't produce the best code for accessing sData - but I don't think it is the biggest performance sync in the code. I think the biggest issue is that we don't keep r, g, b and a properly unboxed and that we do some pretty bad stuff with si because it is both used in arithmetic and in the indexing operation. I have filed a couple of issues to fix that.

if AOT is going to be consistently slower than JIT, why even compile to native binary in the first place?

As I have indicated before - we would really like to bring AOT performance as close to the JIT performance as possible. We are working on it continuously. It takes time because it is not a trivial problem. It is much easier to make a fast JIT for a language like Dart than a fast AOT, especially if you take certain additional constraints like code size into account. (Dart AOT was originally created for mobile devices - so every byte counts).

The reason to use AOT in the first place is low latency startup and good performance (it might not be as high as JIT performance in all cases, but it is still good enough for many kinds of applications). Also you can use AOT in places where you can't use JIT (e.g. iOS).

If you don't care about startup latency and care about peak performance - then you should certainly use JIT at the moment.

Just found out this issue, happening the same here

image

My code is nearly 80 lines, fully type annotated (no dynamics) and makes use of const / final variables and const constructors and fixed length lists when possible. Also, the part of the code where most of the time is spent is on a switch (this is normally optimized into a jump table in some compilers).

I think more consideration should be given to optimizations, specially since it took a few time to compile the dart code AOT (I thought part of it was due to optimizations?).

I will leave the code here, in case you may use it for further improvements,
Have a nice day!

https://github.com/ConsoleTVs/dartVM

We would really like to bring AOT performance as close to the JIT performance as possible. We are working on it continuously.

There is a ray of hope in this sentence and just because of this I am switching to Dart AOT for full stack(back-end, front-end) development. Having said that, I think we can try to learn from other statically typed AOT complied languages like Go, Rust, Crystal, and Nim.

The issue is not about learning other languages. I can code in almost 20. The thing is that as far as I see, the AOT compilation is only meant for start-up sensible apps. However, most people expect run-time performance rather than startup performance.

@ConsoleTVs you can replace List<int> with Int32List to speedup AOT version of the code. We currently loose some type information in the backend to produce good code (filed https://github.com/dart-lang/sdk/issues/39515 to track fixing that)

Great to hear! This could make the AOT version run faster, taking 12.380 s instead of 16.945 s (don't take those numbers seriously, those are not accurate). Still an improvement!

In this new server less world of Aws Lamda, startup performance, run-time performance and CPU/Memory efficiency are important.

In this new server less world of Aws Lamda, startup performance, run-time performance and CPU/Memory efficiency are important.

What are you trying to prove?

Nothing.

May I ask, why g++ is able to produce so much faster code with AOT. Beating almost every JIT in existence. Yet Dart, with all those static types and ahead of time information becomes incompetent in front of JIT.

The languages using JIT, actually have their complex logic(and often their core library) implemented in statically typed and AOT compiled language.

And the false propaganda I have seen here, right after dart's AOT release is that JIT is faster than AOT.

even Java's JIT cannot beat C/C++'s AOT.
All the optimizations that the JIT is busy doing, is enough to slow it down below the AOT speed.
AOT is awaiting optimizations.

@thomasb892 Because of this: We currently loose some type information in the backend to produce good code

@thomasb892

May I ask, why g++ is able to produce so much faster code with AOT.

Because g++ is compiling C++, which is a much lower-level language. Imagine you write something like this in C++:

struct S {
  int f;
};

int foo(int a, std::vector<int>& b, S* p) {
  return a + b[0] + p->f;  
} 

When a C++ compiler compiles this function it does not have to worry that a can be nullptr (because it can't - int is a primitive type, not a pointer), that b is anything but actually std::vector, that p->f is a method call rather than just an access to an int type member at fixed offset.

In Dart none of this are true.

class S {
  final int f;
};

int foo(int a, List<int> b, S p) {
  return a + b[0] + p.f;  
} 

a can be null, b can be null or any instance of any implementation of List<int>, p can point to SImpl defined as

class SImpl implements S {
  get f => throw "Hahaha";
} 

and so on and so forth.

So comparing Dart AOT to C++ does not really help. Compiling C++ is easier.

(As a sidenote: even C++ compiler can be assisted by PGO, e.g. you can get significant performance improvements from relayouting binaries or doing profile guided devirtualization - which highlights pure AOTs shortcomings).

@mraleph I think Dart was supposed to get strict nulls soon? Which should address that issue at least.

Yes, we definitely plan on doing VM perf optimisations once we have null safety landed.

@mit-mit any plan on aot perf optimisation? flutter use aot on ios and also possible on android. and for serverless, startup speed, memory usage and runtime perf is all important

@windrunner414 we are continuously working on improving performance of AOT code.

If you have some specific code in mind which you think runs slow please file a separate issue. Then we can take a look and suggest if we can do something on our side to make the code faster or if the code could be changed to make it faster.

@mraleph the raw number crunching performed by the image library I mentioned in this issue is definitely good candidate? It ought to perform better with AOT, as it's all statically-typed and, well, this is what CPU's do best. Getting close to bare-metal performance ought to be possible. :smile:

@mindplay-dk while in general we want to improve performance of working with numbers, I would say that using pure Dart ports of image manipulation routines does not make sense to me - if performance is important - instead I'd recommend calling some native library to do the image manipulation (you can sandbox it if you are worried about vulnerabilities).

class S {
final int f;
};

int foo(int a, List b, S p) {
return a + b[0] + p.f;
}

`a` can be `null`, `b` can be null or any instance of any implementation of `List<int>`, `p` can point to `SImpl` defined as

```dart
class SImpl implements S {
  get f => throw "Hahaha";
} 

and so on and so forth.

So comparing Dart AOT to C++ does not really help. Compiling C++ is easier.

(As a sidenote: even C++ compiler can be assisted by PGO, e.g. you can get significant performance improvements from relayouting binaries or doing profile guided devirtualization - which highlights pure AOTs shortcomings).

@mraleph

List b is only supposed to be passed by reference. So it can be a pointer. Therefore it can be null. In C/C++ they are mostly pointers otherwise it's slow. We could use everything as pointers.

Also that OOP code, even C++ does it. And does it rather fast. Dart AOT has a lot of potential.

When the new null safety land, we can know there are never null. for nullable object maybe it can be forced to check, can't call anything if u do not check if it's null.
And for S, maybe don't need care for it's runtimeType, just use the offset of S,like c++. Do not pass the SImpl pointer but the S pointer.

class S {int a=1;}
class S1 {int b=2;}
class SImpl extends S with S1 {int c=3;}

*SImpl, *S -> int a
      *S1 -> int b
            int c

void s(S1 s1) => print(s1.b);
SImpl anyImpl = SImpl();

if call s(anyImpl), pass the *S1,I think we can know what type anyImpl is, unless it's dynamic. if it's dynamic, check the runtimeType is nessecary, but if not, this step can be skip

There are possibly more tricks one could use to speed up AOT because of Dart being very similar to Java. Android shifted from Dalvik VM(JIT) to ART runtime(AOT) and it has only been faster ever since.

Maybe we could learn from ART.

@thomasb892

List b is only supposed to be passed by reference. So it can be a pointer. Therefore it can be null. In C/C++ they are mostly pointers otherwise it's slow. We could use everything as pointers.
Also that OOP code, even C++ does it. And does it rather fast. Dart AOT has a lot of potential.

I am not sure I understand what you are trying to say here. Yes, b is a pointer. Yes, it can be null. What's next? In reality it is more of a problem that variables of primitive types (like int) can be null - this is much bigger issue for performance than that variables of "complex" types like List<...> can be null.

That's where C++ differs a lot from Dart - variables of primitive types can't ever be nullptr there. Also if you use pointers in C++ and then derefence them compiler is actually free to assume that the pointer is not nullptr (it is UB to dereference a NULL pointer), in Dart null is an actual object which has some methods (like null.hashCode and null.toString work), while attempt to call anything else on null will trigger null.noSuchMethod. Drastic difference from C++. Though again: nullability is the biggest issue for primitives. For something like List<> the biggest issue is that often you don't know which implementation of List<> you are getting. It's as if in C++ instead of passing around std::vector<T>& you would pass around some sort of abstract interface with virtual methods and std::vector<> was one of the possible implementations. (Though it is even more complicated than that because of the covariance in Dart - C++ templates are invariant).

There are possibly more tricks one could use to speed up AOT because of Dart being very similar to Java. Android shifted from Dalvik VM(JIT) to ART runtime(AOT) and it has only been faster ever since.

Yes, there are tricks to speedup AOT. If you actually look though git history you will discover that we are constantly applying new :)

Note that these days ART does not actually use a simple AOT - since Android N it actually uses profile guided AOT which is driven by profiles collected in runtime. You don't compile the whole app on installation - instead you run application in a JIT and then use some background process to recompile hot parts of your application based on the profile information. Since Android 8 this profile information contains among other thing inline cache states - which allows "AOT" (I'd rather call it _asynchronous JIT_ though) to perform speculative optimisations.

Also as I have said before: when compiling Java you don't face all the same challenges that you face when compiling Dart - for example Java int and double are non-nullable primitives just like in C++.

@windrunner414

When the new null safety land, we can know there are never null. for nullable object maybe it can be forced to check, can't call anything if u do not check if it's null.

It is true, though it must be clarified that initially most applications would be run in hybrid opt-in/opt-out mode in which you can actually violate non-nullability promises. Only if your application is fully opted in (no dependencies are opted-out) and you are running in _strong checking mode_ you can be sure that int x is never null. We do plan to make good use of non-nullability information for such applications.

And for S, maybe don't need care for it's runtimeType, just use the offset of S,like c++. Do not pass the SImpl pointer but the S pointer.

Yeah, I know how C++ implements inheritance. I am not sure why are you bringing it up here though. Notice that original example with SImpl replaces field with a getter. How this technique helpful in addressing that? (It is not)

It's an interesting question whether there is a lot of performance sensitive code like that to begin with.

Leaving that aside (assuming for example this sort of code was important and we wanted to apply this technique), I can see at least few challenges applying it:

  • it introduces inner pointers - which require some complexity in the GC
  • it does not play well with covariance, think about List<SImpl> which is a List<S> or about void Function(S) which is void Function(SImpl). This means when you cast objects like this you need to introduce wrappers which normalise pointer representation. This quickly gets complicated.

@mraleph We can know at compile time if it might be a getter / setter, and let S and any implementation of S to have a getter&setter, not just int f. It may improve performance but u are right, there are many challenges and it's complicated

Correct me if I'm wrong. Now dart is a real statically typed language(beginning from Dart 2.0) and being used for develop mobile apps. Flutter is a native performance cross-platform framework. To better complete with native apps(written in java/kotlin/swift), high performance is important.
So, is there any plan to support unboxed type(something like java value type, inline classes[1], value types without object identity) to further improve performance and reduce memory usage?

_[1]State of Valhalla. The Road to Valhalla(https://cr.openjdk.java.net/~briangoetz/valhalla/sov/01-background.html)_

@hooluupog Feature Request for value types is better raised at dart-lang/language, because it is a language design decision. We have discussed adding value types for many years now - and so far there have been much higher priority issues to tackle.

@mraleph Okay, got it.

Forgive me if I鈥檓 getting this wrong but it sounds like it comes down to losing type metadata for the sake of file sizes? Are there other performance issues that this is important for? I can understand why this would be important for mobile apps and for dart2js but for serverside apps and cli鈥檚, performance would be much more important than file size IMO.

Forgive me if I鈥檓 getting this wrong but it sounds like it comes down to losing type metadata for the sake of file sizes?

No, I am not sure which part of this thread made you think this way.

It is true though that we take code size in consideration - which impacts for example our inlining decisions (AOT inlining is much less aggressive than JIT inlining as a result), but that's a somewhat separate topic.

Forgive but this is kind of off topic: For a backend server application like aqueduct, is it recommended to deploy to production in AOT or JIT on say google's cloud run (semi serverless)? @devisions has been adding AOT support to aqueduct here

That would depend on a bunch of factors, such as how frequently your backend spins down/up, what kind of code is runs, etc. I'd recommend doing some benchmarking for your particular workload.

@sjapps It is true that - as @mit-mit Michael said - some stress testing would be needed on both AOT (native) and JIT (non-native), as your application behavior and JIT optimizations may something respond better than the native version.

Indeed, startup time and memory usage may favor your needs and expectation.

This is applicable to all other similar platforms, such as Java (more specifically, look for Quarkus with GraalVM).

Oh, and the lovely AOT capability of Aqueduct has been added by the Aqueduct Team and @joeconwaystk himself. I am start investing time into it as I would love to contribute, and in this particular case I was just a messenger and gave back some feedback. 馃槉

Was this page helpful?
0 / 5 - 0 ratings