Sdk: dart2native performance - inefficient object allocation in async code

Created on 23 Sep 2020 · 10Comments · Source: dart-lang/sdk

I found a surprising case where dart2native produced async code that was 50% slower than VM code, in fairly trivial code.

In general I was not expecting dart2native to be slower for a simple file-reading operation than dart. That's a bummer and I don't understand it fully, but I can live with a 10-20% drop.

However, a simple async loop dropped its speed to _half_ on native, which I think could be considered a bug.

Hopethesis:

It seems that dart2native is missing an important memory allocation optimization that does work well in synchronous code, but not in async code.

Compare these snippets:

final handle = file.openSync();
var total = 0;
do {
    total += blockSize;
    await handle.read(blockSize).then((Uint8List block) { // <--- result is re-allocated each time
        // process result
    });
} while (total < length);

final handle = file.openSync();
var total = 0;
var block = Uint8List(blockSize); // <-- pre-allocation
do {
    total += blockSize;
    await handle.readInto(block).then((n) {
        // process result
    });
} while (total < length);

Reproduce

copy the readfiletest.dart from this gist: https://gist.github.com/boukeversteegh/c2caf353640a280084afed9d27f5fbb1
prepare random 1 GB file:
dart readfiletest.dart write 1 GB
compile the script:
dart2native filereadtest.dart -o filereadtest.exe
run each of below commands

I have run each test 4 times, and noted the best result.

VM performance

| case | command | duration | speed | reaction |
| :--- | :--- | ---: | ---: | --- |
| sync loop | dart readfiletest.dart read 1 GB sync | 4.009 s | 255 MB/s | 😐 |
| sync loop with pre-allocated block | dart readfiletest.dart read 1 GB sync2 | 3.430 s | 304 MB/s | 🙂 |
| async loop | dart readfiletest.dart read 1 GB async | 3.481 s | 294 MB/s | 🙂 |
| async loop with pre-allocated block | dart readfiletest.dart read 1 GB async2 | 3.541 s | 290 MB/s | 🙂 |

Note that pre-allocating gives a significant performance boost in _sync_ code as well, but the difference is not as dramatic.
Perhaps this invalidates above hypothesis.

Native performance (Windows)

| case | command | duration | speed | reaction |
| :--- | :--- | ---: | ---: | --- |
| sync loop | filereadtest.exe read 1 GB sync | 4.302 s | 238 MB/s | 😕 |
| sync loop with pre-allocated block | filereadtest.exe read 1 GB sync2 | 4.322 s | 234 MB/s | 😕 |
| async loop | filereadtest.exe read 1 GB async | 7.733 s | 132 MB/s | 😲 |
| async loop with pre-allocated block | filereadtest.exe read 1 GB async2 | 4.483 s | 228 MB/s | 😕 |

The biggest surprise to me is that _none_ of the implementations is faster natively than on the VM.

C++ performance

(for comparison)

| case | command | duration | speed | reaction |
| :--- | :--- | ---: | ---: | --- |
| sync loop with re-used block | build in release mode: readfiletest2.cpp | 2.209 s | 463 MB/s | 😁 |

I hope someone could provide some insights into these performance differences, and into how we can get better raw file io performance.

area-vm

Source

boukeversteegh

All 10 comments

cc @mkustermann

mit-mit on 30 Sep 2020

@boukeversteegh Thank you for reporting this bug.

The cause of this is a loss in type information in our optimizing compiler. After it looses the Uint8List type an optimization doesn't get performed anymore, which means instead of directly accessing the bytes in the loop, it will make calls for each byte access.

There are several issues here

Issue 1) The closure gets multiple entry points. The phi node for the parameter should get Redefinition(vp, {Uint8List}) and AssertAssignable(vp, Uint8List) as inputs, but it does not. i.e. AssertAssignable is not used via data-dependency-for-control-dependency.

This can be fixed by:

diff --git a/runtime/vm/compiler/frontend/kernel_to_il.cc b/runtime/vm/compiler/frontend/kernel_to_il.cc
index 5ff3f4325c2..3af403e51b9 100644
--- a/runtime/vm/compiler/frontend/kernel_to_il.cc
+++ b/runtime/vm/compiler/frontend/kernel_to_il.cc
@@ -1822,8 +1822,9 @@ void FlowGraphBuilder::BuildArgumentTypeChecks(
     Fragment* checks = is_covariant ? explicit_checks : implicit_checks;

     *checks += LoadLocal(param);
-    *checks += CheckAssignable(*target_type, name,
-                               AssertAssignableInstr::kParameterCheck);
+    *checks += AssertAssignableLoadTypeArguments(TokenPosition::kNoSource,
+                                                      *target_type, name, AssertAssignableInstr::kParameterCheck);
+    *checks += StoreLocal(param);
     *checks += Drop();

     if (!is_covariant && implicit_redefinitions != nullptr && optimizing_) {

Issue 2) Even with the above fix, a Canonicalize pass will remove data-dependency-for-control-dependency when removing the phi node in PhiInstr::Canonicalize by fully unwrapping both inputs

Issue 3) The TypedData optimizer uses the CompileType of the input definition rather than (the potentially more precise) type of the input Value, which can be fixed by:

diff --git a/runtime/vm/compiler/call_specializer.cc b/runtime/vm/compiler/call_specializer.cc
index 2c7f4dcd64d..d555f8b51eb 100644
--- a/runtime/vm/compiler/call_specializer.cc
+++ b/runtime/vm/compiler/call_specializer.cc
@@ -1571,16 +1571,16 @@ void TypedDataSpecializer::TryInlineCall(TemplateDartCall<0>* call) {

     const intptr_t receiver_index = call->FirstArgIndex();

-    CompileType* receiver_type = call->ArgumentAt(receiver_index + 0)->Type();
+    CompileType* receiver_type = call->ArgumentValueAt(receiver_index + 0)->Type();

     CompileType* index_type = nullptr;
     if (is_index_get || is_index_set) {
-      index_type = call->ArgumentAt(receiver_index + 1)->Type();
+      index_type = call->ArgumentValueAt(receiver_index + 1)->Type();
     }

     CompileType* value_type = nullptr;
     if (is_index_set) {
-      value_type = call->ArgumentAt(receiver_index + 2)->Type();
+      value_type = call->ArgumentValueAt(receiver_index + 2)->Type();
     }

     auto& type_class = Class::Handle(zone_);

@mraleph Does the Issue 2 above look like a problem to you (i.e. the fact that we remove control-dependency-as-data-dependency chains during canonicalize)?

mkustermann on 30 Sep 2020

The original performance problem should be fixed in master now.

mkustermann on 2 Oct 2020

🎉1

@boukeversteegh thanks for reporting this. It would be great if you could re-measure and report an updated table here!

mit-mit on 2 Oct 2020

Does the following release contain the fix?

https://storage.googleapis.com/dart-archive/channels/dev/release/2.11.0-180.0.dev/sdk/dartsdk-windows-x64-release.zip

Because I'm not sure I'm using the correct release, I just did a very quick rerun comparing the old version with the latest dev.
I ran each test just once. Results seem to indicate that I'm not running the correct release...

If its the correct release, I will run with best out of 4, and include the dart vm results.

Native performance (Windows)

| case | command | 2.9.3 stable | 2.11.0-180.0.dev |
| :--- | :--- | ---: | --: |
| sync loop | filereadtest.exe read 1 GB sync | 228 MB/s | 230 MB/s |
| sync loop with pre-allocated block | filereadtest.exe read 1 GB sync2 | 235 MB/s | 237 MB/s |
| async loop | filereadtest.exe read 1 GB async | 114 MB/s | 115MB/s |
| async loop with pre-allocated block | filereadtest.exe read 1 GB async2 | 160 MB/s | 161 MB/s |

boukeversteegh on 2 Oct 2020

@boukeversteegh No, its not available in dev channel, only in master. If you want to try it out you can download it from

https://storage.googleapis.com/dart-archive/channels/be/raw/hash/aaff0b67f0bee8b99763f00ded653c007bcc1933/sdk/dartsdk-windows-x64-release.zip

mkustermann on 2 Oct 2020

👍1

Just reran all the tests, best out of 4. Very good improvements on the async implementation!

It turned from being the _slowest_ into the _fastest_ one, for AOT. 😄

VM Performance

|command|2.9.3|master|
| --- | --: | --: |
|read 1 GB sync|280 MB/s|279 MB/s|
|read 1 GB sync2|329 MB/s|332 MB/s|
|read 1 GB async|317 MB/s|316 MB/s|
|read 1 GB async2|309 MB/s|309 MB/s|

Native Performance

|command|2.9.3|master|reaction|
| --- | --: | --: | --- |
|read 1 GB sync|229 MB/s|258 MB/s|😊|
|read 1 GB sync2|236 MB/s|248 MB/s|😊|
|read 1 GB async|114 MB/s|262 MB/s|😁👍|
|read 1 GB async2|162 MB/s|250 MB/s|😁👍|

Thank you very much for this amazing quick fix!

boukeversteegh on 2 Oct 2020

🎉1

@boukeversteegh great to hear!

I'm curious btw to learn more about how, and for what kind of app, you are using Dart Native?

mit-mit on 2 Oct 2020

@mit-mit thank you for your interest!

I need to build a set of apps, cli tools, websites and backend applications revolving around a core technology which is IO heavy (reading git repositories with many pack files). At the moment I'm focusing on the cli tools and writing an SDK.

To avoid a complicated multi-language stack, and to make the core tech portable to mobile, I chose Dart.

dart2native interested me in particular for the following reasons:

ability to distribute cli tools as binaries, so users won't need to install dart
potential performance gains when running dart on my servers (so far, I've let go of that expectation)

I started running some tests to figure out the fastest way to read files in Dart, and then stumbled upon the issue.

What I've learned so far:

async file IO is slow, even when doing CPU heavy operations on the data asynchronously
random file access is slow in general. much faster to load the whole file in memory, even compared to reading the file sequentially in small blocks
file.readAsBytesSync() is the fastest but cannot read files over 1GB

If you have any tips for improving raw throughput in Dart, I'd be happy to hear!

boukeversteegh on 2 Oct 2020

If you have any tips for improving raw throughput in Dart, I'd be happy to hear!

You might want to consider doing memory mapped IO (which can be used through dart:ffi) instead of dart:io APIs. I'd expect that to be considerably faster if you need to work with large files.