Sdk: dart2native performance issue

Created on 6 Nov 2019  Â·  3Comments  Â·  Source: dart-lang/sdk

I used dart2native to compile a little Dart program that reads through a big word vector file (typically > 1 GB) and prints some statistics. The execution times were a bit disappointing. The compiled native used real 0m37.340s (user 0m35.359s), while the old non-compiled used only real 0m23.449s (user 0m23.894s). This was on a MacBook Air.
My Dart program, vec-test.dart, is listed at the bottom. I run it like this: dart vec-test.dart some_file.vec
A vec file (text) may be found at https://fasttext.cc/docs/en/crawl-vectors.html, e.g. https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.no.300.vec.gz

  • Dart SDK Version (dart --version)
    Dart VM version: 2.6.0 (Thu Oct 24 17:52:22 2019 +0200) on "macos_x64"

  • Whether you are using Windows, MacOSX, or Linux (if applicable)
    macOS 10.14.6, Mojave

import "dart:io";
import "dart:convert";
import "dart:async";
import "dart:math";

RegExp norwRx = new RegExp(r"^[a-zæøåé]+\-?[a-zæøåé]*$");

bool nordic(String word) {
  return true;
}

main(List<String> arguments) {
  final filename = arguments.first;
  final file = new File(filename);
  Stream<List<int>> inputStream = file.openRead();

  final verbose = arguments.length > 1;
  int wordLines;
  int dims;
  int goodCount = 0;
  double maxLen = 0.0;
  String maxWord;

  inputStream
    .transform(utf8.decoder)
    .transform(new LineSplitter())
    .listen((String line) {
        List<String> parts = line.split(" ");
        if (parts.length <= 3) {  // may be a trailing space
          wordLines = int.parse(parts[0]);
          dims = int.parse(parts[1]);
          print("$wordLines word lines, $dims dimensions");
        } else if(norwRx.hasMatch(parts[0])) {
          double sumOfSquares = 0.0;
          for (int i=1; i<=dims; i++) {
            double d = double.parse(parts[i]);
            sumOfSquares += d * d;
          }
          double vLen = sqrt(sumOfSquares);
          goodCount++;
          if (vLen > maxLen) {
            maxLen = vLen;
            maxWord = parts[0];
          }
          if (verbose) {
            print("${parts[0]}: $vLen");
          }
        }
      },
      onDone: () {
        print("\naccepted words: $goodCount");
        print("maxLen=$maxLen, for '$maxWord'");
      },
      onError: (e) { print(e.toString()); }
    );
}
area-vm type-performance vm-native

Most helpful comment

I have the other very simple example, where compiled performance is about 85% worse, than in VM:

void main() {
  final List<int> result = List(3000);
  for (int i = 0; i < 15; i++) {
    final start = DateTime.now();
    for (int j = 0; j < 3000; j++) {
      result[j] = i;
    }
    int sum = 0;
    for (int j = 0; j < 3000; j++) {
      sum += result[j];
    }
    // final int sum = result.reduce((t, v) => t + v);
    final end = DateTime.now();
    print('${end.difference(start).inMicroseconds}, $sum');
  }
}

VM results (you can see heating up in the first 3 rows, that's fine):

93, 0
154, 3000
85, 6000
7, 9000
6, 12000
6, 15000
7, 18000
6, 21000
6, 24000
6, 27000
6, 30000
6, 33000
6, 36000
7, 39000
6, 42000

dart2native results:

13, 3000
12, 6000
11, 9000
11, 12000
12, 15000
11, 18000
11, 21000
11, 24000
11, 27000
11, 30000
11, 33000
11, 36000
11, 39000
13, 42000

P.S. When using final int sum = result.reduce((t, v) => t + v); to calculate sum instead of the manual loop it takes 2-8 times more time of VM and 2 times more when compiled.

_Edit_: Using Uint32List (or Uint8List, which is enough in this case) instead of List<int> lets the compiled performance be the same as in VM. Can't the compiler do this optimisation by its own?

All 3 comments

RegExp seems to be much slower with dart2native. Here is another case:

#!/usr/bin/env dart

void main() {
  final re = RegExp(r'^/foo/bar/baz/(.+)$');
  final s = '/foo/bar/baz/the_five_boxing_wizards/jump/quickly';

  final results = List<String>(100000);
  final sw = Stopwatch()..start();
  for (var i = 0; i < results.length; i += 1) {
    results[i] = re.firstMatch(s)?.group(1);
  }
  print('${sw.elapsedMilliseconds}ms');
}

Running this with directly (i.e., through dart) on my Linux laptop, this completes in ~30ms.

Using dart2native and running the result, this instead completes in ~500ms(!).

Dart VM version: 2.6.1 (Mon Nov 11 13:12:24 2019 +0100) on "linux_x64"

Edit: looked into this a bit more and this seems to be the same issue as #37774, #39139.

I'm running into the same issue on Flutter, where in the release build (which is AOT – see) the RegExp performance is heavily degraded – see profiling below. Parsing the same content took 1476ms (AOT) vs 84ms (JIT).

From the profiling, it seems like the culprit is _ExecuteMatchSticky – in other profiling I did the culprit is much more exaggerated, in hindsight I should have taken screenshots of those.

Screen Shot 2020-04-04 at 2 49 01 pm

Screen Shot 2020-04-04 at 2 48 36 pm

My flutter doctor output – for Dart version.

[✓] Flutter (Channel unknown, v1.15.9, on Mac OS X 10.15.4 19E266, locale en-AU)
    • Flutter version 1.15.9 at /Users/fwang/Documents/Personal/flutter
    • Framework revision cc52a903a8 (4 weeks ago), 2020-03-04 18:59:18 -0800
    • Engine revision 810727bf3f
    • Dart version 2.8.0 (build 2.8.0-dev.11.0 57462f9ca5)

Am happy to provide more details if needed.

I have the other very simple example, where compiled performance is about 85% worse, than in VM:

void main() {
  final List<int> result = List(3000);
  for (int i = 0; i < 15; i++) {
    final start = DateTime.now();
    for (int j = 0; j < 3000; j++) {
      result[j] = i;
    }
    int sum = 0;
    for (int j = 0; j < 3000; j++) {
      sum += result[j];
    }
    // final int sum = result.reduce((t, v) => t + v);
    final end = DateTime.now();
    print('${end.difference(start).inMicroseconds}, $sum');
  }
}

VM results (you can see heating up in the first 3 rows, that's fine):

93, 0
154, 3000
85, 6000
7, 9000
6, 12000
6, 15000
7, 18000
6, 21000
6, 24000
6, 27000
6, 30000
6, 33000
6, 36000
7, 39000
6, 42000

dart2native results:

13, 3000
12, 6000
11, 9000
11, 12000
12, 15000
11, 18000
11, 21000
11, 24000
11, 27000
11, 30000
11, 33000
11, 36000
11, 39000
13, 42000

P.S. When using final int sum = result.reduce((t, v) => t + v); to calculate sum instead of the manual loop it takes 2-8 times more time of VM and 2 times more when compiled.

_Edit_: Using Uint32List (or Uint8List, which is enough in this case) instead of List<int> lets the compiled performance be the same as in VM. Can't the compiler do this optimisation by its own?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xster picture xster  Â·  3Comments

DartBot picture DartBot  Â·  3Comments

sgrekhov picture sgrekhov  Â·  3Comments

rinick picture rinick  Â·  3Comments

matanlurey picture matanlurey  Â·  3Comments