Sdk: UTF8.decode is slow (performance)

Created on 21 Jan 2018  路  15Comments  路  Source: dart-lang/sdk

UTF8.decode seems surprisingly slow for large buffers.

Using this Flutter program:

import 'dart:async';
import 'dart:convert';
import 'dart:typed_data';

import 'package:flutter/services.dart';

Future<Null> main() async {
  ByteData data = await rootBundle.load('LICENSE');
  for (int i = 1024; i < data.lengthInBytes; i += 1024) {
    final Stopwatch watch = new Stopwatch()..start();
    UTF8.decode(data.buffer.asUint8List(0, i));
    int time = watch.elapsedMilliseconds;
    print('${"$i bytes".padLeft(15)} ${"${time}ms".padLeft(10)} ${(1000 / 1024 * i / time).toStringAsFixed(1)} KB per second');
  }
}

...I find that on a Pixel 2 XL, 4KB is parsed in about 1ms, 40KB in about 12ms, and 400KB in about 135ms. This is only about 3MB per second. Then again, I'm not sure what would be considered good throughput on this device. Maybe this is as good as we can get?

Presumably this could be fixed in the most radical fashion by supporting UTF-8 as a native data type.

area-vm

Most helpful comment

Recently I've worked on unifying the heap object layout of our typed data classes so that our AOT compiler can directly access the bytes of typed data members (e.g. of Uint8List) independent of whether the object is internal/external typed data array or a view on those (**)

All of the work combined has lead to the following numbers (measured on the benchmark mentioned above in https://github.com/dart-lang/sdk/issues/31954#issuecomment-407503177 on my workstation):

JIT non-view/view
executed in 0:00:03.429451 (verify="XXXXXXXXX")
executed in 0:00:06.607781 (verify="XXXXXXXXX")

AOT (view and non-view roughly the same)
executed in 0:00:05.855320 (verify="XXXXXXXXX")

So the AOT is almost on par with the JIT (for this particular benchmark)

There is still room for improvement and we continue working on it!

() **Notice of Warning: Our AOT compiler will only directly access the bytes (e.g. in foo(Uint8List bytes) => bytes[0];) if there are no 3rd party user classes implementing these interfaces.

All 15 comments

All of the decoding is implemented in Dart so performance suffers in AOT mode.

JIT mode (debug) does around 15-20 MB per second vs 1.7-2MB in AOT mode on Nexus 5X.

  • [ ] Short term this can be addressed by moving UTF8 decoding into C++, if it is a bottleneck.

We expect things to become better in AOT mode post Dart 2 transition as AOT code quality should improve considerably. However it is not trivial to reach the parity with JIT on this benchmark.

LICENSE is a Latin1 file so it essentially tests how fast we can determine that and create a string out of it.

Most important loops for the benchmark are in scanOneByteCharacters, _scanCodeUnits and _createOneByteString.

  • [ ] Note: _scanCodeUnits does thing that was already done by scanOneByteCharacters so it is obviously wasteful and needs to be fixed first thing.

Here is how all those functions look:

    int scanOneByteCharacters(List<int> units, int from) {
      final to = endIndex;
      final mask = _ONE_BYTE_LIMIT;
      for (var i = from; i < to; i++) {
        final unit = units[i];
        if ((unit & mask) != unit) return i - from;
      }
      return to - from;
    }

List<int> units does not really tell much to the AOT compiler (even in Dart 2 mode) because there are multiple types implementing lists, while JIT is capable of establishing that units is a Uint8 view (by observing the type feedback collected in runtime) and completely inlining units[i] because the code is monomorphic.

In Dart 2 mode we might be able to establish the same thing using type flow analysis, but I am cautiously optimistic about that because it is likely that people invoke UTF8.decode on various List<int> implementations.

It might be that we need to implement something heavier as a real solution - e.g. some mechanism to duplicate / "version" performance sensitive code that manipulates typed lists.

We have already solved this problem for dart2js/browser.

There is a hook for host-specific acceleration.
The dart2js runtime uses this hook to do decoding via the browser's TextDecoder, which provided a 10x speedup on browsers that support TextDecoder.
VM/Flutter could also use this hook to detect byte arrays and use an alternative implementation.

If you hook to a second implementation, you could use C++, or a clone of the current implementation and modify it to use the byte array implementation type. If you do this, I would advise bailing out to the main algorithm to handle error cases since the Dart runtime is incompatible with the browser implementation in error cases. We want to fix the discrepancy by changing the Dart library to match the browser as that would make the in-browser conversion more efficient by avoiding a validation scan. If you fall-back to the main implementation on edge cases, that will ensure that we need to fix edge cases in only one implementation.

@lrhn fyi.

Thanks for the tip about _convertIntercepted @rakudrama. We could use it if we decide to do a short term solution for just JSON.decode.

The real problem however is much larger - if the user writes similar code they will see similar slowness (especially comparing JIT vs AOT), so I would like us to think about it in that context too.

it is likely that people invoke UTF8.decode on various List<int> implementations.

It seems extremely likely to me that any large inputs are going to be from ByteData instances.

FWIW, I would be perfectly fine with specialized ByteData or Uint8List variants of this API (and other similar APIs that take List<int>). In general in Flutter land there are definitely codepaths where we don't even support the non-ByteData case.

https://github.com/dart-lang/sdk/commit/12c461b15ae3ad4c503db256cd08f0fe1d66ee34 by @aartbik improved performance of this microbenchmark on 64-bit platforms (4-5x on my measurements). I have not measured ARM32 yet.

Before the change:

I/flutter (22444):    622592 bytes      337ms 1804.2 KB per second

after the change:

I/flutter (22623):    622592 bytes       66ms 9212.1 KB per second

Focusing on just the Dart part with a benchmark that does just

String unpack(Uint8List data) {
  return utf8.decode(data);
}

using a synthetic Uint8List. This yields still a big JIT vs AOT difference:

JIT executed in 0:00:12.020120
AOT executed in 0:01:12.535360

Here is a quick profile:

+   35.62%    35.62%  dart_precompile  libb.so                   [.] Precompiled__Utf8Decoder_8003594_convert_scanOneByteCharacters
+   29.98%    29.98%  dart_precompile  libb.so                   [.] Precompiled__Uint8List_6027147____1007                   
+   22.21%    22.02%  dart_precompile  libb.so                   [.] Precompiled__StringBase_0150898__createOneByteString

Just to make sure we are measuring the same thing, here is a stand-alone toy benchmark that operates on long lists of ASCII 'X' bytes, with JIT/AOT timings below.

JIT: executed in 0:00:02.853522 (verify="XXXXXXXXX")
AOT: executed in 0:00:16.729649 (verify="XXXXXXXXX")

benchmark:

import 'dart:convert';
import 'dart:typed_data';

String unpack(Uint8List data) {
  return utf8.decode(data);
}

main(List<String> args) {
  var data = new Uint8List.fromList(
    new List<int>.generate(1000000, (int i) => 88)
  );
  String x = null;
  Stopwatch stopwatch = new Stopwatch()..start();
  for (int i = 0; i < 1000; i++) {
    x = unpack(data);
  }
  print('executed in ${stopwatch.elapsed} (verify="${x.substring(1,10)}")');
}

@aartbik It might be also good to have a benchmark that measures Uint8ListView performance because when decoding binary formats people pass views instead of copying bytes out.

Sure, I made it "your" format as follows:

main(List<String> args) {
  var data = new Uint8List.fromList(
    new List<int>.generate(1000000, (int i) => 88)
  );
  if (args.contains('--use-view')) {
    data = new Uint8List.view(data.buffer);
  }
  // AS BEFORE
}

which gives me

JIT-direct/view:
executed in 0:00:02.555726 (verify="XXXXXXXXX")
executed in 0:00:05.349252 (verify="XXXXXXXXX")

AOT-direct/view
executed in 0:00:16.817085 (verify="XXXXXXXXX")
executed in 0:00:26.025103 (verify="XXXXXXXXX")

In the long run, something like cloning for type may help here. In the short run, we could get the utf performance up by specializing a bit more in the library code. The _convertIntercepted() idea is very interesting, but perhaps done a bit too high in the calling tree. Specialization seems only necessary for the leaf methods where most work is done. Still, let me play around with this a bit....

Ah, I thought that using _generics_ would make it extremely easy to specialize library code in a non-intrusive way, but Dart does not follow the C++ way of generating different versions for each instance. Without this sort of generics, the code duplication of such a solution seem to become a bit excessive (any best practices on this?).

@alexmarkov @miyoyo

One of the ideas I entertained is that we can do _a Kernel to Kernel transformation_ based on TFA analysis results and an explicit pragma hint, e.g. in core libraries we do

@pragma('vm.type-specialize')
void convert(List<int> codeUnits, int startIndex, int endIndex) {
  /* body */
}

which would be a hint for a subsequent transformation to look at TFA analysis results and see which types can reach codeUnits. Then if for example _List<int> and Uint8List are the reaching types it can transform the code like this:

@pragma('vm.type-specialize')
void convert(List<int> codeUnits, int startIndex, int endIndex) {
  if (codeUnits is Uint8List) {
    _convert$Uint8List(codeUnits, ...);
  } else if (codeUnits is _List) {
    _convert$_List(codeUnits, ...);
  }
}

void  _convert$Uint8List(Uint8List codeUnits, ...) {
  // copy of body specialized for Uint8List
}

This is one possible approach that works for library code that people are willing to annotate with hints.

Another approach would be to try and change how we do []= and [] in loops, currently when we do:

for (var i = from; i < to; i++) {
  final unit = units[i];  // this will be an inline cache based dispatch (unless call-site if monomorphic)
  if ((unit & mask) != unit) return i - from;
}

maybe the cost of invoking [] can be reduced, e.g. if lookup of the [] can be hoisted out of the loop.

// pseudo-code
final getIndexedPtr = getIndexingStubFor(units);
for (var i = from; i < to; i++) {
  final unit = getIndexedPtr(units, i);  // fast invocation, maybe even without setting up a pool pointer, etc.
  if ((unit & mask) != unit) return i - from;
}

Sidenote: it might be good to check that the rest of the code in hot functions actually looks good in AOT mode (i.e. that [] access is the only slow operation).

@aartbik I was asked to help with some protobuf decoding speed and it seems like we happen to spend significant time in string decoding. Hope it's ok with you if I look into this?

One aspect might be to unify the object layouts of typed data views and non-views. Currently if code is specialized via if (data is Uint8List) { ... } the compiler cannot inline accesses to data since there are two implementations of it, namely _Uint8ArrayList or _Uint8ArrayView where the views have different layout than the non-views.

Recently I've worked on unifying the heap object layout of our typed data classes so that our AOT compiler can directly access the bytes of typed data members (e.g. of Uint8List) independent of whether the object is internal/external typed data array or a view on those (**)

All of the work combined has lead to the following numbers (measured on the benchmark mentioned above in https://github.com/dart-lang/sdk/issues/31954#issuecomment-407503177 on my workstation):

JIT non-view/view
executed in 0:00:03.429451 (verify="XXXXXXXXX")
executed in 0:00:06.607781 (verify="XXXXXXXXX")

AOT (view and non-view roughly the same)
executed in 0:00:05.855320 (verify="XXXXXXXXX")

So the AOT is almost on par with the JIT (for this particular benchmark)

There is still room for improvement and we continue working on it!

() **Notice of Warning: Our AOT compiler will only directly access the bytes (e.g. in foo(Uint8List bytes) => bytes[0];) if there are no 3rd party user classes implementing these interfaces.

The AOT typed data optimizations had a huge performance gain. Though the Dart code is only taking advantage of it for the ascii case atm.

@askeksa-google is now going to extend this fast case to arbitrary unicode.

We'll try to keep optimizing the Dart implementation - though still consider intensifying it afterwards.

Was this page helpful?
0 / 5 - 0 ratings