Sdk: ByteData should support memcpy from other type data objects

Created on 7 Feb 2018 · 9Comments · Source: dart-lang/sdk

Writing this code makes me sad. The VM can do this much faster an easier than I can from inside Dart.

void _copyInt8(ByteData buffer, int offset, Int8List value) {
  final int count = value.length;
  for (int i = 0; i < count; ++i) {
    buffer.setInt8(offset + i, value[i]);
  }
}

void _copyUint8(ByteData buffer, int offset, Uint8List value) {
  final int count = value.length;
  for (int i = 0; i < count; ++i) {
    buffer.setUint8(offset + i, value[i]);
  }
}

void _copyInt16(ByteData buffer, int offset, Int16List value) {
  final int count = value.length;
  const int stride = 2;
  for (int i = 0; i < count; ++i) {
    buffer.setInt16(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
  }
}

void _copyUint16(ByteData buffer, int offset, Uint16List value) {
  final int count = value.length;
  const int stride = 2;
  for (int i = 0; i < count; ++i) {
    buffer.setUint16(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
  }
}

void _copyInt32(ByteData buffer, int offset, Int32List value) {
  final int count = value.length;
  const int stride = 4;
  for (int i = 0; i < count; ++i) {
    buffer.setInt32(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
  }
}

void _copyUint32(ByteData buffer, int offset, Uint16List value) {
  final int count = value.length;
  const int stride = 4;
  for (int i = 0; i < count; ++i) {
    buffer.setUint32(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
  }
}

void _copyInt64(ByteData buffer, int offset, Int32List value) {
  final int count = value.length;
  const int stride = 8;
  for (int i = 0; i < count; ++i) {
    buffer.setInt64(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
  }
}

void _copyUint64(ByteData buffer, int offset, Uint16List value) {
  final int count = value.length;
  const int stride = 8;
  for (int i = 0; i < count; ++i) {
    buffer.setUint64(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
  }
}

/cc @Hixie @zanderso

area-vm library-typed-data type-enhancement

Source

abarth

👍2

All 9 comments

Unfortunately it isn't as well documented as it should be, but the fromList constructors and the setRange methods on the typed data types are implemented in the VM when the argument types are also typed data. So:

Uint8List original;
...
Uint8List copy = new Uint8List.fromList(original);
// copy.buffer.asByteData() to get a ByteData.

and

Uint8List l1;
Uint8List l2;
...
l2.setRange(offset, length, l2);

Are implemented in the VM. See https://github.com/dart-lang/sdk/blob/master/runtime/lib/typed_data_patch.dart#L104 and https://github.com/dart-lang/sdk/blob/master/runtime/lib/typed_data.cc#L108

zanderso on 8 Feb 2018

Does ByteData support a setRange? I guess I need to create a view of the underlying ByteBuffer for each type that I want to set as a range?

abarth on 8 Feb 2018

Right. ByteData doesn't have a setRange, but you can do for example byteData.buffer.asInt16List(...).setRange(...)

zanderso on 8 Feb 2018

Do you have a sense for where the crossover point is in performance? For example, is there some length below which _copyInt16 is faster than byteData.buffer.asInt16List(...).setRange(...)? I guess I can make a microbenchmark and figure that out myself.

abarth on 8 Feb 2018

I'm not sure. The answer might also be different between AOT and JIT. Maybe @mraleph can help.

zanderso on 8 Feb 2018

The context of this question is the Dart FIDL2 encoder for Fuchsia:
https://fuchsia-review.googlesource.com/c/topaz/+/120996

I don't have any evidence that this code is a measurable part of the profile. I just felt silly writing these memcpy equivalents.

abarth on 8 Feb 2018

👍1

I ran some microbenchmarks on my x64 desktop:

https://gist.github.com/zanderso/51fcfd0a797f5d486200047a7ece92b7

50 elements seems like a reasonable cross-over point, except for Int64List where it looks like there is a bug:

CopyInt8Benchmark loop 50(RunTime): 1.4557483528209487 us.
CopyInt8Benchmark setRange 50(RunTime): 1.4949139273621506 us.
CopyInt16Benchmark loop 50(RunTime): 2.267146794201098 us.
CopyInt16Benchmark setRange 50(RunTime): 1.609126353375262 us.
CopyInt32Benchmark loop 50(RunTime): 2.0314798924128294 us.
CopyInt32Benchmark setRange 50(RunTime): 1.6735112547726414 us.
CopyInt64Benchmark loop 50(RunTime): 36.02046007132051 us.
CopyInt64Benchmark setRange 50(RunTime): 1.746860488841064 us.

/cc @alexmarkov

zanderso on 8 Feb 2018

We don't have an inline version of setInt64 operation - we recognize it, but we don't have any special handling for it for some reason (probably simply forgot about it - our support for unboxed 64-bit integers is pretty sketchy and definitely something we have on radar to fix soon).

In general I would recommend using setRange(...) whenever possible. I think it will be easier for us to recognize it and essentially short circuit it to memcpy where applicable.

mraleph on 8 Feb 2018

After working on optimizing a computation bound problem today (small network protocol benchmark over a loopback socket, hits 200 mb/s if reading is stubbed), I found that using setRange is very slow in many cases, and I've found that replacing setRange calls with a for loop (since in my benchmark there are small amounts of data being copied into staging buffers) greatly increases performance (once I gained around 30 mb/s by removing a setRange call).

I've also found that array views also quite slow, but that's being handled by #35154.