Writing this code makes me sad. The VM can do this much faster an easier than I can from inside Dart.
void _copyInt8(ByteData buffer, int offset, Int8List value) {
final int count = value.length;
for (int i = 0; i < count; ++i) {
buffer.setInt8(offset + i, value[i]);
}
}
void _copyUint8(ByteData buffer, int offset, Uint8List value) {
final int count = value.length;
for (int i = 0; i < count; ++i) {
buffer.setUint8(offset + i, value[i]);
}
}
void _copyInt16(ByteData buffer, int offset, Int16List value) {
final int count = value.length;
const int stride = 2;
for (int i = 0; i < count; ++i) {
buffer.setInt16(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
}
}
void _copyUint16(ByteData buffer, int offset, Uint16List value) {
final int count = value.length;
const int stride = 2;
for (int i = 0; i < count; ++i) {
buffer.setUint16(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
}
}
void _copyInt32(ByteData buffer, int offset, Int32List value) {
final int count = value.length;
const int stride = 4;
for (int i = 0; i < count; ++i) {
buffer.setInt32(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
}
}
void _copyUint32(ByteData buffer, int offset, Uint16List value) {
final int count = value.length;
const int stride = 4;
for (int i = 0; i < count; ++i) {
buffer.setUint32(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
}
}
void _copyInt64(ByteData buffer, int offset, Int32List value) {
final int count = value.length;
const int stride = 8;
for (int i = 0; i < count; ++i) {
buffer.setInt64(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
}
}
void _copyUint64(ByteData buffer, int offset, Uint16List value) {
final int count = value.length;
const int stride = 8;
for (int i = 0; i < count; ++i) {
buffer.setUint64(offset + i * stride, value[i], Endianness.LITTLE_ENDIAN);
}
}
/cc @Hixie @zanderso
Unfortunately it isn't as well documented as it should be, but the fromList constructors and the setRange methods on the typed data types are implemented in the VM when the argument types are also typed data. So:
Uint8List original;
...
Uint8List copy = new Uint8List.fromList(original);
// copy.buffer.asByteData() to get a ByteData.
and
Uint8List l1;
Uint8List l2;
...
l2.setRange(offset, length, l2);
Are implemented in the VM. See https://github.com/dart-lang/sdk/blob/master/runtime/lib/typed_data_patch.dart#L104 and https://github.com/dart-lang/sdk/blob/master/runtime/lib/typed_data.cc#L108
Does ByteData support a setRange? I guess I need to create a view of the underlying ByteBuffer for each type that I want to set as a range?
Right. ByteData doesn't have a setRange, but you can do for example byteData.buffer.asInt16List(...).setRange(...)
Do you have a sense for where the crossover point is in performance? For example, is there some length below which _copyInt16 is faster than byteData.buffer.asInt16List(...).setRange(...)? I guess I can make a microbenchmark and figure that out myself.
I'm not sure. The answer might also be different between AOT and JIT. Maybe @mraleph can help.
The context of this question is the Dart FIDL2 encoder for Fuchsia:
https://fuchsia-review.googlesource.com/c/topaz/+/120996
I don't have any evidence that this code is a measurable part of the profile. I just felt silly writing these memcpy equivalents.
I ran some microbenchmarks on my x64 desktop:
https://gist.github.com/zanderso/51fcfd0a797f5d486200047a7ece92b7
50 elements seems like a reasonable cross-over point, except for Int64List where it looks like there is a bug:
CopyInt8Benchmark loop 50(RunTime): 1.4557483528209487 us.
CopyInt8Benchmark setRange 50(RunTime): 1.4949139273621506 us.
CopyInt16Benchmark loop 50(RunTime): 2.267146794201098 us.
CopyInt16Benchmark setRange 50(RunTime): 1.609126353375262 us.
CopyInt32Benchmark loop 50(RunTime): 2.0314798924128294 us.
CopyInt32Benchmark setRange 50(RunTime): 1.6735112547726414 us.
CopyInt64Benchmark loop 50(RunTime): 36.02046007132051 us.
CopyInt64Benchmark setRange 50(RunTime): 1.746860488841064 us.
/cc @alexmarkov
We don't have an inline version of setInt64 operation - we recognize it, but we don't have any special handling for it for some reason (probably simply forgot about it - our support for unboxed 64-bit integers is pretty sketchy and definitely something we have on radar to fix soon).
In general I would recommend using setRange(...) whenever possible. I think it will be easier for us to recognize it and essentially short circuit it to memcpy where applicable.
After working on optimizing a computation bound problem today (small network protocol benchmark over a loopback socket, hits 200 mb/s if reading is stubbed), I found that using setRange is very slow in many cases, and I've found that replacing setRange calls with a for loop (since in my benchmark there are small amounts of data being copied into staging buffers) greatly increases performance (once I gained around 30 mb/s by removing a setRange call).
I've also found that array views also quite slow, but that's being handled by #35154.