I hope this is the right place to ask the question and that this hasn't already been discussed to death elsewhere. Feel free to direct me elsewhere or close the issue if I've missed something.
In short, I am one of the maintainers of exceljs (which is basically a transfrom stream taking in a read stream, unzipping its contents, running an xml parser on the unzipped chunks and then emitting back excel-related events) and we're in the process of adding support for async iteration via for await of (https://github.com/exceljs/exceljs/pull/1135).
In doing that, we've noticed that for await of is significantly slower than the current .on('data',..) based approach. Our benchmark is not a microbenchmark, but a full end-to-end benchmark incl. creating and analyzing excel objects in memory (https://github.com/exceljs/exceljs/pull/1139). Switching to for await of (vs. handling the events in sync callbacks) decreased performance by around 60%.
I have debugged this issue (https://github.com/lddubeau/saxes/issues/32) and in short, the issue arises because for every chunk/event passed into our transform stream, we emit out a magnitude greater of chunks/events. And so what's causing the performance is that the callback code would run through these emitted chunks/events mostly synchronously, whereas the current implementation of Symbol.asyncIterator on Readable calls setImmediate between each event, which is quite expensive. I wrote a simple microbenchmark to compare for of against for await of on the same array or iterator, and the difference is around 10x.
So we've come up with this 'hack' where instead of emitting one-by-one all of these chunks/events that our transform produces, we now gather them up in an array and emit that once. Or phrased another way, instead of calling this.push() for every excel related event that we produce, we call, for each chunk written into our stream, a lot of this.events.push() (where this.events is just an array that initialized in the constructor) and then finally this.push(this.events) once we're done consuming the chunk (and we also reset this.events to an empty array again). Clever, but now consuming the stream is ugly. Instead of writing `` we now write
// We'd like to write this, but it's slow
for await (const chunk of readable) { ... }
// This is within 2% of the runtime of the callback based approach, but not very ergonomic
for await (const chunks of readable) {
for (const chunk of chunks) { ... }
}
I think this performance issue will bite a lot people because it's so easy to fall into and, at least to me, came as a surprise. I remember reading that readline has similar performance issues (and similarly to the above it produces a lot more events than it takes in) and would probably also see performance improvements from the above approach.
My question boils down to this: Is there a fundamental reason in the spec around async iteration or streams that we have to go to setImmediate if the read buffer still has stuff in it (i.e., if we could call .next() synchronously? Is it something that v8 can/will eventually optimize? If no to both questions, what should library authors do to give users all the advantages of async iteration while not sacrificing performance?
Roping in @BridgeAR as a fellow performance nerd and the only one I know here ;)
I'm sure there's some clever name for this like 'async iteration performance impacted by transform stream event amplification' ;)
@alubbe are you saying that it calls emit('data') synchronously? If yes, nothing using promises will ever be faster.
That being said, can you point to where the async iterator implementation is calling setImmediate? I can't find such a thing.
@nodejs/streams
Note that you are adding an additional PassThrough in https://github.com/exceljs/exceljs/pull/1135/files#diff-a7c32d962b3105f42a2ae84812252274R60.
As a rule of thumb, each new transform that you add cuts throughput by ~30-50%.
I'm not sure it's all due to that, but that would be my first starting point for improving the perf there.
From my tests which consisted in simply doing some reading from a file, the async iterators had similar performance, but there might be something that I did overlook.
Can you produce a couple of self-runnable example that just use node core? We'll look into them if there are some ways in which we can optimize things.
The PassThrough is used in both versions to force unzipper to support async iteration (it's a temporary hack, see https://github.com/ZJONSSON/node-unzipper/issues/178), so it doesn't affect the numbers I mentioned.
Thanks for all of your input so far, I will try to build an example around readline as soon as I can.
And here it is: https://github.com/alubbe/readline-benchmark
The repo benchmarks 3 ways of using readline. Firstly, using the stream interface:
await runProfiling('readline stream interface', () => new Promise((resolve, reject) => {
const rl = readline.createInterface({
input: fs.createReadStream('big.txt'),
});
let i = 0;
rl.on('line', (line) => {
i += 1;
});
rl.on('error', reject);
rl.on('close', () => {
console.log(`Read ${i} lines`);
resolve();
});
On my mac (node 13.9.0), this takes between 32 and 35 ms and uses around 3 MB of memory.
Then, it benchmarks async iteration as recommended in the docs:
await runProfiling('readline async iteration', async () => {
const rl = readline.createInterface({
input: fs.createReadStream('big.txt'),
});
let i = 0;
for await (const line of rl) {
i += 1;
}
console.log(`Read ${i} lines`);
});
On my mac (node 13.9.0), this takes around 50 ms and uses around 9 - 10 MB of memory.
Lastly, I modified two lines in the prototype of readline's Interface to enable combining async iteration with sync interation:
await runProfiling('readline async iteration via array of lines', async () => {
const rl = readline.createInterface({
input: fs.createReadStream('big.txt'),
});
let i = 0;
for await (const lines of rl) {
for (const line of lines) {
i += 1;
}
}
console.log(`Read ${i} lines`);
});
On my mac (node 13.9.0), this takes between 30 and 34 ms and uses around 3 MB of memory.
Here's the full output:
####################################################
WARMUP: Current memory usage: 1.93 MB
WARMUP: readline stream interface profiling started
Read 128457 lines
WARMUP: readline stream interface profiling finished in 73ms
WARMUP: Current memory usage (before GC): 3.41 MB
WARMUP: Current memory usage (after GC): 1.98 MB
####################################################
RUN 1: readline stream interface profiling started
Read 128457 lines
RUN 1: readline stream interface profiling finished in 35ms
RUN 1: Current memory usage (before GC): 3.6 MB
RUN 1: Current memory usage (after GC): 2.24 MB
####################################################
RUN 2: readline stream interface profiling started
Read 128457 lines
RUN 2: readline stream interface profiling finished in 32ms
RUN 2: Current memory usage (before GC): 3.04 MB
RUN 2: Current memory usage (after GC): 2.24 MB
####################################################
RUN 3: readline stream interface profiling started
Read 128457 lines
RUN 3: readline stream interface profiling finished in 32ms
RUN 3: Current memory usage (before GC): 3.22 MB
RUN 3: Current memory usage (after GC): 2.25 MB
####################################################
WARMUP: Current memory usage: 2.24 MB
WARMUP: readline async iteration profiling started
Read 128457 lines
WARMUP: readline async iteration profiling finished in 67ms
WARMUP: Current memory usage (before GC): 3 MB
WARMUP: Current memory usage (after GC): 2.31 MB
####################################################
RUN 1: readline async iteration profiling started
Read 128457 lines
RUN 1: readline async iteration profiling finished in 68ms
RUN 1: Current memory usage (before GC): 10.17 MB
RUN 1: Current memory usage (after GC): 2.36 MB
####################################################
RUN 2: readline async iteration profiling started
Read 128457 lines
RUN 2: readline async iteration profiling finished in 49ms
RUN 2: Current memory usage (before GC): 9.21 MB
RUN 2: Current memory usage (after GC): 2.36 MB
####################################################
RUN 3: readline async iteration profiling started
Read 128457 lines
RUN 3: readline async iteration profiling finished in 48ms
RUN 3: Current memory usage (before GC): 8.86 MB
RUN 3: Current memory usage (after GC): 2.34 MB
####################################################
WARMUP: Current memory usage: 2.31 MB
WARMUP: readline async iteration via array of lines profiling started
Read 128457 lines
WARMUP: readline async iteration via array of lines profiling finished in 31ms
WARMUP: Current memory usage (before GC): 3.16 MB
WARMUP: Current memory usage (after GC): 2.31 MB
####################################################
RUN 1: readline async iteration via array of lines profiling started
Read 128457 lines
RUN 1: readline async iteration via array of lines profiling finished in 31ms
RUN 1: Current memory usage (before GC): 3.12 MB
RUN 1: Current memory usage (after GC): 2.32 MB
####################################################
RUN 2: readline async iteration via array of lines profiling started
Read 128457 lines
RUN 2: readline async iteration via array of lines profiling finished in 34ms
RUN 2: Current memory usage (before GC): 3.03 MB
RUN 2: Current memory usage (after GC): 2.33 MB
####################################################
RUN 3: readline async iteration via array of lines profiling started
Read 128457 lines
RUN 3: readline async iteration via array of lines profiling finished in 30ms
RUN 3: Current memory usage (before GC): 3.61 MB
RUN 3: Current memory usage (after GC): 2.33 MB
This reproduces my excel.js issue quite neatly - the event amplification causes a performance dip, both in cpu and memory. Is there anything library authors, node core or v8 can do or should we recommend this for await of + for of combination for performance sensitive/low-level libraries? I remember reading that the async/await spec allows for optimizations around Promises around sync returns or Promises that have already resolved, but I have no idea if that could apply here.
readline asynciteration has known performance issues (as it adds an internal transform stream) and therefore is not really representative as we would focus in removing that Transform instead.
yes, but benchmark 3 uses that internal transform and is as fast as benchmark 1, which does not, right? If you'd like to try this out with something else, let me know what I should benchmark.
Also, benchmarks 2 and 3 both use the internal transform but there's a big performance difference, so I think it is an independent discussion to removing the transform (which of course is a great idea!)
I could also write a naive readline implementation as a transform stream from scratch (just to focus on the amplification of events) and benchmark that - would that help you since it wouldn't use code that is not yet fully optimized?
That would help, yes.
(Btw did you mean this extra readable stream here when you said transform https://github.com/nodejs/node/blob/efec6811b667b6cf362d648bc599b667eebffce0/lib/readline.js#L1073-L1104 ?)
I've created two naive readline implementations as transform streams:
class NaiveReadline extends Stream.Transform {
constructor() {
super({readableObjectMode: true, writableObjectMode: true});
this.buffer = '';
}
_transform(chunk, _encoding, callback) {
let pausing = false;
chunk = this.buffer + chunk;
const lines = chunk.split(lineEnding);
this.buffer = lines.pop();
for (const line of lines) {
if (!this.push(line) && !pausing) {
pausing = true;
this.pause();
}
}
if(!pausing) return callback();
this.once('readable', callback);
}
};
class NaiveReadline2 extends Stream.Transform {
constructor() {
super({readableObjectMode: true, writableObjectMode: true});
this.buffer = '';
}
_transform(chunk, _encoding, callback) {
chunk = this.buffer + chunk;
const lines = chunk.split(lineEnding);
callback(null, lines);
}
};
I hope they are correct enough for this benchmark (obviously a ton of error handling and event listener clean up is missing).
In any case, let's consume it via stream events:
await runProfiling('naive readline stream interface', () => new Promise((resolve, reject) => {
const rl = new NaiveReadline();
let i = 0;
rl.on('data', (line) => {
i += 1;
});
rl.on('error', reject);
rl.on('end', () => {
console.log(`Read ${i} lines`);
resolve();
});
fs.createReadStream('big.txt').pipe(rl);
}));
On my mac (node 13.9.0), this takes between 32 and 35 ms and uses around 3 MB of memory (same as node's readline).
Let's benchmark async iteration:
await runProfiling('naive readline async iteration', async () => {
let i = 0;
for await (const line of fs.createReadStream('big.txt').pipe(new NaiveReadline())) {
i += 1;
}
console.log(`Read ${i} lines`);
});
On my mac (node 13.9.0), this takes around 50 ms and uses around 4 - 10 MB of memory. Again, similar to node's readline.
Lastly, let's test NaiveReadline2 with its simple 1 event -> 1 array of events setup:
await runProfiling('naive readline2 async iteration via array of lines', async () => {
let i = 0;
for await (const lines of fs.createReadStream('big.txt').pipe(new NaiveReadline2())) {
for (const line of lines) {
i += 1;
}
}
console.log(`Read ${i} lines`);
});
On my mac (node 13.9.0), this takes around 30ms and uses around 3 MB of memory - again highlighting the original issue of event amplification slowing down for await of.
Here's the full output:
####################################################
WARMUP: Current memory usage: 1.91 MB
WARMUP: naive readline stream interface profiling started
Read 128457 lines
WARMUP: naive readline stream interface profiling finished in 47ms
WARMUP: Current memory usage (before GC): 3.93 MB
WARMUP: Current memory usage (after GC): 2 MB
####################################################
RUN 1: naive readline stream interface profiling started
Read 128457 lines
RUN 1: naive readline stream interface profiling finished in 40ms
RUN 1: Current memory usage (before GC): 4.11 MB
RUN 1: Current memory usage (after GC): 2.01 MB
####################################################
RUN 2: naive readline stream interface profiling started
Read 128457 lines
RUN 2: naive readline stream interface profiling finished in 33ms
RUN 2: Current memory usage (before GC): 2.27 MB
RUN 2: Current memory usage (after GC): 2.01 MB
####################################################
RUN 3: naive readline stream interface profiling started
Read 128457 lines
RUN 3: naive readline stream interface profiling finished in 33ms
RUN 3: Current memory usage (before GC): 2.86 MB
RUN 3: Current memory usage (after GC): 2.02 MB
####################################################
WARMUP: Current memory usage: 2.02 MB
WARMUP: naive readline async iteration profiling started
Read 128457 lines
WARMUP: naive readline async iteration profiling finished in 69ms
WARMUP: Current memory usage (before GC): 4.45 MB
WARMUP: Current memory usage (after GC): 2.08 MB
####################################################
RUN 1: naive readline async iteration profiling started
Read 128457 lines
RUN 1: naive readline async iteration profiling finished in 61ms
RUN 1: Current memory usage (before GC): 3.37 MB
RUN 1: Current memory usage (after GC): 2.09 MB
####################################################
RUN 2: naive readline async iteration profiling started
Read 128457 lines
RUN 2: naive readline async iteration profiling finished in 48ms
RUN 2: Current memory usage (before GC): 9.98 MB
RUN 2: Current memory usage (after GC): 2.1 MB
####################################################
RUN 3: naive readline async iteration profiling started
Read 128457 lines
RUN 3: naive readline async iteration profiling finished in 58ms
RUN 3: Current memory usage (before GC): 9.62 MB
RUN 3: Current memory usage (after GC): 2.13 MB
####################################################
WARMUP: Current memory usage: 2.09 MB
WARMUP: naive readline2 async iteration via array of lines profiling started
Read 128557 lines
WARMUP: naive readline2 async iteration via array of lines profiling finished in 32ms
WARMUP: Current memory usage (before GC): 2.81 MB
WARMUP: Current memory usage (after GC): 2.07 MB
####################################################
RUN 1: naive readline2 async iteration via array of lines profiling started
Read 128557 lines
RUN 1: naive readline2 async iteration via array of lines profiling finished in 29ms
RUN 1: Current memory usage (before GC): 2.83 MB
RUN 1: Current memory usage (after GC): 2.07 MB
####################################################
RUN 2: naive readline2 async iteration via array of lines profiling started
Read 128557 lines
RUN 2: naive readline2 async iteration via array of lines profiling finished in 30ms
RUN 2: Current memory usage (before GC): 3.15 MB
RUN 2: Current memory usage (after GC): 2.07 MB
####################################################
RUN 3: naive readline2 async iteration via array of lines profiling started
Read 128557 lines
RUN 3: naive readline2 async iteration via array of lines profiling finished in 31ms
RUN 3: Current memory usage (before GC): 3.33 MB
RUN 3: Current memory usage (after GC): 2.09 MB
Let me know if this is more helpful or if there's any other checks you'd like me to run.
(Btw did you mean this extra readable stream here when you said transform
Yes.
Some notes:
pause() and it should really not be used in that way from within _transform.this.buffer.One thing you might want to try is to ditch the Transform stream completely and async iterate the fs.createReadStream() and do the event-amplification there. This would give us a hint if it's our implementation or it's a cost of the mechanism.
Regarding 2., I actually first wrote a version with pause() and the numbers came out the same.
Regarding your suggestion, could you write some pseudo code of what you have in mind? Are you thinking of an async iterator that yields each produced line vs. one that yields an array of lines?
Something like:
for await (let chunk of fs.createReadStream(file) {
const lines = ...
for (let line of lines) {
yield line
}
}
I've finally found some time to revisit this - and in short, the above is indeed a lot slower than yielding the lines array and letting the consumer iterate over it. I strongly suspect that it's the fact that for await of does not/can not run synchronously when it has the chunks available to do so. Any ideas on how to achieve a fast performance without yielding arrays?
Here is what I benchmarked (all files are in the repo):
const fs = require('fs');
const NaiveReadline = require('./NaiveReadline');
const NaiveReadline2 = require('./NaiveReadline2');
const runProfiling = require('./runProfiling');
const lineEnding = /\r?\n|\r(?!\n)/;
(async () => {
await runProfiling('naive readline stream interface', () => new Promise((resolve, reject) => {
const rl = new NaiveReadline();
let i = 0;
rl.on('data', (line) => {
i += 1;
});
rl.on('error', reject);
rl.on('end', () => {
console.log(`Read ${i} lines`);
resolve();
});
fs.createReadStream('big.txt').pipe(rl);
}));
await runProfiling('naive readline async iteration', async () => {
async function* readline(stream) {
let buffer = '';
for await (let chunk of stream) {
chunk = buffer + chunk;
const lines = chunk.split(lineEnding);
buffer = lines.pop();
for (const line of lines) {
yield line
}
}
}
let i = 0;
for await (const line of readline(fs.createReadStream('big.txt'))) {
i += 1;
}
console.log(`Read ${i} lines`);
});
await runProfiling('naive readline2 async iteration via array of lines', async () => {
async function* readline2(stream) {
let buffer = '';
for await (let chunk of stream) {
chunk = buffer + chunk;
const lines = chunk.split(lineEnding);
buffer = lines.pop();
yield lines;
}
}
let i = 0;
for await (const lines of readline2(fs.createReadStream('big.txt'))) {
for (const line of lines) {
i += 1;
}
}
console.log(`Read ${i} lines`);
});
await runProfiling('naive readline3 async iteration via an iterator', async () => {
async function readline3(stream, iterator) {
let buffer = '';
for await (let chunk of stream) {
chunk = buffer + chunk;
const lines = chunk.split(lineEnding);
buffer = lines.pop();
for (const line of lines) {
iterator(line)
}
}
}
let i = 0;
await readline3(fs.createReadStream('big.txt'), line => {
i += 1;
})
console.log(`Read ${i} lines`);
});
})();
And here are the benchmarking results:
####################################################
WARMUP: Current memory usage: 1.91 MB
WARMUP: naive readline stream interface profiling started
Read 128457 lines
WARMUP: naive readline stream interface profiling finished in 94ms
WARMUP: Current memory usage (before GC): 2.87 MB
WARMUP: Current memory usage (after GC): 1.99 MB
####################################################
RUN 1: naive readline stream interface profiling started
Read 128457 lines
RUN 1: naive readline stream interface profiling finished in 42ms
RUN 1: Current memory usage (before GC): 5.93 MB
RUN 1: Current memory usage (after GC): 2.02 MB
####################################################
RUN 2: naive readline stream interface profiling started
Read 128457 lines
RUN 2: naive readline stream interface profiling finished in 37ms
RUN 2: Current memory usage (before GC): 2.27 MB
RUN 2: Current memory usage (after GC): 2.26 MB
####################################################
RUN 3: naive readline stream interface profiling started
Read 128457 lines
RUN 3: naive readline stream interface profiling finished in 34ms
RUN 3: Current memory usage (before GC): 2.61 MB
RUN 3: Current memory usage (after GC): 2.27 MB
####################################################
WARMUP: Current memory usage: 2.27 MB
WARMUP: naive readline async iteration profiling started
Read 128457 lines
WARMUP: naive readline async iteration profiling finished in 78ms
WARMUP: Current memory usage (before GC): 2.71 MB
WARMUP: Current memory usage (after GC): 2.27 MB
####################################################
RUN 1: naive readline async iteration profiling started
Read 128457 lines
RUN 1: naive readline async iteration profiling finished in 82ms
RUN 1: Current memory usage (before GC): 4.98 MB
RUN 1: Current memory usage (after GC): 2.3 MB
####################################################
RUN 2: naive readline async iteration profiling started
Read 128457 lines
RUN 2: naive readline async iteration profiling finished in 62ms
RUN 2: Current memory usage (before GC): 5.49 MB
RUN 2: Current memory usage (after GC): 2.31 MB
####################################################
RUN 3: naive readline async iteration profiling started
Read 128457 lines
RUN 3: naive readline async iteration profiling finished in 60ms
RUN 3: Current memory usage (before GC): 5.32 MB
RUN 3: Current memory usage (after GC): 2.3 MB
####################################################
WARMUP: Current memory usage: 2.29 MB
WARMUP: naive readline2 async iteration via array of lines profiling started
Read 128457 lines
WARMUP: naive readline2 async iteration via array of lines profiling finished in 27ms
WARMUP: Current memory usage (before GC): 3.43 MB
WARMUP: Current memory usage (after GC): 2.33 MB
####################################################
RUN 1: naive readline2 async iteration via array of lines profiling started
Read 128457 lines
RUN 1: naive readline2 async iteration via array of lines profiling finished in 27ms
RUN 1: Current memory usage (before GC): 3.61 MB
RUN 1: Current memory usage (after GC): 2.31 MB
####################################################
RUN 2: naive readline2 async iteration via array of lines profiling started
Read 128457 lines
RUN 2: naive readline2 async iteration via array of lines profiling finished in 46ms
RUN 2: Current memory usage (before GC): 3.27 MB
RUN 2: Current memory usage (after GC): 2.32 MB
####################################################
RUN 3: naive readline2 async iteration via array of lines profiling started
Read 128457 lines
RUN 3: naive readline2 async iteration via array of lines profiling finished in 24ms
RUN 3: Current memory usage (before GC): 3.6 MB
RUN 3: Current memory usage (after GC): 2.37 MB
####################################################
WARMUP: Current memory usage: 2.32 MB
WARMUP: naive readline3 async iteration via an iterator profiling started
Read 128457 lines
WARMUP: naive readline3 async iteration via an iterator profiling finished in 26ms
WARMUP: Current memory usage (before GC): 10.84 MB
WARMUP: Current memory usage (after GC): 2.35 MB
####################################################
RUN 1: naive readline3 async iteration via an iterator profiling started
Read 128457 lines
RUN 1: naive readline3 async iteration via an iterator profiling finished in 25ms
RUN 1: Current memory usage (before GC): 10.48 MB
RUN 1: Current memory usage (after GC): 2.36 MB
####################################################
RUN 2: naive readline3 async iteration via an iterator profiling started
Read 128457 lines
RUN 2: naive readline3 async iteration via an iterator profiling finished in 24ms
RUN 2: Current memory usage (before GC): 10.98 MB
RUN 2: Current memory usage (after GC): 2.36 MB
####################################################
RUN 3: naive readline3 async iteration via an iterator profiling started
Read 128457 lines
RUN 3: naive readline3 async iteration via an iterator profiling finished in 34ms
RUN 3: Current memory usage (before GC): 10.98 MB
RUN 3: Current memory usage (after GC): 2.37 MB
Have you tried to use yield* lines? That should transfer control to the iterator of lines.
Overall you are correct: for await is always asynchronous for each chunk (one microtick) and it seems that causes significant issues in your specific case.
Unfortunately, yield* lines is even slower than all the other alternatives.
Going back to the original question of this issue, is this a fundamental thing or is it something where the spec allows a type of micro-tick optimization that v8 could implement, similar to https://v8.dev/blog/fast-async ?
The vast majority of the time is spent inside V8 and it is due to the overhead of yield a lot of data. I _hope_ that an optimization could be done on their side.
As you can see in https://upload.clinicjs.org/public/6e4f252b53ea9834559fff3bc24d62283900d7a565cb99ef9bdae1c459deaf15/28374.clinic-flame.html#selectedNode=50&zoomedNode=&exclude=8000&merged=true, a signficant amount of time is spent in managing the generators.
Here is a flamegraph done with linux perf:
flamegraph.html.gz
As you can see most of the overhead is in the the logic V8 uses to handle async generators :/.
cc @nodejs/v8, do you think there is any optimization possible?
yield* in async generators is pretty complex, and has to perform a lot of awaits. I think its unlikely you'd get better performance when using it.
yeah, yield* is slow, but can we get these two to be roughly the same speed?
// 1 - every chunk gets yielded individually, easy to use, currently a lot slower than 2
for await (const chunk of iterable) { ... }
// 2 - chunks get yielded as arrays as they are produced, forces the consumer to write a second for loop, but much faster
for await (const chunks of iterable) {
for (const chunk of chunks) { ... }
}
I had a couple of more thoughts this morning:
export async function* transformSomeStreamGenerator(stream, opts) {...}
export async function transfromSomeStream(stream, opts) {
return Readable.from((function*() {
for await (const chunks of transformSomeStreamGenerator(stream, opts)) {
for (const chunk of chunks) {
yield chunk
}
}
})())
}
This way transfromSomeStream can be used for piping and transformSomeStreamGenerator could be used for high-performance for-await-of, where the consumer is expected to write the for (const chunk of chunks) part themselves.
Any thoughts on this?
You are missing a few async keywords in your example.
In the 2) case, I can't see if it will have any perf benefit at all.
You are missing a few async keywords in your example.
Added ;)
In the 2) case, I can't see if it will have any perf benefit at all.
It would allow two ways of using it:
// 1 - every chunk gets yielded individually, easy to use
// currently a lot slower than 2
for await (const chunk of transfromSomeStream(stream)) { ... }
// 2 - chunks get yielded as arrays as they are produced,
// forces the consumer to write a second for loop, but much faster
for await (const chunks of transformSomeStreamGenerator(stream)) {
for (const chunk of chunks) { ... }
}
Happy to write another benchmark for this - but again, this would assume there is no way for v8 to achieve something close to parity. Can anyone comment on that?
I don't understand why you are using Readable.from() in transfromSomeStream.
Sure, let me give you some context. I'm re-writing the internals of exceljs, throwing out streams and replacing them with generators (sync wherever possible, async when needed). However, for backwards compatibility, we still need to expose a stream-like, eventemitter-based interface to our consumers.
So, our main logic is now implemented in a parse function (https://github.com/exceljs/exceljs/pull/1135/files#diff-2fe0207cb76e1f94add746d745867d8eR103), which is an async generator. Because of the issues discussed here, it yields an array of events instead of yielding each event by itself (https://github.com/exceljs/exceljs/pull/1135/files#diff-2fe0207cb76e1f94add746d745867d8eR346).
Based on that I refactored the read function to just call parse, synchronously loop through the events array and emit each event individually (i.e., keep the existing functionality): https://github.com/exceljs/exceljs/pull/1135/files#diff-2fe0207cb76e1f94add746d745867d8eR90-R100
As you can see, for legacy reasons, this is not exactly a streams3 implementation, but if it had been (i.e., if exposing a streams3 implementations wasn't breaking backwards compatibility), I would have used
export async function read(stream, opts) {
return Readable.from((function*() {
for await (const chunks of parse(stream, opts)) {
for (const chunk of chunks) {
yield chunk
}
}
})())
}
Now, my worry is that the consumers will start using for-await on read rather than parse, because it's nicer/more ergonomic to receive each event individually, rather than an array of events. And then at some point down the line their code runs on slowly on production and they forget about not having used parse, blaming the library or node.js. So in short, I would like to have for-await be equally fast on read and parse (equivalently, have parse yield each event individually).
To re-state my question, is there any way for v8 optimizations to achieve parity or something close to it? I.e., can we somehow avoid paying the full price of constantly going back to the microtask/event queue on each iteration, when we already have the next element and could just do the iteration immediately?
I do not have an answer to your question.
I think you've already helped a lot in boiling this issue down to a v8 question rather than a node streams question. When I have a moment, I'll create an issue on the v8 tracker and ask for feedback there - or is there anyone from the @nodejs/v8 team that you pinged that has any new ideas before I do so?
I would go with an issue on v8.
I so far haven't been involved too much with this part of V8 so I don't have anything to say from the top of my head; this is something we'll have to look into. Thanks for the report!
Filing issues like this on the V8 issue tracker for V8 specific issues is definitely useful, but we're following @nodejs/v8 as well.
Okay I've tried writing it up as best as I could: https://bugs.chromium.org/p/v8/issues/detail?id=10385
If I've left out any details, please let me know!
I've done some more thinking - consider this code:
const fs = require('fs').promises;
const iterations = 5;
const arr = Array.from(Array(iterations).keys());
(async () => {
for await (const i of arr) {}
console.log('1 - ', i);
})();
(async () => {
for await (const i of arr) {}
console.log('2 - ', i);
})();
(async () => {
console.log(await fs.readFile('./hello.txt', 'utf8'));
})();
Right now, this prints
1 - 1
2 - 1
1 - 2
2 - 2
1 - 3
2 - 3
1 - 4
2 - 4
1 - 5
2 - 5
world
(since world comes from the event queue, and the rest comes from microtasks).
But, according to the spec, couldn't it print this?
1 - 1
1 - 2
1 - 3
1 - 4
1 - 5
2 - 1
2 - 2
2 - 3
2 - 4
2 - 5
world
Because if so, one could make for-await-of as performant as for-of. Does anyone here know whether that would be spec-compliant?
whatever V8 prints when that code is run is spec compliant and there's no other version that would be spec compliant.
@alubbe Did you consider consuming your chunks with a coroutine? Take a look at, e.g. how Bluebird's promise.coroutine is implemented. The advantage of using generators to approximate async methods is exactly as you point out initially: you don't have to spend time waiting for setImmediate. When you yield control to your coroutine it can give control back synchronously or it can do something that is really async and give control back asynchronously.
I've continued to think about this problem, and I think I'm in a fairly unique position to possibly be able to offer a solution. I've written up my thoughts in an issue in the iter-tools package, which I maintain.
I think if I was really savvy about how I go about this I'd write up a proposal for a new WKS and its implied semantics and submit it to TC39. I think they require a sponsor though. I wonder if I know anyone who would hear me out. I guess I'd have to cook up a POC first.
@conartist6 yes, that was the alternative that I considered - to export a helper function responsible for iteration taking a callback that gets each item passed in as an argument. However, I'm a strong proponent of using for over nesting functions because a) you can use break and everything gets closed/cleaned up for you and b) because of simpler stack traces / less redirection.
So what I'm really looking for is to make the for-await-of construct smarter in v8 because I don't think I can come up with a smarter Symbol.asyncIterator implementation that would yield the performance that I'm looking for (and that probably your approach would achieve). In essence, a marriage of form and functionality that currently doesn't exist.
Yeah I generally like working with for loops too. Let me fork the current plugin that transpiles them to coroutines, and I think I can show you that you can use the for loop syntax to do what we're both talking about. I am assuming if I build this you have a branch where you could test the performance?
Of course the transpiled stack traces won't be great, but if this works we could lobby TC39 to standards track it and then the stack traces would come eventually. I think it could just use the current for await .. of syntax since it's a nonbreaking change.
I think I can make async generator functions work this way too while keeping them backwards compatible.
I want to point out that it isn't a specific case really. I benchmarked it and found that for .. of loops are more than 250x faster than for await .. of loops. In the common case the _vast_ majority of that extra work is unnecessary. To loop over characters read from the average chunk (65535 bytes) of a file we're talking about 1 await that is necessary and 131,069 that are not. This is of course not node's fault, it's the fault of the language spec which somehow failed to anticipate that forcing one of the highest level and most costly constructs in the language onto what should be a useful abstraction over a very low level API might not promote its usage. I'd have to do some profiling to see where the time is being spent, but I suspect that it isn't queueMicrotask that is expensive. Rather it's saving the execution state into an abstract closure and unpacking it (twice (sequentially (for every character))) that adds orders of magnitude worth of complexity. Anyway, here is my benchmark.
I am looking for a champion for my proposal to alter the language specification to permit skipping this useless work. If anyone could point me in the direction of a TC39 delegate who might be interested I'm all ears. I've designed my proposal to be 100% backwards compatible.
Aha, @devsnek is one. Interested?
es.discourse.group is an official tc39 forum full of delegates. if someone there is interested I'm sure they will offer.
I'll take that as a no. Ah well, thanks for the fast response.
I can't be your champion, but your cheerleader ;) What you're proposing is exactly what I was after, because in my investigations here and on the v8 issue it became clear to me that we're leaving performance on the table, but we'd probably need to fix/improve this at the language level. I lack the knowledge to write this up, but I'm going to follow the discussions on https://es.discourse.group/t/syncandasynciterator/554
Thank you for taking this up!