Node.js appears to inconsistently handle how to detect the end of a stream, in order to pass on execution to the next step, especially in the presence of errors.
const Transform = require('stream').Transform;
const fs = require('fs');
function testPipe(T) {
var s = fs.createReadStream('index.js');
s.pipe(T);
s.on('error', function() { console.log('source.error'); });
s.on('finish', function() { console.log('source.finish'); });
s.on('end', function() { console.log('source.end'); });
s.on('close', function() { console.log('source.close'); });
T.on('error', function() { console.log('destination.error'); });
T.on('finish', function() { console.log('destination.finish'); });
T.on('end', function() { console.log('destination.end'); });
T.on('close', function() { console.log('destination.close'); });
}
testPipe(new Transform({
transform: function(chunk, encoding, callback) {
console.log('_transform');
return void callback();
},
flush: function(callback) {
console.log('_flush');
return void callback();
}
}));
The documentation says:
The 'end' event is emitted when there is no more data to be consumed from the stream.
The 'end' event will not be emitted unless the data is completely consumed. This can be accomplished by switching the stream into flowing mode, or by calling stream.read() repeatedly until all data has been consumed.
However, no "end" event is produced from the destination stream:
source.end
_final
destination.finish
source.close
A similar result is seen for a transform stream:
_flush
destination.finish
source.end
source.close
The data has clearly been "completely consumed", as indicated by the fact that _flush has been called and returned.
The stream emits "finish" which suggests people should use this. However, this only indicates the data has been read, not that it's processed. Further, its behavior is inconsistent:
const T = new Transform({
transform: function(chunk, encoding, callback) {
console.log('_transform');
return void callback(new Error('Syntax error'));
},
flush: function(callback) {
console.log('_flush');
return void callback();
}
});
We get:
destination.error
But if there's a problem while processing the end of the document (say, more data was expected), then a different set of events is raised. This:
const T = new Transform({
transform: function(chunk, encoding, callback) {
console.log('_transform');
return void callback();
},
flush: function(callback) {
console.log('_flush');
return void callback(new Error('Unexpected end-of-file'));
}
});
Produces:
_flush
destination.error
destination.finish
source.end
source.close
Note how destination.finish is emitted after destination.error.
The Node.js documentation provides an example of a Writable stream that consumes data. However, it doesn't specify how to actually use the example in a context with events and actual streams.
When piping a file (using fs.createReadStream) into the example, again no "end" event is emitted:
source.end
destination.finish
source.close
The documentation omits any description of how to determine when all processing is complete (either when the stream is consumed and fully processed, or once there is a fatal error indicating no more processing will occur).
I've asked several developers to describe how to consume a stream and call a function, and nobody has been able. This suggests to me the behavior of Streams is too complex, and/or under-documented.
I would like to suggest a specific course of action, especially specific edits to the documentation, however it's not clear to me what the intended behavior of streams is even supposed to be.
I've documented a case where Node.js fails to guarantee that I can have a function called exactly once, or that it's difficult to write applications following this behavior without significant risk of introducing bugs.
There's several courses of action:
'end' after an error (if 'finish' is getting called after 'error' then surely it should be safe to also call 'end', right?)I'm not entirely sure, I'm not that familiar with streams but I'd like to help.
This part:
However, no "end" event is produced from the destination stream:
Did you forget to include the code for that? I don't see a '_write' log anywhere.
But assuming you only replaced Transform stream with Writable stream:
The Writable stream doesn't have an 'end' event it only has 'finish' event, so it is correct that there is no such event.
Please correct me if I misunderstood the case.
Transform case:
A similar result is seen for a transform stream:
In your case you indeed consume the first stream s via the transform s.pipe(T) and hence get an 'end' event but you don't consume the transform stream T so you don't get an 'end' event for it. If you write it as follows:
const t = new Transform({
transform: function(chunk, encoding, callback) {
console.log('_transform');
return void callback();
},
flush: function(callback) {
console.log('_flush');
return void callback();
}
});
testPipe(t);
t.on('data', () => {});
Even more so, your T stream is empty (you didn't push anything in transform), so listener on data won't even be called (because there is no data). You can also use t.resume() instead of t.on('data', () => {}) with the same results.
You referred to this doc:
The 'end' event is emitted when there is no more data to be consumed from the stream.
The 'end' event will not be emitted unless the data is completely consumed. This can be accomplished by switching the stream into flowing mode, or by calling stream.read() repeatedly until all data has been consumed.
And that is indeed correct, you have consumed all data from source stream s by pipe'ing it hence you get and 'end' event for source stream, but you didn't consume the resulting transform stream T (in my example t). Note it is a duplex stream, so even though pipe has ended its Writable part (hence the 'finish' event) the Readable part of it is still alive and doesn't know that there is no data to be read (because the stream is empty) as nobody called read() yet. Therefore you only get the flush callback that the whole source has been transformed but it doesn't mean that the 'end' event will be fired.
You may also come across this line in the documentation:
The 'end' event is emitted after all data has been output, which occurs after the callback in transform._flush() has been called.
Which may be misleading IMO. As far as I understand it means that 'end' event will always be emitted after you receive the 'flush' event (as order of events) and not that after flush callback there will be and 'end' event.
There is a better description in _flush() doc
Custom Transform implementations may implement the transform._flush() method. This will be called when there is no more written data to be consumed, but before the 'end' event is emitted signaling the end of the Readable stream.
The stream emits "finish" which suggests people should use this. However, this only indicates the data has been read, not that it's processed.
No, it actually means that Writable part of the Transform stream has ended -> all data has been transformed/processed (because pipe is not lazy). You may not see many _transform because of the watermark (default highWaterMark for fs is 64k as noted here), try doing the following:
var s = fs.createReadStream('index.js', { highWaterMark: 10 });
This way you will see that there are a lot of transform calls before the flush.
As for the error cases:
I didn't clearly understood what you mean by 'the end of the document' in:
But if there's a problem while processing the end of the document
So please, clarify. Because _flush is called after end of the document (assuming if means there in no more data in Readable stream, see link above).
Though, in the first case stream ended in the middle because of the error -> neither source nor destination has ended and there is an error.
But in second case source stream has finished (because _flush is called after EOF) so you obviously will get source.end and source.close. As for the destination.error and destination.finish: you get 'error' because your stream erroed, obviously but as the same time Writable part of the Transform (Duplex) stream has ended, because of the source.end, therefore you also get 'finish' event.
This became quite a lot of text, feel free to correct me as streams are indeed complex =).
@lucamaraschi Thanks for the input!
Did you forget to include the code for that?
Er, I provided the code for the Transform example, but the Writable code is similar:
const Writable = require('stream').Writable;
const fs = require('fs');
function testPipe(T) {
var s = fs.createReadStream('index.js');
s.pipe(T);
s.on('error', function() { console.log('source.error'); });
s.on('finish', function() { console.log('source.finish'); });
s.on('end', function() { console.log('source.end'); });
s.on('close', function() { console.log('source.close'); });
T.on('error', function() { console.log('destination.error'); });
T.on('finish', function() { console.log('destination.finish'); });
T.on('end', function() { console.log('destination.end'); });
T.on('close', function() { console.log('destination.close'); });
}
testPipe(new Writable({
write: function(chunk, encoding, callback) {
console.log('_write');
return void callback();
},
final: function(callback) {
console.log('_final');
return void callback();
}
}));
I'd encourage you to run the example and see the output.
Adding the 'data' event seems to fix the case; but this is unexpected. The 'end' event doesn't say it's dependent on having another registered event.
I'm not aware of any other cases where adding a listener changes the behavior of the application; this is not "functional" behavior and rather unexpected.
The documentation should be very clear about this, if this is the intended behavior.
This also doesn't work for a Writable stream, which by definition should only emit 'finish'.
it actually means that Writable part of the Transform stream has ended
Noted.
So please, clarify. Because _flush is called after end of the document (assuming if means there in no more data in Readable stream, see link above).
By "end of document" I mean EOF, processing that you have to do after the document ends, such as raising an error if you forgot a closing curly brace (for example).
Because a write destination still has to verify the syntax, emitting 'finish' before this verification has been performed would be premature. I'm not aware of any other libraries that emit an EOF event before the actual error checking has been performed.
So in short, I think I can narrow this down to two problems:
(1) For Readable and Transform streams, Node.js should be more clear about when the 'end' event is emitted. The event isn't being emitted in every case that the documentation says it should be.
(2) For Writable and Transform streams, Node.js should decide if 'error' stops processing (and therefore preempts 'finish'), or if 'finish' always emits, even after an error. It doesn't make sense to raise an error then sometimes call 'finish' and sometimes not call 'finish'.
So as for your first message:
About Writable example, I'm not sure what are you implying, I get the following results
source.end
_final
destination.finish
source.close
Where the only confusing thing may be that source.end happens before _final and destination.finish but that is due to Writable actually having _final callback. If you write as follows:
testPipe(new Writable({
write: function(chunk, encoding, callback) {
console.log('_write');
return void callback();
},
}));
You will get:
destination.finish
source.end
source.close
Which is consistent with Transform example.
Anything else I have missed?
Adding the 'data' event seems to fix the case; but this is unexpected. The 'end' event doesn't say it's dependent on having another registered event.
That's not about the data event, it's about changing Readable stream mode to flowing which gets it going (listener on data event changes the stream to flowing if there are no 'readable' listeners). As I noted, you may have also just called stream.resume() or manually call stream.read() until you get all the data. So the main thing as noted in the documentation of the 'end' event is:
The 'end' event will not be emitted unless the data is completely consumed. This can be accomplished by switching the stream into flowing mode, or by calling stream.read() repeatedly until all data has been consumed.
So even if you didn't push anything in the Transform example stream doesn't know that it has no data unless someone tries to read() via any of the methods hence you are not getting an end event.
As for the
I'm not aware of any other cases where adding a listener changes the behavior of the application; this is not "functional" behavior and rather unexpected.
This is indeed unexpected, but that's AFAIK legacy that remains from old streams and for now we have to deal with it.
The documentation should be very clear about this, if this is the intended behavior.
Well, it is noted as such, it is not that visible but it is there. documentation:
Attaching a 'data' event listener to a stream that has not been explicitly paused will switch the stream into flowing mode. Data will then be passed as soon as it is available.
Though that's indeed quite a reading =)
As for the last problem:
Because a write destination still has to verify the syntax, emitting 'finish' before this verification has been performed would be premature.
It is okay to emit 'finish' because it only means that Writable part of the Duplex-Transform stream has ended, so basically it means that this stream will no longer have any data passed to it.
Even more so, this always happens after the _flush callback so you still have the ability to add any additional data you need in _flush.
So basically any consumer would wait on 'readable' or 'data' and 'end' events, because they actually correspond to the Readable part of the Duplex-Transform stream - the part you actually get the data from.
(1) For Readable and Transform streams, Node.js should be more clear about when the 'end' event is emitted. The event isn't being emitted in every case that the documentation says it should be.
So after my explanation, could you please try to list what parts are unclear exactly, to be able to work from that?
(2) For Writable and Transform streams, Node.js should decide if 'error' stops processing (and therefore preempts 'finish'), or if 'finish' always emits, even after an error. It doesn't make sense to raise an error then sometimes call 'finish' and sometimes not call 'finish'.
I'll explain the behavior as I've already started, but there is my opinion in the end in this.
Well 'error' event only notes an error, the place where it happens decides whether 'finish' will or will not be emitted. So basically the are 2 cases (at least i hope so =)):
transform() or write() callbacks) - you won't get a 'finish' event (and most others) because the processing ended prematurely and The stream is not closed when the 'error' event is emitted.flush() or final() callbacks) - even if error happens there you will get 'finish' event, because the Writable part of the stream (or the writable stream itself) is basically finished (those callbacks may add some data but the stream itself should end) whether those callbacks succeed or not, they may delay the 'finish' event but not cancel it. The main reason for those callbacks is:final() - to close any resource handles and some post processing - This optional function will be called before the stream closes, delaying the 'finish' event until callback is called. (doc)flush() - to write additional data/checks - This will be called when there is no more written data to be consumed, but before the 'end' event is emitted signaling the end of the Readable stream.(doc) - so basically the fact that you have 'finish' in Transform stream doesn't mean success, it only means that there is no more data to process.So from this, for me, even though the behavior of 'error' and 'finish' is understandable it may be unclear that error it 'post-processing' callbacks will not prevent the stream from ending so indeed this is probably the topic to discuss.
@awwright is this still relevant or can we close it?
@lucamaraschi I still think the documentation needs some work at the very least. Where the documentation lists some of the exceptions to behavior, it should list all the exceptions to behavior.
Calling destroy on a writable stream appears to have changed behavior between node 8 and node 10. In node 8, it would emit a 'finish', in node 10 it emits a 'close' instead. O_o
Further, _final is inconsistent. In cases where a readable is piped to the writable, _final is called as per the docs. In the case where .end(...) is called directly on the writable, _final is not called, it seems.
I am definitely calling the callback from the calls to _write()...
EDIT: Figured it out.
function prefinish(stream, state) {
if (!state.prefinished && !state.finalCalled) {
if (typeof stream._final === 'function' && !state.destroyed) {
state.pendingcb++;
state.finalCalled = true;
process.nextTick(callFinal, stream, state);
} else {
state.prefinished = true;
stream.emit('prefinish');
}
}
}
The use of process.nextTick() here is what's throwing me off. The documentation is very vague about in which order things are called.
I made the (incorrect) assumption that if all of my code inside my stream is synchronous, then I can safely write, end, and finish up a stream in one synchronous call.
This is not the case, as the _finish method is called on the next tick. In a synchronous application, this means that _finish is never called.
Highly frustrating, the documentation should be updated for sure.
However, no "end" event is produced from the destination stream:
That is because the destination is Writable and writable emits 'finish' whereas Readable emits 'end'. They deliberately use different names for conceptually the same thing so that you can distinguish which side of a Duplex and Transform has ended.
A similar result is seen for a transform stream:
_flush destination.finish source.end source.closeThe data has clearly been "completely consumed", as indicated by the fact that
_flushhas been called and returned.The stream emits "finish" which suggests people should use this. However, this only indicates the data has been read, not that it's processed. Further, its behavior is inconsistent:
Again the 'finish' event is emitted by the Writable side. It is awkward that _flush and 'finish' are executed before the source 'end' and 'close' but this is not incorrect because the Readable and Writable in this example are different streams and therefore operate largely independently of one another.
However, I say "largely" because there is this from the Transform._flush documentation:
[Transform._flush] will be called when there is no more written data to be consumed, but before the 'end' event is emitted signaling the end of the Readable stream.
So there is some coordination between the source Readable and destination Writable. Why? Not sure.
[Error in Transform._transform produces:]
destination.error[Error in Transform._flush produces:]
_flush destination.error destination.finish source.end source.closeNote how
destination.finishis emitted afterdestination.error.
That's not really unexpected. I would rather know about the error first so that I might skip or augment whatever is executed in finish.
And all of the extra events in the second case can be attributed to the fact that the source ended.
I'm going to quit here but if there is something else that you still think is not explainable feel free to reply and point that out.
Ping @nodejs/streams
I propose we close this.
I wouldn't mind looking into this of someone could make a new summary of these issues based on 14+.
I believe some of these issues might have been resolved in the stream consistency efforts done in node 14 and beyond.
Reading this again, it seems to be as designed. The mentioned timing issues I believe are resolved in 14+.
Most helpful comment
Further,
_finalis inconsistent. In cases where a readable is piped to the writable,_finalis called as per the docs. In the case where.end(...)is called directly on the writable,_finalis not called, it seems.I am definitely calling the callback from the calls to
_write()...EDIT: Figured it out.
The use of
process.nextTick()here is what's throwing me off. The documentation is very vague about in which order things are called.I made the (incorrect) assumption that if all of my code inside my stream is synchronous, then I can safely write, end, and finish up a stream in one synchronous call.
This is not the case, as the
_finishmethod is called on the next tick. In a synchronous application, this means that_finishis never called.Highly frustrating, the documentation should be updated for sure.