Currently the Readable streams documentation states that highWaterMark is measured in bytes for non-object mode streams. However, when .setEncoding() is called on the Readable stream, then highWaterMark is measured in characters and _not_ bytes since state.length += chunk.length happens _after_ chunk is overwritten with a decoded string and state.length is what is compared with highWaterMark.
This seems to be an issue since at least v0.10, so I wasn't sure if we should just update the documentation to note this behavior or if we should change the existing behavior to match the documentation.
/cc @nodejs/streams
My initial take on this is that measuring characters is a bug.
@jasnell I agree! this is a nice issue if anybody wants to start contributing on streams and on core cc @nodejs/streams.
I'd be interested in digging into this to learn more about streams.... will take a look in the morning
One other issue to think about with this is when someone calls setEncoding() when state.length > 0, what to do then? Convert all the buffered data to strings and update state.length or just loop through the buffer using Buffer.byteLength(chunk, encoding) to update state.length ?
I think we should just use byte length for highWaterMark. Converting values... sounds tricky.
Hi! It seems like it's been awhile since anyone's mentioned anything on this issue. Does anyone mind if I take a go at this?
@jessicaquynh please do! Ping @nodejs/streams if you need any help.
@mcollina Thanks! Should I also include a test for this change?
@jessicaquynh yes please!
Great, will do! Thanks @cjihrig!
@nodejs/streams would love some guidance! The implementation that I made has caused a side-effect and I am unsure what to do to fix it.
I have changed this line to write the chunk.length (if not in objectMode) to be a converted Buffer.byteLength(chunk, encoding).
However, the most recent version of _stream_readable.js has this line that conflicts with the change.
This if statement gets entered pre-emptively when the byteLength is quite low. Say, 2 or 4 and it throws the error that state.buffer.head is undefined. The difference is in these instances, the string chunk length was 0. I notice that the file on master and the file in this issue's reference contain a different howMuchToRead function. So perhaps the new function got rewritten without this bug in mind?
Anyhow, any help or feedback would be greatly appreciated! Thanks! :)
So perhaps the new function got rewritten without this bug in mind?
I do not think so, no.
The best approach is to write a new unit test, that exhibit the described issue, and push it on branch on your profile. I would like to have a closer look at both the change and the test.
As a side note, Buffer.byteLength(str) calls str.toString(), which might be the issue here.
@mcollina That was a very complicated process, but I ended up learning a lot along the way! Would love your advice and feedback!
I wrote a test for this for on a lower byte level using mdash and the euro character, as their character length and byte length are not the same.
@jessicaquynh nice one! I think the problem is here: https://github.com/jessicaquynh/node/blob/issue%236798/lib/_stream_readable.js#L253. That state.buffer.head.data can be both a Buffer or a String, which means that we need to use Buffer.byteLength.
Let me know how it goes!
@mcollina hey! I am not sure if I implemented your fix correctly, but it's still failing a few other tests. I wasn't exactly sure how to reproduce beyond introducing the process.on('uncaughtException') listener. Let me know if this is incorrect or bad practice!
Nonetheless, here's an updated test that reproduces the error. If you push in an exact buffer source, it works as expected. But should you split the buffer and the split is not exact, it attempts to return the empty buffer head!
Some notes. I fear this is definitely not a _good first contribution_. I was wrong in tagging this as is.
Anyway, we are progressing: we need also to update https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As it's making its calculations on the length, not the bytes.
There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.
@mcollina Ahah :) I will admit it's been a daunting (although rewarding) task. Thank you for your aid in this process.
I have implemented your recommended changes. There is still one error that happens on the parallel tests, and that has to do with a ReadLine implementation. Namely, this test which is throwing the empty buffer head error.
From my interpretation it looks like the byte order mark is causing the length to be off?
@jessicaquynh can you push your branch? https://github.com/nodejs/node/compare/master...jessicaquynh:issue%236798 does not have all the fixes (especially the howMuchToRead ones). I want to run the tests on your branch too.
The problem is related to https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As you can see, it's still accessing the length without passing through Buffer.byteLength.
@mcollina Sorry about that! I forgot to push up last night with fromListPartial fix. It has now been updated and pushed to be the same as my local branch.
Additionally, I am wondering I could ask for clarification:
There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.
Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?
Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?
Yes!
In current master, the logic is _exact_, and it treats bytes and chars in the same way.
Meaning that, if you try to .read(16), it will read either 16 bytes or characters.
If we count only characters, what should we do? If we have 15 bytes, and then a 3-bytes utf sequence, we can either read 15 or 18 bytes, we cannot read 16 or 17. This is very likely what is happening here.
(Maybe I am wrong about that utf story, _this change_ is really _hard_).
I'm seeing two tests failing:
In case of 1., there is something wrong going on with hex encoding.
[ '6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'61616161616161616161',
'' ]
[ '6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161',
'6161616161' ]
All the data is there, but organized differently. There might be an issue in onEofChunk, again accessing chunk.length: https://github.com/jessicaquynh/node/blob/cedc393221c305eb87fc50fde2191664bddac697/lib/_stream_readable.js#L405
Maybe this is just too hard to fix in general? The use case of piping with setEncoding is kinda edge-casy anyway.
@mafintosh probably, at this point I am wondering if it's feasible, and if it decreases throughput. Maybe we are just worrying for nothing.
Hey @mcollina, I've added your suggestion but it actually crops up more errors with the tests.
I'm seeing two tests failing:
test/parallel/test-stream2-set-encoding.js
I noticed in my first changes that this same error was happening when I was using the state's encoding in the readableAddChunk func, here.
I am not exactly sure what the issue is. So, I tried experimenting without setting the encoding in fromListPartial, here and that seemed to work.
Though, I am not exactly sure why because the failing test/parallel/test-stream2-set-encoding.js is checking for hex encoding, and with the empty string it defaults to utf8. If you wouldn't mind explaining this difference in behaviour, I'd very much appreciate it :)
Nonetheless
test/parallel/test-stream-preprocess.js
is still failing with this error.
return Buffer.byteLength(state.buffer.head.data)
@jessicaquynh sorry it's taking a bit of time :( this week and the next one will be a bit busy, I will try to answer as quickly as I can.
@mcollina no worries! I appreciate all your help :)
ping @nodejs/streams
I think the best approach about this issue is to update the documentation. An actual fix proved to be way harder, and it might slow things down considerably.
Would anyone update the docs? Maybe @jessicaquynh?
@mcollina definitely can; are you thinking about putting it in as a "gotcha" statement?
@jessicaquynh yes, I think that would work, let's see what other people think when you fire the PR.
ping @mscdex
Yes, I think this could be closed.
Most helpful comment
My initial take on this is that measuring characters is a bug.