Node: streams: Readable highWaterMark is measured in bytes or characters

Created on 17 May 2016 · 32Comments · Source: nodejs/node

Version: all
Platform: all
Subsystem: streams

Currently the Readable streams documentation states that highWaterMark is measured in bytes for non-object mode streams. However, when .setEncoding() is called on the Readable stream, then highWaterMark is measured in characters and _not_ bytes since state.length += chunk.length happens _after_ chunk is overwritten with a decoded string and state.length is what is compared with highWaterMark.

This seems to be an issue since at least v0.10, so I wasn't sure if we should just update the documentation to note this behavior or if we should change the existing behavior to match the documentation.

/cc @nodejs/streams

stream

Source

mscdex

Most helpful comment

My initial take on this is that measuring characters is a bug.

jasnell on 17 May 2016

👍2

All 32 comments

My initial take on this is that measuring characters is a bug.

jasnell on 17 May 2016

👍2

@jasnell I agree! this is a nice issue if anybody wants to start contributing on streams and on core cc @nodejs/streams.

mcollina on 17 May 2016

I'd be interested in digging into this to learn more about streams.... will take a look in the morning

MylesBorins on 17 May 2016

One other issue to think about with this is when someone calls setEncoding() when state.length > 0, what to do then? Convert all the buffered data to strings and update state.length or just loop through the buffer using Buffer.byteLength(chunk, encoding) to update state.length ?

mscdex on 16 Jun 2016

I think we should just use byte length for highWaterMark. Converting values... sounds tricky.

mcollina on 16 Jun 2016

Hi! It seems like it's been awhile since anyone's mentioned anything on this issue. Does anyone mind if I take a go at this?

jalafel on 16 Oct 2016

@jessicaquynh please do! Ping @nodejs/streams if you need any help.

mcollina on 16 Oct 2016

@mcollina Thanks! Should I also include a test for this change?

jalafel on 18 Oct 2016

@jessicaquynh yes please!

cjihrig on 18 Oct 2016

Great, will do! Thanks @cjihrig!

jalafel on 18 Oct 2016

@nodejs/streams would love some guidance! The implementation that I made has caused a side-effect and I am unsure what to do to fix it.

I have changed this line to write the chunk.length (if not in objectMode) to be a converted Buffer.byteLength(chunk, encoding).

However, the most recent version of _stream_readable.js has this line that conflicts with the change.

This if statement gets entered pre-emptively when the byteLength is quite low. Say, 2 or 4 and it throws the error that state.buffer.head is undefined. The difference is in these instances, the string chunk length was 0. I notice that the file on master and the file in this issue's reference contain a different howMuchToRead function. So perhaps the new function got rewritten without this bug in mind?

Anyhow, any help or feedback would be greatly appreciated! Thanks! :)

jalafel on 20 Oct 2016

So perhaps the new function got rewritten without this bug in mind?

I do not think so, no.

The best approach is to write a new unit test, that exhibit the described issue, and push it on branch on your profile. I would like to have a closer look at both the change and the test.

As a side note, Buffer.byteLength(str) calls str.toString(), which might be the issue here.

mcollina on 20 Oct 2016

@mcollina That was a very complicated process, but I ended up learning a lot along the way! Would love your advice and feedback!

I wrote a test for this for on a lower byte level using mdash and the euro character, as their character length and byte length are not the same.

https://github.com/jessicaquynh/node/tree/issue%236798

jalafel on 23 Oct 2016

@jessicaquynh nice one! I think the problem is here: https://github.com/jessicaquynh/node/blob/issue%236798/lib/_stream_readable.js#L253. That state.buffer.head.data can be both a Buffer or a String, which means that we need to use Buffer.byteLength.

Let me know how it goes!

mcollina on 24 Oct 2016

@mcollina hey! I am not sure if I implemented your fix correctly, but it's still failing a few other tests. I wasn't exactly sure how to reproduce beyond introducing the process.on('uncaughtException') listener. Let me know if this is incorrect or bad practice!

Nonetheless, here's an updated test that reproduces the error. If you push in an exact buffer source, it works as expected. But should you split the buffer and the split is not exact, it attempts to return the empty buffer head!

jalafel on 26 Oct 2016

Some notes. I fear this is definitely not a _good first contribution_. I was wrong in tagging this as is.

Anyway, we are progressing: we need also to update https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As it's making its calculations on the length, not the bytes.

There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.

mcollina on 26 Oct 2016

@mcollina Ahah :) I will admit it's been a daunting (although rewarding) task. Thank you for your aid in this process.

I have implemented your recommended changes. There is still one error that happens on the parallel tests, and that has to do with a ReadLine implementation. Namely, this test which is throwing the empty buffer head error.

From my interpretation it looks like the byte order mark is causing the length to be off?

jalafel on 27 Oct 2016

@jessicaquynh can you push your branch? https://github.com/nodejs/node/compare/master...jessicaquynh:issue%236798 does not have all the fixes (especially the howMuchToRead ones). I want to run the tests on your branch too.

The problem is related to https://github.com/jessicaquynh/node/blob/b587c1f2a7f21fc00b5c116d48a9c44d62feae11/lib/_stream_readable.js#L865-L884. As you can see, it's still accessing the length without passing through Buffer.byteLength.

mcollina on 27 Oct 2016

@mcollina Sorry about that! I forgot to push up last night with fromListPartial fix. It has now been updated and pushed to be the same as my local branch.

Additionally, I am wondering I could ask for clarification:

There is still a thing that worries me: characters are unsplittable, we can send the first half of a utf8 sequence through when the consumer expect it to be a string. Not sure if it's going to be an issue code-wise.

Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?

jalafel on 27 Oct 2016

Is this in regards to instances in the test when the Buffer splits the euro/emdash characters between different chunks?

Yes!
In current master, the logic is _exact_, and it treats bytes and chars in the same way.
Meaning that, if you try to .read(16), it will read either 16 bytes or characters.

If we count only characters, what should we do? If we have 15 bytes, and then a 3-bytes utf sequence, we can either read 15 or 18 bytes, we cannot read 16 or 17. This is very likely what is happening here.

mcollina on 27 Oct 2016

(Maybe I am wrong about that utf story, _this change_ is really _hard_).

I'm seeing two tests failing:

test/parallel/test-stream2-set-encoding.js
test/parallel/test-stream-preprocess.js

In case of 1., there is something wrong going on with hex encoding.

[ '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '61616161616161616161',
  '' ]
[ '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161',
  '6161616161' ]

All the data is there, but organized differently. There might be an issue in onEofChunk, again accessing chunk.length: https://github.com/jessicaquynh/node/blob/cedc393221c305eb87fc50fde2191664bddac697/lib/_stream_readable.js#L405

mcollina on 27 Oct 2016

Maybe this is just too hard to fix in general? The use case of piping with setEncoding is kinda edge-casy anyway.

mafintosh on 27 Oct 2016

@mafintosh probably, at this point I am wondering if it's feasible, and if it decreases throughput. Maybe we are just worrying for nothing.

mcollina on 27 Oct 2016

Hey @mcollina, I've added your suggestion but it actually crops up more errors with the tests.

I'm seeing two tests failing:
test/parallel/test-stream2-set-encoding.js

I noticed in my first changes that this same error was happening when I was using the state's encoding in the readableAddChunk func, here.

I am not exactly sure what the issue is. So, I tried experimenting without setting the encoding in fromListPartial, here and that seemed to work.

Though, I am not exactly sure why because the failing test/parallel/test-stream2-set-encoding.js is checking for hex encoding, and with the empty string it defaults to utf8. If you wouldn't mind explaining this difference in behaviour, I'd very much appreciate it :)

Nonetheless

test/parallel/test-stream-preprocess.js

is still failing with this error.

return Buffer.byteLength(state.buffer.head.data)

jalafel on 28 Oct 2016

@jessicaquynh sorry it's taking a bit of time :( this week and the next one will be a bit busy, I will try to answer as quickly as I can.

mcollina on 2 Nov 2016

@mcollina no worries! I appreciate all your help :)

jalafel on 2 Nov 2016

ping @nodejs/streams

jasnell on 30 May 2017

I think the best approach about this issue is to update the documentation. An actual fix proved to be way harder, and it might slow things down considerably.

Would anyone update the docs? Maybe @jessicaquynh?

mcollina on 30 May 2017

@mcollina definitely can; are you thinking about putting it in as a "gotcha" statement?

jalafel on 31 May 2017

@jessicaquynh yes, I think that would work, let's see what other people think when you fire the PR.

mcollina on 1 Jun 2017