Node: Is it a bug to return BOM header after calling fs.readFile when the text encoding (like utf8) is specified?

Created on 22 May 2016 · 9Comments · Source: nodejs/node

Version:: all version so far
Platform:: all platform
Subsystem: ...

When loading text content from filepath that are utf-8 encode and having a BOM header

fs.readFile(filepath, 'utf8', (err, content) => {

});

the content returned still contains the BOM header, in addition to the text content. It causes a lot of wasted efforts to remove such a header in custom codes and is not the kind of behavior in other languages ...

Removing such a header in node shouldn't affect most of the existing codes that depend on it since such a header is most likely not used by custom codes anyway

feature request fs question

Source

webarchymeta

👍4

Most helpful comment

It would potentially be possible to add a omitBOM: false|true option to the fs.readFile API and have it attempt to strip the BOM for known encodings (utf8 and utf16le). A PR would be welcomed :-) That's no guarantee that it would land because others would need to weigh in on it but I'd welcome it for sure.

jasnell on 7 Jun 2016

👍4

All 9 comments

It seems developers leave this to userland: https://github.com/nodejs/node/issues/3040

vsemozhetbyt on 22 May 2016

👍1

I don't understand the rational behind such a decision, it's seems to be semantically incorrect ... It does not take a lot to do it in the core at all

webarchymeta on 22 May 2016

❤1

Well, I also am tired of writing time after time:

  const rli = rl.createInterface({input: fs.createReadStream(path, 'utf8')});
  let lineNumber = 0;
  rli.on('line', line => {
    if (++lineNumber === 1) line = line.replace(bomRE, '');
    //...
  })

May be something will change in the distant future.

vsemozhetbyt on 23 May 2016

There are at multiple (15+) different Byte Order Mark (BOM) values that can be received from a stream. Actual value of BOM is very relevant to someone implementing a generic parser capable of handling anything thrown at it - or at least fail gracefully.

If core receives the BOM and stops passing it through then core should also handle all the nuances of every encoding and endianness accordingly and somehow present them in a uniform way for userland stream. That seems unlikely to happen and no-one wants the core to make dumbed down assumptions as a shortcut just because majority of users are expecting to receive UTF-8 on little-endian CPU.

So, it is true that requesting stream in UTF-8 has fixed BOM of EF BB BF (hex), but that does not apply to other possible encoding/endianess combinations.

Maybe optional parameter for dropping BOMs from read streams could be offered?

imyller on 23 May 2016

Yeh, what you suggested could be an option for the time being ...

But still ... I am not an expert in text encoding so I do not know a lot intricacies about it. But to me it seems that byte ordering is platform dependent only, after having decoded a block of text successfully from the raw bytes the node core had already made the right conversion (using the BOM) and send the resulting text into the userland. What is the use of the BOM information after that, inside the userland, which most like doing only platform independent programming?

For those who make generic parsers (the real men :-)), they can always read the raw bytes into a Buffer object and start from there. Am I missing something in my statements?

webarchymeta on 23 May 2016

@webarchymeta

What is the use of the BOM information after that, inside the userland, which most like doing only platform independent programming?

Exactly. For example: embedded system code that gets cross-compiled to multiple platforms, code interfacing with legacy systems/hardware or mission-critical code with need to gracefully handle every possible input scenario; no matter how implausible.

For those who make generic parsers (the real men :-)), they can always read the raw bytes into a Buffer object and start from there. Am I missing something in my statements?

You might be right about that, but I'm still arguing that BOMs are not by any means irrelevant nuisance and exposing them makes Node.js stronger in capability than it would be if hiding them. Streams starting with BOMs are defined in unicode standard.

That said, it I still think that it would be good idea to offer option for BOM stripped streams - just for developer comfort. Also, BOMs could be used to implement ReadStream with true unicode conversion between source content encoding (defined by its BOM) and output encoding requested by the user. Especially in that case outputting BOM bytes could very well be optional if conversion is reliable.

imyller on 23 May 2016

jasnell on 7 Jun 2016

👍4

Maybe for streams it would be also very helpful.

vsemozhetbyt on 7 Jun 2016

This issue has been inactive for sufficiently long that it seems like perhaps it should be closed. Feel free to re-open (or leave a comment requesting that it be re-opened) if you disagree. I'm just tidying up and not acting on a super-strong opinion or anything like that.