Node: Buffer.toString() didn't trim the tailing `\0`

Created on 20 Jan 2016  路  8Comments  路  Source: nodejs/node

node version is 5.4.0

const buf = new Buffer([45, 45, 45, 45, 0, 0, 0, 0]);
undefined
> buf.toString('utf8');
'----\u0000\u0000\u0000\u0000'
> buf.toString('ascii');
'----\u0000\u0000\u0000\u0000'
> buf.toString('ascii').length;
8

As the code above, if the buffer contains a string with \0 padding in the end, Buffer.toString() returns those padding as part of the String, which will never be shown during the output, and it's also affect the String.length.

The correct returned result should be ----, and the length should be 4, the tailing \0 shouldn't be part of the final string.

buffer

Most helpful comment

This cause many problems.

  • String.trim() will not remove the \0, I don't think it expects \0 anyway;
  • Output like console.log() will not show \u0000, not space, not '\u0000', nothing, it's basically the same as ----, however, the length is not 4, it's 8;
  • when using something like lodash's pad(), padEnd(), padStart(), they will use the length, which included those \0, and cannot correct padding the strings.

The problem came from parsing a binary file, which has some fixed length areas contain \0 ended/padded strings. I think the case is very common in parsing binary files/network packets.

My current work around is buf.toString('utf8').replace(/\0/g, '');, however, I think Buffer.toString() or maybe String.trim() should deal with the case.

All 8 comments

\u0000 is a legitimate ASCII and Unicode character, the first in the block of C0 control codes. The 0x00 byte is the correct encoding of this character in UTF-8 and ASCII.

I agree that this is intentional (and desired) behavior. You could ask the same thing about built-in strings.

This cause many problems.

  • String.trim() will not remove the \0, I don't think it expects \0 anyway;
  • Output like console.log() will not show \u0000, not space, not '\u0000', nothing, it's basically the same as ----, however, the length is not 4, it's 8;
  • when using something like lodash's pad(), padEnd(), padStart(), they will use the length, which included those \0, and cannot correct padding the strings.

The problem came from parsing a binary file, which has some fixed length areas contain \0 ended/padded strings. I think the case is very common in parsing binary files/network packets.

My current work around is buf.toString('utf8').replace(/\0/g, '');, however, I think Buffer.toString() or maybe String.trim() should deal with the case.

If a binary format has null-padded fixed-length strings, then the fact that the nulls should be removed is a policy of the format, and therefore the responsibility of the parser for that format.

Buffer.toString() does exactly what it says on the box: Convert the bytes to characters using the given encoding. It is not in a position to unilaterally omit specific characters to save you a few lines in your parser.

String.trim() is part of the Ecmascript spec and is implemented by browsers and used in limitless amounts of code. It also cannot be changed to help you write your parser.

What @mscdex and @jesseschalken said. Regarding the length, there are countless other invisible characters in Unicode.

Closing as not a bug.

Have you guys any advice for dealing with unicode characters like '\u0000' and 'C?C>'. I've been looking hard but it doesn't seem like there's a standard for dealing with unicode strings extracted from file binaries, for example, and converting them to a human-readable form.

Op, I got this problem too. as a C++ programmer, I made a web client with nodejs, but all the tasks and data dispatched by a C++ server. while the front web received the data, I have to do many code for translating my data. what amazing is nodejs compile by c++....maybe, I should merge nodejs resource code to my server, sounds be familar with some frame like nginx.

To anyone who comes across this .trim()'s behavior is defined in the ECMAScript specification such that \0 (NULL) characters are not removed.

Try .replace(/^[\s\uFEFF\xA0\0]+|[\s\uFEFF\xA0\0]+$/g, "") instead of .trim().

This is based on MDN's polyfill for .trim(), but adds \0 characters.

Was this page helpful?
0 / 5 - 0 ratings