If I try to read a big file (582,170,692 bytes, ~ 555 MB) into a buffer, it is OK. If I add an encoding and try to get a string, I get an error.
> require('fs').readFileSync('ru-ru_Wiki-2007-01-03.dsl').length
582170692
> require('fs').readFileSync('ru-ru_Wiki-2007-01-03.dsl', 'utf16le').length
Error: "toString()" failed
at Buffer.toString (buffer.js:513:11)
at Object.fs.readFileSync (fs.js:511:41)
at repl:1:15
at sigintHandlersWrap (vm.js:22:35)
at sigintHandlersWrap (vm.js:96:12)
at ContextifyScript.Script.runInThisContext (vm.js:21:12)
at REPLServer.defaultEval (repl.js:313:29)
at bound (domain.js:280:14)
at REPLServer.runBound [as eval] (domain.js:293:12)
at REPLServer.<anonymous> (repl.js:513:10)
It seems the string does not exceed the Spec limit. Is there any other undocumented (or documented in other places) limits for the fs.readFileSync()
or Buffer.toString()
?
I've found the de facto limit for the current v8: 268,435,440 characters (Math.pow(2, 28) - 16
), 536,870,880 bytes in UTF16.
This test code is OK:
const fs = require('fs');
fs.writeFileSync('bigfile.txt', `\uFEFF${'*'.repeat(Math.pow(2, 28) - 16 - 1)}`, 'utf16le');
console.log(fs.readFileSync('bigfile.txt', 'utf16le').length);
If I add just one character, it throws the error. Should it be documented somewhere?
FWIW the limit comes from here. ChakraCore uses a much different value that is dependent upon the value of INT_MAX
(on my system that would be 2147483646 -- which is ~10x larger than V8's static limit). With that in mind, I'm not sure how useful it is to document a VM-specific limit like this...
I think the docs recommendation should be (if it is not already) to use raw Buffers for "any very large data".
(iirc toString()
on that size is not exactly trivial?)
Just in case anyone else finds themselves at this issue from Google, I ran into this while trying to synchronously (no readline, streams, etc.) read a 400mb JSON file line by line.
As suggested, I used raw buffers to solve this aided by the buffer-split package.
var bsplit = require('buffer-split');
function readLineJSON(path) {
const buf = fs.readFileSync(path); // omitting encoding returns a Buffer
const delim = Buffer.from('\n');
const result = bsplit(buf, delim);
return result
.map(x => x.toString())
.filter(x => x !== "")
.map(JSON.parse);
}
@vsemozhetbyt … is there anything here you’d like to see? Would you want to open a docs PR yourself?
I have no definite opinion what should be added and in what way. It seems there is no consensus if we should document engine-specific limits. So feel free to close till any new decisions)
We should certainly improve the error messages:
#define SB_STRING_TOO_LONG_ERROR \
v8::Exception::Error(OneByteString(isolate, "\"toString()\" failed"))
Edit: @addaleax Just noticed your comment in the code. I could not find an open issue for this, is there? Any specific reason this has not been changed yet?
I could not find an open issue for this, is there? Any specific reason this has not been changed yet?
@tniessen No, not beyond the discussion in https://github.com/nodejs/node/pull/12765. The reason this has not been changed yet is that since it’s semver-major it would target Node 9, which gives plenty of time, and the fact that at some point we’re going to have to go through our native errors anyway to upgrade them to the new error code system. (Also, most of the ToDos from that PR might be suitable for first-time contributions from people with a C++ background.)
In which version of Node we should expect huge files to be supported?
@Extarys As per this blog post, max String length was increased in V8 6.2, ie the last Node.js LTS version (08.11.3) already supports them.
To be more exact: the new limit is mentioned in the "Increased max string length" section, ie 2**30 - 25
in 64-bit OS. It is 1 073 741 799
code units, or near 1 GB in ASCII or near 2 GB in UTF-16 LE (UTF-8 limit is less predictable, but near 1 GB should be OK at least).
Thanks for this update! I love that when importing big logs or something.
Sorry to ping on a closed thread, but I can't Google my way out of asking this: How do I set the buffer.constants.MAX_STRING_LENGTH
to the new maximum? The docs say:
<integer> The largest length allowed for a single string instance.
Represents the largest length that a string primitive can have, counted in UTF-16 code units.
This value may depend on the JS engine that is being used.
but I'm not familiar with the UTF16 code units or how to use them. Do I just write:
buffer.constants.MAX_STRING_LENGTH=2**30 - 25
I found this blog post, which uses the 2**30...
syntax
You can't change buffer.constants.MAX_STRING_LENGTH
value, it is read-only. You can only use it to retrieve the information about the limit which is set by the JS engine.
Ah, thank you for clarifying. Just console.log
?
Or ===
, >
, <
etc comparisons.
Most helpful comment
Just in case anyone else finds themselves at this issue from Google, I ran into this while trying to synchronously (no readline, streams, etc.) read a 400mb JSON file line by line.
As suggested, I used raw buffers to solve this aided by the buffer-split package.