JavaScript is, without some custom boilerplate, unable to properly deal with Unicode characters/codepoints outside the BMP, i.e., ones whose encoding requires more than 16 bits.
This limitation seems to carry over to PEG.js, as shown in the example below.
In particular, I'd like be be able to specify ranges such as [\u1D400-\u1D419]
(which presently turns into [α΅0-α΅9]
) or equivalently [π-π]
(which throws an "Invalid character range" error). (And using the newish ES6 notation [\u{1D400}-\u{1D419}]
results in the following error: SyntaxError: Expected "!", "$", "&", "(", ".", character class, comment, end of line, identifier, literal, or whitespace but "[" found.
.)
Might there be a way to make this work that does not require changes to PEG.js?
Example code:
This grammar:
//MathUpper = [π-π]+
MathUpperEscaped = [\u1D400-\u1D419]+
Expected behavior:
The parser generated from the given grammar successfully parses, for example, "πππ".
Actual behavior:
A parse error: Line 1, column 1: Expected [α΅0-α΅9] but "
(Or, when uncommenting the other rule, an "Invalid character range" error.)
To be entirely honest, aside from updating Unicode support for the PEG.js grammar parser and the JavaScript example, I have little to no knowledge about Unicode, so am currently unable to fix this problem (it's clearly stated in both of the grammar's: _Non-BMP characters are completely ignored_).
For now, while working on more urgent personel and work-related projects (including _PEG.js 0.x_), I'll continue waiting for someone who understands Unicode better to offer some PR's π, or eventually get around to it after _PEG.js v1_, sorry buddy.
FYI, surrogate pairs appear to work. The grammar
start = result:[\uD83D\uDCA9]+ {return result.join('')}
parses π© which is u+1F4A9. Note the result.join('') puts the surrogate pair back together, otherwise you get ['\uD83D','\uDCA9']
instead of hankey. Ranges would be problematic.
More on surrogate pairs: https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
This by no means replaces what the OP asked for.
@drewnolan Thanks for the heads up π
Unfortunately that grammar also parses \uD83D\uD83D
.
For others who have run into this issue: I'm lucky in that I only need to handle a small subset of codepoints outside the BMP, so I ended up mapping them into the BMP's private use area prior to parsing and inverting this mapping right afterwards.
This solution is obviouly fraught with problems in the general case, but it works well in my problem domain.
@futagoza - I'll do my best to explain. You are facing several problems here.
utf-16
, ucs4
, et cetera. These are the way that the codepoints
, which are the intended data, are encoded as bytes. utf-16-le
by example lets you encode most letters as two-byte pairs called code units
, but use groups of code units to express high value characters up to 0x10ffff
.I want this too, but realistically, this is not going to happen
holy shit, someone committed a full string parser replacement almost a year ago, and they recognized the overhead, so they let us use standard JS strings usually
WHY ISN'T THAT MERGED
@StoneCypher I love the fire in your heart! But why give the current maintainer a hard time? No one is owed anything. Why not maintain your own fork?
There is no current maintainer. The person who took PEG over has never released anything. He worked on the next minor for three years, then said he didn't like how it looked, is throwing all of peg.js
away, and starting over from something he wrote from scratch in a different language, with an incompatible AST.
The tool has lost half its userbase waiting three years on this man to commit one-line fixes that other people wrote, like es6 module support, typescript support, arrow support, extended unicode, et cetera.
There are a dozen people asking him to merge and he keeps saying "no, this is my hobby project now and I don't like what it is"
Lots of people have companies based on this parser. They're completely screwed.
This man promised to be a maintainer for an extremely important tool, and hasn't done any maintenance. It's time to let someone else keep this library in good standing now.
Why not maintain your own fork?
I have for three years now. My peg
has almost a third of the issue tracker fixed.
I had to clone it, rename it, and make a new fork to fix the size problem to try to commit it, because mine has drifted too much
It's time for everyone else to receive these fixes, as well as the ones that have been sitting in the tracker since 2017.
This guy isn't maintaining peg; he's letting it die.
It's time for change.
@drewnolan - so, I'm not sure whether this is interesting or not, but, surrogate pairs don't actually work. It's just that, by coincidence, they usually do.
In order to understand the underlying problem you have to think about the encoding level bit pattern, not the representation level one.
That is, if you have a unicode value of 240, most people will think "Oh, he means 0b1111 0000
." But actually, that's not how Unicode represents 240; over 127 is represented by two bytes, because the top bit is a flag, not a value bit. So 240 in Unicode is actually 0b0000 0001 0111 0000
in storage (except in utf-7, which is real and not a typo, where things get super weird. And yes, I know Wikipedia says it's not in use. Wikipedia is wrong. It's what SMS is sent over; it might be the most common character encoding by total traffic.)
So here's the problem.
If you write a byte STUV WXYZ, then in utf16, from ucs4 data, if your thing gets cut in half, pretty often you can just staple it back together.
One time in 128 you cannot, for characters natively over two bytes' encoding. (Sounds like an awfully specific number, doesn't it?)
Because when that top bit in the second byte position is in use, cutting it in half will add a zero where that should have been a one. Stapling them back together side by side as binary data doesn't remove the value concept again. The decoded value is not equivalent to the encoded value, and you're appending decodings, not encodings.
It so happens that most emoji are outside this range. However, big chunks of several languages aren't, including rare chinese, most math, and musical symbols.
Granted, your approach is good enough for almost all emoji, and for every human language common enough to get in by Unicode 6, which is a huge improvement over the status quo
But this PR really should be merged, once it's sufficiently tested both for correctness and against unexpected performance problems (remember, unicode performance problems are why php died)
It seems the . (dot character)
expression also needs Unicode mode. Compare:
const string = '-π-π±-';
const symbols = (string.match(/./gu));
console.log(JSON.stringify(symbols, null, ' '));
const pegResult = require('pegjs/')
.generate('root = .+')
.parse(string);
console.log(JSON.stringify(pegResult, null, ' '));
Output:
[
"-",
"π",
"-",
"π±",
"-"
]
[
"-",
"\ud83d",
"\udc0e",
"-",
"\ud83d",
"\udc71",
"-"
]
I worked recently on this, using #616 as a basis and modifying it to use ES6 syntax \u{hhhhhhh}
, I will create a PR in some hours.
Computing the UTF-16 regex ranges split by surrogates is a bit complicated and I used https://github.com/mathiasbynens/regenerate for this; this would be the first dependency of the package pegjs, I hope it is possible (there are also polyfills for Unicode properties which could be added as dependency, see #648). See Wikipedia if you donβt know UTF-16 surrogates.
To make PEG.js compatible with the whole Unicode, there are various levels:
.
to catch 1 or 2 code units.For most points we can manage to be backward-compatible and generate parsers very similar to older ones, except for the point 5 because the result of a parsing can depend if the dot rule captures one or two code units. For this I propose to add a runtime option to let the user choice between two or three choices:
Regex class can be analysed statically during parser generation to check if they have a fixed length (in number of code units). There are 3 cases: 1. only a BMP or a single code unit, or 2. only a two code units, or 3. one-or-two code units depending on runtime. For now the bytecode assumes a regex class is always one code unit (see here). With the static analysis, we could change this parameter of this bytecode instruction to 1 or 2 for the two first cases. But for the third case, I guess a new bytecode instruction should be added to, at runtime, obtain the number of code units matched and increase the cursor accordingly. Other options without a new bytecode instruction would be: 1. to always compute the number of code units matched but this is a performance penalty during parsing for BMP-only parsers so I donβt like this option; 2. to compute if the current code unit is a high surrogate followed by a low surrogate to increment of 1 or 2, but this would assumes the grammar has always well-formed UTF-16 surrogates without the possibility to write grammars with lone surrogates (see next point) and this is also a performance penalty for BMP-only parsers.
There is the question of lone surrogates (high surrogate without low surrogate after it, or low surrogate without high surrogate before it). My opinion about this is that a regex class should be exclusively: either with lone surrogates either with well-formed UTF-16 Unicode characters (BMP or a high surrogate followed by a low surrogate), else there is the danger that grammar authors unaware of UTF-16 subtleties mix both and do not understand the result, and grammar authors who want to manage themselves UTF-16 surrogates can do it with PEG rules to describe links between specific high and low surrogates. I propose to add a visitor enforcing this rule during parser generation.
To conclude, it is probably easier to manage the question of lone surrogates in PEG than in regex because the PEG parser always advances, so either next code unit is recognised either it is not, at the contrary of regexes where possibly some backtracking could associate or disassociate a high surrogate with a low surrogate and consequently change the number of Unicode characters matched, etc.
The PR for ES6 syntax for astral Unicode character is #651 based on #616 and the development for classes is https://github.com/Seb35/pegjs/commit/0d33a7a4e13b0ac7c55a9cfaadc16fc0a5dd5f0c implementing the points 2 and 3 in my comment above, and only a quick hack for cursor increment (point 4) and nothing for now for the rule dot .
(point 5).
My current development about this issue is mostly finished, the more advanced work is in https://github.com/Seb35/pegjs/tree/dev-astral-classes-final. All five points mentioned above are treated and the global behaviour tries to mimic JS regexes regarding the edge cases (and there is a lot of them).
The global behaviour is governed by the option unicode
similar to the flag unicode
in JS regexes: the cursor is increased of 1 Unicode character (1 or 2 code units) depending on the actual text (e.g. [^a]
matches the text "π―" and the cursor is increased of 2 code units). When the option unicode
is false the cursor is always increased of 1 code unit.
Regarding the input Iβm not sure if we design PEG.js the same way that JS regexes: should we authorise [\u{1F4AD}-\u{1F4AF}]
(equivalent to [\uD83D\uDCAD-\uD83D\uDCAF]
) in the grammar in non-Unicode mode? We can make a difference between "input Unicode" and "output Unicode":
Personally I guess I would prefer we authorise input Unicode, either permanently or with an option with a default true
since there is no significant overhead and this would enable this possibility for everyone by default, but the output Unicode should remain false
because performance of generated parsers is better (always a cursor increment of 1).
Regarding this issue in general (and about the output Unicode defaulting to false
), we should keep in mind that it is already possibly to encode in our grammars Unicode characters, at the price of understanding the functioning of UTF-16:
// rule matching [\u{1F4AD}-\u{1F4AF}]
my_class = "\uD83D" [\uDCAD-\uDCAF]
// rule matching any Unicode character
my_strict_unicode_dot_rule = $( [\u0000-\uD7FF\uE000-\uFFFF] / [\uD800-\uDBFF] [\uDC00-\uDFFF] )
// rule matching any Unicode character or a lone surrogate
my_loose_unicode_dot_rule = $( [\uD800-\uDBFF] [\uDC00-\uDFFF]? / [\u0000-\uFFFF] )
So a grammar author who want both a quick parser and be able to recognise Unicode characters in specific parts of her/his grammar can use such rule. Consequently this issue is merely about simplifying Unicode management without diving into UTF-16 internals.
About the implementation, I considered in my first attempt that the grammar text was encoded in Unicode characters and the 'dot' rule of the PEG.js parser recognised Unicode characters. The second and final attempt reverted this (the dot rule is always 1 code unit for quicker parsing) and there is a small algorithm in the visitor prepare-unicode-classes.js to reconstruct split Unicode characters in characters classes (e.g. [\uD83D\uDCAD-\uD83D\uDCAF]
is syntactically recognised as [ "\uD83D", [ "\uDCAD", "\uD83D" ], "\uDCAF" ]
and this algorithm transforms this to [ [ "\uD83D\uDCAD", "\uD83D\uDCAF" ] ]
). I envisaged to write this in the grammar itself but it would have been long and more importantly there are multiple ways to encode the characters ("π―", "uD83DuDCAF", "u{1F4AF}"), so it is easier to write it in a visitor.
I added two opcodes in the second attempt:
(input.charCodeAt(currPos) & 0xFC00) === 0xD800 && input.length > currPos + 1 && (input.charCodeAt(currPos+1) & 0xFC00) === 0xDC00
classes[c].test(input.substring(currPos, currPos+2)
ACCEPT_N
, and the characters classes are split in two regexes of fixed length (1 or 2 code units).I did some optimisations with the "match" feature, eliminating during generation the "dead code" paths depending on the mode (Unicode or not) and the characters class.
Note also that regexes are always positive in this implementation: inverted regexes return the opposite resulting in the bytecode. This was easier to avoid edge cases around surrogates.
I wrote some documentation but I will probably add some more (perhaps a guide to quickly explain the details of the option(s) unicode and the snippets with the home-made Unicode dot rule). I will add tests before submitting it as a PR.