Use any non ascii valid ecmascript character in your variables and your gonna get this error: Unexpected token ILLEGAL
I think this is a serious issue because it's stopping some users from using your checker.
And man we are in 2015, why utf-8 support is still a issue?
TL;DR - it's a known problem and limitation of our current lexer. We have ideas for how to fix it but it's probably a bunch of work and we won't be able to get to it soon. We're happy to take PR's though :)
It's a current limitation of the lexer. We've hand-written the parser but still use a lexer generator to lex the file. ocamllex, the lexer generator we use, generates a state machine that goes byte by byte. So sure, we could lex the unicode characters as UTF-8 and just teach the state machine about it.
The big problem with dealing with UTF-8 in ocamllex is that the spec for which unicode characters are allowed where is really complicated. Esprima, a JS parser written in JS, uses a giant regex with UTF-16 ranges to decide if a unicode character is allowed as the start of an identifier or in an identifier. When dealing with UTF-8, you don't have the luxury of these UTF-16 ranges. I wrote a little script to generate a UTF-8 regex for the allowed character classes, and it was ginormous. Plugging it into ocamllex resulted in way to many states for the poor little state machine.
The fix will probably be handwriting a new lexer, switching to a different lexer generator, or pre-processing the input source code. So yeah, despite it being 2015 unicode support is still non-trivial :(
A quick, incomplete and dirty hack is to support at least Latin Extended A & B. I'm not suggesting support for precomposed characters . This kind of support would be covering all Latin alphabets in Europe, south and north America and Africa. There are only 336 characters, this should not bee too taxing on your state machine.
Non-ISO-8859-1 identifiers are used in popular d3 library:
var 蔚 = 1e-6, 蔚2 = 蔚 * 蔚, 蟺 = Math.PI, 蟿 = 2 * 蟺, 蟿蔚 = 蟿 - 蔚, half蟺 = 蟺 / 2, d3_radians = 蟺 / 180, d3_degrees = 180 / 蟺;
When trying to import this library from flow-checked file
/* @flow */
import * as d3 from 'd3'
...
flow fails to parse d3.js and therefore does not allow to import it:
js/myfile.js:3
3: import * as d3 from 'd3'
^^^^ d3. Required module not found
node_modules/d3/d3.js:1260
1260: var ?? = 1e-6, 蔚2 = 蔚 * 蔚, 蟺 = Math.PI, 蟿 = 2 * 蟺, 蟿蔚 = 蟿 - 蔚, half蟺 = 蟺 / 2, d3_radians = 蟺 / 180, d3_degrees = 180 / 蟺;
^ Unexpected token ILLEGAL
Curiously, simply quoting keys in an object fixes the issue:
const CONVERSIONS = {
m: 1,
cm: 100,
mm: 1000,
"渭m": 1000000
}
@gabelevi I tried looking into the source but I am new to ocaml, could you use something like [$_[:alpha:]][$_[:alnum:]]* for identifiers? I've been using this pattern for syntax highlighting in sublime and it works with chinese, russian, etc... characters.
The commit above should improve things a lot. We should now support unicode identifiers and whitespace, and although we already "supported" unicode in strings, the loc and range in the AST and in error locations are now accurate.
We don't yet support \u escapes in identifiers. So let 馃挬 = "poop" works, but let \u{1F4A9} = "poop" doesn't, yet.
tracking the \u-in-identifer issue in https://github.com/facebook/flow/issues/3837
Most helpful comment
The commit above should improve things a lot. We should now support unicode identifiers and whitespace, and although we already "supported" unicode in strings, the
locandrangein the AST and in error locations are now accurate.We don't yet support
\uescapes in identifiers. Solet 馃挬 = "poop"works, butlet \u{1F4A9} = "poop"doesn't, yet.