Flow: No utf-8 support

Created on 15 Jun 2015 · 8Comments · Source: facebook/flow

Use any non ascii valid ecmascript character in your variables and your gonna get this error: Unexpected token ILLEGAL
I think this is a serious issue because it's stopping some users from using your checker.

And man we are in 2015, why utf-8 support is still a issue?

help wanted parsing

Source

matematicapentrutoti

Most helpful comment

The commit above should improve things a lot. We should now support unicode identifiers and whitespace, and although we already "supported" unicode in strings, the loc and range in the AST and in error locations are now accurate.

We don't yet support \u escapes in identifiers. So let 💩 = "poop" works, but let \u{1F4A9} = "poop" doesn't, yet.

mroch on 6 Apr 2017

🎉1 😄1

All 8 comments

http://es5.github.io/x7.html#x7.6

matematicapentrutoti on 15 Jun 2015

TL;DR - it's a known problem and limitation of our current lexer. We have ideas for how to fix it but it's probably a bunch of work and we won't be able to get to it soon. We're happy to take PR's though :)

It's a current limitation of the lexer. We've hand-written the parser but still use a lexer generator to lex the file. ocamllex, the lexer generator we use, generates a state machine that goes byte by byte. So sure, we could lex the unicode characters as UTF-8 and just teach the state machine about it.

The big problem with dealing with UTF-8 in ocamllex is that the spec for which unicode characters are allowed where is really complicated. Esprima, a JS parser written in JS, uses a giant regex with UTF-16 ranges to decide if a unicode character is allowed as the start of an identifier or in an identifier. When dealing with UTF-8, you don't have the luxury of these UTF-16 ranges. I wrote a little script to generate a UTF-8 regex for the allowed character classes, and it was ginormous. Plugging it into ocamllex resulted in way to many states for the poor little state machine.

The fix will probably be handwriting a new lexer, switching to a different lexer generator, or pre-processing the input source code. So yeah, despite it being 2015 unicode support is still non-trivial :(

gabelevi on 15 Jun 2015

A quick, incomplete and dirty hack is to support at least Latin Extended A & B. I'm not suggesting support for precomposed characters . This kind of support would be covering all Latin alphabets in Europe, south and north America and Africa. There are only 336 characters, this should not bee too taxing on your state machine.

matematicapentrutoti on 16 Jun 2015

Non-ISO-8859-1 identifiers are used in popular d3 library:

var ε = 1e-6, ε2 = ε * ε, π = Math.PI, τ = 2 * π, τε = τ - ε, halfπ = π / 2, d3_radians = π / 180, d3_degrees = 180 / π;

When trying to import this library from flow-checked file

/* @flow */
import * as d3 from 'd3'
...

flow fails to parse d3.js and therefore does not allow to import it:

js/myfile.js:3
  3: import * as d3 from 'd3'
                         ^^^^ d3. Required module not found

node_modules/d3/d3.js:1260
1260:   var ?? = 1e-6, ε2 = ε * ε, π = Math.PI, τ = 2 * π, τε = τ - ε, halfπ = π / 2, d3_radians = π / 180, d3_degrees = 180 / π;
            ^ Unexpected token ILLEGAL

tomkur on 26 Feb 2016

Curiously, simply quoting keys in an object fixes the issue:

const CONVERSIONS = {
  m: 1,
  cm: 100,
  mm: 1000,
  "μm": 1000000
}

FlowType Try

STRML on 21 Aug 2016

@gabelevi I tried looking into the source but I am new to ocaml, could you use something like [$_[:alpha:]][$_[:alnum:]]* for identifiers? I've been using this pattern for syntax highlighting in sublime and it works with chinese, russian, etc... characters.