Fable: String enumeration doesn't handle Unicode surrogate pairs

Created on 2 Dec 2017  ·  13Comments  ·  Source: fable-compiler/Fable

Description

Strings in both F# and JavaScript are UTF-16, i.e. sequence of 16-bit code units, so it should be possible to have the same code unit enumeration behavior. This is important for applications such as parsers. However it appears surrogate code units are lost in translation in Fable.

Given this string:

let unicodeString = ".\U0001f404."

It is 3 characters consisting of period, cow (U+1F404), and period:

printfn "%s" unicodeString
.🐄.

Since 1F404 is more than 16 bits, the cow is encoded in UTF-16 with two code units U+D83D U+DC04 (surrogate pair).

We can verify that the UTF-16 encoded string in F# and Javascript is actually 4 code units total:

printfn "length = %d" <| String.length unicodeString
length = 4

We can (maybe?) enumerate the code units and print their values:

unicodeString |> Seq.iter (fun c -> printfn "%c %x %d" c (int c) (int c))

CORRECT output F# Interactive:
. 2e 46
� d83d 55357
� dc04 56324
. 2e 46

INCORRECT output Fable:
. 2e 46
🐄 d83d 55357
. 2e 46

Fable enumerates the surrogate pair as one resolved character (but not for the int conversion where it only sees the first half of the pair).

Similar discrepancy with Seq.map and Seq.toList/List.ofSeq:

unicodeString |> List.ofSeq |> printfn "%A"
['.'; '�'; '�'; '.']  (F#)
[.; 🐄; .]  (Fable)

unicodeString |> Seq.map int |> Seq.toList |> printfn "%A"
[46; 55357; 56324; 46]  (F#)
[46; 55357; 46]  (Fable)

We can work around it by manually indexing the string:

let enumerateString s = seq { 0 .. String.length s - 1 } |> Seq.map ( fun i -> s.[i] )

unicodeString |> enumerateString |> List.ofSeq |> printfn "%A"
['.'; '�'; '�'; '.']  (F#)
[.; �; �; .]  (Fable)

unicodeString |> enumerateString |> Seq.map int |> Seq.toList |> printfn "%A"
[46; 55357; 56324; 46]  (F#)
[46; 55357; 56324; 46]  (Fable)

Should Fable handle string enumeration into code units like this natively?

Related information

  • Fable version (dotnet fable --version): 1.3.4
  • Operating system: Windows 10

Most helpful comment

@Pauan

Windows 10:

Browser | Indexing | charCodeAt | ES2015 iteration
-- | -- | -- | --
Chrome 62.0.3202.94 (Official Build) (64-bit) | 815,504 | 564,875 | 338,517
Firefox Developer Edition 58.0b8 (64-bit) | 1,385,233 | 1,831,598 | 381,374
Edge 41.16299.15.0/EdgeHTML 16.16299 | 1,625,288 | 215,262 | 24,892

All 13 comments

Thanks a lot for such a detailed report, @chadunit! Interesting, right now Fable is just using JS default string iteration. It's funny that indexing and iteration work differently. I know that they introduced full Unicode points in ES6, maybe in old browser indexing doesn't even work.

I remember that at the beginning I was pondering to transpile chars to numbers, but I decided to use strings at the end precisely because JS gives you 1-char strings when iterating a string. I think this was mainly for convenience, I don't remember if there was a special reason to use only native string iteration. If there's none I guess it's possible to create our custom iterator (based on indexing) whenever a string is iterated. Currently Fable ignores casting, but we can detect the case when a string is being upcast to char seq and use a custom function.

@alfonsogarciacaro ECMAScript 2015 string iteration does indeed iterate over Unicode code points. Unicode existed in earlier versions of ECMAScript, but it was cumbersome to use, so ECMAScript 2015 made it easier.

You can use an old-school for loop if you want to iterate over 16-bit code units:

const length = string.length;

for (let i = 0; i < length; ++i) {
  console.log(string[i]);
}

This will work in all browsers, and it's probably faster too.

If you want to return integers rather than 1-length strings, you can use string.charCodeAt instead:

const length = string.length;

for (let i = 0; i < length; ++i) {
  console.log(string.charCodeAt(i));
}

This also works in all browsers.

Also, I just now created a JSPerf which compares the various JS string iteration techniques.

As I expected, ES2015 iteration is extremely slow.

Surprisingly, it turns out that charCodeAt is significantly faster than all the other methods.

Thanks a lot for the comments and for the benchmark @Pauan, that's really helpful. Interestingly, in my case indexing was the fastest method (and also surprisingly charAt was 50% slower. Right now, there are other methods in Fable that are expecting char is a string at runtime, so it makes sense to use indexing for now :+1:

Indexing fastest for me too.

I don't know where the line is for Fable handling JS/F# behavior differences vs. user understanding and working around. I come from .NET background so JS aspect of Fable is more of an "implementation detail" to me :smile:. The fewer behavior differences the easier it is on my brain but depends on overall vision for Fable probably.

So I personally would prefer if Fable did coerce the iterator to behave exactly like F#. And maybe users wanting native ES2015 code point iteration could get it through JsInterop somehow? But the issue is easily worked around so I'd understand leaving it or at least considering it pretty low in priority.

@alfonsogarciacaro @chadunit What browsers are you using? I tested on Firefox 57.0b4 and Chrome 61.0.3163.100 (both on 64-bit Linux)

@chadunit Personally, I agree. Even though I'm a JavaScript guy, I'd rather Fable aim for good F# compatibility, even if it sometimes clashes with JS compatibility. Then you would use the FFI to bridge between F# and JS.

But I know I'm unusual, a lot of JavaScript people would much prefer for smooth JS compatibility by default, even if it clashes with F#.

@Pauan

Windows 10:

Browser | Indexing | charCodeAt | ES2015 iteration
-- | -- | -- | --
Chrome 62.0.3202.94 (Official Build) (64-bit) | 815,504 | 564,875 | 338,517
Firefox Developer Edition 58.0b8 (64-bit) | 1,385,233 | 1,831,598 | 381,374
Edge 41.16299.15.0/EdgeHTML 16.16299 | 1,625,288 | 215,262 | 24,892

@chadunit Very interesting, thanks! These are my results:

NixOS Unstable GNU/Linux (64-bit)

| Browser | Indexing | charCodeAt | ES2015 iteration |
| ---- | ---- | ---- | ---- |
| Chrome 61.0.3163.100 (Official Build) (64-bit) | 735,247 | 1,639,718 | 304,294 |
| Firefox Developer Edition 57.0b4 (64-bit) | 1,188,943 | 1,519,161 | 357,225 |

I don't really trust JSPerf though, so I should benchmark it with my custom benchmarking code.

When I benchmark it in my custom benchmark library, I get this result on NodeJS v6.11.1:

String iteration:
  +------------------+--------------+---------------+--------------+------------+
  | Name             | Milliseconds | Iterations/ms | % Difference |   % Slower |
  +------------------+--------------+---------------+--------------+------------+
  | charCodeAt       |    0.00160ms |  626.30893/ms |   100.00000% |   0.00000% |
  | Indexing         |    0.00207ms |  482.25136/ms |    76.99896% |  29.87188% |
  | codePointAt      |    0.00429ms |  233.29633/ms |    37.24940% | 168.46069% |
  | ES2015 iteration |    0.00960ms |  104.14109/ms |    16.62775% | 501.40425% |
  +------------------+--------------+---------------+--------------+------------+

So charCodeAt is still significantly faster than everything else (including ~30% faster than indexing). It sounds like there's some performance differences between Windows and Linux.

In any case, the real point is that ES2015 iteration is very slow, and also gives the incorrect answer in this situation, so using indexing would be a good idea.

@Pauan What about str.split('')? I just noticed we're already using that to translate .NET's str.ToCharArray(). Or it would be better to add our own implementation to fable-core?

function toCharArray(str) {
  if (typeof str !== "string") {
    throw new Error("Argument str is not of type string");
  }
  var len = str.length;
  var ar = new Array(len);
  for (var i = 0; i < len; i++) {
    ar[i] = str[i];
  }
  return ar;
}

Also FYI, I've already verified which code needs to be edited to fix this issue, but there's currently an optimization made by Fable that erases some coertions and I need to modify it in order to know exactly when a string is being cast to a char seq. That's why this is taking a bit longer.

@alfonsogarciacaro str.split('') will split based upon 16-bit code units (i.e. the correct F# behavior). In general anything that existed in JavaScript prior to ES2015 will be using 16-bit code units.

I don't think it's necessary to add a custom implementation, but it wouldn't hurt anything either. It might even be faster than using str.split(''). And if we have a custom implementation, it's easy to switch to charCodeAt later if we want to.

It took some time but this should be finally fixed in latest Fable 2 beta. Please reopen if there're still problems.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

et1975 picture et1975  ·  3Comments

MangelMaxime picture MangelMaxime  ·  3Comments

jwosty picture jwosty  ·  3Comments

SirUppyPancakes picture SirUppyPancakes  ·  3Comments

theprash picture theprash  ·  3Comments