Three.js: unicode character split issue in function createPaths(src/extras/core/Font.js)

Created on 4 May 2018 · 10Comments · Source: mrdoob/three.js

Description of the problem

While creating paths for common string text, the current function createPaths( text, size, divisions, data ) works fine, but not for those characters whose unicode is > U+FFFF due to below split function:
var chars = String( text ).split( '' );
For example, character "𝄞" (U+1D11E) is split into 2 parts, so it won't be able to referred correctly while looking up from the font glyphs.

I guess there're pretty less requirements as mine, maybe the impact is almost zero. But still better to handle properly for those kind of scenarios. 😃

Thanks.

Three.js version

[ ] Dev
[x] r92
[ ] ...

Browser

[ ] All of them
[x] Chrome
[ ] Firefox
[ ] Internet Explorer

OS

[ ] All of them
[ ] Windows
[x] macOS
[ ] Linux
[ ] Android
[ ] iOS

Hardware Requirements (graphics card, VR Device, ...)

Bug

Source

mooncaker816

All 10 comments

I think Array.from() should solve the problem: https://jsfiddle.net/5LLchg9q/

Array.from() is not supported in IE 11 and other older browsers but we could still use split() as a fallback. Or we just add a polyfill.

Mugen87 on 4 May 2018

Yes, both Array.from(text) & [...text] works.
I think use split() as fallback is reasonable.

mooncaker816 on 4 May 2018

So instead of

var chars = String( text ).split( '' );

let's do this:

var chars = Array.from ? Array.from( text ) : String( text ).split( '' ); // see #13988

Would you like to do a PR with the change?

Mugen87 on 4 May 2018

❤1

Hi @Mugen87 , PR #13998 is created, pls kindly review. Thanks.

mooncaker816 on 5 May 2018

https://jsfiddle.net/5LLchg9q/1/

Why not use
String.prototype.match.call(string, /[\uD800-\uDBFF][\uDC00-\uDFFF]?|[^\uD800-\uDFFF]|./g)
instead of
String( text ).split( '' )

gero3 on 5 May 2018

@gero3 Can you please explain the regex a bit? 😇

Mugen87 on 5 May 2018

👍1

@Mugen87,
For the unicode point within \u10000-\u10FFFF, I believe they are encoded with Extended UCS-2 in JS as a pair of surrogate points: \\uD800 - \uDBFF and low surrogate range is \uDC00 - \uDFFF.
Such as \u10001 => \uD800\uDC01.
high surrogate = Math.floor((unicode point - 0x10000) / 0x400) + 0xD800
low surrogate = (unicode point - 0x10000) % 0x400 + 0xDC00

So I guess @gero3 intends to pick all those pairs by [\uD800-\uDBFF][\uDC00-\uDFFF] and ignore the invalid character with only one single surrogate by [^\uD800-\uDFFF]? (Although the single surrogate scenario should not happen.)

If this is true, then I guess we should use below regex:
/[\uD800-\uDBFF][\uDC00-\uDFFF]|[^\uD800-\uDFFF]/g
or
/[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/g

@gero3 , Pls correct me if I'm wrong, thanks.

https://jsfiddle.net/vj49vuxb/2/

mooncaker816 on 6 May 2018

@mooncaker816 If you are happy with the current implementation of your PR, I would prefer this way. It's just easier to read than introducing a new regex.

Mugen87 on 10 May 2018

👍1

The regex looks like a condensed version of https://github.com/dotcypress/runes