Node: toLocaleUpperCase() not working for Georgian locale ('ka')

Created on 25 Aug 2018  Β·  21Comments  Β·  Source: nodejs/node

  • Version: v10.7.0
  • Platform: Darwin 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
  • Subsystem:

toLocaleUpperCase() does not work properly for Georgian locale.

_v10.7.0 and above_

> 'αƒ˜αƒαƒœαƒ•αƒαƒ αƒ˜'.toLocaleUpperCase();
'α²˜α²α²œα²•α²α² α²˜'

Expected behaviour

_Till node v10.6.0_

> 'αƒ˜αƒαƒœαƒ•αƒαƒ αƒ˜'.toLocaleUpperCase();
'αƒ˜αƒαƒœαƒ•αƒαƒ αƒ˜'
V8 Engine i18n-api

All 21 comments

FWIW:

> [...'αƒ˜αƒαƒœαƒ•αƒαƒ αƒ˜'.toLocaleUpperCase()].map(ch => ch.codePointAt().toString(16))
[
  '1c98', // GEORGIAN MTAVRULI CAPITAL LETTER IN  (U+1C98)
  '1c90', // GEORGIAN MTAVRULI CAPITAL LETTER AN  (U+1C90)
  '1c9c', // GEORGIAN MTAVRULI CAPITAL LETTER NAR (U+1C9C)
  '1c95', // GEORGIAN MTAVRULI CAPITAL LETTER VIN (U+1C95)
  '1c90', // GEORGIAN MTAVRULI CAPITAL LETTER AN  (U+1C90)
  '1ca0', // GEORGIAN MTAVRULI CAPITAL LETTER RAE (U+1CA0)
  '1c98'  // GEORGIAN MTAVRULI CAPITAL LETTER IN  (U+1C98)
]

cc @nodejs/v8 @nodejs/intl Is it regression or fix?

Was there an ICU update between 10.6 and 10.7?

Edit: yes: https://github.com/nodejs/node/commit/122ae24f62de6f848eadcf72b75dff6114cf0079

The ICU 62 changelog says:

The Unicode 11.0 changes may also require some code/tests to be fixed. Notably:

  1. Word break now groups white space together.
  2. Segmentation in general simplifies tests for emoji sequences.
  3. Casing behaves differently for Georgian, and differently for that than for any other script.

And the Unicode 11.0.0 changelog says:

Casing Issues
Casing behavior for the Georgian script has changed significantly. There is a new set of Mtavruli capital letters (U+1C90..U+1CBA, U+1CBD..U+1CBF) in Unicode 11.0, with case mappings to the existing Mkhedruli letters (U+10D0..U+10FA, U+10FD..U+10FF). In prior versions of the Unicode Standard, Mkhedruli Georgian was considered a monocameral (non-casing) script, and the Mkhedruli Georgian letters were gc=Lo. Starting with Version 11.0, those Mkhedruli Georgian letters are now gc=Ll, and have uppercase mappings to Mtavruli Georgian capital letters. This change will have major implications for Georgian implementations, including changes for input methods, fonts, casing, and string matching. Existing implementations have treated Mtavruli headlines and other uses for textual emphasis as a text style, so there will also be significant issues for document conversion and upgrade.

Another complication for Georgian is that the primary orthography does not use titlecasing, and the Mkhedruli Georgian letters do not have titlecase mappings to Mtavruli letters. This is unique among bicameral systems in the Unicode Standard, so casing implementations should be prepared for this exception.

Should we treat such changes as semver-major?

I don't know. Is it something that should be fixed upstream in V8?

/cc @nodejs/intl

Hi, I maintain https://github.com/moment/moment , and our builds are breaking because of this issue. Any idea if/when it will be fixed?

@marwahaha what is the breakage?

@targos @marwahaha it's not clear what you mean by 'fixed'. What do you see is the bug here?

captura de pantalla 2018-10-22 a la s 8 36 39 a m

But, this is after downloading a font that supports the new characters: https://app.box.com/s/psnogufec39aq486uny1o300j7c6hk76/file/317278048647

https://www.unicode.org/mail-arch/unicode-ml/y2018-m07/0063.html discusses this some.

Here's the problem. The vast majority of Georgian fonts do not yet have the new uppercase characters. So when any system uses case mapping to uppercase text (e.g. browsers interpreting CSS’s text-transform: capitalize), then the users of Georgian will see boxes (β€œtofu”) if the font they are using does not have the glyphs.

See a site for example https://bpgfonts.wordpress.com/download/

@vsemozhetbyt

Should we treat such changes as semver-major?

Traditionally we haven't. Functionality isn't removed.

as an example, /\p{Emoji}/u.test('πŸ›Ή') ( is Skateboard an emoji?) will also return true for 10.7.0 and false for 10.6.0. (I tried to find a similar case for node 8.9.0 / 8.10.0 which added Unicode 10, such as t-rex πŸ¦• but \p{Emoji} was not turned on then. There probably is some similar example i could do with regex.

(Edit: I don't have a font with the skateboard yet, either.)

I actually think the following may be what is breaking moment.js. Looks like a bug somewhere.

new RegExp('ოαƒ₯αƒ’', 'i').test('ოαƒ₯αƒ’'.toUpperCase()) // == 'Ოα²₯α²’'

… returns false, but should be true. (breaks in Safari 12.0 also, hm)

Update: filed a v8 bug https://bugs.chromium.org/p/v8/issues/detail?id=8348

Update update: the bug is in moment.

Needs to be new RegExp('ოαƒ₯αƒ’', 'ui').test('ოαƒ₯αƒ’'.toUpperCase()) // == 'Ოα²₯α²’' ( need that 'u' flag)

Like to close this as working as designed. toUpperCase() is working properly for Unicode 11.

@nodejs/v8 if we want to float the eventual v8 fix, will we want a new issue? or repurpose this one? The original issue is not a bug, but correct behavior for toLocaleUpperCase(). However, there's a regex bug.

@srl295 you'd have to eventually make one at https://bugs.chromium.org/p/v8/ so that it could be properly referenced.

Oh, you mean in Node? I guess you could just backport the commit to master once it's in V8 LKGR. Let me know if you need help with that.

This still seems to fail on Node 10.20.1, although not on 8.17.0.
https://travis-ci.org/github/moment/moment/builds/688016559

@marwahaha The V8 fix (v8/v8@bb24140cb3eef5452e7a74f96a8261f6c049dd02) hasn't been back-ported to V8 6.8, the version that ships with Node.js v10.x. Node.js v12.x contains the fix though, it bundles V8 7.8. (The fix was merged in 7.4 or 7.5.)

v10.x enters maintenance mode tomorrow and it's not a trivial fix (risk of regressions) so I suggest taking no action.

@bnoordhuis i think that sounds right on a short re-read of the bugs.

Thanks, closing then.

Was this page helpful?
0 / 5 - 0 ratings