Jruby: Modifiers are dropped in \X regular expression matches

Created on 30 Oct 2017 · 5Comments · Source: jruby/jruby

The \X regular expression matches on "extended grapheme cluster".

This Issue is about how that match becomes wrong.

Environment

Versions:

JRuby version: jruby 9.1.13.0 (2.3.3) 2017-09-06 8e1c115 Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
Operating system and platform: Darwin Olles-MacBook-Pro.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Expected Behavior

$ /usr/bin/ruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "å" 1:"å">

The circle above the a is a "modifier". Here, in MRI, it's in the MatchData.

Actual Behavior

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "a" 1:"a">

Note the absence of the "modifier".

In order to know what I'm talking about, here are links.

StackOverflow answer about "What is even \X?"

Keywords: extended grapheme cluster

Source

olleolleolle

Most helpful comment

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

headius on 28 Nov 2017

🎉1 👍1

All 5 comments

The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent \\X scan fails to consume those characters and only produces the non-combing 'a', 'o', 'A', 'O' characters.

This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters.

headius on 30 Oct 2017

@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nf\c).match(/(\X)/)[1]"
"å"

headius on 30 Oct 2017

This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا‎" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels.

auroranockert on 30 Oct 2017

❤1

Oh, closed by mistake! Re-opened.

olleolleolle on 31 Oct 2017

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

headius on 28 Nov 2017

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

JRuby 9.2 Time to Date conversion bug ?

guizmaii · 5Comments

`LoadError: no such file to load -- racc/info` when using JRuby 9.2.15.0

koic · 3Comments

Illegal reflective access by org.jruby.util.ShellLauncher

elsabio · 3Comments

Windows raising different exception since 9.2.12.0

jsvd · 6Comments

UDPSocket fails to bind to an IPv6 address unless AF_INET6 is explicitly set