Jruby: Modifiers are dropped in \X regular expression matches

Created on 30 Oct 2017  ·  5Comments  ·  Source: jruby/jruby

The \X regular expression matches on "extended grapheme cluster".

This Issue is about how that match becomes wrong.

Environment

Versions:

  • JRuby version: jruby 9.1.13.0 (2.3.3) 2017-09-06 8e1c115 Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
  • Operating system and platform: Darwin Olles-MacBook-Pro.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Expected Behavior

$ /usr/bin/ruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "å" 1:"å">

The circle above the a is a "modifier". Here, in MRI, it's in the MatchData.

Actual Behavior

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "a" 1:"a">

Note the absence of the "modifier".

Read more

In order to know what I'm talking about, here are links.

StackOverflow answer about "What is even \X?"

Keywords: extended grapheme cluster

Most helpful comment

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

All 5 comments

The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent \\X scan fails to consume those characters and only produces the non-combing 'a', 'o', 'A', 'O' characters.

This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters.

@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:

$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nf\c).match(/(\X)/)[1]"
"å"

This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا‎" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels.

Oh, closed by mistake! Re-opened.

I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.

Was this page helpful?
0 / 5 - 0 ratings