The \X regular expression matches on "extended grapheme cluster".
This Issue is about how that match becomes wrong.
Versions:
$ /usr/bin/ruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "å" 1:"å">
The circle above the a is a "modifier". Here, in MRI, it's in the MatchData.
$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nfd).match /(\X)/"
#<MatchData "a" 1:"a">
Note the absence of the "modifier".
In order to know what I'm talking about, here are links.
StackOverflow answer about "What is even \X?"
Keywords: extended grapheme cluster
The unicode_normalize appears to be working properly here, expanding (in this case) all the diacritics to their combined character forms. The subsequent \\X scan fails to consume those characters and only produces the non-combing 'a', 'o', 'A', 'O' characters.
This is likely missing or incorrect logic in joni. I'm reading up on how parsers and regex are expected to handle unicode normalized into combining characters.
@olleolleolle So one obvious workaround would be to not normalize, or normalize to the one of the complete forms NFC or NFKC rather than the decomposed forms:
$ jruby -e "p 'åäöÅÄÖ'.unicode_normalize(:nf\c).match(/(\X)/)[1]"
"å"
This just solves the test case, if you use the string "أُحِبُّ ٱلْقِرَاءَةَ كَثِيرًا" instead, then it fails even in NFC, since there's no precomposed variants for Arabic with vowels.
Oh, closed by mistake! Re-opened.
I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.
Most helpful comment
I'm going to close this in favor of #4568, which details the standard, the commits that added it to MRI, and so on.