Julia: `isalpha` should use Unicode property `Alphabetic`; rename to `isletter`

Created on 30 Apr 2018  Â·  5Comments  Â·  Source: JuliaLang/julia

Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO). This is almost correct, except that the Unicode Alphabetic property belongs to these categories, to a Nl category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic that live in Mc and Mn (spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic list.

Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}) and Java (Character.isAlphabetic) get this right (Java documentation explicitly explains the Alphabetic property, Python 2 and 3 both ("அதிகாலை".isalpha()) seem to be getting it wrong. Perl also gets the Other_Alphabetic characters correctly identified under \p{Alpha} (though it also seems to have additional magic on top).

Other_Alphabetic apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha.

I'm not sure if utf8proc supports querying for either the Alphabetic or the Other_Alphabetic property (the utf8proc_property_struct doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.

unicode

Most helpful comment

+1 for isletter; I can never remember whether isalpha is "is alphabetic" or "is alphanumeric."

All 5 comments

There are a whole bunch of Unicode character properties that aren't currently in utf8proc, e.g. Other_Alphabetic and Sentence_Terminal and Quotation_Mark and ...

I suspect that, rather than cramming all of these into utf8proc, it would be better to keep utf8proc focused mainly on normalization and have a separate package of UnicodeProperties with a bunch of optimized 2-stage tables (exposed as e.g. a new AbstractSet type) for different character properties.

In the meantime, maybe isalpha should be renamed to isletter, analogous to GoLang.

+1 for isletter; I can never remember whether isalpha is "is alphabetic" or "is alphanumeric."

Triage is ok with renaming to isletter.

If someone wants to make a PR doing this rename, that would be good, I don't think it's going to happen otherwise though. @digital-carver? (or @ararslan if you feel like it)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

m-j-w picture m-j-w  Â·  3Comments

yurivish picture yurivish  Â·  3Comments

musm picture musm  Â·  3Comments

i-apellaniz picture i-apellaniz  Â·  3Comments

ararslan picture ararslan  Â·  3Comments