Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO
). This is almost correct, except that the Unicode Alphabetic
property belongs to these categories, to a Nl
category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic
that live in Mc
and Mn
(spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic
list.
Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}
) and Java (Character.isAlphabetic
) get this right (Java documentation explicitly explains the Alphabetic
property, Python 2 and 3 both ("அதிகாலை".isalpha()
) seem to be getting it wrong. Perl also gets the Other_Alphabetic
characters correctly identified under \p{Alpha}
(though it also seems to have additional magic on top).
Other_Alphabetic
apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha
.
I'm not sure if utf8proc
supports querying for either the Alphabetic
or the Other_Alphabetic
property (the utf8proc_property_struct
doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.
There are a whole bunch of Unicode character properties that aren't currently in utf8proc, e.g. Other_Alphabetic
and Sentence_Terminal
and Quotation_Mark
and ...
I suspect that, rather than cramming all of these into utf8proc, it would be better to keep utf8proc focused mainly on normalization and have a separate package of UnicodeProperties with a bunch of optimized 2-stage tables (exposed as e.g. a new AbstractSet
type) for different character properties.
In the meantime, maybe isalpha
should be renamed to isletter
, analogous to GoLang.
+1 for isletter
; I can never remember whether isalpha
is "is alphabetic" or "is alphanumeric."
Triage is ok with renaming to isletter
.
If someone wants to make a PR doing this rename, that would be good, I don't think it's going to happen otherwise though. @digital-carver? (or @ararslan if you feel like it)
Most helpful comment
+1 for
isletter
; I can never remember whetherisalpha
is "is alphabetic" or "is alphanumeric."