Julia: Replace isnumber(), etc. with a single function

Created on 10 Dec 2015  Â·  33Comments  Â·  Source: JuliaLang/julia

This was mentioned by @JeffBezanson and @ScottPJones at https://github.com/JuliaLang/julia/issues/14340#issuecomment-163450531. It looks like we could replace some or all of isalpha, isalnum, iscntrl, isgraph, islower, isnumber, isprint, ispunct and isupper with a single function testing for a given Unicode character general category. (isspace and isascii cross several categories and must thus be kept; isdigit is also more restrictive than isnumber.)

The simplest API would be something like charcategory(x::Char) -> Symbol. This would force writing e.g. all(c->charcategory(c) == :L, s) to check whether all characters of a string are uppercase, but it would at least have the advantage of clarity (https://github.com/JuliaLang/julia/issues/14156#issuecomment-159986925), and would be fast as soon as Jeff's work on anonymous functions is merged.

An intermediate solution would be to keep the most commonly used functions like isnumber, islower, isprint and isupper, but deprecate isalpha, isalnum, isgraph, ispunct and iscntrl. Maybe even isupper and islower could be deprecated: isn't it more common to use lowercase or uppercase if you care about the result, or to check for a specific character? Actually, even isnumber might be deprecated, as it is easily confused with isdigit, which I suspect is the most commonly needed test (for parsing and conversion).

stdlib strings unicode

Most helpful comment

All character class stuff should be moved into a Unicode stdlib package, including all these functions. The Unicode prefix will also make it clear that these are string-specific functions.

All 33 comments

Do you not like ischartype(class, x::Union{Char, AbstractString}), which is shorter and much easier to understand for new people for the string case than using all? (it could still be _implemented_ using all, and not need an anonymous function either). I also think ischartype(Lower, c) is clearer than charcategory(c) == :L, and can be a lot faster as well.

A function returning the char category (AFAIK, class isn't a defined Unicode term) is more general. If the symbol comparison is slow, we could use an enum instead.

How is it more necessarily more general?
Using types allows you do express things much more powerfully, and extend them later.
For example:

abstract CharCategory
abstract Alphabetic <: CharCategory
abstract LowerCase <: Alphabetic
abstract UpperCase <: Alphabetic
abstract UpperCategory <: UpperCase
abstract TitleCase     <: UpperCase
abstract MathLetter   <: Alphabetic
...
const global cat_to_type = [UpperCategory, LowerCase, TileCase, MathLetter, OtherLetter, ...]

Also, it turns out that using all in that way to see if a string is considered lowercase or uppercase is not actually correct, see the Unicode documentation:

Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.

class there was just late night (early morning) screwup on my part, you are right, should have been category.

ischartype{T<:CharCategory}(::Type{T}, ch:Char) is what I really am proposing.

What I meant by "general" is that it allows us to save the result (category), e.g. to print it for debugging purposes, or to list all categories encountered in a string. With ischartype, you would need to check for each category separately. It's like forcing you to write issqrt(x, 2) instead of sqrt(x) == 2.

Also, it turns out that using all in that way to see if a string is considered lowercase or uppercase is not actually correct, see the Unicode documentation:
Note that for “caseless” characters, such as U+02B0, U+1D34, and U+02BD, isLowerCase
and isUpperCase are both True, because the inclusion of a caseless letter in a string is not
criterial for determining the casing of the string—a caseless letter always case maps to itself.

Well, you can check whether the characters are in category Lu _or_ Lm _or_ Lo.

We could also make the function even more general, say charprop{T<:UnicodeProperty}(::Type{T}, c::Char), with possible types CharCategory, CharCasing, CharScript, CharWhitespace... It would return a boolean or a symbol/enum depending on the requested information.

There already is a category_code(c) function (not exported, you'd need to use Base.UTF8proc.category_code(c)), that returns an integer 0-29, along with constants such as
UTF8PROC_CATEGORY_LU. Those should probably be redone as an enum.

ischartype wouldn't force you to check each category separately at all - in fact, it makes things simpler, because you can use type inheritance as shown above.
ischartype(Alphabetic, x) for example, or ischartype(UpperCase, x) (returns true for titlecase or uppercase characters).

I also didn't say that you'd just have ischartype, and I do really like your charprop function idea.
So, charprop(CharCategory, ch) would return the abstract category type?

So, if I had a set of UnicodePropertys, as you listed, and the set of types for CharCategory as I proposed above, along with your nice charprop function and my convenience ischartype function
for categories, do you think it might possibly be accepted as a PR?

(I'm not sold on the name for ischartype, if anyone can suggest something better, but not _too_ long,
I'd appreciate it! Let the bikeshedding begin!)

Turns out, ischartype(UpperCase,x) would be the same as
isa(charprop(CharCategory, c), UpperCase), just shorter.

ischartype(cat::CharCategory, c::Char) = isa(charprop(CharCategory, c), cat)`

Yes, that was my point. I don't think we need both, especially since you wouldn't write isa(charprop(CharCategory, c), cat) but charprop(CharCategory, c) <: cat.

OK then, would you support a PR where I did all that I talked about above, less the addition of ischartype? If so, I'll go ahead and spend the time on it.

I would support it, but let's ask about the opinion of others...

A common use case is a question such as "is this character an ASCII letter or digit?", as in isascii(ch) && (isalpha(ch) || isdigit(ch)). Can this be efficiently expressed with the new function, maybe more efficiently that currently?

Doesn't look like this use case would really gain from the present proposal. isascii and isdigit are quite trivial to compute (they don't call utf8proc at all), and would be kept anyway. Also, calling isalpha is quite wasteful for ASCII characters. So I'd better write this as '0' <= c <= '9' || 'a' <= c <= 'z' || 'A' <= c <= 'Z'.

That said, I guess it would be written inefficiently as something like isascii(ch) && (charprop(CharCategory, c) in (Unicode.Ll, Unicode.Lu, Unicode.Nd). We could keep isalpha to make this shorter, but it would also prompt people to write slow code without realizing.

isalpha(ch) = charprop(CharCategory, c) <: Alphabetic
I think that would be fast - it uses the type hierarchy of the CharCategory types.

That's fast, but not as fast as a plain comparison on integer values, which is all you need when working with ASCII strings. In the scenario @eschnett showed, computing the Unicode category is a waste (as I said in my previous comment).

Right, that is for the non ASCIIString case

I'm not sure what problem is being solved here.

Let's not do this.

The types version of this doesn't help anything, but doing it with symbols as originally proposed would work

True, but doesn't really seem worth it.

How about deprecating some of these functions, as mentioned in the description? Things like isnumber are really confusing since often you want to use isdigit instead. I suspect we'll want to provide a function to extract the Unicode general category of a character at some point, as possible in many languages. If we do that, we'll have two ways of getting the same information. I'd rather shrink the API as much as possible for 1.0, and add functions later if they turn out to be useful.

Agreed, these names are pretty obscure and not that widely used. Better to not be stuck with them for all of 1.x

Note that we do have Base.UTF8proc.category_* to get category codes, although we shouldn't export this without cleaning it up, probably to return an enum.

Even if you export a function to get category codes, however, many of the isfoo functions check for several categories (which is annoying to do by hand, especially since most users don't understand Unicode well enough to implement a function like isprint from category codes) or only check a subset of the categories (like iscntrl).

See also #5939, and #8233. Note also that, if I recall correctly from the discussion, there are several modern languages that choose to have these functions, e.g. Go.

Can we move them under a namespace? They're not especially generic things that you would ever call on anything but a Char or AbstractString. Go's spellings of these are also notably more obvious than ours.

They're also not all that widely-used, looks like on the order of 5-15 packages depending which one.

Moving these to a Unicode module sounds like a good idea. Note that the Go functions are only provided by a package. OTC, Rust includes them in the stdlib (though due to their classic OOP approach these functions are grouped under the char type).

What's even more interesting is that Swift has removed these functions in version 3 in favor of patterns like c in CharacterSet.alphanumerics. This is very similar to the solution proposed here. We could provide meta-categories like Swift, i.e. one for each of the existing functions, so that users do not have to list Unicode categories one by one. The advantage is that with a unified pattern, you can query a lot of different properties (not only categories).

All character class stuff should be moved into a Unicode stdlib package, including all these functions. The Unicode prefix will also make it clear that these are string-specific functions.

That package can also re-export stuff from Base.UTF8proc.

Functions have been moved to the Unicode stdlib module. Keeping this issue open since it would still make sense to provide an API to get Unicode character properties like general category.

I think an API to get the category code should be a separate issue.

We can iterate on the API of the Unicode package in the future, fortunately.

Seems this issue's 1.0 items have been completed?

Yup, kicked to the 1.x milestone for further iteration.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ararslan picture ararslan  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

TotalVerb picture TotalVerb  Â·  3Comments

omus picture omus  Â·  3Comments

tkoolen picture tkoolen  Â·  3Comments