I find it kind of confusing that the standard library for string processing is called Unicode
instead of something like Strings
. Isn't it just an implementation detail of Julia that strings happen to be unicode? Seems weird to force end-users, many of whom are probably scientific users who might not know anything about character encodings, to know that Julia strings are stored as Unicode in order to call common string processing functions.
For example, functions like Unicode.islower
don't seem intrinsically tied to Unicode (if anything, that function is only meaningful for Latin encodings).
I would argue that these functions shouldn't have left Base, though that ship has sailed.
I very much agree that the name Unicode
feels like an implementation detail that gets exposed. I realize that perhaps technically it isn't but splitting function up in different packages depending on whether they depend on the Unicode version is not very user-friendly. I would also want to put them back into Base purely from a user-friendliness perspective, spoken as someone who knows almost nothing of implementation details...
The definition of what it means to lowercase something or to be lowercase is specific to Unicode and it changes with different versions of the standard. It might make sense to also have an ASCII
module and have ASCII-specific predicates and transformations since often one does not care about case transforming Elvish. If someone wants to propose a definition of these functions that is not dependent on the Unicode standard, I think the onus is on them to provide a precise definition of what the functions mean without referring to that standard.
I think "return the lowercase character corresponding to the given character" is precise enough by our usual standards of generic function meanings. If I made up my own encoding where "A" is 9999 and "a" is 9998, defining lowercase(JeffChar(9999))
to return JeffChar(9998)
seems legit to me.
That's a particularly unhelpful example since the question is not how the letters A
and a
are encoded – that does not matter at all. The question is how to transform other characters whose case transformation is non-obvious according to obvious ASCII rules. What are the uppercase versions of ÿ
or ß
, for example?
Here's what I would suggest we do: Keep these functions in Base. For each Julia version x.y.z, document that the functions implement/correspond to a particular Unicode standard. People who want compliance with bleeding edge or old Unicode standards could use a separate package as needed.
If we call the package Strings
, it seems to give us license to move a bunch more functions there from Base, which might be nice. We could have both Strings
and Unicode
(for truly unicode-specific things like graphemes) or Strings.Unicode
.
OK, I'll submit a PR soon to do the renaming.
Here is the full list of exports, grouped by similar functionality:
graphemes
textwidth
isvalid
islower
, isupper
, isalpha
, isdigit
, isxdigit
, isnumeric
, isalnum
, iscntrl
, ispunct
, isspace
, isprint
, isgraph
lowercase
, uppercase
, titlecase
, lcfirst
, ucfirst
Grapheme iteration and Unicode validity are inherently Unicode-related. Other character sets could define notions of these. Character class predicates could extended to other character sets as could character transformations. Having Strings.isspace
etc. not operate on strings may be a bit surprising.
Strings
and Strings.Unicode
seems fine, though I really think that some of the basic functionality, e.g. lowercase
, uppercase
, etc., should really live in Base.
I think that's why we were also going to move a bunch of other string functionality (search, chomp, isvalid, that sort of thing) in too.
At least for me, one of the main motivations for moving all these functions to stdlib was to clean the Base namespace from all the is*
predicates and group them under a common module (see discussion at https://github.com/JuliaLang/julia/issues/14347). I don't really care how the module is called.
What are the uppercase versions of ÿor ß, for example?
@StefanKarpinski German added an uppercase ß in 2017. Unicode 5.1: U+1E9E
That was a rhetorical question, @jrklasen. My point is that the answer to the question is not self-evident – that you referred to the Unicode version in which it was added makes the point quite well.
sorry I was not quite clear, I just wanted to point out that uppercase and lowercase of ß isn't simply defined by Unicode, but by the "Rat für deutsche Rechtschreibung" (Council for German Orthography), which changed it last year from (ss|ß) <-> SS to ss <-> SS and ß <-> ẞ (U+1E9E). If the package has another name than Unicode this could be adopted.
But the package specifically implements Unicode uppercasing, not Council for German Orthography uppercasing. That can be implemented by another package, but it's not what this one implements.
There seem to be three common objections:
using
to use functions like lowercase
.using
but find writing using Unicode
specifically annoying.While I can understand 1, we moved these things out because it seemed unfortunate to have names like isvalid
or isspace
exported from Base – they're not terribly generic or general. We could move them back, but I don't think that's great. We could move some of them back and leave others in the Unicode
package, but where do you draw the line?
I find 2 a bit whiny. Is the name "Unicode" really that scary? It's 2018 and Unicode is here to stay, being afraid of the name seems silly. If you're doing case transformations on arbitrary code points, you are doing something that is defined in terms of the Unicode standard and it seems not so unreasonable that you be aware of it.
Point 3 seems to be the most compelling technical argument, but as I said, I also am having a hard time being convinced that it's important to support character sets that can't be mapped into Unicode.
I wonder if there's a solution here similar to the Pkg3 "This package is not installed. Install? [Y/n]" prompt.
Case changing is so basic that I think it's really silly to require using
anything to get them. I think the answer to "where do you stop" would be with a survey what other languages provide out of the box without requiring any imports. The lowest common denominator of those seems like it would be a good candidate for Base, while the rest can live in Unicode (or whatever people want to call it).
Other recent languages adopt a variety of approaches:
Unicode
as methods of its char
type. Of course that's a bit different from Julia since methods are attached to a type, which is almost like having them in a separate module (though the advantage is that you don't need to load it explicitly).is*
functions, but as a series of character set objects.Ruby does the same thing as Python. Matlab and R have basic string functions such as case changing readily available without loading a library.
I kind of like the character set object approach but that is somewhat orthogonal to whether those objects are in Base or not.
Matlab and R have basic string functions such as case changing readily available without loading a library.
I'm not sure that Matlab and R are the languages to follow when it comes to modularity or strings.
Other recent languages adopt a variety of approaches
Some languages allow/encourage interactive use, where others prefer a more programmatic workflow. Better to look at those languages with similar usage patterns...
Continuing from https://github.com/JuliaLang/julia/pull/25416#issuecomment-356043200 since we should have the conversation here, not on the PR.
@JeffBezanson, what is your proposal specifically? Case transforms in Base
and everything else in Unicode
? Or some of the character class predicates in Base
(some of them do not depend on Unicode at all), and some in Unicode
? One of the original motivations was getting isvalid
and isnumeric
and such into their own namespace.
isvalid
is now a pretty important function so I think we'll have to keep it.
It's unfortunate that some of the predicate names like isspace
and isdigit
are pretty obvious while others like isnumeric
are more generic. I don't know where to draw the line so I'd just keep all of them in Base (except isassigned
).
A couple of the predicates are redundant and can maybe be deprecated:
isalnum
is identical to isalpha(c) || isnumeric(c)
isgraph
is identical to isprint(c) && !isspace(c)
Most helpful comment
I very much agree that the name
Unicode
feels like an implementation detail that gets exposed. I realize that perhaps technically it isn't but splitting function up in different packages depending on whether they depend on the Unicode version is not very user-friendly. I would also want to put them back into Base purely from a user-friendliness perspective, spoken as someone who knows almost nothing of implementation details...