julia 🚀 - canonicalize unicode identifiers

+100 for this. Any strategy for ensuring that homoglyphs are merged seems like a big improvement to me.

johnmyleswhite on 17 Jan 2014

+1 for canonicalizing everything.

toivoh on 17 Jan 2014

(We should probably also normalize the Unicode identifiers, in addition to canonicalizing homoglyphs.)

stevengj on 17 Jan 2014

On possible software package that we could adapt for this might be utf8proc, which is MIT-licensed and fairly compact (600 lines of code plus a 1M data file). ~~It looks like it does Unicode normalization, but not homograph canonicalization (except for a small number of special cases?).~~ Looks like it handles homoglyphs for us.

stevengj on 17 Jan 2014

+1 for canonicalization _and_ normalization.

We certainly don't want the same disambiguation issues with combining diacritics and nonprinting control characters (like the right to left specifier). The Unicode list contains quite a few characters with combining diacritics already; not sure if it's exhaustive though.

jiahao on 17 Jan 2014

Actually, it looks like the utf8proc library completely solves this problem, because it implements (among other things) the standard "KC" Unicode normalization which canonicalizes homoglyphs.

I just compiled the utf8proc library and called it from Julia via:

function snorm(s::ByteString, options=0)
       r = Ptr{Uint8}[C_NULL]
       e = ccall((:utf8proc_map,:libutf8proc), Int, (Ptr{Uint8},Csize_t,Ptr{Ptr{Uint8}},Cint), s, sizeof(s), r, options)
       e < 0 && error(bytestring(ccall((:utf8proc_errmsg,:libutf8proc), Ptr{Uint8}, (Int,), e)))
       return bytestring(r[1])
end

and then

julia> s = "µ"
julia> uint16(snorm(s)[1])
0x00b5
julia> uint16(snorm(s, (1<<1) | (1<<2) | (1<<3) | (1<<5) | (1<<12))[1])
0x03bc

works (the second argument is various canonicalization flags copied from the utf8proc.h header file).

Moreover, the utf8proc canonicalization functions (including Unicode-aware case-folding and diacritical-stripping) would be useful to have in Julia anyway. I vote that we just put the whole utf8proc into deps and export some version of this functionality in Base, in addition to canonicalizing identifiers.

stevengj on 17 Jan 2014

Awesome, thanks for doing the legwork on this.

jiahao on 17 Jan 2014

That sounds like a really good idea to me.

JeffBezanson on 17 Jan 2014

KC has one case that we probably don't care about but seems worth mentioning: superscript numerals will be normalized to normal numerals. (We probably don't care because why would you have superscript numerals in a numeric literal, but this seems like the sort of thing to be abused in a future International Obfuscated Julia Coding Contest.)

pao on 17 Jan 2014

That's not totally ideal; N² is a cute variable name :)

JeffBezanson on 17 Jan 2014

I've actually used χ² somewhere.

jiahao on 17 Jan 2014

We also have to avoid normalizing out different styled letters that represent different symbols in mathematics.

JeffBezanson on 18 Jan 2014

The problem with ² is that it seems to mean ^2, so maybe it's better not to encourage it.

nalimilan on 18 Jan 2014

@JeffBezanson may be referring to what UAX #15 calls font variants (see Fig. 2). They give as an example \mathfrak H vs \bbold H, but I suspect regular \phi vs script \varphi is the one that would come up fairly often. (Ironically, Github won't let me enter the characters...)

So it seems that we are learning toward canonical equivalence, as opposed to full compatibility equivalence, in which case NFD may be sufficient rather than NFKC.

jiahao on 18 Jan 2014

For variable names, I don't see the superscript/subscript being as much of a problem, other than i.e., χ² will be the same identifier as χ2; if you are distinguishing these I might think you were mad.

pao on 18 Jan 2014

Our use case is very different from something like a text formatter, which wants to know that superscript 2 is a 2. In a programming language any characters that look different should be considered different. We can perhaps be flexible about superscripts, but font variants of letters have to be supported.

JeffBezanson on 18 Jan 2014

The initial issue raised involved confusion over U+00B5 MICRO SIGN and U+03BC GREEK SMALL LETTER MU. Normalization type NFD would not fix this problem since U+00B5 has only a compatibility decomposition to U+03BC and not a canonical decomposition. NFKC will fix that issue. The utility at http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKC%0D%0A&b=µ is useful for this.

mathpup on 18 Jan 2014

@JeffBezanson, I'm not convinced that "characters that look different should be considered different." One problem is that, unlike LaTeX, we cannot rely on a particular font/glyph being used to render particular codepoints. U+00B5 and U+03BC look distinct in some fonts (one is rendered italic) and not in others, for example. Moreover, even when codepoints are rendered distinctly, the difference will often be subtle (χ² versus χ2) and hence an invitation for bugs and confusion. (That's why these variants work for phishing scams, after all.)

I would prefer to simply state that identifiers are canonicalized to NFKC, so that only characters that look entirely distinct (as opposed to potentially slight font variations) are treated as distinct identifiers. It's useful to have variables named µ and π, but Julia shouldn't pretend that it is LaTeX.

stevengj on 18 Jan 2014

There are several different levels of distinction being discussed:

"Indistinguishables". Different unnormalized but strongly equivalent forms – i.e. byte sequences that mean the same things but are represented different, such as precomposed characters like U+0065, U+0301 vs. U+00E9.
"Strong confusables". Characters like μ vs. µ and other things listed here that are semantically distinct but will often cause confusion and frustration due to very similar rendering.
"Weak confusables." Character sequences that are normally easy to distinguish but might end up looking similar in some renderings, e.g. χ² vs. χ2.

These call for different approaches. To deal with "indistinguishables" it's pretty clear that we should just normalize them. At the other end of the spectrum, this is a pretty lousy way to deal with "weak confusables" – imagine using both χ² and χ2 in some code and being really confused when they are silently treated as the same identifier! For weak confusables, I suspect the best behavior is to treat them as distinct but emit a warning if two weakly confusable identifiers are used in the same file (or scope). In the middle, strong confusables are a tougher call – both automatically normalizing them to be the same (like with indinstinguishables) and warning if they appear in the same file/scope (like weak confusables) are reasonable approaches. However, I tend to favor the warning.

I've intentionally avoided Unicode terms here to keep the problem statement separate from the solution. I suspect that we should first normalize source to NFD, which takes care of collapsing "indistinguishables". Then we should warn if two identifiers are the same modulo "compatibles" and "confusables". That means that using composed and uncomposed versions of è in the same source file would just silently work – they mean the same thing – but using both χ² vs. χ2 or ﬃ and ffi in the same file would produce a warning and then proceed to treat them as distinct.

StefanKarpinski on 18 Jan 2014

@StefanKarpinski Good summary! but I think you have the wrong conclusion.

I was once challenged to find out why 10l would compare unequal to 101, in a C program (it was more elaborated), but because the font I could not find the bug.

My preference would definitely be to make Julia consider all possible ambiguous characters equal, and give a warning/error if someone use identifiers that is considered equal because of rule 2 and 3. I do not read Unicode codepoints, and i do not have a different word for ﬃ and ffi, and I can't even see the difference when I am focused on logic. To me programming is about expressing ideas, and variables using both ﬃ and ffi as different variables in the same scope would be the worst offence to any code style guide.

ivarne on 18 Jan 2014

Well, that's why it should warn. Whether it considers them the same or different is somewhat irrelevant when it causes a warning. I guess one benefit of considering such things the same rather than keeping them different is ease of implementation: if the analysis is done at the file level, you can canonicalize an entire source file and warn if two "confusable" identifiers are used in the same source file and then hand the canonicalized program off to the rest of the parsing process without worrying any further. Then again, you can do the same without considering them the same by doing the confusion warning at the same step but leaving confusable identifiers different.

StefanKarpinski on 18 Jan 2014

As a practical matter, it is far easier to implement and explain canonicalization to NFKC, taking advantage of the existing standard and utfproc, than it would be to implement and document our own nonstandard normalization. (There are a _lot_ of codepoints we'd have to argue over.)

We can also certainly issue a warning whenever a file contains identifiers that are distinct from their canonicalized versions. (But I think it would be an unfriendly practice to issue a warning _instead_ of canonicalizing.)

stevengj on 18 Jan 2014

It seems unfortunate to me to canonicalize distinct characters that unicode provides specifically for their use in mathematics.

Should we use a different normalization, maybe NFD, for string literals?

JeffBezanson on 18 Jan 2014

I don't think string literals should be normalized at all by default, although we should provide functions to do normalization if that is desired. The user should be able to enter any Unicode string they want.

stevengj on 18 Jan 2014

+1 for what @stevengj said. There's something to be said for preserving user input as much as possible. (What if the user wants to implement a custom normalization, for example...)

jiahao on 18 Jan 2014

Just to be perverse, let's say we normalize to NFKC, and Quaternions.jl gets renamed ℍ.jl. Then using ℍ would look for .julia/H/src/H.jl?

nolta on 19 Jan 2014

I've actually rampantly made the assumption that package names are ASCII largely because I think it's opening a whole can of worms to use non-ASCII characters in package names.

StefanKarpinski on 19 Jan 2014

I'm much more concerned about identifier names. I don't think merging ℍ and H makes sense for us.

JeffBezanson on 19 Jan 2014

@stevengj – what about the χ² vs. χ2 issue? Your proposal silently treats them as the same, which strikes me as almost as bad as the (thus far hypothetical) problems we're trying to avoid here.

StefanKarpinski on 19 Jan 2014

Actually, no, it's worse – at least you can look at the contents of your source file and discover that two similar looking identifiers are actually the same. If χ² and χ2 are treated as the same identifier, there's no way to figure it out short of finding the obscure appendix of the Julia manual that explains this behavior. I find that unacceptable.

StefanKarpinski on 19 Jan 2014

I would like to point out that (on my Mac), even the strong confusing symbols render noticably differently. Swapping one for the other would maintain meaning, but loses a significant amount of typographic readability.

I agree that this normalization should only apply to symbols (variable names), and I think it should only apply to Indistinguishables. Hopefully nobody tries to use X2, χ² and χ2 in their code, in much the same was as avoiding similar words (like I vs l) is a good idea

vtjnash on 19 Jan 2014

Everyone agrees that you shouldn't use both Ill1I1 and Il1IlI as variable names, but nobody thinks a language should silently canonicalize them to the same thing.

JeffBezanson on 19 Jan 2014

That seems to be what @stevengj is arguing for.

StefanKarpinski on 19 Jan 2014

Yes, I think Julia should canonicalize ℍ to H internally. You are free to use ℍ as a variable name if you want, you just aren't free to use it as a distinct variable from H. Why is this such a loss for the language?

Conceptually, this is quite a familiar thing. If I use a syntax-highlighting text editor, it might change the font of certain variables. No one thinks that this changes the meaning of the identifiers.

To ordinary programmers (as opposed to Unicode geeks), a µ is a μ. I shudder to think of trying to explain this distinction to my students. (In contrast, everyone understands that I and l and 1 are distinct characters even though they look similar.)

stevengj on 19 Jan 2014

That's not the part that's problematic. The problem is doing it silently. If you happen to have an editing environment where ℍ and H are obviously quite different, then it is completely surprising – in a way that's impossible to discover the cause of – that they are treated as the same identifier. That is not ok.

StefanKarpinski on 19 Jan 2014

I don't know that it's surprising. My reaction would be _Oh, it treats different fonts as the same identifier. I guess that makes sense._ Because to ordinary people, ℍ and H are the "same character" in different fonts. (And if you're a Unicode nerd, you know about normalizations. But the vast majority of scientific programmers are not Unicode nerds.)

stevengj on 19 Jan 2014

@stevengj I don't think that would be your reaction. You wouldn't even have considered that ℍ and H had anything to do with one another. Without a warning, you wouldn't even notice that two different identifiers are considered identical. See this potential example:

julia> ℍ = 2
[many complex lines of code]
julia> H = 1
julia> ℍ
1 # WTF?!

nalimilan on 19 Jan 2014

Yes, exactly. That's really not ok. If there's a warning, then you know something bad is going on.

StefanKarpinski on 19 Jan 2014

In an editor/IDE/whatever you write your code, you use the _same_ font for all the codes in the same window (you might change the font of course, but your changed font applies to every character in your working area). I would never expect the editor to use font A for this variable, while using font B for another. Therefore, I would expect the same name to appear exactly the same in my editor -- when they look different, they are different.

lindahua on 19 Jan 2014

Here is my two cents:

I've never encountered such problems in real coding practice, but I understand that this may become a concern in particular context. For such cases, I think a better way might be to provide tools to detect identifiers that might look strikingly similar and modify them with the code author's approval.
Blindly treating two identifiers as the same thing just because they may look similar (_e.g._ H and ℍ) is, to me, a recipe to disastrous confusion.

That being said, if two characters _always look the same_ and there are virtually no ways to distinguish them visually, it might be safe to canonicalize them. But we should be conservative about this.

lindahua on 19 Jan 2014

I wonder if Julia is really the first programming language to face such issues? I guess that many languages still stick to ascii identifiers to be safe. I know that Java has unicode identifiers, but my quick googling only turned up heated debates on whether to use unicode identifiers at all.

toivoh on 19 Jan 2014

The Fortress programming language uses Unicode extensively, but even they have had absolutely nothing to say about normalization issues in the language specification. (pdf) From what I can tell, one usually codes the symbols as ASCII identifiers rather than inputing them directly.

jiahao on 19 Jan 2014

The bold, italic, and sans-serif attributes in the mathematical variants of
mu do not represent different fonts. In fact, each has a unique unicode
code point.

On the other hand, the editor may substitute a character from a different
font if the requested character is not available. To be specific, in Xcode
I use Monaco in the editor. If I insert a Greek letter mu U+03BC, the
editor actually uses the mu from Lucida Grande because that character is
not available in Monaco.

On Sun, Jan 19, 2014 at 9:17 AM, Dahua Lin [email protected] wrote:

In an editor/IDE/what ever you write your code, you use the _same_ font
for all the codes in the same window (you might change the font of course,
but your changed font applies to every character in your working area). I
would never expect the editor use font A for this variable, while using
font B for another. Therefore, I would expect the same name to appear
exactly the same in my editor -- when they look different, they are
different.

—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32710501
.

mathpup on 19 Jan 2014

Yes, that's right. Unicode doesn't care about fonts; it provides differently-styled letters precisely because they are used as distinct symbols in mathematics. If it weren't for that use case (which is our use case), those characters wouldn't exist.

Many in the lisp/scheme world argue for case-insensitive identifiers because to them letter case is just a personal style choice, with the same character underneath. For example some people like to name functions in all-uppercase where they are defined and otherwise use lowercase. However, those people are wrong.

JeffBezanson on 19 Jan 2014

Just to be clear, the mathematical variants of mu (bold, italic,
sans-serif) are distinct Unicode code points and can be present in the same
font. On the other hand, a code editor might borrow a character from
another font if it is not available in the requested font. Xcode does this.

By the way, I looked more carefully at micro versus mu, and in Xcode's
default font Menlo, they appear to be identical. I don't mean similar. I
mean that at 288 point on the screen, overlaid on top of each other, they
look identical.

On Sun, Jan 19, 2014 at 12:23 PM, Jeff Bezanson [email protected]:

Yes, that's right. Unicode doesn't care about fonts; it provides
differently-styled letters precisely because they are used as distinct
symbols in mathematics. If it weren't for that use case (which is our use
case), those characters wouldn't exist.

Many in the lisp/scheme world argue for case-insensitive identifiers
because to them letter case is just a personal style choice, with the same
character underneath. For example some people like to name functions in
all-uppercase where they are defined and otherwise use lowercase. However,
those people are wrong.

—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32715371
.

mathpup on 19 Jan 2014

I understand that different codepoints have nothing to do with choosing different fonts. I just think most people will perceive them as different fonts of the "same character".

stevengj on 19 Jan 2014

Whatever various standard normalizations might say, I think there is a real distinction between characters that are truly identical (like the two mus), and characters that are the same abstract letter but intended to look quite different, like H vs. double-struck H. "Same character" is of course subjective and depends on the application, but in math double-struck letters are decidedly different symbols with different meanings.

JeffBezanson on 19 Jan 2014

One reasonable solution would be to restrict the set of characters in
identifiers to a documented subset of Unicode. Allowing arbitrary
characters in identifiers seems to be inviting problems.

On Sun, Jan 19, 2014 at 1:45 PM, Jeff Bezanson [email protected]:

Whatever various standard normalizations might say, I think there is a
real distinction between characters that are truly identical (like the two
mus), and characters that are the same abstract letter but intended to look
quite different, like H vs. double-struck H. "Same character" is of course
subjective and depends on the application, but in math double-struck
letters are decidedly different symbols with different meanings.

—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32717731
.

mathpup on 19 Jan 2014

That would need to be a fairly large subset though - whats the point of Unicode identifiers if you don't support the various languages of the world?

IainNZ on 19 Jan 2014

I would prefer restricting identifiers to ASCII characters.

On Sun, Jan 19, 2014 at 4:28 PM, Iain Dunning [email protected]:

That would need to be a fairly large subset though - whats the point of
Unicode identifiers if you don't support the various languages of the world?

—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32724326
.

mathpup on 20 Jan 2014

👎1

For maximum portability, better just limit it to uppercase. No, to be
really safe, letters A-E only.
On Jan 19, 2014 8:13 PM, "mathpup" [email protected] wrote:

I would prefer restricting identifiers to ASCII characters.

On Sun, Jan 19, 2014 at 4:28 PM, Iain Dunning [email protected]:

That would need to be a fairly large subset though - whats the point of
Unicode identifiers if you don't support the various languages of the
world?

—
Reply to this email directly or view it on GitHub<
https://github.com/JuliaLang/julia/issues/5434#issuecomment-32724326>
.

—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32728523
.

JeffBezanson on 20 Jan 2014

😄1

For maximum portability, better just limit it to uppercase

Well, one to six letters ought to be enough to name every variable you could possibly want.

jiahao on 20 Jan 2014

For the greatest portability, we should treat all unicode letters the same and distinguish variables only by their length.

StefanKarpinski on 20 Jan 2014

😄2

More seriously, though, we are not in a good place right now. The majority opinion (or maybe just mine) is that neither NFC nor NKFC is entirely suitable. The former will not normalize Greek mu μ and micro µ, while the latter would normalize ℍ and H, and χ² and χ2.

At this point, I would suggest NFD/NFC by default_, because I'm pretty sure we don't want to mess with combining diacritics regardless, and print warnings if NKFD-equivalent identifiers exist in scope. (_D may be sufficient since we don't necessarily need to recompose the Unicode string for an identifier name, although introspection would be less pretty)

The other choice is a custom canonicalization...

jiahao on 20 Jan 2014

Normalizing ℍ and H, and χ² and χ2 may not be a real issue if 1) the user does not have to see nor use the canonical form, and 2) a warning is printed when both are used in the same context.

nalimilan on 20 Jan 2014

That's exactly what I proposed. A good first-order approximation of my proposal is:

NFC/D normalize source code silently.
Warn if two NFKC/D-equivalent identifiers appear in the same file.

There may be additional character equivalences that should trigger warnings, but we can add those as they come up.

StefanKarpinski on 20 Jan 2014

I would imagine that it would be easier to raise an error if different symbols "canonicalize" equal (Stafan's 2. point), rather than give a warning and continue. There also does not seem to be a unanimous opinion if we should merge the variables or keep them separate, and making it an error solves that problem. If you get a warning you should fix it anyway and I would say sooner is better than later for that kind of thing.

ivarne on 20 Jan 2014

I agree with stefan's step (1), but I truly don't understand the problem of ℍ vs. H. I don't think anybody uses a font that renders these the same. For program source the default should be to keep different things different. Attempting to apply all sorts of knowledge about what is and isn't "the same letter" puts a programming language in the linguistics business, where it does not belong.

JeffBezanson on 20 Jan 2014

I think we should do (1) and see wait and see if there are actually ever any issues that necessitate (2).

StefanKarpinski on 20 Jan 2014

@IainNZ's blog post, for future reference: http://iaindunning.com/2014/julia-unicode.html. In particular, I think that it will be useful to consider what Python's corresponding issue: http://legacy.python.org/dev/peps/pep-3131/.

StefanKarpinski on 23 Feb 2014

@iainnz, Python 3 uses NFKC, I think:

~$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> µ = 1
>>> μ
1

stevengj on 24 Feb 2014

Yes indeed, just found this too: http://docs.python.org/3/reference/lexical_analysis.html#identifiers

IainNZ on 24 Feb 2014

I still don't think that's the right choice. Instead, you should make it an error to use two different NFKC-equivalent identifiers in the same scope.

StefanKarpinski on 24 Feb 2014

Note that NFKC would also fix #5903, I think.

stevengj on 24 Feb 2014

Indeed it does, it's just that we still don't really want blanket NFKC normalization:

julia> normalize_string("：", :NFC)
"："

julia> normalize_string("：", :NFKC)
":"

jiahao on 24 Feb 2014

It might be useful to read through the Python discussions on why they chose NFKC, and explicitly discussed and rejected the possibility of flagging compatible characters as an error. It seems that for users of several non-English languages, it is actually quite difficult in practice to avoid cases of the "same" identifier in NFC-inequivalent forms, e.g. in Japanese or in Korean or in Serbian and Croatian, and supporting users in these languages was a strong motivating factor in their decision (see the conclusion of the linked thread). The example of the punctuation characters from #5903 is yet another one of these unintentional inequivalencies for non-English users. (In these cases, giving an error as @StefanKarpinski suggests, or even just a warning, would be a huge headache: one of the linked authors wrote, "as a daily user of several Japanese input methods, I can tell you it would be a massive pain in the ass if Python doesn't convert those, and errors would be an on-the-minute-every-minute annoyance.")

I really think that using NFKC has far more advantages (avoiding extreme confusion in the many many cases where NFC-inequivalent identifiers are typically read as equivalent by mundanes) than disadvantages (treating e.g. H and ℍ as the same identifier).

stevengj on 24 Feb 2014

FWIW, I agree with @stevengj.

johnmyleswhite on 24 Feb 2014

That is compelling, but silently equating identifiers that look different
also has real problems. Identifiers are not human language words; a
programming language should not have an opinion about what sequences of
characters are linguistically the same. Some in the lisp/scheme world want
letter case not to matter, for example, but then it is pointed out that in
some languages case is more significant than it is in English. I know I am
glossing over a lot here, and there is enough complexity out there to
provide counterexamples for almost everything, but I feel that generally
making different characters different is the most transparent and avoids
the most debates.
On Feb 23, 2014 9:08 PM, "Steven G. Johnson" [email protected]
wrote:

It might be useful to read through the Python discussions on why they
chose NFKC. It seems that for users of several non-English languages, it is
actually quite difficult in practice to avoid typing the same identifier in
NFC-inequivalent ways, e.g. in Japanesehttps://mail.python.org/pipermail/python-3000/2007-June/008220.htmlor in
Koreanhttps://mail.python.org/pipermail/python-3000/2007-June/008227.htmlor in
Serbian and Croatianhttps://mail.python.org/pipermail/python-3000/2007-June/008316.html.
The example of the punctuation characters from #5903https://github.com/JuliaLang/julia/issues/5903is another one of these unintentional inequivalencies. (In these cases,
even giving an error as @StefanKarpinskihttps://github.com/StefanKarpinskisuggests, or even just a warning, would be a huge headache.)

I really think that using NFKC has far more advantages (avoiding extreme
confusion in the many many cases where NFC-inequivalent identifiers that
are typically read as equivalent) than disadvantages (treating e.g. H and
ℍ as the same identifier).

—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-35852577
.

JeffBezanson on 24 Feb 2014

@JeffBezanson, programs are read and written by humans, not just machines. And the language doesn't need its _own_ opinion on the subject of which characters are linguistically the same, because the Unicode standard has helpfully decided these issues on a character-by-character and language-by-language basis for us already.

stevengj on 24 Feb 2014

Folding characters that look totally different doesn't necessarily help
readability. It seems like we need more normalization forms to pick from.
I think many strange things would happen, like treating superscripts as
normal digits.
On Feb 23, 2014 10:27 PM, "Steven G. Johnson" [email protected]
wrote:

@JeffBezanson https://github.com/JeffBezanson, programs are read and
written by humans, not just machines. And the language doesn't need its
_own_ opinion on the subject of which characters are linguistically the
same, because the Unicode standard has helpfully decided these issues on a
character-by-character and language-by-language basis for us already.

Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-35855140
.

JeffBezanson on 24 Feb 2014

Superscripts would only be counted as digits when they are part of identifiers, not as numbers. As someone else wrote in one of these threads, I seriously question the sanity of someone who uses χ2 and χ² and χ２ as distinct identifiers in the same scope.

It's a tradeoff. NFKC does indeed fold some characters that look fairly distinct, like H and ℍ. But it also folds lots of characters that look extremely similar. I think the benefits of the latter outweigh the drawback of the former.

(I suppose we could do NFKC + a small list of exceptions that are treated as distinct, if NFKC caused too many problems...but I don't hear anyone complaining about it in Python3.)

stevengj on 24 Feb 2014

We should also probably be more strict about what codepoints are allowed in identifiers, similar to Python. Currently you can do things like:

julia> ² = 1
1
julia> 2²
2
julia> ２= 1
1
julia> 1 + ２
2
julia> –3 = 3
3
julia> -3 + –3
0

stevengj on 24 Feb 2014

Yes, we should certainly restrict identifiers more. That will actually help a lot, since it will prevent punctuation in identifiers, making problems like fullwidth =more transparent.

JeffBezanson on 24 Feb 2014

I don't think restricting identifiers somewhat, while a good idea in itself, will really solve the problem posed by input in other languages. Instead of b＝3 not defined you will get b＝3 is not a valid identifier or similar.

stevengj on 24 Feb 2014

I'm re-opening this issue, because the discussions of the Python developers, combined with things like #5903, are beginning to make me feel strongly that NFKC would do much more good than harm when you take into account non-English scripts.

If we are going to come up with a custom normalization (as was suggested in #5903), it seems better to do NFKC+exceptions than NFC+exceptions, since I suspect that there are far more codepoints that we would want to treat as equivalent in NFC than codepoints we would want to treat as inequivalent in NFKC. But I think we should do NFKC and wait for real-world complaints before adding exceptions, whereas we already have real-world complaints about NFC.

And we really should finalize the normalization question before 0.3 (rather than switching to NFC in 0.3 and NFKC at a later date), since this is purely a policy decision (the code change to use NFKC is a trivial one-line patch).

stevengj on 24 Feb 2014

Unlike python, we have first class symbols.

I don't think we want different symbols referencing the same memory (very bad for type inference and macros), nor do we want to completely normalize the symbols (bad for readability, error messages, output)

vtjnash on 24 Feb 2014

If we do this it will have to be very early, at parse time.
On Feb 24, 2014 3:51 PM, "Jameson Nash" [email protected] wrote:

Unlike python, we have first class symbols.

I don't think we want different symbols referencing the same memory (very
bad for type inference and macros), nor do we want to completely normalize
the symbols (bad for readability, error messages, output)

Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-35935248
.

JeffBezanson on 24 Feb 2014

I'd like to see some more examples where NFC canonicalization is insufficient. What does it mean that different inputs methods give different representations? Are we talking about things that look alike but are encoded differently? Shouldn't that be handled by NFC? If that's not the case, then there really is a missing canonicalization level in the standard.

StefanKarpinski on 24 Feb 2014

@vtjnash, you can already create symbols that are not valid identifiers by symbol("...."), including non-NFC symbols. Everyone agreed that we should do normalization only of parsed identifiers, and the only question is NFC (implemented now) versus NFKC. As @JeffBezanson alluded to, this is already being done at parse time for NFC, including for :symbol literals; the implementation would be a one-line patch to change NFC to NFKC.

@StefanKarpinski, read the examples I linked above from other languages in the Python discussion, where depending on the input method you may get e.g. a "fullwidth" or "halfwidth" version of a character, or you may get a ligature instead of two separate characters. Or simply see #5903. Or see the Unicode confusables list, which includes the seven(!) different variants of μ. Kompatible characters are often rendered slightly differently, e.g. ２vs. 2 or µ vs. μ or ﬁ vs. fi, but the differences can be extremely subtle (and depend strongly on the font!). Typical users don't distinguish between them semantically _even when they look slightly different_ (e.g. would you treat "ﬁnally" and "finally" as different words?).

As I said, this is a tradeoff. Using NFKC will render some characters as equivalent that have fairly distinct glyphs in common fonts, e.g. H vs. ℍ. But this disadvantage needs to be weighed against the huge number of extremely similar renderings for distinct codepoints that are generated as a side-effect of input in many non-English languages, which we _want_ to treat as equivalent (#5903). If we use NFC, we will almost certainly need to add special exceptions for at least some of these characters. If we use NFKC, we can add special exceptions if there is a desperate real-world need to use (e.g.) H and ℍ as distinct identifiers in the same scope, but I think we'll need a lot fewer exceptions (if any).

stevengj on 24 Feb 2014

It's important to remember how much you are relying on the font if you expect Kompatible codepoints to render differently (even if only slightly). On my browser, "ﬁnally" and "finally" are indistinguishable in the regular text font, but are distinct in the code font: ﬁnally versus finally. Conversely, µ and μ are distinct in the text, but are nearly indistinguishable in code: µ vs. μ. (Of course, you could design a font in which even "a" and "b" render identically, but that is not a practical concern.) But my biggest concern is the difference in codepoints generated by non-English input methods.

stevengj on 24 Feb 2014

To agree with Steven, I think we should be optimizing for the least distinguishable font that might plausibly be used.

johnmyleswhite on 24 Feb 2014

The visual distinguishability argument of various font renderings could also be applied to 1 and l and I (1, l, I), and to a lesser extent, 0 and O (0, O) also. This is not a Unicode-specific problem at all, but somehow we've all become inured to it when the characters are familiar to us. Should we thence argue for canonicalized equivalences amongst these sets of characters as well? Surely everyone here would acknowledge how ridiculous this sounds. Font designers have become wiser about visually disambiguating these ASCII characters over time. Maybe we ought to tell people to just use better fonts.

jiahao on 24 Feb 2014

I agree with that, but allowing code that uses seven different variants of µ and just pretending they're all the same is crazy. It may be a pain when you're using different input modalities, but that shit needs to be standardized in the code, not just inside the parser. That's why I'm suggesting automatic normalization to NFC and error on NFKC collisions. We can also provide a tool that will map all NFKC variants in a source file to one of them (the first that occurs?), but just allowing a crazy mix doesn't seem like a good idea. Note that we can easily go from considering NFKC collisions to be an error to normalizing them to be the same, whereas you cannot go in the other direction without breaking people's programs.

StefanKarpinski on 24 Feb 2014

To agree with Steven, I think we should be optimizing for the least distinguishable font that might plausibly be used.

So Wingdings?

StefanKarpinski on 24 Feb 2014

😄1

I found the discussions linked to by PEP 3131 to be quite insightful. I've personally experienced the kind of character wrangling that goes on in Japanese, which most Japanese appear to accept stoically as part of the computing experience. The half-width vs full-width characters in particular seem particularly thorny.

Another case to consider: @JeffBezanson and I personally find the code point tables like the U+3300-series (pdf) quite horrific in this regard. Does Japanese _really_ need a single Unicode character for ㍇ = マンション (apartment, _lit._ "mansion") or ㌖=キロメートル (kilometre)? Apparently the Unihan people decided that they did. These things are _definitely_ visually distinguishable, yet allowing both identifiers in the same code seems to just want to invite trouble.

jiahao on 25 Feb 2014

1 and I don't have the same semantic meaning to a native reader, hence we learn from an early age to distinguish them. ﬁnally and finally, on the other hand, have the same meaning, so we learn from an early age _not_ to distinguish them. Do we really want to fight that?

And, as was discussed in the Python list, the problem with giving errors is this could easily produce a continual stream of errors.

stevengj on 25 Feb 2014

The argument for relying on visual distinguishability due to font renderings can be restated as: if there are two or more characters that are visually indistinguishable, then we should treat them as equivalent. This places a serious burden on the particular choice of rendering engine, font and so on, which is simply not within the ability of a programming language to control.

Furthermore, there is an assumption built into this argument that the same character can have only one visually distinguishable rendering. This is very much false. Modern fonts that support OpenType features are free to implement all sorts of rendering details like font variants for the same characters (example) and discretionary ligatures (examples) that _may or may not_ be generated by _in situ_ font substitutions with the ligature characters @stevengj has used as examples, .

The same code point can even have multiple visually distinguishable renderings, and this is officially sanctioned by the Unicode standard. The U+3400-series code table (large pdf) spells it all out in gory detail. The Unihan folks decided to _not_ explicitly map code points to Han (Chinese) characters, but rather do so indirectly by defining mappings of Unicode code points to code points in _older_ standards. The result is that code points like U+34D8 exist, where there are characters that can be rendered with or without an additional dot _at the rendering engine's discretion_. And yes, a single dot can be semantically very important - consider 大 (big) and 太 (great). Even more subtle are the different characters 已 (already), 己 (self), and 巳(ordinal 6), which differ only in the length of the second downward stroke (picture). And despite this, Unicode allows for this level of rendering variability which is sometimes semantically very important and at other times not at all.

jiahao on 25 Feb 2014

The semantic argument is even more nefarious: there are multiple code points that are semantically equivalent in ways that go far beyond the mu vs. micro problem. Unihan (UAX 38) defines an additional layer of equivalence for the Han code points to deal with so-called semantic variants. It turns out that Unihan defines _two_ such equivalences, for partial semantic overlap (e.g. 井, water-well vs. 丼 food bowl, or also well) and complete semantic overlap (e.g. 兎 and 兔, both meaning rabbit) where the different characters would be interpreted by many native Chinese speakers to be equivalent written alternatives. These characters are _not_ NKFC equivalent; should we then also be in the business of canonicalization by semantic equivalence as well? Where does the madness end?

jiahao on 25 Feb 2014

👍1

I think people from CJK are used to use English letters as variable names, cjk characters are mainly used in strings. This is because it's not convenient to switch between EN and CJK IME. The only habit-changing things from Julia would be that Greek letters could be used as names, which is easy to read maths.
I agree with @stevengj on similar letters should be dealt with when they are used as variable/function names.So are those fullwidth math sign, letters, punctuations. Just leave as they be when they are in strings would be sufficient for me

wlbksy on 25 Feb 2014

If we do this it will have to be very early, at parse time.

of course. but i think, for sanity, that it needs to be a universal property of symbols. being able to make symbols through symbol() that can't be made directly is different from passing symbol() exactly the same sequence of bytes and getting back a different variable than if it went through the parser directly (and vice versa). If symbols aren't always in canonical form, then it seems that getfield and function argument splatting could also be problematic.

I think all symbols should be normalized the same way regardless of how they are entered. While it makes sense to me to treat different ways of writing the same unicode character as equivalent, it doesn't make sense to me to treat different unicode characters as sometimes equivalent.

To agree with Steven, I think we should be optimizing for the least distinguishable font that might plausibly be used.

I disagree. The beauty of having unicode identifiers is in being able to use them freely, even if it means you need to upgrade your tools.

normalizing symbols won't fix #5903, since by the time the parser has decided it is a symbol, it is too late to redefine it as a separate operator. instead, I think it is more akin to the question of whether arbitrary expressions can be used as infix operators. Since they can't, it is a limited subset of operators that would be affected by allowing full-width alternatives to the half-width punctuation. Therefore, I believe that it is reasonable to make that modification without resorting to full NFKC for all symbols.

Somewhat unrelated, but I would require that all code be normalized to the standard ascii half-width operators for pull requests to any of my repositories. Even if they are defined to work identically, and differ slightly visually, it poses a maintenance hazard if find/replace doesn't see them that way.

vtjnash on 25 Feb 2014

As far as I can tell, no one is opposed to NFC normalization, so we should probably do that. For anything beyond that, perhaps we should wait until we have more input from users in languages that utilise non-latin character sets, since these are the parties most affected. As a monoglot, I have no real opinion as to what would be best, but I suspect the answer could be different for different languages.

I think @vtjnash may have hit upon a good solution, at least in the interim: provide recommended guidelines, along with a script for testing whether code satisfies those guidelines, which could be used as an appropriate git hook or incorporated into travis tests.

This could be enforced for Base and other JuliaLang repos, but if people really want to use two different mus in their own code, then they can. Moreover these guidelines could be later amended based on feedback without breaking existing code.

simonbyrne on 25 Feb 2014

I concur with @simonbyrne that we keep NFC and be cautious about going beyond.

A guideline about choosing names, to me, is better than forcing a controversial behavior (_i.e._ quietly tying identities that look noticeably different).

In terms of asian full-width character. I think it might be better to raise an error (or at least a warning) when people are using ＝ (full-width) instead of = (normal). I believe that a better approach is to detect such cases and give warnings (in this way we also encourage programmers to use symbols consistently) than to quietly do the job (not necessarily in a correct way) by _guessing_ the author's intention.

lindahua on 25 Feb 2014

I don't claim that NFKC is perfect; it is certainly possible to write obfuscated code even in ASCII. Just that it will cause far fewer problems than the alternative of NKC. The fact that there is no perfect solution is not an argument that we should do nothing. NFKC is a widely accepted, standardized, and continually updated way of normalizing strings so that different input methods generally (if not always) produce the same codepoints and that many (even if not all) codepoints with slightly different renderings but similar meanings are identified with one another. Losing the ability to use ＝ and = as distinct identifiers seems like a small price to pay for this benefit. Why is NFC a better choice?

@vtjnash, the question of whether symbol(foo) should be normalized too is orthogonal to this discussion, since the same question applies to NFC.

stevengj on 25 Feb 2014

Ok, "do nothing" might not be the best solution, but it does have the nice property of being very transparent. You can see what's going on just by looking at code points and seeing that they are different. Similarly, erring on the side of treating identifiers as different will tend to produce not-defined errors, while silently equating identifiers will tend to produce subtle bugs.

Probably almost nobody has a good intuitive grasp of what NFKC does. If it were really true that it specifically targeted differences due to input method, that might be valuable, but instead it strikes me as a giant random list of equated code sequences.

JeffBezanson on 4 Mar 2014

Moving, as too contentious to block 0.3.

JeffBezanson on 4 Mar 2014

This is why I've been arguing for an error. Our general philosophy is that if there's no obvious one right interpretation of something, raise an error. NFC is fine-grained enough that we can be sure that NFC-equivalent identifiers are meant to be the same. NFKC is coarse-grained enough that we can be sure that NFKC-distinct identifiers are clearly meant to be different. Everything between is no man's land. So we should throw an error. Otherwise, we are implicitly guessing what the user really meant. Not canonicalizing to NFKC is guessing that distinct identifiers are actually meant to be different. Canonicalizing to NFKC is guessing that distinct but NFKC-equivalent identifiers are meant to be the same. Either strategy will inevitably be wrong some of the time.

StefanKarpinski on 5 Mar 2014

NFKC-distinct identifiers are clearly meant to be different

Only if you exclude Chinese (Unihan) characters; I've already provided counterexamples.

jiahao on 5 Mar 2014

I'm willing to say that if Unihan has decided to ignore the standards on this matter, that is not our problem.

StefanKarpinski on 5 Mar 2014

That sentence is illogical; Unihan _is_ part of the Unicode standard. You can say that the standard is inconsistent. All I'm saying is that none of the arguments I have heard in favor of NFKC are actually sufficient to cover the corner cases in Unihan.

jiahao on 5 Mar 2014

Unless there's some even coarser equivalence standard that works for Unihan as well, NFKC is the best we've got and we're not going to get into the business of deciding what Unicode characters should or shouldn't be considered equivalent. If there isn't such a standard, then the mismatch between Unihan and NFKC is the Unicode consortium's problem as I said, not ours.

StefanKarpinski on 5 Mar 2014

I agree with @StefanKarpinski: there's not much to win by silently normalizing identifiers using NFKC. If we report an error/warning, people will notice the problem early and avoid much trouble. Julia IDEs will be made smart enough to detect cases where two identifiers are equal after NFKC normalization, and will suggest you to adapt automatically when typing them. OTC if the parser does the normalization, you will never be able to trust grep to find an identifier because of the many possible variants.

nalimilan on 5 Mar 2014

I'm just concerned that the ambiguity detection might be silent while you
develop your own code, and then throw an error when someone tries to use
that code together with something else. It think that it would be much
better if we can find some (more restrictive) way that reports all the
ambiguities independent of what the code is combined with.

On Wed, Mar 5, 2014 at 9:43 AM, Milan Bouchet-Valat <
[email protected]> wrote:

I agree with @StefanKarpinski https://github.com/StefanKarpinski:
there's not much to win by silently normalizing identifiers using NFKC. If
we report an error/warning, people will notice the problem early and avoid
much trouble. Julia IDEs will be made smart enough to detect cases where
two identifiers are equal after NFKC normalization, and will suggest you to
adapt automatically when typing them. OTC if the parser does the
normalization, you will never be able to trust grep to find an identifier
because of the many possible variants.

Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-36720711
.

toivoh on 5 Mar 2014

Seems like this can be closed.

stevengj on 6 Jun 2014

Perhaps Lint is the right place to catch this.

hayd on 14 Jul 2016

👍3

Julia: canonicalize unicode identifiers

Most helpful comment

All 104 comments

Related issues