Julia: Julia doesn't like Pizza

Created on 15 Jul 2013  ·  61Comments  ·  Source: JuliaLang/julia

julia> x = '\U1f355'
'\U1f355'

julia> charwidth(x)
0
bug unicode upstream

Most helpful comment

Britain may be leaving the EU, but Unicode 9 came out and fixed this issue for us, so overall it's a pretty good day:
screen shot 2016-06-24 at 12 29 30 am

All 61 comments

I nominate for most interesting issue subject.

In other news, this looks like something to take up with wcwidth; it doesn't even think pizza is printable:

$ cat test.c
#include <wchar.h>
#include <stdio.h>

int main( void ) {
    wchar_t pizza = 0x1fe55;
    printf("%d\n", wcwidth(pizza) );
    return 0;
}

$  gcc -o test test.c && ./test
-1

We don't allow negative lengths, so we clip to 0, which seems pretty reasonable to me

This prints just fine on my terminal and looks to have a width of two. Not sure how we should handle this given that the C function is wrong.

Revisiting this, it appears to work on OSX Mavericks 10.9, although it reports a charwidth() of 1, so the ending single quote that wraps the character is overlaid on some rounds of pepperoni.

image

Unsurprisingly apple seems to be the first to update their unicode tables. I think our only options here are to rely on whatever libc is available, or use our own unicode tables. Not sure we want to get into all that.

Is it possible to test some characters during build time and emit runtime warnings if charwidths are known to be wrong?

Eh, what's the point? That's just a warning that people are going to ignore or re-open this issue. If the system libc has the wrong character width for something, then you get mangled garbage. Get a better OS.

The fact that this even exists as a character is a clear sign that unicode allows too many bits :-).

What's the status here? @Keno?

Looks like somebody submitted a patch to glibc _yesterday_ to update their unicode data:
https://sourceware.org/ml/libc-alpha/2014-06/msg00585.html

Did the bump from libutf8proc to libmojibake solve this? #7917.

I don't think the char width problem has been solved yet. @jiahao went through all the codepoints and computed the correct widths, but this information has not made it into libmojibake.

ping @jiahao

The last time I discussed this with @stevengj, we weren't entirely settled on whether the new charwidth function should be submitted as an entirely new function to JuliaLang/libmojibake#2 or as a patch to Julia's existing charwidth.

A correct charwidth would be useful to projects other than Julia, but it would mean significant new functionality to libmojibake and it would cease to be just a lightly updated fork of a minimal Unicode handling library.

I really think it makes sense for it to be in libmojibake. It's in line with the other functionality in there, and won't bloat the library by a large percentage.

I don't actually care much either way, but I'm fine with putting it in libmojibake.

The advantage of putting it in Julia (replacing src/support/wcwidth.c) is that it will still work if someone is using the system utf8proc (which seems not unlikely on e.g. Fedora, especially if the Unicode-7 support in libmojibake gets folded upstream).

Ok, then let's drop it in as our wcwidth.c to fix this issue, and possibly move it to libmojibake later.

utf8proc now includes an up-to-date utf8proc_charwidth function based on @jiahao's analysis (JuliaLang/utf8proc#27), so we can fix this issue by upgrading to the latest utf8proc and using this function instead of wcwidth. (utf8proc's charwidth for U+1f355 is 2.)

We should probably turn the name back to utf8proc here. I'd also slightly prefer going back to using a tarball for it (once 1.2.0 is tagged) rather than a submodule.

+1

I slightly prefer submodules, since the tarballs tend to leave a bunch of old versions littering the deps/ directory when the version is upgraded. Submodules are also somewhat more flexible, since we can link a pre-release version if there is an urgent need (e.g. a bugfix).

Submodules tend to confuse newcomers when versions are upgraded, introducing confusing diffs after they git pull when we change the submodule. It's also a bit messier for packagers who want to use system versions. It should be possible to set UTF8PROC_VERSION to a non-release sha for testing, and github should just make the right tarball for us. Either way though.

This is not quite fixed, the display of charwidth == 2 characters is still off.
screen shot 2015-03-30 at 9 31 29 pm

Unicode tab completion also is still a bit wonky, this is what it looks like for me entering theta (which is charwidth == 1.

screen shot 2015-03-30 at 9 35 31 pm

screen shot 2015-03-30 at 9 35 44 pm

screen shot 2015-03-30 at 9 35 59 pm

screen shot 2015-03-30 at 9 36 12 pm

It looks like there is something in the LineEdit code that is using the length (in codepoints) rather than the strwidth? For example, strwidth("🐨") == 2 (U+1f428), but if I type 🐨 abcd and then hit Ctrl-a to go back to the beginning of the line, then the formatting suddenly shifts to squish the koala partially under the a (as if it had width 1).

@jakebolewski, the \theta tab-completion example works fine for me. Maybe it is a font issue?

I'm not sure we/LineEdit can do anything about this --- if the terminal or some other system component had incorrect character widths, wouldn't these kinds of problems persist? We don't control the actual display.

Aren't we telling the OS what cells to print each character in when doing line editing?

It looks like an ITerm bug actually, changing to the default OSX terminal results in correct behavior. All fonts in ITerm that I have tried have the same problem.

I suppose that, in theory, we could print charwidth - wcwidth spaces or something like that to compensate for a buggy wcwidth?

Yes, we could print spaces to compensate. I feel like it's not worth the effort though, since there is unlimited scope for downstream display bugs.

If you believe there is an iTerm bug, that should be filed upstream. They are usually very responsive.

@Keno ended up not being a bug in ITerm. You need to turn off "treat ambiguous-width characters as double width" in the Text section under preferences.

I'm using Terminal, not ITerm. What confuses me is that 🐨 abcd looks fine when I first type it, and is only wonky when I hit ctrl-a ... this makes me think it might be a bug in LineEdit.refresh_line.

Can you screenshot what you see? I don't quite understand and I suspect terminal differences might make it hard to discuss with out.

Procedure: paste 🐨, then type abcd. Then type ctrl-a. Screenshot before ctrl-a:
image
and after ctrl-a:
image

I have a sudden urge to reprogram a key on my keyboard to type a koala. Or do I use the snowman more often?

We need tab completion for emoji's \:koala:

I have Base.REPLCompletions.latex_symbols["\\koala"] = "\U1f428" in my .juliarc.jl.

(On a more serious note, I see exactly the same behavior for \Longrightarrow = = U+27f9, which also has charwidth == 2 and wcwidth == 1 on my machine.)

Ok, thanks. I can reproduce and take a look.

I suspect what might be happening here is that on input the terminal properly moves the cursor by two spaces but on output it does not.

Blech, I suppose there's not much we can do about it. But iTerm2 seems to have the same behavior, and maybe they can fix it on their end.

But the weird thing is that in bash with the same terminal the results are the same (both wrong) for input and output. What are we doing differently for input than for output?

Well, maybe @gnachman can shed some light on whether that's the intended behavior:
screen shot 2015-03-31 at 4 51 46 pm
(i.e. the double width write only taking up a single cell). It might also be of course that iTerm's unicode tables are out of date. If the behavior is by design I suspect we can just print spaces to make up for it (though we might run into trouble with the space overdrawing the character - can we do a cursor move instead of printing a space?).

I think it's out-of-date/inaccurate Unicode tables. If I use 䍒 = U+4352, which has wcwidth==charwidth==2, then "a\u4352b" prints correctly with a double-width between the a and b, without overwriting, in both Terminal and iTerm2. Maybe iTerm2 can just use utf8proc...

(Of course, it may be by design that the terminals match the buggy system wcwidth, so that column alignment in programs using wcwidth works. That would be an argument for doing a cursor move by charwidth-wcwidth.)

But I still don't understand how it could work okay for input (for us, but not for bash).

I'm against trying to work around it with hacks like spaces or extra cursor moves. The cost is (1) complexity and maintainability, (2) incompleteness (we can't reasonably intercept write data), (3) I can imagine it making matters worse in other, unforeseen circumstances.

I think we can probably work around it by doing a cursor move (writing spaces is problematic, because iTerm's redraw code would draw the space on top of the koala), but it just seems like such an ugly solution. I am entirely fine with telling people to use a terminal with proper unicode tables if there is any chance this can be fixed there.

It looks like Julia uses some combination of EastAsianWidth.txt and advances extracted from GNU Unifont, as described here:

https://github.com/JuliaLang/utf8proc/blob/master/data/charwidths.jl

While it's tempting to adopt the same widths that Julia uses in iTerm2, I'm a little worried about what will break. In order for this rickety system to work, all apps must be in agreement with the terminal emulator about the width of all characters. For example, bash and vim 7.3 think pizza is narrow. Mind you, it looks like crap because they're wrong, but at least interactive editing works.

@gnachman, unfortunately I think there are conflicting goals in Julia and iTerm (or Terminal) here.

In Julia, we need a portable wcwidth replacement because (a) we want consistent cross-platform results for Unicode processing, (b) the Windows wcwidth is broken (16-bit wchar_t), and (c) wcwidth is clearly out-of-date on most other systems. There isn't any clear standard to follow, but the procedure involving Unifont etcetera, after an exhaustive analysis by @jiahao, seemed to be the most reliable practicable approach.

iTerm, however, only has to work on MacOS, and has to work with lots of programs that rely on MacOS's buggy wcwidth (which reports e.g. that pizza is narrow), so it makes a certain amount of sense for it to adopt whatever width the system wcwidth reports, no matter how crappy it looks.

So, we might have to do a cursor move by charwidth-wcwidth if we want it to be both nicely editable and to look nice. Or use wcwidth if we want it to be nicely editable regardless of how it looks.

@stevengj iTerm2 does need to interoperate with anything you can ssh, telnet, etc. to, not to mention Julia, so I'm open to giving users a way to opt in to more a sensible wcwidth(). I don't use wcwidth() on the client so I _could_ use utf8proc_charwidth in the right circumstances. Since AFAIK only Julia departs from the standard, there'd need to be a new escape sequence to tell the terminal emulator to switch character-width lookup tables.

OTOH, since Julia is the black sheep in this regard, it probably makes the most sense for Julia to print a space after characters that it treats as wide but wcwidth does not. And deal with cursor movement across them correctly, etc. That'll work with every terminal out there. If a window gets resized it won't wrap correctly, though. Terminal and iTerm2 will both refuse to "break" a fullwidth character into two half-width pieces, choosing instead to move the whole thing to the start of the next line, but that's a small price to pay.

@gnachman If I print a space next to a character, is there a chance the space will get drawn on top of it during a redraw? I think I've seen that behavior in my experiments.

@Keno Yes, that can happen. I'm working on a fix to that issue in my refactor_drawing branch. Feel free to try it if you're feeling brave :). I expect to merge it into master in a week or two. Terminal.app doesn't have that issue, so that approach is safe to use.

It's not our preference to depart from standards here. Just looking at that glyph, clearly somebody thinks it is double-width. As @stevengj said there is no clear standard.

wcwidth is only "standard" in the sense that it is used by many programs; it is not consistent even between MacOS versions, much less across operating systems, and is invariably out of date.

Note that UAX#11 provides a clear standard for a subset of Unicode, and wcwidth as of MacOS 10.10.2 does not conform to Unicode 7 in the sense that it reports -1 (not printable / not recognized) for many of the characters listed in UAX#11 as having width 1 or 2 (narrow/wide).

@gnachman The rationale and details of the analyses used to justify Julia's implementation are explained in JuliaLang/utf8proc#2 and JuliaLang/utf8proc#27 and in this notebook, which amongst other things details the exact discrepancies between my system wcwidth and the analysis. In all the cases I examined, I could not find a reason to justify the system answer over the analysis outlined in the issues and notebook.

As @JeffBezanson and @stevengj have already stated, there is _no_ standard governing character widths, and so it is not possible to characterize Julia as "departing from the standard". On the contrary, it appears that not enough thought has gone into any other implementation for the purpose of determining character widths.

To illustrate our reasoning, consider the pizza character U+1F355. The relevant entry in EastAsianWidths.txt is:

1F330..1F37D;N # So [78] CHESTNUT..FORK AND KNIFE WITH PLATE

which assigns it the "neutral" category (not "narrow", which is coded as "Na"). Thus it falls into the nebulous category where UAX 11 has essentially nothing to say because the character doesn't exist in legacy East Asian encodings. (UAX 11 even says in its Scope not to consider it an authoritative source on character widths, but rather that

The East_Asian_Width is an informative property... the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

)

In the absence of a clear standard, the best I could come up with is to look at a font that actually bothered to provide a glyph for that code point, hence settling on Unifont, which provides this glyph:

uni01f3

Note that the character width assigned by inspecting the advance width from Unifont agrees with the eyeball comparison of the reference glyph in the Unicode character charts (pdf).

screen shot 2015-04-01 at 11 39 54 pm

Superimposed for reference is a square box. I do not see any reason why this should be 'narrow' instead of 'fullwidth'.

@jiahao, I wasn't criticizing your work. The informal agreement between client and server, which as you note is underspecified, is what is rickety. Your work is really valuable--I wish it (or something like it) were widely adopted.

I had believed that EastAsianWidth.txt was "the standard", but I'm persuaded that there isn't really one at all. AFAIK most apps treat N as narrow, but it leads to the problems described in this bug.

It sounds like this should be reported upstream/more widely, if it hasn't been already.

Unfortunately, it seems like the only upstream that can really fix this is libc, in order to fix wcwidth. I don't know where to file this kind of low-level bug report with Apple (??), and Microsoft is hopeless because of their wchar_t size, but it would be worthwhile for someone to check utf8proc against the latest GNU libc and file a bug report for discrepancies where libc is clearly wrong.

Julia has the right (most updated) char widths, so it's up to the user to demand their terminal emulators are displaying properly. most likely, that'll happen gradually with various companies (#7267) lagging behind more or less from the standards committee.

Britain may be leaving the EU, but Unicode 9 came out and fixed this issue for us, so overall it's a pretty good day:
screen shot 2016-06-24 at 12 29 30 am

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omus picture omus  ·  3Comments

yurivish picture yurivish  ·  3Comments

musm picture musm  ·  3Comments

tkoolen picture tkoolen  ·  3Comments

sbromberger picture sbromberger  ·  3Comments