julia 🚀 - Julia doesn't like Pizza

I nominate for most interesting issue subject.

In other news, this looks like something to take up with wcwidth; it doesn't even think pizza is printable:

$ cat test.c
#include <wchar.h>
#include <stdio.h>

int main( void ) {
    wchar_t pizza = 0x1fe55;
    printf("%d\n", wcwidth(pizza) );
    return 0;
}

$  gcc -o test test.c && ./test
-1

We don't allow negative lengths, so we clip to 0, which seems pretty reasonable to me

staticfloat on 15 Jul 2013

😄1 👍1

This prints just fine on my terminal and looks to have a width of two. Not sure how we should handle this given that the C function is wrong.

StefanKarpinski on 15 Jul 2013

Revisiting this, it appears to work on OSX Mavericks 10.9, although it reports a charwidth() of 1, so the ending single quote that wraps the character is overlaid on some rounds of pepperoni.

jiahao on 5 Nov 2013

Unsurprisingly apple seems to be the first to update their unicode tables. I think our only options here are to rely on whatever libc is available, or use our own unicode tables. Not sure we want to get into all that.

JeffBezanson on 6 Nov 2013

👀1

Is it possible to test some characters during build time and emit runtime warnings if charwidths are known to be wrong?

jiahao on 6 Nov 2013

Eh, what's the point? That's just a warning that people are going to ignore or re-open this issue. If the system libc has the wrong character width for something, then you get mangled garbage. Get a better OS.

StefanKarpinski on 6 Nov 2013

The fact that this even exists as a character is a clear sign that unicode allows too many bits :-).

timholy on 6 Nov 2013

👍1

What's the status here? @Keno?

quinnj on 23 Jun 2014

Looks like somebody submitted a patch to glibc _yesterday_ to update their unicode data:
https://sourceware.org/ml/libc-alpha/2014-06/msg00585.html

JeffBezanson on 23 Jun 2014

Did the bump from libutf8proc to libmojibake solve this? #7917.

quinnj on 21 Aug 2014

I don't think the char width problem has been solved yet. @jiahao went through all the codepoints and computed the correct widths, but this information has not made it into libmojibake.

jakebolewski on 21 Aug 2014

ping @jiahao

quinnj on 29 Aug 2014

The last time I discussed this with @stevengj, we weren't entirely settled on whether the new charwidth function should be submitted as an entirely new function to JuliaLang/libmojibake#2 or as a patch to Julia's existing charwidth.

A correct charwidth would be useful to projects other than Julia, but it would mean significant new functionality to libmojibake and it would cease to be just a lightly updated fork of a minimal Unicode handling library.

jiahao on 29 Aug 2014

I really think it makes sense for it to be in libmojibake. It's in line with the other functionality in there, and won't bloat the library by a large percentage.

JeffBezanson on 29 Aug 2014

I don't actually care much either way, but I'm fine with putting it in libmojibake.

stevengj on 29 Aug 2014

The advantage of putting it in Julia (replacing src/support/wcwidth.c) is that it will still work if someone is using the system utf8proc (which seems not unlikely on e.g. Fedora, especially if the Unicode-7 support in libmojibake gets folded upstream).

stevengj on 21 Nov 2014

Ok, then let's drop it in as our wcwidth.c to fix this issue, and possibly move it to libmojibake later.

JeffBezanson on 21 Nov 2014

utf8proc now includes an up-to-date utf8proc_charwidth function based on @jiahao's analysis (JuliaLang/utf8proc#27), so we can fix this issue by upgrading to the latest utf8proc and using this function instead of wcwidth. (utf8proc's charwidth for U+1f355 is 2.)

stevengj on 12 Mar 2015

We should probably turn the name back to utf8proc here. I'd also slightly prefer going back to using a tarball for it (once 1.2.0 is tagged) rather than a submodule.

tkelman on 12 Mar 2015

+1

nalimilan on 13 Mar 2015

I slightly prefer submodules, since the tarballs tend to leave a bunch of old versions littering the deps/ directory when the version is upgraded. Submodules are also somewhat more flexible, since we can link a pre-release version if there is an urgent need (e.g. a bugfix).

stevengj on 13 Mar 2015

Submodules tend to confuse newcomers when versions are upgraded, introducing confusing diffs after they git pull when we change the submodule. It's also a bit messier for packagers who want to use system versions. It should be possible to set UTF8PROC_VERSION to a non-release sha for testing, and github should just make the right tarball for us. Either way though.

tkelman on 14 Mar 2015

This is not quite fixed, the display of charwidth == 2 characters is still off.
screen shot 2015-03-30 at 9 31 29 pm

jakebolewski on 31 Mar 2015

Unicode tab completion also is still a bit wonky, this is what it looks like for me entering theta (which is charwidth == 1.

screen shot 2015-03-30 at 9 35 31 pm

screen shot 2015-03-30 at 9 35 44 pm

screen shot 2015-03-30 at 9 35 59 pm

screen shot 2015-03-30 at 9 36 12 pm

jakebolewski on 31 Mar 2015

It looks like there is something in the LineEdit code that is using the length (in codepoints) rather than the strwidth? For example, strwidth("🐨") == 2 (U+1f428), but if I type 🐨 abcd and then hit Ctrl-a to go back to the beginning of the line, then the formatting suddenly shifts to squish the koala partially under the a (as if it had width 1).

@jakebolewski, the \theta tab-completion example works fine for me. Maybe it is a font issue?

stevengj on 31 Mar 2015

I'm not sure we/LineEdit can do anything about this --- if the terminal or some other system component had incorrect character widths, wouldn't these kinds of problems persist? We don't control the actual display.

JeffBezanson on 31 Mar 2015

Aren't we telling the OS what cells to print each character in when doing line editing?

stevengj on 31 Mar 2015

It looks like an ITerm bug actually, changing to the default OSX terminal results in correct behavior. All fonts in ITerm that I have tried have the same problem.

jakebolewski on 31 Mar 2015

I suppose that, in theory, we could print charwidth - wcwidth spaces or something like that to compensate for a buggy wcwidth?

stevengj on 31 Mar 2015

Yes, we could print spaces to compensate. I feel like it's not worth the effort though, since there is unlimited scope for downstream display bugs.

JeffBezanson on 31 Mar 2015

If you believe there is an iTerm bug, that should be filed upstream. They are usually very responsive.

Keno on 31 Mar 2015

@Keno ended up not being a bug in ITerm. You need to turn off "treat ambiguous-width characters as double width" in the Text section under preferences.

jakebolewski on 31 Mar 2015

I'm using Terminal, not ITerm. What confuses me is that 🐨 abcd looks fine when I first type it, and is only wonky when I hit ctrl-a ... this makes me think it might be a bug in LineEdit.refresh_line.

stevengj on 31 Mar 2015

Can you screenshot what you see? I don't quite understand and I suspect terminal differences might make it hard to discuss with out.

Keno on 31 Mar 2015

Procedure: paste 🐨, then type abcd. Then type ctrl-a. Screenshot before ctrl-a:

and after ctrl-a:

stevengj on 31 Mar 2015

I have a sudden urge to reprogram a key on my keyboard to type a koala. Or do I use the snowman more often?

JeffBezanson on 31 Mar 2015

We need tab completion for emoji's \:koala:

jakebolewski on 31 Mar 2015

I have Base.REPLCompletions.latex_symbols["\\koala"] = "\U1f428" in my .juliarc.jl.

stevengj on 31 Mar 2015

😄1

(On a more serious note, I see exactly the same behavior for \Longrightarrow = ⟹ = U+27f9, which also has charwidth == 2 and wcwidth == 1 on my machine.)

stevengj on 31 Mar 2015

Ok, thanks. I can reproduce and take a look.

Keno on 31 Mar 2015

I suspect what might be happening here is that on input the terminal properly moves the cursor by two spaces but on output it does not.

Keno on 31 Mar 2015

Blech, I suppose there's not much we can do about it. But iTerm2 seems to have the same behavior, and maybe they can fix it on their end.

But the weird thing is that in bash with the same terminal the results are the same (both wrong) for input and output. What are we doing differently for input than for output?

stevengj on 31 Mar 2015

Well, maybe @gnachman can shed some light on whether that's the intended behavior:
screen shot 2015-03-31 at 4 51 46 pm
(i.e. the double width write only taking up a single cell). It might also be of course that iTerm's unicode tables are out of date. If the behavior is by design I suspect we can just print spaces to make up for it (though we might run into trouble with the space overdrawing the character - can we do a cursor move instead of printing a space?).

Keno on 31 Mar 2015

I think it's out-of-date/inaccurate Unicode tables. If I use 䍒 = U+4352, which has wcwidth==charwidth==2, then "a\u4352b" prints correctly with a double-width 䍒 between the a and b, without overwriting, in both Terminal and iTerm2. Maybe iTerm2 can just use utf8proc...

(Of course, it may be by design that the terminals match the buggy system wcwidth, so that column alignment in programs using wcwidth works. That would be an argument for doing a cursor move by charwidth-wcwidth.)

But I still don't understand how it could work okay for input (for us, but not for bash).

stevengj on 31 Mar 2015

I'm against trying to work around it with hacks like spaces or extra cursor moves. The cost is (1) complexity and maintainability, (2) incompleteness (we can't reasonably intercept write data), (3) I can imagine it making matters worse in other, unforeseen circumstances.

JeffBezanson on 31 Mar 2015

I think we can probably work around it by doing a cursor move (writing spaces is problematic, because iTerm's redraw code would draw the space on top of the koala), but it just seems like such an ugly solution. I am entirely fine with telling people to use a terminal with proper unicode tables if there is any chance this can be fixed there.

Keno on 31 Mar 2015

It looks like Julia uses some combination of EastAsianWidth.txt and advances extracted from GNU Unifont, as described here:

https://github.com/JuliaLang/utf8proc/blob/master/data/charwidths.jl

While it's tempting to adopt the same widths that Julia uses in iTerm2, I'm a little worried about what will break. In order for this rickety system to work, all apps must be in agreement with the terminal emulator about the width of all characters. For example, bash and vim 7.3 think pizza is narrow. Mind you, it looks like crap because they're wrong, but at least interactive editing works.

gnachman on 31 Mar 2015

@gnachman, unfortunately I think there are conflicting goals in Julia and iTerm (or Terminal) here.

In Julia, we need a portable wcwidth replacement because (a) we want consistent cross-platform results for Unicode processing, (b) the Windows wcwidth is broken (16-bit wchar_t), and (c) wcwidth is clearly out-of-date on most other systems. There isn't any clear standard to follow, but the procedure involving Unifont etcetera, after an exhaustive analysis by @jiahao, seemed to be the most reliable practicable approach.

iTerm, however, only has to work on MacOS, and has to work with lots of programs that rely on MacOS's buggy wcwidth (which reports e.g. that pizza is narrow), so it makes a certain amount of sense for it to adopt whatever width the system wcwidth reports, no matter how crappy it looks.

So, we might have to do a cursor move by charwidth-wcwidth if we want it to be both nicely editable and to look nice. Or use wcwidth if we want it to be nicely editable regardless of how it looks.

stevengj on 1 Apr 2015

@stevengj iTerm2 does need to interoperate with anything you can ssh, telnet, etc. to, not to mention Julia, so I'm open to giving users a way to opt in to more a sensible wcwidth(). I don't use wcwidth() on the client so I _could_ use utf8proc_charwidth in the right circumstances. Since AFAIK only Julia departs from the standard, there'd need to be a new escape sequence to tell the terminal emulator to switch character-width lookup tables.

OTOH, since Julia is the black sheep in this regard, it probably makes the most sense for Julia to print a space after characters that it treats as wide but wcwidth does not. And deal with cursor movement across them correctly, etc. That'll work with every terminal out there. If a window gets resized it won't wrap correctly, though. Terminal and iTerm2 will both refuse to "break" a fullwidth character into two half-width pieces, choosing instead to move the whole thing to the start of the next line, but that's a small price to pay.

gnachman on 1 Apr 2015

@gnachman If I print a space next to a character, is there a chance the space will get drawn on top of it during a redraw? I think I've seen that behavior in my experiments.

Keno on 1 Apr 2015

@Keno Yes, that can happen. I'm working on a fix to that issue in my refactor_drawing branch. Feel free to try it if you're feeling brave :). I expect to merge it into master in a week or two. Terminal.app doesn't have that issue, so that approach is safe to use.

gnachman on 1 Apr 2015

It's not our preference to depart from standards here. Just looking at that glyph, clearly somebody thinks it is double-width. As @stevengj said there is no clear standard.

JeffBezanson on 1 Apr 2015

wcwidth is only "standard" in the sense that it is used by many programs; it is not consistent even between MacOS versions, much less across operating systems, and is invariably out of date.

Note that UAX#11 provides a clear standard for a subset of Unicode, and wcwidth as of MacOS 10.10.2 does not conform to Unicode 7 in the sense that it reports -1 (not printable / not recognized) for many of the characters listed in UAX#11 as having width 1 or 2 (narrow/wide).

stevengj on 1 Apr 2015

@gnachman The rationale and details of the analyses used to justify Julia's implementation are explained in JuliaLang/utf8proc#2 and JuliaLang/utf8proc#27 and in this notebook, which amongst other things details the exact discrepancies between my system wcwidth and the analysis. In all the cases I examined, I could not find a reason to justify the system answer over the analysis outlined in the issues and notebook.

As @JeffBezanson and @stevengj have already stated, there is _no_ standard governing character widths, and so it is not possible to characterize Julia as "departing from the standard". On the contrary, it appears that not enough thought has gone into any other implementation for the purpose of determining character widths.

To illustrate our reasoning, consider the pizza character U+1F355. The relevant entry in EastAsianWidths.txt is:

1F330..1F37D;N # So [78] CHESTNUT..FORK AND KNIFE WITH PLATE

which assigns it the "neutral" category (not "narrow", which is coded as "Na"). Thus it falls into the nebulous category where UAX 11 has essentially nothing to say because the character doesn't exist in legacy East Asian encodings. (UAX 11 even says in its Scope not to consider it an authoritative source on character widths, but rather that

The East_Asian_Width is an informative property... the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

)

In the absence of a clear standard, the best I could come up with is to look at a font that actually bothered to provide a glyph for that code point, hence settling on Unifont, which provides this glyph:

uni01f3

Note that the character width assigned by inspecting the advance width from Unifont agrees with the eyeball comparison of the reference glyph in the Unicode character charts (pdf).

screen shot 2015-04-01 at 11 39 54 pm

Superimposed for reference is a square box. I do not see any reason why this should be 'narrow' instead of 'fullwidth'.

jiahao on 2 Apr 2015

@jiahao, I wasn't criticizing your work. The informal agreement between client and server, which as you note is underspecified, is what is rickety. Your work is really valuable--I wish it (or something like it) were widely adopted.

I had believed that EastAsianWidth.txt was "the standard", but I'm persuaded that there isn't really one at all. AFAIK most apps treat N as narrow, but it leads to the problems described in this bug.

gnachman on 2 Apr 2015

It sounds like this should be reported upstream/more widely, if it hasn't been already.

timholy on 2 Apr 2015

Unfortunately, it seems like the only upstream that can really fix this is libc, in order to fix wcwidth. I don't know where to file this kind of low-level bug report with Apple (??), and Microsoft is hopeless because of their wchar_t size, but it would be worthwhile for someone to check utf8proc against the latest GNU libc and file a bug report for discrepancies where libc is clearly wrong.

stevengj on 2 Apr 2015

glibc#4335

jiahao on 2 Apr 2015

Julia has the right (most updated) char widths, so it's up to the user to demand their terminal emulators are displaying properly. most likely, that'll happen gradually with various companies (#7267) lagging behind more or less from the standards committee.

vtjnash on 14 Mar 2016

Britain may be leaving the EU, but Unicode 9 came out and fixed this issue for us, so overall it's a pretty good day:
screen shot 2016-06-24 at 12 29 30 am

Keno on 24 Jun 2016

🎉13 😄3

iTerm2 PR: https://github.com/gnachman/iTerm2/pull/294

Keno on 24 Jun 2016

Julia: Julia doesn't like Pizza

Most helpful comment

All 61 comments

Related issues