Xterm.js: Emoji skin modifiers don't work

Created on 4 Sep 2017 · 10Comments · Source: xtermjs/xterm.js

Pasting "👍🏻 👈🏿"

screen shot 2017-09-04 at 10 50 09 am

It looks like the skin modifier is being put into a new cell, when it should actually be added to the single emoji.

areparser typenhancement

Source

Tyriar

👍2

Most helpful comment

@Tyriar Imho this needs a more advanced character evaluation than xterm.js has atm - the current approach can only handle leading combining chars, while this one follows the char to be modified (http://unicode.org/reports/tr51/#Emoji_Modifiers_Table). The info about the correct rendering is in the sequence of chars (+ unicode spec), it is just a nightmare to cover all those edge cases (emoji modifiers were not part of the official spec until 2015).

jerch on 5 Sep 2017

👍2

All 10 comments

I got it working with the following code:

diff --git a/src/Parser.ts b/src/Parser.ts
index b047f91..0a4aeb9 100644
--- a/src/Parser.ts
+++ b/src/Parser.ts
@@ -187,7 +187,7 @@ export class Parser {
     let cs;
     let ch;
     let code;
-    let low;
+let skipChars = 0;

     const cursorStartX = this._terminal.buffer.x;
     const cursorStartY = this._terminal.buffer.y;
@@ -204,6 +204,10 @@ export class Parser {
     }

     for (; this._position < l; this._position++) {
+      if (skipChars > 0) {
+        skipChars--;
+        continue;
+      }
       ch = data[this._position];

       // FIXME: higher chars than 0xa0 are not allowed in escape sequences
@@ -212,7 +216,7 @@ export class Parser {
       if (0xD800 <= code && code <= 0xDBFF) {
         // we got a surrogate high
         // get surrogate low (next 2 bytes)
-        low = data.charCodeAt(this._position + 1);
+        let low = data.charCodeAt(this._position + 1);
         if (isNaN(low)) {
           // end of data stream, save surrogate high
           this._terminal.surrogate_high = ch;
@@ -220,10 +224,19 @@ export class Parser {
         }
         code = ((code - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000;
         ch += data.charAt(this._position + 1);
+        let i = 2;
+        skipChars = 1;
+        while (0xD800 <= data.charCodeAt(this._position + i) && data.charCodeAt(this._position + i) <= 0xDBFF) {
+          ch += data.charAt(this._position + i);
+          ch += data.charAt(this._position + i + 1);
+          i += 2;
+          skipChars++;
+        }
       }
       // surrogate low - already handled above
-      if (0xDC00 <= code && code <= 0xDFFF)
+      if (0xDC00 <= code && code <= 0xDFFF) {
         continue;
+      }

       switch (this._state) {
         case ParserState.NORMAL:

However, doing this screws up the prompt as 2 characters (according to the shell) become one, which means you can go left into the $PS1.

screen shot 2017-09-04 at 11 24 24 am

We should fix that before going forward with supporting modifiers and the zero-width joiner. Not sure how best to do this, perhaps characters need some more metadata indicating how many characters are joined into the cell, then left/right arrow repeats the action that number of times?

Tyriar on 4 Sep 2017

jerch on 5 Sep 2017

👍2

Oh I was wrong with my last statement, the input does not support leading but only following combining characters atm (combining chars with zero width will end up in the last active cell). So the emoji problem should be fixable by adding the modifiers to the wcwidth combining char table COMBINING_HIGH.

jerch on 5 Sep 2017

Ok the problem gets quite nasty - the correct handling of an emoji + modifier in the browser depends on the system font renderer. There is no platform independent way to solve this, seems this feature is to new.

testing results:

Ubuntu 14 prints 2 block symbols with unicode number [][] (no visual glyphs)
Ubuntu 16 prints the glyphs separated (thumbup + colored field behind), it does not combine the characters
OSX 10.12 correctly does the trick

Thats the point where the wcwidth implementation has to reassemble the system wcwidth to get correct widths. Only way I can think of to solve this is to prerender the emoji+modifier offline and measure the text width to see if it actually combines. If it does not combine the Unicode spec suggests to print them both or only the first one. The latter seems reasonable in a terminal environment to preserve the row width.
Not sure, if emojis with modifiers are worth that trouble, what you think?

jerch on 5 Sep 2017

Well it would be good to support it, I'm not exactly to keen on jumping on it atm though.

The offscreen measuring thing looks like it would solve the problem. I'm not sure we would want to touch wcwidth though? Isn't that used to determine the width that the character consumes, not the number of characters in a single "cell"?

Tyriar on 5 Sep 2017

Well wcwidth is part of the deal: currently combining chars end up in the last cell. This is because wcwidth determined it as a char with zero width. With systems treating emoji modifiers differently it does not work that easy anymore:

OSX: modifier combines --> wcwidth should report width of zero
Ubuntu: modifier does not combine --> char has its own visual representation and needs its space, wcwidth should report whatever space it takes to draw it (alternatively drop that char at a higher level before rendering)

If you have a look at the Unicode spec above it gets even trickier with the ZERO WIDTH JOINER (http://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences).

Maybe its time to get the hands dirty with splitting up the stream into graphemes (see https://github.com/devongovett/grapheme-breaker) and feed those to wcwidth. wcwidth then needs a way to dynamically determine the width of unkown graphemes, which could be some offscreen thingy.

_NOTE_ - any implementation in that direction will have a big negative performance impact, since every char has to be inspected at least once (like now with wcwidth) and multiple times in case combining chars are following.

jerch on 5 Sep 2017

Relating #1059

mandel59 on 11 May 2018

To align to the behavior of remote wcwidth implementations rather than Unicode specification is one of solutions: for instance, mintty renders a keycap emoji in 1 cell and a rainbow flag emoji in 4 cells. Although it is ugly, it solves the probrem about cursor location.

https://github.com/mintty/mintty/wiki/Tips#installing-emoji-resources

mandel59 on 11 May 2018

@mandel59 Well yes and no. Problem behind is that xterm.js can run on a total different system than the backend due to the browser/network stack in between, and both systems involved can differ about the wcwidth and the final render output width (which is even worse because font capabilities dependent). Currently xterm.js's wcwidth is not frontend aware, it simply applies the same rules for all systems. But even being frontend system aware does not solve the problem:

Imagine the following for char XY:

backend: wcwidth 1
frontend: wcwidth 2

Which one to apply now? If we take the 1 of the backend the rendered char is still drawn as wide char due to the way the frontend wants to render it - the cursor ends up on top of or "in the middle" of the glyph. If we take the 2 of the frontend we violate the grid layout of the terminal and will run one cell short at the end of the line, because the pty on the backend side just saw the 1 there.

Imho there is no real solution for this problem. In a perfect world wcwidth is the same for backend and frontend and we dont face this problem at all. For systems where the backend runs on the same system as the frontend (like a local terminal) it can be fixed this way, yes. For remote systems, it is a matter of priority - go with the backend for correct cell alignment with possible over/under draws or go with the frontend for correct draw but broken cell alignment?

Note that most of the unicode chars did not change the wcwidth spec for years and do not create these hassles. The emojies were changed recently from a width of 1 to 2, therefore the problem.

jerch on 11 May 2018

Background on this: Those composed characters are part of the grapheme parsing rules in Unicode, same as country flags and others.

We will not see grapheme support anytime soon in xterm.js, as all cmdline programs get them wrong currently. Also note that they are specced to produce different outputs, technically the hand symbol + the box is correct. If the font renderer has a merged symbol for the codepoints it is free to use that prolly shorter symbol. That ambiguity is a nightmare for monospaced grids, but Unicode does not care much in this regard.

@Tyriar Should we close this issue as wont fix?

jerch on 15 Nov 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Buffer performance improvements

Tyriar · 73Comments

Custom addons

feamsr00 · 39Comments

Introduce some build system to enable more than one js file

Tyriar · 26Comments

Support 1337 escape sequence to show inline images

cristynkells · 26Comments

Can xterm attach use socket. io?

BengBu-YueZhang · 63Comments