Pixi.js: The space that is followed by CJK should not be a breaking space in TextMetrics

Created on 3 Nov 2020  ·  10Comments  ·  Source: pixijs/pixi.js

Expected Behavior

The space that is followed by Chinese/Japanese/Korean should not be a breaking space.
Like how it works in DOM as well:
スクリーンショット 2020-11-03 15 50 50
export - 2020-11-03T160427 408

Current Behavior

The space will be a breaking space no matter what the following character is. The breaking space works correctly in English, but it is not correct with Chinese/Japanese/Korean.
export - 2020-11-03T160407 935

Possible Solution

TextMetrics.isBreakingSpace may need to check the next character as well.
If the character is CJK, the space should not be a breaking space.

let isUnbreakSpace = false;

if (typeof nextChar === 'string')
{
    const matchedAsCJK = nextChar.match(/[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff66-\uff9f]/g);

    isUnbreakSpace = !!matchedAsCJK;
}

return (TextMetrics._breakingSpaces.indexOf(char.charCodeAt(0)) >= 0) && !isUnbreakSpace;

Steps to Reproduce

const style = new PIXI.TextStyle({
    breakWords: true,
    fontSize: 13,
    fontWeight: "bold",
    lineJoin: "bevel",
    stroke: "#896161",
    whiteSpace: "pre-line",
    wordWrap: true,
    wordWrapWidth: 285
});
const text = new PIXI.Text('テストテキスト テストテキスト テストテキスト テストテキスト ', style);

Environment

  • pixi.js version: _e.g. 5.1.6
  • Browser & Version: _e.g. Chrome 86
  • OS & Version: _e.g. Mac OS 10.15
🕷 Bug

Most helpful comment

CJK isn't supported out of the box, atm, and there are a number of issues with it. I was going to make a plugin a long time ago to add support, and some refactor work for Text and TextMetrics allowed the easy addition, but it just never got done, as I suck.
Not sure if it'd be something natively included in PixiJS itself, as its requires knowing what to do with certain symbols, for certain languages, via regex.
I need to take another look. I have a method in my bespoke version of PixiJS, but there I just outright know what language hte game is in, so don't have to be as smart about things

All 10 comments

Reproduction: https://codepen.io/sukantpal/full/dyXKWKy

The bounding box is a little smaller (^) than the box you're showing the "current behavior".

CJK isn't supported out of the box, atm, and there are a number of issues with it. I was going to make a plugin a long time ago to add support, and some refactor work for Text and TextMetrics allowed the easy addition, but it just never got done, as I suck.
Not sure if it'd be something natively included in PixiJS itself, as its requires knowing what to do with certain symbols, for certain languages, via regex.
I need to take another look. I have a method in my bespoke version of PixiJS, but there I just outright know what language hte game is in, so don't have to be as smart about things

@SukantPal
Thank you for the reproduction. I didn't use the exact same code, so the width is smaller😉

@themoonrat
It sounds like a great idea to allow adding additions to Text and TextMetrics. It is hard to perfectly resolve the problem in all languages, so it might be easier to let the developers solving the problem by themselves via the addition.

This issue actually affects my current project. If there has anything that I can help, please let me know😊

@huang-yuwei Since I'm really struggling to properly contribute these days.... the following is the contents of a patch file. Would be interesting to know if it does solve your issues.

It _does_ rely on looking at certain characters, which I believe is accurate, and it relies on you creating and setting a new LANGUAGE property in settings of what the language is, so PIXI.settings.LANGUAGE = "zh-TW" for example.
The best way would be to auto detect, but that's where I'm unsure if you start to suffer performance penalties using regex's every time. So in this version, I state what the language actually is from my app, and go from there.

diff --git "a/packages/text/src/TextMetrics.ts" "b/packages/text/src/TextMetrics.ts"
index 95e94453..26a368d1 100644
--- "a/packages/text/src/TextMetrics.ts"
+++ "b/packages/text/src/TextMetrics.ts"
@@ -1,3 +1,4 @@
+import { settings } from '@pixi/settings';
 import { TextStyle, TextStyleWhiteSpace } from './TextStyle';

 interface IFontMetrics {
@@ -8,6 +9,18 @@ interface IFontMetrics {

 type CharacterWidthCache = { [key: string]: number };

+/* eslint-disable no-control-regex */
+const regexBasicLatin = /[\u0000-\u00ff]/;
+const regexCannotStartZhCn = /[,!%),.:;>?]}¢¨°·ˇˉ―‖’”„‟†‡›℃∶、。〃〆〈《「『〕〗〞︵︹︽︿﹃﹘﹚﹜!"%'),.:;?]`|}~⦅]/;
+const regexCannotEndZhCn = /[,$(*,£¥·‘“〈《「『【〔〖〝﹗﹙﹛$(.[{£¥]/;
+const regexCannotStartZhTw = /[,!),.:;?]}¢·–—’”•‥„‧ †╴ 、。〆〈《「『〕〞︰︱︲︳︵︷︹︻︽︿﹁﹃﹏﹐﹑﹒﹓﹔﹕﹖﹘﹚﹜!),.:;?]|}、]/;
+const regexCannotEndZhTw = /[,([{£¥‘“‵々〇〉》」〔〝︴︶︸︺︼︾﹀﹂﹗﹙﹛({]/;
+const regexCannotStartJaJp = /[!%),.:;?]}¢°’”‟†‡℃、。〄〆〈《「『〕゛゜ゝゞ・ゝゞ!%),.:;?]}。 」 、 ・ ゙ ゚ ⦅]/;
+const regexCannotEndJaJp = /[$([\{£¥‘“々〇〉》」〔$([{「 ⦆¥]/;
+const regexCannotStartKoKr = /[!%),.:;?\]}¢°’”†‡℃〆〈《「『〕!%),.:;?]}⦅]/;
+const regexCannotEndKoKr = /[$([\{£¥‘“々〇〉》」〔$([{⦆¥₩]/;
+/* eslint-enable no-control-regex */
+
 /**
  * The TextMetrics object represents the measurement of a block of text with a specified style.
  *
@@ -574,7 +587,11 @@ export class TextMetrics
      */
     static canBreakWords(_token: string, breakWords: boolean): boolean
     {
-        return breakWords;
+        const isCJK = settings.LANGUAGE.indexOf('zh-') !== -1
+                    || settings.LANGUAGE.indexOf('ja-') !== -1
+                    || settings.LANGUAGE.indexOf('ko-') !== -1;
+
+        return breakWords || isCJK;
     }

     /**
@@ -595,6 +612,59 @@ export class TextMetrics
     static canBreakChars(_char: string, _nextChar: string, _token: string, _index: number,
         _breakWords: boolean): boolean
     {
+        const isCJK = settings.LANGUAGE.indexOf('zh-') !== -1
+        || settings.LANGUAGE.indexOf('ja-') !== -1
+        || settings.LANGUAGE.indexOf('ko-') !== -1;
+
+        if (isCJK)
+        {
+            if (_nextChar)
+            {
+                if (_char === ' ')
+                {
+                    return true;
+                }
+
+                if (regexBasicLatin.exec(_char))
+                {
+                    return false;
+                }
+
+                let regexCannotStart;
+                let regexCannotEnd;
+
+                if (settings.LANGUAGE === 'zh-CN')
+                {
+                    regexCannotStart = regexCannotStartZhCn;
+                    regexCannotEnd = regexCannotEndZhCn;
+                }
+                else if (settings.LANGUAGE === 'zh-TW')
+                {
+                    regexCannotStart = regexCannotStartZhTw;
+                    regexCannotEnd = regexCannotEndZhTw;
+                }
+                else if (settings.LANGUAGE === 'ja-JP')
+                {
+                    regexCannotStart = regexCannotStartJaJp;
+                    regexCannotEnd = regexCannotEndJaJp;
+                }
+                else if (settings.LANGUAGE === 'ko-KR')
+                {
+                    regexCannotStart = regexCannotStartKoKr;
+                    regexCannotEnd = regexCannotEndKoKr;
+                }
+
+                if (regexCannotEnd.exec(_char) || regexCannotStart.exec(_nextChar))
+                {
+                    return false;
+                }
+            }
+            else
+            {
+                return false;
+            }
+        }
+
         return true;
     }

@themoonrat
Thank you for providing the codes. I have tried it in my environment.
I replaced the canBreakChars and canBreakWords with your code, and the breaking space problem still there.
Besides, after checking the use-cases in Ja and Zh, the text is not always in the same languages. Maybe setting a stable language may not be useful.

  1. the languages might be mixed sometime. For example: PIXI株式会社, Kenさん.

    • Expected:

      スクリーンショット 2020-11-05 13 26 28

    • Current:

      スクリーンショット 2020-11-05 13 27 41

  2. the listed text will contain number OR dot. For example: - テキストテキストテキストテキストテキストテキスト

    • Expected:

      スクリーンショット 2020-11-05 13 46 30

    • Current:

      スクリーンショット 2020-11-05 13 46 13

Even though it is not working, you gave me e a direction to think about this issue. I didn't there was a canBreakChars exists.
I will also try it by myself, and hope can share my process with you soon😀

@huang-yuwei I've had a dig out with old PRs, and remembered https://github.com/pixijs/pixi.js/pull/4447

So, that PR was made making changes to TextMetrics ... but at the time we were making API changes that exposed canBreakChars and canBreakWords which could theoretically make it possible to make a plugin, that would override those function, and used them to make CJK text work. So his PR was good, but we were hoping to update it to use the new functions, but never got around to it.

I converted by own hacky version of CJK text to use these new exposed functions, but still not perfect, not as a plugin, and not good enough for public consumption.

If you could be that missing link to bring this all together and get CJK working that'd be amazing! That other PR has some unit tests that might help, too.

If you need any help, let me know :)

@themoonrat Thank you for sharing. I will take look at the PR 😊

@themoonrat
Hi, I had figured out some methods to get CJK working, and would like to hear feedback from you 🙏

The main target of my methods is aiming for three purposes.

  1. As you had said, instead of applying this to PIXI itself, apply this as a plugin.
  2. Fix the Kinsoku-shori issue as what you want to achieve. The CANNOT-START or CANNOT-END is referring to your code and https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages.
  3. Fix the breaking-space issue when the space is not following by Latin language (I found it is more accurate and more simpler than checking the CJK character)

The changes I have made are:

  1. As the behavior in DOM, I think the better approach is not detecting the language and checking the Kinsoku-shori. It is highly possible to use more than two languages in one text element. Therefore, instead of detecting the languages, applying the Kinsoku-shori when developers use the plugin.

    • to achieve this, we need to overwrite the canBreakChars from the plugin.

    • about this part, nothing needs to be changed in PIXI

      ```

      // Line breaking rules in CJK (Kinsoku Shori)

      const regexCannotStartZhCn = /[!%),.:;?]}¢°·'""†‡›℃∶、。〃〆〕〗〞﹚﹜!"%'),.:;?!]}~]/;

      const regexCannotEndZhCn = /[$(£¥·'"〈《「『【〔〖〝﹙﹛$(.[{£¥]/;

      const regexCannotStartZhTw = /[!),.:;?]}¢·–— '"•" 、。〆〞〕〉》」︰︱︲︳﹐﹑﹒﹓﹔﹕﹖﹘﹚﹜!),.:;?︶︸︺︼︾﹀﹂﹗]|}、]/;

      const regexCannotEndZhTw = /[([{£¥'"‵〈《「『〔〝︴﹙﹛({︵︷︹︻︽︿﹁﹃﹏]/;

      const regexCannotStartJaJp = /[)]}〕〉》」』】〙〗〟'"⦆»ヽヾーァィゥェォッャュョヮヵヶぁぃぅぇぉっゃゅょゎゕゖㇰㇱㇲㇳㇴㇵㇶㇷㇸㇹㇺㇻㇼㇽㇾㇿ々〻‐゠–〜? ! ‼ ⁇ ⁈ ⁉・、:;,。.]/;

      const regexCannotEndJaJp = /[([{〔〈《「『【〘〖〝'"⦅«—...‥〳〴〵]/;

      const regexCannotStartKoKr = /[!%),.:;?]}¢°'"†‡℃〆〈《「『〕!%),.:;?]}]/;

      const regexCannotEndKoKr = /[$([{£¥'"々〇〉》」〔$([{⦆¥₩ #]/;

const regexCannotStart = new RegExp(
  `${regexCannotStartZhCn.source}|${regexCannotStartZhTw.source}|${regexCannotStartJaJp.source}|${regexCannotStartKoKr.source}`,
);
const regexCannotEnd = new RegExp(
  `${regexCannotEndZhCn.source}|${regexCannotEndZhTw.source}|${regexCannotEndJaJp.source}|${regexCannotEndKoKr.source}`,
);


PIXI.TextMetrics.canBreakChars = function canBreakChars(char, nextChar) {
  if (nextChar) {
    if (regexCannotEnd.exec(char) || regexCannotStart.exec(nextChar)) {
      return false;
    }
  }
  return true;
};
```

  1. As I have mentioned in this issue, space is not always breaking space when the following character is CJK(or not Latin languages).

    • to achieve this, we need to overwrite the isBreakingSpace from the plugin

    • check whether the next char is a Latin char. if it is, then block the changes

    • about this part, we need to receive the nextChar in isBreakingSpace as well. Therefore, I suggest applying this change in PIXI. (If you are okay with it, I will push a PR later).

      ```

      // Space breaking rules in CJK

      const regexBasicLatin = /[a-zA-Z0-9()_!#$%?^&,."\']]/;

PIXI.TextMetrics.isBreakingSpace = function isBreakingSpace(char, nextChar) {
  if (typeof char !== 'string') {
    return false;
  }

  const isBreakingSpace =
    PIXI.TextMetrics._breakingSpaces.indexOf(char.charCodeAt(0)) >= 0;

  if (isBreakingSpace && nextChar) {
    const unbreakableSpace = !regexBasicLatin.exec(nextChar);
    if (unbreakableSpace) return false;
  }

  return isBreakingSpace;
};
```

You can test it at here as well: https://codepen.io/huang-yuwei/pen/GRqbEmm

Thank you very much for the codes and shared #4447 with me. These are all very helpful.

we need to receive the nextChar in isBreakingSpace as well. Therefore, I suggest applying this change in PIXI.
I'm happy for this change to take place, yes. It doesn't effect current behaviour, and it'll follow the same param order as a different function that already requires it, canBreakChars.

Don't see anything wrong at a glance with what you've done... once the plugin has a repo or a pr, then I can easily replace my hacky version with your nice version and experiment with a number of games I have and see if I can spot any issues :)

This is exciting!

@themoonrat thank you for the comment and all of the help🙌
It does sound exciting!

I have created PR #7023, and will try the plugin later.
Will keep in touch with you 😉

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SebastienFPRousseau picture SebastienFPRousseau  ·  3Comments

lunabunn picture lunabunn  ·  3Comments

softshape picture softshape  ·  3Comments

Makio64 picture Makio64  ·  3Comments

MRVDH picture MRVDH  ·  3Comments