Vscode: Feature request: Treat the Chinese text as a Chinese sequence when using`Ctrl+Left/Right`

Created on 17 May 2018  ·  8Comments  ·  Source: microsoft/vscode

Now the VSCode treats a long Chinese text as one “word”. Each time use Ctrl+Left/Right, it will move the cursor to the begin or end.

The feature request is that treat the Chinese text as a Chinese sequence, then each Ctrl+Left/Right, it just move one step. This act is the system text program default.

Example: (use | as the cursor )

|本文的学习公式
// Ctrl+Right
本文的学习公式|

Expected:

|本文的学习公式
// Ctrl+Right
本|文的学习公式
// Ctrl+Right
本文|的学习公式
// Ctrl+Right
本文的|学习公式

(Of course, It would be better if it can support Word Segmentation.)

editor-core feature-request

Most helpful comment

Let's see if we can have time for it during holiday time.

All 8 comments

It's better if VS Code can support Word Segmentation just like what Chrome does, although I know that this requires a big data dict and increases the program package size a lot. But, if it doesn't like to segment words, then I suggest that it keeps moving cursor once a sentence, instead of a char - personally, I think it is too slow to jump a Chinese char on <Ctrl+Right>.

However, it doesn't move once a sentence. Actually, it also can't recognize the Chinese punctuations.

Examples:

|output gate 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output| gate 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output gate| 会影响结果,因此该模型有两个版本,分别为是否使用
// <Ctrl+Right>
output gate 会影响结果,因此该模型有两个版本,分别为是否使用|

This is a longstanding problem which virtually all East-Asian developers will notice once they start editing natural sentences (say, in Markdown) on vscode. I think this is fundamentally a problem of wrong word-splitting for CJK languages (and perhaps Thai, too), which use no spaces to delimit words. A similar problem happens when you double-click a word in a line (the whole line will be selected instead of the target word) and when you trigger an autocompletion using Ctrl+Space (a whole line will be shown as a candidate).

Ideally, dictionary-based word segmentation is desirable (this is available on MS Word, Google Chrome browser, etc), but it's not 100% correct, and I'm not sure if it is really necessary for a code editor. Another practical approach that works at least in Japanese is to split words based on character types, because a typical Japanese text is a mixture of kanji, hiragana and katakana (This algorithm is implemented on most domestic text editors and even MS Notepad.exe). Character types can be easily determined via Unicode code points.

Example:

(1) 吾輩は猫である。名前はまだない。
(2) 吾輩|は|猫|で|ある|。|名前|は|まだ|ない|。
(3) 吾輩|は|猫|である|。|名前|はまだない|。

(1): Natural Japanese text with two sentences. is a Japanese period.; (2): Dictionary-based word boundaries (|), available on MS Word, Chrome, etc.; (3): Codepoint-based kana-kanji boundaries, available on Firefox, Notepad.exe, etc.

There is already a popular extension that does (3) above for Japanese text. Unfortunately, it works on Ctrl+ / but nowhere else. It does not work on double-clicks, Ctrl+D, autocompletion, text search, and so on.

Personally, I think (3) should be implemented as part of the basic functionality of VSCode, considering the fact that it's available on any other decent text editors. Dictionary-based solution (2) may be too costly within the main vscode repository, but I hope there is a way to allow extension developers to override word-boundary detection algorithm or the double-click behavior.


By the way, for the meantime, you can alleviate this problem by tweaking "editor.wordSeparators" settings and adding multibyte punctuation marks such as . With this, you can stop the cursor at least at (double-byte) periods and commas using Ctrl + /

So I searched related issues regarding CJK text navigations. I learned that "selection/navigation via double-click/keyboards" and "extracting words for autocompletion" are technically two different fields, but they are conceptually related anyway.

Keyboard navigation & Double click:

  • #27017 Double click to select word that don't recognize Chinese punctuation Probably shares the same root cause as this. Suggests the use of wordSeparators config, which is better than nothing, but not ideal for the aforementioned reason. Obviously there are usually many words between two commas/periods.
  • #25208 Moving cursor using Ctrl+(left/right arrow) in Chinese and English mixed text Is very similar to this, except that #25208 is about separation between hanja and English but this is mainly about separation between two hanjas (or between a hanja and a punctuation mark). Anyway, these are all something wordSeparators cannot handle.

Word extraction for autocompletion:

  • #37202 Suggestions for wordSeparators Not working even after #15177 was marked as fixed
  • #15177 autocomplete doesn't honor full width period. This was recently marked as "fixed". I confirmed in the latest Insiders that words are extracted taking commas/periods into consideration, but its usefulness is limited because there are usually many words between them. And why is this markdown-only? I think something like this should be enabled in plaintext.

So in conclusion, IMHO vscode should (by default, regardless of the language) assume there is a word boundary when a character type changes between "Latin alphabet/number", "CJK unified ideograph (hanja/kanji)", "Punctuations Marks (incl. multibyte ones)", "Japanese hiragana" and "Japanese katakana" even if there is no space. In addition, when Ctrl+Right is input inside a sequence of multiple "CJK unified ideographs", Chinese users (seem to) want the cursor to move by one character, whereas Japanese users usually want the cursor to move to the end of the sequence, as described by (3) above. This may have to be configurable with locale-based default values.

// Japanese
これは日|本語の文章
// ctrl + right
これは日本語|の文章

// Chinese
本文|的学习公式
// ctrl + right
本文的|学习公式

@smikitky thanks for your detailed investigation ;) IMO word navigation should work seamlessly with CJKV, as ASCII word separators can't handle CJKV words. I do have a prototype of delegating the word segmentation to the browser instead of dealing that ourselves and will work on that in the near team, stay tuned.

I'd like to remind you of Ctrl + Delete, which I think may share the same logic as Ctrl + Arrow, and performs even more upset because you may easily delete too many characters by accident.

@rebornix This issue was once included in iteration plans, but I'm seeing no recent activity. Since we're nearing the end of the housekeeping iteration, can I ask if you have any update on this?

Let's see if we can have time for it during holiday time.

Was this page helpful?
0 / 5 - 0 ratings