Gutenberg: Word Count in content structures does not count Chinese words properly

Created on 9 Feb 2019  Â·  7Comments  Â·  Source: WordPress/gutenberg

Describe the bug
When writing a post in Chinese the word count shown in the content structure does not show an accurate word count.

To Reproduce
Steps to reproduce the behavior:

  1. Use http://generator.lorem-ipsum.info/_chinese to generate some sample Chinese text
  2. Create new page in Gutenberg
  3. Paste in content
  4. Click the info icon at the top
  5. See an incorrect word count

Expected behavior
When using the same content in word processor like Pages the word count is significantly larger than the one shown in Gutenberg. The expected behavior would be to have an accurate word count independent of the language used.

Screenshots
You can see a large amount of content but it is showing only 10 words
screen shot 2019-02-09 at 7 41 03 am

Desktop (please complete the following information):

  • OS: MacOS 10.14.2
  • Browser: Firefox
  • Version> 64
Internationalization (i18n) [Type] Bug

All 7 comments

It looks like this happens because in some languages words may not be separated by spaces. e.g: 这是鸟 means "This is a bird" and it was 3 words without a single character space.
Counting words is a complex problem, in some languages, the best approach in some cases may be count each character and use a character to word ratio but then we may have docs that mix languages so we need to identify the best method to use per segment.
The following external link describes an algorithm used to count words https://docs.sdl.com/LiveContent/content/en-US/SDL%20WorldServer-v3/GUID-376E123B-1C7E-4D64-82B0-1D33F088ABD5 it may be helpful for this issue.

@jorgefilipecosta I think there may be two ways to fix this bug. One way is like the atom-word-counter, we will present both the count of words(based on English words) and the count of characters (All kinds of characters excluding white space). This only requires little changes in the UI.
image
Another way is like the MS Office word, if we count the words in sentences mixed with East Asian languages and Latin languages like "Hello 你好", there are three words (one English word+two Chinese characters). This requires a significant change in the count function, especially the matchWords
image
Which one is better? Does anyone have any idea?

Thank you for summarizing and sharing your thoughts @Jackie6.
I am also not sure which option is better in a case like this, cc: @jasmussen, @mapk, @kjellr in case you have some thoughts on this.

Great ticket. It seems like the two options presented appear to be the "easy" version (count words and characters), and the hard version (be aware of the language when counting words).

It _seems_ like the latter is the better user experience, but it could be so difficult that unless we get solid pull requests it may take a while for this to appear. Whereas for the former, it's probably both easy to build, and a character count could likely be useful regardless of language.

Keeping in mind we mean to merge the Document Outline tool with the Block Navigation tool, we could possibly build solution 1 at the same time, and then consider upgrading to version 2 at a later time?

@sandymcfadden, there is #14589 opened with a proposal of how to resolve this issue as suggested in the discussion above.

Related: #24823 was merged, but it seems this issue is still relevant.

cc @david-szabo97

@swissspidy Ugh, this is a difficult topic.
IMHO, the best would be to move to the list view and do what Google Docs does.
Show Words, Characters, and Characters excluding spaces. Even though Words is not useful information for languages like Chinese or Japanese, I don't think we can accurately handle all the languages. If we show all the three variations, then we can leave it to the writer to decide which information is useful to him/her.

image

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SchneiderSam picture SchneiderSam  Â·  88Comments

maddisondesigns picture maddisondesigns  Â·  79Comments

tofumatt picture tofumatt  Â·  86Comments

afercia picture afercia  Â·  78Comments

DeveloperWil picture DeveloperWil  Â·  102Comments