Common-voice: Too many text errors in zh-HK 粵語頁面官粵夾雜,非純官話亦非純粵語

Created on 2 Jun 2020  ·  21Comments  ·  Source: mozilla/common-voice

(English translation below)

呢個頁面 https://voice.mozilla.org/zh-HK 太多文本錯誤,官唔官粵唔粵,唔知乜鬼雜交怪胎。求其搵幾個例子:

  1. 「您」:粵語中冇「您」呢隻字,「您」字係北京官話方言字,粵語中第二人稱代詞剩得 nei5,寫作「你」。如果真係要表達第二人稱專稱,會寫作「閣下」。
  2. 「的」:粵語入邊嘅屬格標記係「嘅」,粵拼 ge3。「的」普拼 de 係官話字,唔存在於粵語中。
  3. 「一些」:粵語入邊用「一啲」,「一些」係官話詞彙。
  4. 「也」:「也」係官話詞彙,粵語入邊用「都」。
  5. 「不」:粵語入邊否定詞用「唔」(或者「毋」),「不」字係官話表達。

希望 zh-HK 講清楚,究竟頁面係要用粵語定係官話。如果係粵語就用純正粵語,係想用官話就用純官話,唔好官粵夾雜,會誤導人,讀起身亦都好唔舒服。或者乾脆新開一個語言頁面,就叫「粵語(香港)」,原先嘅頁面就叫「國語(香港)」或者「普通話(香港)」,可以避免誤導人。

The page https://voice.mozilla.org/zh-HK has too many lexical errors, this page is supposed to be written in Cantonese, but right now it is written in mixed-Cantonese-Mandarin. Namely:

  1. "您": "您" is a word that only exists in Peking Mandarin, there is no "您" in Cantonese, Cantonese uses only "你".
  2. "的": The possessive marker in Cantonese is "嘅", pronounced ge3. "的" is the possessive marker of Mandarin.
  3. "一些": Cantonese uses "一啲", "一些" is a Mandarin word.
  4. "也": This is a Mandarin word too, in Cantonese it is "都".
  5. "不": The negation marker of Cantonese is "唔" or "毋", "不" is the Mandarin negation marker.

To sum up, the language option name "Chinese (Hong Kong)" is ambiguous and should be renamed as "Cantonese (Hong Kong)" and add a new option "Mandarin (Hong Kong)". There is no such language called Chinese. Chinese is a macro-language, not a language. Mandarin and Cantonese are two individual unintelligible languages under the Chinese macro-language. "I speak Chinese" is equally absurd as "I speak Germanic".

I am making a new pull request to fix this issue, please merge my PR later, thank you for your assistance.

Most helpful comment

@sammyfung @irvin I just made some proposed changes via Pontoon, please approve and we can close this issue. And we can discuss the related sub-issues in other posts.

All 21 comments

Hello! Thanks for reporting this. It sounds like the specific errors you're noticing are related with the way the site itself is translated, and not the sentences that are being read out? If yes, please make those corrections through the Mozilla localization tool Pontoon. The Common Voice repo automatically pulls from Pontoon, so any PRs you make here will just be overwritten immediately. Here's the page for zh-HK:

https://pontoon.mozilla.org/zh-HK/common-voice/web/locales/en/messages.ftl/

As for Cantonese/Mandarin vs. Chinese - thanks for the feedback, it's definitely an issue we're aware of. We're working towards a more holistic locale vs. language strategy for the entire site that will hopefully address this and similar issues.

Hi! Yes, it is just the translation of the web page, not the data. I was going to make a pr to modify the file in /web/locales/zh-HK/messages.ftl. Because most of the errors are systematic, so I can use batch replace in my text editor and fix it very quickly. But regarding the tool Pontoon, does that mean I can't do a PR, but instead can only use the Pontoon to change the translation? Thank you!

But regarding the tool Pontoon, does that mean I can't do a PR, but instead can only use the Pontoon to change the translation?

Hi @laubonghaudoi yes, our text localization is managed by the expert communities that contribute to Pontoon. Pontoon tooling enables checks and balances to ensure translation across locales is as accurate as possible. Common Voice adopts localization (including the primary locales themselves) from Pontoon.

As @phirework mentioned we're aware that written and spoken language do not always correlate to a single locale -- it will be a large undertaking to revise our platform to improve this in a way that works across many languages. This work won't start most likely until 2021. For now, please contribute any recommended changes to text translation you have via the Pontoon tool. Thank you!

cc @bacharakis for visibility.

Hi @mbransn thank you so much for the clarification. Actually we have a general complaint on a bigger issue, which is the language classification of Mozilla's standard. Right now Mozilla is using "Chinese Hong Kong", code name "zh-hk" as the language name, but this is ambiguous and misleading because Chinese can be Cantonese or Mandarin, which are very different and unintelligible to each other. In the long run, it should be split into "Cantonese Hong Kong" and "Mandarin Hong Kong". At this moment I will just assume "Chinese Hong Kong" to be Cantonese, and change the page translation into pure Cantonese. May I know when the change will take effect?

P.S. I think the cause of the current incorrect translation is that, the original translator of the page assumed "Chinese Hong Kong" == Mandarin, while later some revisions assumed "Chinese Hong Kong" == Cantonese, thus resulting in this mess. But this common voice project is collecting Cantonese voice data, not Mandarin, so I am sure that the assumption of "Chinese" == Cantonese wins here.

cc @sammyfung and local l10n team

To your last point @laubonghaudoi, I think the common voice core team is working on better language/dialect management and they are considering moving zh-hk to yue, but that work is still in early stages @peiying2

I am currently the main approver of the pontoon translations, previously I have only focused on making sure the translations of terminologies are at least consistent, but yes there are many cases of bad canton-isation. 同意有很大改善和統一語腔的空間。 I can see some new suggestions :) and look forward to seeing more!

Thanks for the issue created by @laubonghaudoi, and I made my corrected comment here.

The purpose of the Common Voice project is to collect the voice. For Hong Kong, it is Cantonese. Therefore, I prefer to use Yue (vocal format, 口語) on the website UI instead of written format (書面語). The Cantonese part of the Common Voice project Cantonese is not only managed by Hong Kong volunteers, so mixture of 2 Cantonese formats is used in the UI translation.

Please submit your PR.

@laubonghaudoi raised two issues:

1. The text in zh-HK web pages is a mixture of Cantonese and Mandarin.
I didn't involved in the translation of zh-HK text. I guess the text was originally migrated from zh-CN or zh-TW pages. This is understandable for quick launching zh-HK to collect Cantonese voice. Later other volunteers helped to translate some text to Cantonese. Probably due to resource limitation, not 100% was translated. Please correct me I were wrong.

As Cantonese speakers literally also understand written Mandarin, I think 100% translation is good, but I'd rather put effort to collect more Cantonese voice. I suggest @laubonghaudoi to use the Pontoon tool to improve the text if he really feels "uncomfortable from head to toe" (讀起身亦都好唔舒服) with the "crossbreeding monster" (雜交怪胎).

2. Mozilla mixed up "Cantonese Hong Kong" and "Mandarin Hong Kong".
Being a local Hong Kong person and native Cantonese speaker, zh-HK has one and only one meaning to me: the Cantonese use by Hong Kong people. Its pronunciation, vocabulary and grammar are very different from Mandarin. I don't think Mozilla nor the language code zh-HK are causing any confusion or misunderstanding. Could @laubonghaudoi elaborate what is "Mandarin Hong Kong"?

I think what @laubonghaudoi meant by 'mixing up "Cantonese Hong Kong"' and "Mandarin Hong Kong"' was that part of the UI and phrases in the voice project are based on '書面語', which by definition is a Mandarin-based form of writing. (and hence contradicts the notion that 'zh-HK has one and only one meaning to me: the Cantonese use by Hong Kong people'). Changing all phrases to the Cantonese-based form of writing (i.e. '粵文') should clear all confusion.


In response to @Ahhang20:

[...] I don't think Mozilla nor the language code zh-HK are causing any confusion or misunderstanding.

I agree with @dtylam that tagging the locale as yue instead of zh-HK would be better for the avoidance of doubt. (The word 'Chinese' means, to the uninitiated or to foreign ears, Mandarin) Maybe zh-HK could be defined as an alias of yue if you have a strong urge of keeping this label.

Thank you so much for the clarifications guys. Please allow me to explain this issue deeper to avoid further misunderstandings.

I don't think Mozilla nor the language code zh-HK are causing any confusion or misunderstanding. Could @laubonghaudoi elaborate what is "Mandarin Hong Kong"?

I am a native Cantonese speaker from mainland China, not Hong Kong, and therefore the code "zh-hk" is very confusing and misleading. As I said, the term "Chinese" is ambiguous, Chinese is not a language, it is a macro-language, this can be verified by the ISO 693-3 standard here: https://iso639-3.sil.org/code/zho
So you can see why we oppose to using the word "Chinese" or "Chinese Hong Kong", because people have different assumptions and understandings on this same word. People from Hong Kong, like @Ahhang20 just said, assume "Chinese => Cantonese", while people from other parts of the world (including me from the mainland) assume "Chinese => Mandarin". If somebody says "I speak Chinese", you simply can't tell if he is speaking of Cantonese or Mandarin, which are two unintelligible languages. So we suggest to only use the term "Cantonese" or "Mandarin" to avoid any confusions or misunderstandings.

Being a local Hong Kong person and native Cantonese speaker, zh-HK has one and only one meaning to me: the Cantonese use by Hong Kong people. Its pronunciation, vocabulary and grammar are very different from Mandarin.

From a linguistics perspective, there is no such thing as "vocal format 口語" or "written format 書面語", the so-called "written format" is actually written Mandarin, not Cantonese. This involves a more complicated historical issue. Hong Kong has long been a diglossic society, which means people there use two different languages at the same time: speak Cantonese and write Mandarin. Such divergence in the language is called Diglossia. However, the upper society of Hong Kong tried to resolve this divergence by inventing the terms "Chinese, written format" and "Chinese, vocal format" and propagated these pseudo-concepts to the public. There is no such thing as written format Chinese or vocal format Chinese, but many people in Hong Kong still believe it today. That's why we are trying to mitigate the effects of these pseudo-concepts and strongly suggest to use only "Mandarin" and "Cantonese".

Please refer to the Wikipedia explanations here about the diglossic situation in Hong Kong.


The purpose of the Common Voice project is to collect the voice. For Hong Kong, it is Cantonese. Therefore, I prefer to use Yue (vocal format, 口語) on the website UI instead of written format (書面語). The Cantonese part of the Common Voice project Cantonese is not only managed by Hong Kong volunteers, so mixture of 2 Cantonese formats is used in the UI translation.

Please submit your PR.

To your last point @laubonghaudoi, I think the common voice core team is working on better language/dialect management and they are considering moving zh-hk to yue, but that work is still in early stages @peiying2

@sammyfung @dtylam Thank you so much! Please do use the term Cantonese(Yue) instead of Chinese Hong Kong, it would clear all the confusions. Also, does that mean I can just submit a PR instead of using the Pontoon to update the translations?


Allow me to elaborate more on why I said the current texts are "uncomfortable from head to toe" (讀起身亦都好唔舒服) with the "crossbreeding monster" (雜交怪胎). This is mainly caused by the pseudo-concept I mentioned above. By inventing the terms "Chinese, vocal format" and "Chinese, written format", the boarder between Mandarin and Cantonese is blurred and people began mixing the vocabulary and grammar of these two languages. And that's why many Hong Kongers today can't distinguish pure written Cantonese from pure written Mandarin, but instead writing a mixture of Cantonese and Mandarin (which is the crossbreeding monster I said), while nobody in real life talks like that. As I described in the beginning, words like "您" "的" "也" are pure Mandarin words and nobody use it in Cantonese in real life.


Also, regarding the Chinese macro-language issue, I have a proposal. Can Mozilla change the extant Chinese page, code name zh, into Mandarin, code name cmn? The reason is stated above, we should avoid making any assumptions about the "Chinese" macro-language, and only use the language names directly to avoid any misunderstandings or confusions.

I am a native Cantonese speaker from mainland China, not Hong Kong, and therefore the code "zh-hk" is very confusing and misleading. As I said, the term "Chinese" is ambiguous, Chinese is not a language, it is a macro-language, this can be verified by the ISO 693-3 standard here: https://iso639-3.sil.org/code/zho
So you can see why we oppose to using the word "Chinese" or "Chinese Hong Kong", because people have different assumptions and understandings on this same word. People from Hong Kong, like @Ahhang20 just said, assume "Chinese => Cantonese", while people from other parts of the world (including me from the mainland) assume "Chinese => Mandarin". If somebody says "I speak Chinese", you simply can't tell if he is speaking of Cantonese or Mandarin, which are two unintelligible languages. So we suggest to only use the term "Cantonese" or "Mandarin" to avoid any confusions or misunderstandings.

I think @Ahhang20 talk about "zh-hk" but not "zh".

"zh-hk" is Chinese in Hong Kong which Mandarin is not considered for 'zh-hk', because Cantonese and English are official languages in Hong Kong, not Mandarin.

And I tell you that Hong Kong people say "we speak Cantonese" instead of "we speak Chinese".

So, zh-hk and en-hk can always represent Cantonese and English in Hong Kong, therefore I disagree above opinion from @laubonghaudoi

@sammyfung @dtylam Thank you so much! Please do use the term Cantonese(Yue) instead of Chinese Hong Kong, it would clear all the confusions. Also, does that mean I can just submit a PR instead of using the Pontoon to update the translations?

I think that Cantonese (Yue) and Chinese (Hong Kong) are same meaning to me and Hong Kong people, because Chinese (Hong Kong) speak Cantonese in Hong Kong.

So, I think we can keep "Chinese (Hong Kong)". Even we change to "Yue (Hong Kong)", the code is still zh-hk.

Also, regarding the Chinese macro-language issue, I have a proposal. Can Mozilla change the extant Chinese page, code name zh, into Mandarin, code name cmn? The reason is stated above, we should avoid making any assumptions about the "Chinese" macro-language, and only use the language names directly to avoid any misunderstandings or confusions.

I think zh-hk is good enough for Hong Kong people.

And I tell you that Hong Kong people say "we speak Cantonese" instead of "we speak Chinese".

This is clearly not true. Right now there are still many people(not only in Hong Kong but also oversea Cantonese communities) using the term "Chinese 中文" to refer to Cantonese, saying "We speak Chinese" to imply "We speak Cantonese": https://www.youtube.com/watch?v=sc28HYRQ9iA

And this is exactly why we should avoid using the term "Chinese 中文" to represent any language. Please be aware that the language options and code names are not only for Hong Kong people but instead for people around the world. We should get out from the narrow scope of Hong Kong. There are over 80 million Cantonese speakers around the world while the population in Hong Kong takes up only a small fraction of them. When people from outside Hong Kong see the word "Chinese 中文", or the code name "zh", the first thing that comes into their mind is Mandarin, not Cantonese. We should dissolve this misunderstanding by using the accurate terminology, which is the respective name of the language(Cantonese and Mandarin, also Hakka, Hokkien and other minority Chinese languages for future considerations), rather than the name of a macro-language(Chinese).

P.S. I am perfectly fine with not supporting "Mandarin Hong Kong" or "English Hong Kong", and assume Cantonese to be the only official language in Hong Kong. What I propose and request, is to avoid using misleading and confusing names(name of a macro-language instead of an individual language) and clarify what language we are precisely using. I brought up the "Mandarin Hong Kong" split simply because I know that there are a bunch of stubborn people in Hong Kong still insist in writing with the so-called "written format 書面語", which are actually Mandarin. To cater their preference we can keep an "Mandarin Hong Kong" option.

There are historical reasons for the current status of how Chinese used to refer to both Cantonese in Hong Kong, Mandarin in China, and Mandarin in Taiwan. Probably a big part is mainly because of my lack of relevant knowledge but tries to initiate the effort in Chinese at the beginning of this project several years before. _It's not about the scope of view to any place but scope of the people who contribute the most_.

We now had an updated language and accent stretegy, and we had also suggested to re-category the Chinese languages into Mandarin (with Traditional Chinese corpus), Mandarin (Simplified Chinese corpus) and Cantonese, take ISO639-1 as reference.

However, please understand that Common Voice is just a small open source project in Mozilla with limited resources but dealing with a global-scale problem. We need to make progress one step a time toward that ideal scenario, in the meantime keep promoting the project, encouraging more people to contribute, got some achievement, and hope the result of each stage can keep the project continuance.

We do need more people with knowledge in linguistic to join and help. Please be patient, and if you think something is wrong, send your suggestion via Pontoon to get it fixed.

The discussion on different sub-issues raised in this issue is too long, it is not just linguistic related, it is also related to programming technology eg. support which code scheme. I suggest that @laubonghaudoi may contribute minor fixes on some UI translation for zh-hk first if he is interested to contribute.

We may discuss sub-issues one by one later.

We do need more people with knowledge in linguistic to join and help. Please be patient, and if you think something is wrong, send your suggestion via Pontoon to get it fixed.

Thank you! So should I make changes only with Pontoon, or I can submit an PR to fix them once and for all? Most the translation errors are systematic, so if I can make a PR I can use my text editor to do batch replace and fix them much faster. And I can also invite other Cantonese linguists (members of the https://github.com/lshk-org) to review the translation before merging.

And as a HongKonger, personally I would like to express my concern to discuss Cantonese and the meaning of Chinese (Hong Kong) with non-HongKonger, and I keep thinking of any conflict to Code of Conduct in the discussion.

I think UI translation is a simple task that doesn't require a review from linguists.

And some discussion in this issue is difficult and complicated, Cantonese linguists who born in Hong Kong may have different opinions and disagree with Cantonese who born outside Hong Kong.

should I make changes only with Pontoon, or I can submit an PR to fix them once and for all? Most the translation errors are systematic, so if I can make a PR I can use my text editor to do batch replace and fix them much faster.

Just on Pontoon. I understand that directly modify the string file seems to be faster, but this is a one-way procedure for website UI l10n. To prevent your contribution been overwritten please focus there.

And I can also invite other Cantonese linguists (members of the https://github.com/lshk-org) to review the translation before merging.

You and your colleague are more than welcome to share opinions and suggestions on Discourse, especially for those strategy discussions such as the accents and languages one. There is also a Chinese sub-category that we can discuss in Chinese. (Although it's named "Mandarin (Taiwan) cat. but we use it for all Chinese relative things for now.)

@mbransn I think we can close this issue as now we could improve the UI translation on Pontoon and continue the language stretegies discussion on Discourse.

@sammyfung @irvin I just made some proposed changes via Pontoon, please approve and we can close this issue. And we can discuss the related sub-issues in other posts.

@laubonghaudoi @irvin @sammyfung @dtylam thank you for the constructive back and forth here. Closing this issue as UI translation edits are happening at Pontoon. Agreed Discourse is the best place for language strategy discussions. @bacharakis @nukeador and myself will continue to monitor there.

Was this page helpful?
0 / 5 - 0 ratings