Common-voice: Add multilanguage support

Created on 21 Jun 2017  ·  56Comments  ·  Source: mozilla/common-voice

Please, add multilanguage support. I mean made web, iOS and Android apps translatable and supporting recording for more languages besides English. For instance: Spanish, French, German, Catalan....

I'm sure these feature will be a must in near future, so It's better adding and desing them at start point.

Enhancement

Most helpful comment

I just wanted to tell you, that I'm already collecting German sentences since October or so, when @mikehenrty first said that a German version of Common Voice is planned. Today I've crossed the magic threshold of 10,000 lines. Maybe a good moment to open a thread similar to #341.

All 56 comments

Hi @jmontane. Thank you for the issue. We do indeed want to add support for multiple languages, and I agree the sooner the better!

Right now, are working on finishing up the specs (https://mozilla.github.io/voice-web/docs/specs/index.html), but we will get to this soon.

Also looking forward to see this feature.
I'm willing to contribute to the Polish database.

I can provide non-native Japanese samples, validation, and localization. I think it's important to expand this tool's capability to more languages.

I wanna help with Portuguese samples.

Maybe Tamil texts can be obtained from Tamil Wikipedia? Gov of Tamil Nadu has passed an order asking a local University to submit articles in Wikipedia under the CC licence.

I could help with Portuguese and non-native French and I'm really looking forward to that feature, as Portuguese speech datasets are rather hard to find. Is there anyway to start contributing with 'nice, short sentences' in languages other than English? 😄

I'm just gonna quote myself (such a narcisisst) from https://github.com/mozilla/voice-web/issues/723#issuecomment-354973769

Last week we learned about the localization efforts inside of Mozilla, specifically Pontoon and Fluent. We’re eager to leverage these tools as they offer contributors a simple and direct way to provide translations, without needing to be familiar with Git and GitHub. The Pontoon website linked above (hopefully) makes that point for me. It’ll certainly be a bit more of an effort to get these tools configured, compared to other translation libraries we talked about, but it also gives us more. I’m happy to take on the configuration efforts and after that find a way to bring in the work that’s already been done on localization.
As always, happy to hear your thoughts on this!

On a more technical note, here are some early TODOs I stumbled over, which I want to flag here so we don’t forget:
Fluent requires Node >= 8, so we either have to upgrade prod (which isn’t a bad idea anyways) or ask Stas if that number can be lowered
Use new version of UglifyJS (otherwise fluent didn’t work in prod for me): https://webpack.js.org/plugins/uglifyjs-webpack-plugin/#install

Pretty sure the OP is not speaking about translation here, but mainly collecting sentences in a foreign language to build a foreign language corpus. Even if the interface is in English, that should not prevent using it (it's so intuitive anyway), at least in a first time.

@X-Ryl669 That's an interesting distinction. Though I think we'd want to combine these efforts so that people not well-versed in English can also contribute to their language's dataset.

I've just got my new microphone - so let me contribute my native German voice samples, to make this awesome project even better! 😎🤘

Whoops, Fluent is just one step of the way. A bit early to close this.

Really?
Does #820 add multilanguage recording features?

Indeed it does not. #820 just laid the groundwork for making the site localizable. So it's just one of many steps we have to (and will) take to fully add multilanguage support.

Could we already start by creating sentences similar to #341 for other languages? I would be willing to add Dutch sentences. I think we do need to have some criteria for the input as the same people who check it for English might not be able to check it for other languages.

There is an issue about a process for collection of sentences for another languages' corpora: #756. A sentence proposition and validation system has been loosely proposed and @mikehenrty stated that the consolidation of such processes is on the roadmap for 2018.

I just wanted to tell you, that I'm already collecting German sentences since October or so, when @mikehenrty first said that a German version of Common Voice is planned. Today I've crossed the magic threshold of 10,000 lines. Maybe a good moment to open a thread similar to #341.

I'm ready to help proofreading them :)

Here are the CC-0 sentences for Chuvash (note, audio files already exist for all of these and can be found in the dump ). The file sentence indexes will match up. Note, I've deliberately selected the shorter sentences. Can't wait to get these in :D

Here is a large list of Dutch spoken and written sentences available here: http://tst-centrale.org/nl/tst-materialen/corpora/corpus-gesproken-nederlands-detail

The content is free for non-commercial usage. There is both a written and recorded version (roughly 900 hours). The source also includes a large list of metadata (dialect, gender,...). Might be useful. I wrote them an email to check whether we could get a hand on the written corpus.

Here is also a list of roughly 350 Dutch sentences of my own, released under CC-0.
https://pastebin.com/EkPRxXzS

@Gregoor, is there any way to participate in the localisation efforts implemented in #820? I couldn't find anything in https://github.com/mozilla/voice-web/blob/master/CONTRIBUTING.md. Apparently there are some commits recently that mention Pontoon (like 2ebef0dc3b358808e57e3e8e6d44b8b2d8930813).

Hey @danielsjf, yes there is: https://pontoon.mozilla.org/projects/common-voice/ 🙂
It's not in the Contributing file because we want to explain how to be part of that effort on the site itself. We'll add it to the site in the coming weeks.

oh, somehow missed that, I'm going to Pontoon now to improve the German translation :)

The first pontoon commits for German language arrived recently, but nobody has yet opened a German thread similar to #341. So now I'm posting my text corpus right here.

It includes hundreds of common phrases, names, places, stupid word jokes, allusions, technobabbel/science, food, sports, animals, a short story, poems, and lots of other stuff. Well, more than half a year of writing, summing up to 11k sentences. It will take you hours to read them :P

Feel free to proofread, correct, merge or do whatever you want with them. Can't wait for the launch of Common Voice's German version. :-)
11k-german-sentences.txt

wow that's a lot of text!
cool :)

now that translation is done, I'll probably proofread those sentences over time.

I'm looking forward to the German/multi language release as well.

It would be helpful if we could test out the German translation on the staging server voice.allizom.org before that though (text width for buttons, context issues in the translation etc.)
Would be great if someone could push the upstream some time.

Oooh there is a staging server, sweet... it would be great if we could test out Chuvash and Tatar uploads too :)

yeah, certainly helpful the staging server.
I found the URL in the project info page of common voice on Pontoon.

Another ~430 Dutch sentences: https://pastebin.com/aEHCMSgM released under CC-0.

Are we supposed to keep adding all the sentences for each language here or can we make different issues per language?

I should add some additional context for Japanese support, because it is very complicated. Japanese has a really complicated writing system and will require special handling. Some background - there are three main writing systems:

  • Hiragana, which is used mainly for Japanese words, looks like ひらがな
  • Katakana, which is used mainly for loanwords, looks like カタカナ
  • Kanji, which are based on borrowed Chinese characters and have complicated rules, looks like 漢字

Hiragana and katakana both map (almost) 1-to-1 with syllables in spoken Japanese. There are some exceptions:

  • Sometimes a small version of よ (yo), ゆ (yu), or や (ya) is added to another character, like きょ (ki+yo=kyo)
  • Katakana sometimes uses weird stuff like ヴァ (va, a combination of ウ (u) with a diacritical mark plus ア (a)) to map sounds for loanwords, the full list of weird sounds is out there. Some Japanese people will have trouble pronouncing these so it might sound like ば (ba) or so
  • Some symbols have a grammatical function that is separate from their use in words, and have different sounds in those cases. For example, は (ha) sounds like わ (wa) when used as a grammatical marker, but sounds like ha when used e.g. in はじめて (初めて)

Then kanji comes into play. Each kanji generally has at least two pronounciations: 訓読み (kunyomi), which is the Japanese pronouncation and is generally accompanied by hiragana, and 音読み (onyomi), which is the Chinese pronouncation and is generally used for compound words of several kanji. There are often more than two pronouncations. There's also a separate set of pronouncations used for names, which are often just a crap-shoot. Most sentences have a mix of hiragana and kanji and often some katakana. Sentences written entirely in hiragana are difficult to understand or read clearly, so kanji is unavoidable. A full example sentence using all three looks like this: これは私の一番好きパスコンです ("this is my favorite computer"). Broken down this is:

  • これ: This, hiragana
  • は: Grammatical marker indicating the topic, hiragana
  • 私: Me, kanji
  • の: Grammatical connective tissue, makes the preceeding term an adjective (me -> my), hiragana
  • 一番: number one, kanji
  • 好き: liked (adjective), kanji plus hiragana
  • パスコン: pascon, a loanword short for personal computer, katakana
  • です: the verb "to be", hiragana

You could write the whole thing in hiragana, これはわたしのいちばんすきぱすこんです, but that's nigh-unreadable.

There are open source tools like anthy and mozc which can be used to provide suggestions for kanji based on hiragana inputs (it also suggests hiragana or katakana writings for some words as appropriate, since kanji is not always used). This is leveraged in many IMEs that let you type in Japanese on a computer. I also have a project which compiles anthy with wasm and wires up a nice JavaScript interface for doing kanji suggestions in-browser. There's a live demo here. If you take the hiragana sentence I wrote above and paste it into the second textbox, you can get a similar breakdown from anthy for the possible interpretations.

This'll get us pretty far but getting all the way will be difficult. Japanese has many homonyms which use different kanji - for example, きる could be written as 着る or 切る or 斬る and each has a different meaning which has to be inferred from context (respectively: to wear, to cut, to kill someone with a blade).

On voice, example sentences will have to be shown with the appropriate written style (kanji+hiragana+katakana), optionally with furigana (which are hiragana readings written above the kanji to assist in reading, can be done in HTML with rubies), but when it eventually is fed into voice recognition tools it should be listening for hiragana and attempting to turn them into kanji with anthy or what have you.

@SirCmpwn thanks for the description. This shouldn't a problem as (afaik) people won't be asked to type anything in Japanese. There are two tasks:

  1. Listen to a segment, read a text: Does the segment match the text?
  2. Read a text, pronounce it out loud.

The texts will be supplied by the community as text files, "offline" (outside of the interface), so probably an IME will not be necessary.

@jf99 is the 11k-german-sentences.txt part of any repo? If not: What do you think about forking this project and committing the file, so I can work on top of that?

I'm willing to contribute to the Thai database.

I'm wondering whether quotation marks and telephone numbers are actually a good idea to put into the strings. Probably yes, but it's still something that has to be taken into account while training, right? (since it's not syntax but semantic)
What do you think?
jf99's examples include both btw.

@djfe keep them in. They can always be sedded out, but it's hard to put them back in if you remove them.

I was putting in quotation marks since we already have hundreds of them in the English dataset. But still, your question is justified. There are quite a few things in my text where I was unsure how to write them. Some examples:

  • units: "km" vs. "Kilometer" or "ct/min" vs. "Cent pro Minute" or "%" vs. "Prozent"
  • prices: "69,99 Euro" (how you write it) vs. "69 Euro 99" (how you say it)
  • same with time: "5:02 Uhr" vs. "5 Uhr 2"
  • in general numbers: "22" vs. zweiundzwanzig
  • sports: "3:0" vs. "3 zu 0", same with "Maßstab von 1:100000"
  • Are loanwords from foreign languages with foreign characters okay, btw? Thinking of "Déjà-vu".

The list is likely to be incomplete.

@jf99 that's exactly what I was talking about :)

In general it would be good to have a clearer outline of what form of text the developers expect.

The good thing is: the text can be edited (even after the recordings are done) to be better readable for machines, but the voice will stay the same regardless of the way it is written, right?

I haven't looked too much at the English data set, it's good to know that quotation marks are often used there.

@jf99 @Djfe Ideally everything should be written out in text explicitly.

By that I mean "2010" should be explicitly written as "two thousand and ten", "km" should be written as "kilometer", "69,99 Euro" as "sixty nine euros and ninety nine cents".... This prevents any uncertainty from creeping in as to how one reads a number.

For example "2010" could be read as "two zero one zero", "twenty ten", "two thousand and ten", "two hundred one zero", "twenty one zero", or any number of other ways. So just having the number "2010" allows people to read the text in various ways and is ambiguous.

As one of the explicit goals of this project is to train speech-to-text engines, having the text able to be unambiguously read in only one way, mod accents and the like, should be a goal of the added texts.

Seriously for a native German speaker speaking "2010" as 2-0-1-0 won't happen, in a realistic way. Maybe for other strings that is problematic or for other language, where there is more ambiguity, but especially when it is (formatted like) a year (2010) or in a sentence ("I have 23 apples.") nobody will say "I have two three apples.". Contrary, if I had to read "two thousand and ten" that confuses me more than just 2010. Same with 33,34€ – there is usually a known standard way of speaking them.

And if there would actually be two accepted ways (say "sixty nine and ninety nine euro" and "sixty nine euros, ninety nine cents" (without "and") and your way) it is actually good to train the AI to recognize all ways.
Later the AI also may get "31,45€" as input and has to speak it, so…

The only exception may be if it is really not clear (also from the context!) how to speak it, e.g. if there is a (long) number without any context, e.g. "71670". There it is not clear how to speak it.

@rugk Basically all languages have these ambiguities including German.

For example consider the time "3:25" in German. This can be said "drei Uhr fünfundzwanzig" or it can be said "fünf vor halb vier".

As to your suggestion that "if there would actually be two accepted ways...it is actually good to train the AI to recognize all ways.". I'm sorry to say speech to text engines don't work that way. The need to be trained on consistently labeled data or they don't work. The transcription has to match the audio.

AFAIK it is not clear how DeepSpeech will be used. It can also be used to recognize speech, not only for TTS.

@rugk Deep Speech is used for converting speech to text _not_ for converting text to speech.

@kdavis-mozilla Thanks for the clarification! Unfortunately, some sentences will probably get too long due to this requirement.

For example consider the time "3:25" in German. This can be said "drei Uhr fünfundzwanzig" or it can be said "fünf vor halb vier".

Nobody would say "fünf vor halb vier" when he's supposed to read "3:25 Uhr", though. However, conflicts like "zwo" vs. "zwei" and "fuffzig" vs. "fünfzig" are more likely to happen, which supports your argument. If we write out all numbers explicitly, then the way to go is apparently to have some function _after_ the actual STT, which converts strings like "drei Uhr fünfundzwanzig" into "3:25 Uhr".

My question about Déjà-vu is still unanswered.

@kdavis-mozilla Aha, and it still needs to be trained for the exact thing. That is sad.

However, conflicts like "zwo" vs. "zwei" and "fuffzig" vs. "fünfzig" are more likely to happen

At least for these, that is ok. AFAIK it was once said accents (or similar) are accepted and should be supplied. That's also why I thought different ways to speak (with and, without and, etc.) are useful for the AI.

@jf99 As for "fünf vor halb vier", I've heard people use this form, and it is how I was taught. I remember distinctly as I thought it absurd. (Notice I didn't add the "Uhr" to "3:25" as, yes, if one adds the "Uhr" then one usually reads it one way.)

Also, I didn't even mention the "7:15" and "7:45" problem which is far worse. For a lot more German examples see the discussion[1].

As for "Déjà-vu" in English, say. It's a harder question.

My gut says keep it in, but my only worry from a data set/machine learning point of view would be the sparsity of data for character such as "é" in English. Would the system be able to learn "é" with such sparsity? Also where does this stop? Adding in Kanji too?

I don't have a hard and fast answer. I think most English dictionaries would include "Déjà-vu", but all speech corpora I've used in English limit themselves to the characters "a" to "z" "<space>" and "'"

@kdavis-mozilla I would try and stay as close to the original text as possible. Character substitution(e.g. é → e) can and should be done in a preprocessing stage. In my opinion the same goes for numeral expressions. Regarding Kanji, it depends on the target application. My feeling would be that this should also be done with preprocessing.

In general in my experience, the raw data should ideally be massaged only in a deterministic and repeatable way.

@ftyers The discussion here is concerned with what the raw data/original text is. In other words, what should the texts that people read contain.

As these texts are created by contributions, we have control over what they contain. Thus, we can require all numerical expressions "2010", for example, be written out "two thousand and ten".

What about numbers in names?
"Wendelstein 7-X"
EDIT, or: hr3 (broadcasting station)
EDIT2: A5 (highway)
HH-AB-123 => HH-AB-eins-zwei-drei ?
[email protected]

Currently I'm rewriting all numbers in @jf99 's file, while proofreading it. (so: nobody else has to do it :) )

@Djfe What's your progress so far? Are you going to open a pull request with the corrected sentences when you're done proofreading? We could discuss details about single sentences there.

just so you know @jf99 , 11K sentences is the size of your average novel. You've written a novel, auf Deutsch. Except I bet the story doesn't make much sense. Or does it??

Edit: wrong link

I'm planning on finishing it this weekend. But I can't make any promises. I'll do a pull request with all my improvements after this weekend (on sunday or monday).

I reworked the Dutch sentences given the rules stated above (e.g. no numbers). I've added more lines as well so the total is now around ~1400 sentences. All released under CC-0.

I also had another issue with CO2. Is it fine if this one contains a number?

https://pastebin.com/i2KNbHZM

@danielsjf I would vote against chemical abbreviations or abbreviations in general. In this case 'CO2' can be pronounced as 'C-O-2' or as 'carbon-dioxide'

I have some more thoughts on numbers:
Germans (very often/usually) write numbers differently than they write them :/
https://www.youtube.com/watch?v=0Z4ggIyzc-g

Shall we use the writing how they are spoken, or how they are written?
Because in the wild they are certainly spoken differently so the speech recognition engine needs to be able to understand them.

Or do you think everybody will speak them correctly anyways? (how they are usually pronounced)

Or do you think everybody will speak them correctly anyways? (how they are usually pronounced)

As for German numbers, yes, very sure. At least German native speakers know no other way to speak them… :smile:

@Djfe What the video demonstrates is just mumbling. We are certainly NOT going to write sentences with wrong spelling of numbers. I think, it's pretty clear how to write numbers now, except for cases where there is no way to write them out as words (license plates, names of radio stations etc.).

We're almost there, if someone wants to do some early testing & feedback: https://github.com/mozilla/voice-web/pull/982
It still has some rough edges, I'll try to have it in a good shape on staging early next week.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

orschiro picture orschiro  ·  5Comments

LucSalommez picture LucSalommez  ·  5Comments

selimsumlu picture selimsumlu  ·  3Comments

jankeromnes picture jankeromnes  ·  3Comments

kenrick95 picture kenrick95  ·  4Comments