Common-voice: Write some nice, short sentences for people to read.

Created on 23 Jul 2017 · 99Comments · Source: mozilla/common-voice

Due to bugs #23 and #319 in our current sentence collection, we are trying to diversify and refresh our sentences in #333. We would love your help! If you would like to contribute to Common Voice sentences with your own writings, please put your sentences (one per line) in a publicly linkable document (eg. Pastebin), then add a comment to this bug with a link to those sentences.

Criteria:

please write them yourselves. don't copy and paste from somewhere else.
try to make the sentences conversational, ie. easy to read out load.
you must agree to releasing your sentences to the public domain with a cc-0 license.
more than 50 sentences per link, but less than 500 please.
be nice, don't use offensive language. we aren't collecting that kinda material.
i'll be reading each one, and i may remove some but i will let you know why.

Thanks!

help wanted

Source

mikehenrty

👍3

Most helpful comment

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

In my experience, it is really hard to find interesting, readable sentences from Wikipedia. The language there tends to be quite formal and sometimes awkward to speak. I built some tools to chop up text here, but never got around to adding wikipedia sentences to this project. That said, I'm always open to accepting contributions. 👍

The other thing I like about asking people for sentence donations (which is just an experiment for now), is that we get a lot of clever, interesting messages. Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)

mikehenrty on 29 Jul 2017

👍5 😄2

All 99 comments

Given that this is a specific bug... and is therefore addressable as a software item. I have elsewhere offered up already public domain resources as good sources for sentences:

https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

As an open healthcare data person, I obviously would love the opportunity to inject your sentence data set with laymans healthcare terms. Assuming no one else is available to incorporate the various resources that I am discussing here, I might be able to devote some resources towards scrapping some of the resources that have solid APIs, and running them through some per-source filters that would serve to ensure that they do not have unusual acronyms or other confusing industry jargon.

I would recommend that you develop a corpus that details the following information for each source of sentences:

The sentence itself
The url source for the sentence
A reference to the rule (or contribution agreement to the cc-0 license) that provides evidence that the sentence in question is available under the public domain.

I think I could write some scripts that would generate a few million (at least) sentences that met some basic rules that I could demonstrate were public domain. or someone else could if I fail to deliver (which I should warn is a frequent occurrence, this is only tangentially related to my day job, after all)

Let me know if this would be helpful.

Regards,
-FT

ftrotter on 24 Jul 2017

I'm back! Here's another hundred:

https://pastebin.com/1CU7GQYs

CC-0 or whatever license you need. There's a good few simple sentences.

ajaydee on 24 Jul 2017

❤2

Hope this helps.
https://pastebin.com/BRsfLuNt

akshit13 on 24 Jul 2017

❤1

Thanks @ftrotter, that is very helpful. Let's discuss that on Discourse, and see how many contibutions we can get from this threads.

mikehenrty on 24 Jul 2017

Thanks @akshit13 and @ajaydee, I added your contributions.

mikehenrty on 24 Jul 2017

There you go. :-)

https://gist.github.com/orschiro/bb64b1bf56e55dc8741e9c764c5560c0

orschiro on 25 Jul 2017

This a randomized set of sentences from some of my older "computer vision" reviews. I waive all rights to the text. Use it anyway you like.
https://pastebin.com/DTezZ7rA

However, I would urge you to use some large diverse contemporary corpus.

michal-hradis on 25 Jul 2017

Thanks @orschiro and @michal-hradis!

However, I would urge you to use some large diverse contemporary corpus.

We are looking into that on the PR below. In the meantime, we are experimenting with gathering our corpus as we go.
https://github.com/mozilla/voice-web/pull/304

mikehenrty on 25 Jul 2017

👍2

Here is another 300 sentences:
https://pastebin.com/Qf7Ykcbz

jf99 on 25 Jul 2017

@jf99, some good ones in there.

The servers of the common voice project couldn't handle the heavy load.

We're working on it! :D

mikehenrty on 27 Jul 2017

Some more for ya:
https://pastebin.com/tCWPrxZJ

psullivan6 on 27 Jul 2017

Some help:
https://gist.github.com/Sposito/cb6bd11de5567c92329fbb2c5b70f612

Sposito on 28 Jul 2017

Thanks @psullivan6 and @Sposito!

mikehenrty on 28 Jul 2017

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

fastrizwaan on 29 Jul 2017

👍1

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

mikehenrty on 29 Jul 2017

👍5 😄2

is that we get a lot of clever, interesting messages.

Can you give an example you particularly like? :-)

orschiro on 29 Jul 2017

Can you give an example you particularly like? :-)

https://twitter.com/mikehenrty/status/890350612923207680 :-D

mikehenrty on 30 Jul 2017

❤1

Nice one! :-)

orschiro on 30 Jul 2017

Woohoo, I'm famous on twitter! :D

Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)

Yeah, I also find this funny. Although, from validating sentences I have the impression that most people don't get the joke when they're reading sentences like My review of the sun: one star.

Anyway, here is another batch of 400 lines for you to add. Some are just for fun, but for the majority I tried to include words that are not yet covered by the other sentences.
https://pastebin.com/Wqk4c24t

jf99 on 31 Jul 2017

❤2 😄2 👍1

Some help from me: https://gist.github.com/est31/b97b8cad99e5a6ae36d4712e81b5490f

est31 on 1 Aug 2017

❤1

Here's a simple python program to generate random noun+adjective sentences https://github.com/fastrizwaan/sentence
example sentences:

sea is serene
not only the air is sweet but also wet
if land is sweet then the music is wet

fastrizwaan on 1 Aug 2017

@fastrizwaan Thanks for the contribution.

While you code looks to be able to generate an almost infinite set of grammatical sentences, some of the sentences will be nonsensical

_If fire is cold then the water is dry_

The most famous example of such a sentence is from Noam Chomsky

_Colorless green ideas sleep furiously_

Which is grammatical, but makes no sense.

The problem with such grammatical but nonsensical sentences is that if a speech recognition engine is trained on such data, it will,through the language model it learns explicitly or implicitly, learn to expect such nonsensical sentences. This decreases the accuracy on the sentences that we want to actually work, the ones that make sense.

So generically we want to include only sentences that make sense in order to improve the accuracy of any speech recognition engine trained on the resulting data.

The problem then becomes automatically generating sentences that are grammatical _and_ make sense is a bit harder.

kdavis-mozilla on 1 Aug 2017

👍3

@mikehenrty

In my experience, it is really hard to find interesting, readable sentences from Wikipedia.

Wikipedia is also home to gigabytes upon gigabytes of discussions about article development, internal politics, and much more. Much of it will be too wiki-meta to apply to the general population, but I think leveraging the extensive history of the Reference Desk, an off-topic Q/A forum, would be a very effective way of collecting example sentences.

Riamse on 2 Aug 2017

Here's some more for you:
https://pastebin.com/tDzX2uKw

rupshabagchi on 2 Aug 2017

Thanks @jf99 (again!), @est31 and @rupshabagchi! Your sentences are added!

mikehenrty on 3 Aug 2017

Good call out @Riamse. I took a look and there does seem to be quite a bit of data, but extracting usable sentences will take some work. PRs are definitely welcome, but in the meantime I'm going to continue collecting sentences from people's personal writings if I can. We're really getting great stuff!

mikehenrty on 3 Aug 2017

Did you have a look at tatoeba.org... it seems to have a big corpus of sentences (many with audio recordings).

dfordivam on 3 Aug 2017

👍1

I thought I could share this.

Sentence
Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.

Source
https://wikimediafoundation.org/wiki/Home

sivaraam on 6 Aug 2017

Some more sentences for you: https://pastebin.com/pWM1y61d

Hopefully some are useful! I agree to release my sentences to the public domain with a cc-0 license.

Ceejaydeepee on 10 Aug 2017

Do you also want sentences with specialty/jargon vocabulary at this point? In my case, it would be mostly IT/software/computer jargon, but I can also come up with some physics and math stuff, if that's helpful :)

nevik on 13 Aug 2017

https://pastebin.com/3VHV0cY1 here you have some poetry inspired by goethe (or excerpts of a poem i wrote a while ago)

ghost on 15 Aug 2017

Here comes a new load of sentences...
https://pastebin.com/gLTMV3kb

jf99 on 18 Aug 2017

https://gist.github.com/vkatsikaros/551c284c609a62a4fa8035b612e0838f
I agree to release my sentences to the public domain with a cc-0 license.

vkatsikaros on 10 Sep 2017

est31 on 16 Sep 2017

Had some spare time. I agree to release this under the CC-0 License.

thehowl on 4 Oct 2017

Thanks everyone! We'll be adding these soon!

Do you also want sentences with specialty/jargon vocabulary at this point? In my case, it would be mostly IT/software/computer jargon, but I can also come up with some physics and math stuff, if that's helpful :)

Yes, some jargon is good!

mikehenrty on 4 Oct 2017

Let me push 100 more sentences into the queue.
https://pastebin.com/pWtbqTW6

jf99 on 7 Oct 2017

Here are my sentences.

https://pastebin.com/e0GzTMNd

HRusnica on 9 Oct 2017

Here is my sentence:
https://pastebin.com/Ze45KJjC

tamarahills on 13 Oct 2017

Here are a few hundred sentences I wrote.
https://pastebin.com/WUtc8CBy

Some thoughts as I was writing them:

I chose names to use in the sentence fairly randomly from a list of common names. Would it be advantageous and worthwhile to be able to randomly swap out names with other names to be able collect more pronunciations of names and in different contexts?

Similarly, going off of https://discourse.mozilla.org/t/200-ways-to-hear-him-and-only-10-ways-to-hear-her/17301, I tried to use fairly balanced mix of "him" and "her." I agree, that in cases were there is not also a name in the sentence, the "his" and "her" could be randomly changed. Not sure how feasible these kind of "data augmentation" things are or whether it creates concern of duplicate sentences.

I tried to write some sentences with several variations of how numbers could be used. I wasn't sure whether I should have written out words for the numbers or not. Often writing out the numbers makes the sentences less realistic (like no one would write out "Pi is equal to three point one four one five" for "Pi is equal to 3.1415"), but it does avoid make it more closely show what the spoken sounds. There is some variation in valid ways to pronounce numbers (such as saying "1087 Oak Street" as "ten eighty-seven" or "one zero eight seven". Or saying the year 2030 as "twenty thirty" or "two thousand thirty"). I am not sure if that is ambiguity and variation there would be advantageous to have captured in the dataset as a STT system would have to deal with it, or if it would be better to spell out desired pronunciation. Additionally, another data augmentation scheme could be to randomly change out numbers.

DNGros on 17 Oct 2017

Hi @DNGros, thanks for your sentences and for your thoughts here.

Swapping out names and pronoun usage is a good idea, but also a little hard to do (there are many contextual changes that make it not a straight swap). We might look into that in the future, but for now we will stick to just collecting sentences, and trying to keep the pronoun and name use balanced manually. I also agree this is not scaleable in the long run.

As for the numbers, when we publish the dataset, we will turn all the numbers into the spelled-out versions. This is a requirement from our speech recognition team, and therefore probably a requirement of most users of this data.

mikehenrty on 18 Oct 2017

Hi there! I am willing to add more sentence to this project. I've few questions.

Do you want any random sentence or any priority based on tone or other various factor
Can I add local words which is famous global. E.G "accha" is accepted by Oxford to consider as okay and "Jugaad" is also a famous word in India.

ghost on 27 Oct 2017

👍1

Hi, I've done some for you, all released under CC0. I've tried to include a wide variety of the most common verbs and nouns in English, as well as quite a few numbers. Hope that helps.

https://pastebin.com/ZpWty4LR

MichaelNMaggs on 30 Oct 2017

Hi @drashti4!

Random sentences are fine for now.
Sure, if a word is in the Oxford dictionary, consider it fair game :)

mikehenrty on 1 Nov 2017

👍1

Are you still in need of more? It doesn't seem that any of the sentences suggested over the last month have been used yet.

MichaelNMaggs on 3 Nov 2017

Yes, we still need more! I would also love some help getting the sentences adding in the next month into PRs.

mikehenrty on 7 Nov 2017

Great! In that case, here are some more for you, again released under CC0. In this set I've tried to inject some of the commonest British English words and idioms, as suggested by a useful list on Wikipedia. The intent is to widen the range of international expressions, so that the corpus doesn't become overly focused on US-English.

https://pastebin.com/vCLjK3DQ

MichaelNMaggs on 7 Nov 2017

Hi there, I've done a small one for your:

https://pastebin.com/SUKYwB93

jelford on 17 Nov 2017

Here are 75 more, under cc-0.

https://pastebin.com/6WWa5A5S

itsravenous on 1 Dec 2017

Hi! Here are some very simple phrases from me: https://pastebin.com/bTKWpHiU

RLarissa on 6 Dec 2017

I tried to add a bunch of sentences with the goal of hitting some things I hadn't seen in the current set. A lot of it is trying to tease out words with obvious pronunciation differences in different accents, and some words which just have different pronunciations depending on who taught you.

I also added names, "English" words from other languages, words where it sounds like two words put together, and double contractions in common use.

I also tried to hit some random sampling of this set of splits and mergers in English.

https://gist.github.com/kscz/8f2581641caca265945786fe99274966

EDIT: Having spent some time speaking sentences, there are a few sentences in the database I thought were... off. Things like incomplete sentences which I would like to pull. Is there a complete list of the sentences in the database?

kscz on 1 Jan 2018

👍1

I may have missed something, but it looks as though the last commit for this issue was 10th August last year, more than five months ago. As it's discouraging for people to be invited to prepare and add sentence collections that are then not used, I suggest that this issue should be put on hold until a volunteer becomes available who can deal with it.

MichaelNMaggs on 15 Jan 2018

@MichaelNMaggs the last edit of the directory is from december 2017. Have a look yourself: https://github.com/mozilla/voice-web/commits/master/server/data

est31 on 15 Jan 2018

@est31 Thanks. I did see that, but none of those commits add the sentence collections from this issue #341, do they? Or if they do, I can't see any easy way to tell which of the collections suggested in this thread have been used so far. None of the commits reference #341 since 10 August.

MichaelNMaggs on 15 Jan 2018

50 sentences:
https://gist.github.com/jakethakur22/757500fdece319401be87c8e371d1501

I hope they’re useful! I agree to license them under whatever licenses you need to use them.

jakethakur on 18 Apr 2018

I fear that nobody has made use of any of the sentences posted to this issue since at least August 2017. I suggest it should be closed unless/until someone is able to pick up and use the existing submissions.

MichaelNMaggs on 18 Apr 2018

Thanks for the flag @MichaelNMaggs, indeed we are behind bringing these in.

However, we still plan to! We just need to review them (which due to our small team we have not had much time yet to do). That is why we are working on an event to get these sentences reviewed and into the project.

See here for more details:
https://github.com/mozilla/global-sprint/issues/259

mikehenrty on 19 Apr 2018

👍1

50 sentences, English, CC-0:
https://pastebin.com/bjiHHDiE

jdittrich on 4 May 2018

I am contributing 100 sentences in English, subject to CC-0 1.0 "Public Domain Dedication."
Link to the pastebin: MyPasteBin
I hope these are helpful for the project. Looking forward to contributing in other ways in the future!

Tignor82 on 20 May 2018

@Tignor82: Thanks for your contribution! The link does not work – could you correct it? (what one sees is correct, but what you get on click is not)

jdittrich on 20 May 2018

Sorry about that. I just fixed the link.

On Sun, May 20, 2018 at 7:44 AM, jdittrich notifications@github.com wrote:

@Tignor82 https://github.com/Tignor82: Thanks for your contribution!
The link does not work – could you correct it? (what one sees is correct,
but what you get on click is not)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/mozilla/voice-web/issues/341#issuecomment-390474975,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AhnSDob_emPfPJDQwvdqI-MVDZKTKLiyks5t0VcXgaJpZM4Oghi_
.

Tignor82 on 20 May 2018

👍1

@mikehenrty can/shall we post German sentences here, too, or do they have another place?

jdittrich on 7 Jun 2018

I am contributing 200 more English sentences, subject to CC-0 1.0 "Public Domain Dedication."
Link to my PasteBin: MyPasteBin

I hope these sentences are useful for the project!

Tignor82 on 14 Jun 2018

Here are 71 more French sentences : https://framabin.org/p/?762581a57bfd69d4#XYNbIcksgQj1kPpAVtYXfmBgef1voV+uW7EWFdcMyXY=

Booteille on 23 Jun 2018

First 58 sentences in Spanish

https://pastebin.com/0pLpAa8f

Great job!

renderpci on 23 Jun 2018

And 84 more sentences in French : https://framabin.org/p/?3e27b7ce0ef925da#OuAxLeQRtG9gCABKmQgL6OL+nyceq2Mj3M/hbDAinnU=

Booteille on 24 Jun 2018

And 139 sentences more in Spanish:

https://pastebin.com/UU6Tppqz

renderpci on 24 Jun 2018

@mikehenrty can/shall we post German sentences here, too, or do they have another place?

@jdittrich, all languages are welcome, but we will need some time to get these into the pipeline. im working on getting some help for this now.

mikehenrty on 25 Jun 2018

72 Spanish sentences more:

https://pastebin.com/4KTZyeh6

renderpci on 29 Jun 2018

I just uploaded ~~3000~~ 5000 new sentences in German to my fork at https://github.com/jf99/voice-web/tree/german/server/data/de (files batch01.txt to ~~batch12.txt~~ batch20.txt). I split the data into files of 250 lines, as this is probably more practical for proofreaders than a single file of ~~3000~~ 5000 lines.

Also note that lines 4823-10949 of 11k-german-sentences.txt are not yet proofread. Shall I extract them and make smaller batches out of it?

The new sentences can either be directly added to the corpus or you can create pull requests to my fork repository if you want to discuss something.

jf99 on 16 Jul 2018

Just as an official update here:

First: thank you all for your contributions to our text collection effort. Right now, we are working on building a better workflow for collecting and reviewing these sentences (outside of github). I imagine this is more than a month of work, but we will update this thread when we have that process in place. Stay tuned, and thank you for all your help so far!

mikehenrty on 17 Jul 2018

🎉4 👍1

To everyone who has participated in this issue, I want to point that we are asking for your feedback on the sentence collection process, thanks!

https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358

nukeador on 23 Jul 2018

Nice discussion at discourse. Unfortunately, I cannot participate, because the only working way for me to log into discourse was my Github account. Now, Github requires 2FA for external logins and I'm not going to give them my phone number. NO WAY.

jf99 on 24 Jul 2018

@jf99 You should be able to login with the same email you used for your github account using the regular login, not the github one.

nukeador on 24 Jul 2018

@nukeador Nope. After clicking on the link in the email:

Sorry, you may not login using passwordless email. We require login to be performed using the most secure method available for your account, which is GitHub.

There's also a thread about that at discourse:
https://discourse.mozilla.org/t/2fa-cannot-be-mandatory/30352

Mandatory 2FA for a forum is ridiculous. Anyway, this is going offtopic. Djfe already said lots of things at discourse I'd agree to and I currently don't have much time either.

jf99 on 25 Jul 2018

Hi @jf99, if you reach out to me via email, I can probably help you. Feel free to contact me on [email protected]

Thank you!
Best regards,
Henrik

hmitsch on 25 Jul 2018

Hi @jf99 ,

Did you know that you can use github with 2FA without providing a phone number?
https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/

This article would seem to indicate that you have your choice of methods to 2FA.

SMS
TOTP ( An app like Google Authenticator or FreeOTP )
or even a U2F token like a Yubikey.

I know I personally use 2FA for every service on the web that I can because I consider it to be good personal operational security. A single account compromise can lead to an attacker pivoting into other parts of your life.

Further when you do use 2FA to sign into things like Github the session lifetimes are pretty long. So you don't have to pull out the 2FA device again for what often feels like months between sign ins.

It sounds like @hmitsch is doing an awesome job getting you some help to go back to email as a single factor and if that's what you decide works best for you that is great! Just wanted to highlight the ways you can keep 2FA+Github a little more useable.

Thanks for reading.

Your friendly neighborhood security engineer,

Andrew

andrewkrug on 2 Aug 2018

I am adding 100 sentences in Indonesian language here. I will add more there.

bagustris on 28 Aug 2018

I created a 50 more sentences, I hope it helps:

https://pastebin.com/J5mke3fz

AcAntellAno on 1 Sep 2018

zh-hk
General: https://pastebin.com/fzzesRfB
God of Gamblers II (A zh-hk movie)(Rewritten for better pronunciation): https://pastebin.com/GAEDrjrY
Hope this help.

YuetAu on 15 Sep 2018

50 Polish sentences:

https://pastebin.com/zJktPtau

plisieck on 16 Sep 2018

❤1 👍1

100 german sentences from me:

https://pastebin.com/rT5JtUUs

DonHege on 21 Sep 2018

@DonHege Thanks for your contribution! I have reviewed your sentences and made a pull request. Have a look! #1481

jf99 on 21 Sep 2018

61 English sentences:
https://pastebin.com/8xWgCaeJ

nsb-xps on 5 Oct 2018

I submitted two sets of (British English) sentences on 29 Oct and & 7 Nov 2017. So far as I'm aware they have not been used. Were they unsuitable in some way?

https://pastebin.com/ZpWty4LR
https://pastebin.com/vCLjK3DQ

MichaelNMaggs on 7 Oct 2018

hi @MichaelNMaggs. Thanks again for you continued help with this project.

Right now we are building a tool to collect public domain sentences for Common Voice. The discussion around that tool happened here:
https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358

Since that discussion, we have started to create a "Sentence Collection tool" (basically a website to submit and review sentences). Once that tool is in good enough shape, we are going to go back through this entire thread (and other places) and get all the sentences into that tool for review and eventually into Common Voice itself to be spoken. This process is taking some time, but we are fully committed to it.

Thanks for your patience!

mikehenrty on 8 Oct 2018

👍1

I wan to contribute 53 sentences (Traditional Chinese)
https://pastebin.com/7JTZ1ncy

nixczhou on 18 Nov 2018

@Fatimuskii Thank you for your interest. Here are some guidelines regarding contributing to this project. You can put your questions here to get answers more quickly: https://discourse.mozilla.org/t/readme-how-to-see-my-language-on-common-voice/31530/10.

peiying2 on 28 Nov 2018

I think this and similar issues can be closed since a tool and guidelines are developed like discussed in https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358.

davidak on 4 Dec 2018

I contribute 50 sentences (Vietnamese sentence)
https://pastebin.com/Dmr2a3BP

train255 on 6 Dec 2018

https://pastebin.com/8wq3cSm1
More than 50 sentences in Spanish

Fatimuskii on 26 Dec 2018

Hello everyone and happy new year.

As we have been commenting over Common Voice discourse, we want to make sure sentences are properly reviewed and there are some automated quality checks so they are useful for the algorithms.

During the next 6 days we are doing some quality control to the new sentence collection tool and we hope to have it in beta form mid this month so all of you can submit and review sentences yourselves:

https://discourse.mozilla.org/t/sentence-collection-tool-development-topic/33390/5

(This means we won't be using PR or this issue to collect sentences, please keep them in any format you can copy and paste as soon as the tool is ready we'll inform everyone)

Cheers.

nukeador on 2 Jan 2019

50 sentences more in Spanish
https://pastebin.com/rFuBFfJb

Fatimuskii on 3 Jan 2019

50 more sentences in Spanish
https://pastebin.com/wkRC2Xij

Fatimuskii on 4 Jan 2019

50 more sentences
https://pastebin.com/G2G0imMR

Fatimuskii on 4 Jan 2019

50 more in Spanish
https://pastebin.com/pYQpCFSH

Fatimuskii on 4 Jan 2019

50 more sentences in Spanish https://pastebin.com/Mxsf1CJH

Fatimuskii on 4 Jan 2019

Fatima, thanks for your efforts, please check my previous message.

Keep collecting these sentences and we will inform once the sentence collector tool is ready :-)

nukeador on 4 Jan 2019

Hello everyone,

I'm super excited to announce that after a few months of intense work, today we launch the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing.

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain "hard requirements" this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don't worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

Special thanks

I would like to extend a special recognition and thank you to some people who have been responsible for this tool to be launched.

@mhenretty for his idea and initial development
@MKohler for taking the technical lead as volunteer.
@Gweber for his support from the voice web side.
Deep Speech team for their guidance on validation (@josh_meyer, @kdavis)
Kinto team for their support optimizing the code (@leplatrem)
Every volunteer who was involved in the QA testing phase during the last weeks (you were really fundamental)
- @ftyers
- @mozillakab
- @gtimoshaz
- @Txopi
- @tauheedul
- @irvin
- @danielsjf
- @whehd16
- @dcela
- @freaktechnik