Common-voice: Write some nice, short sentences for people to read.

Created on 23 Jul 2017  ·  99Comments  ·  Source: mozilla/common-voice

Due to bugs #23 and #319 in our current sentence collection, we are trying to diversify and refresh our sentences in #333. We would love your help! If you would like to contribute to Common Voice sentences with your own writings, please put your sentences (one per line) in a publicly linkable document (eg. Pastebin), then add a comment to this bug with a link to those sentences.

Criteria:

  • please write them yourselves. don't copy and paste from somewhere else.
  • try to make the sentences conversational, ie. easy to read out load.
  • you must agree to releasing your sentences to the public domain with a cc-0 license.
  • more than 50 sentences per link, but less than 500 please.
  • be nice, don't use offensive language. we aren't collecting that kinda material.
  • i'll be reading each one, and i may remove some but i will let you know why.

Thanks!

help wanted

Most helpful comment

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

In my experience, it is really hard to find interesting, readable sentences from Wikipedia. The language there tends to be quite formal and sometimes awkward to speak. I built some tools to chop up text here, but never got around to adding wikipedia sentences to this project. That said, I'm always open to accepting contributions. 👍

The other thing I like about asking people for sentence donations (which is just an experiment for now), is that we get a lot of clever, interesting messages. Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)

All 99 comments

Given that this is a specific bug... and is therefore addressable as a software item. I have elsewhere offered up already public domain resources as good sources for sentences:

https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

As an open healthcare data person, I obviously would love the opportunity to inject your sentence data set with laymans healthcare terms. Assuming no one else is available to incorporate the various resources that I am discussing here, I might be able to devote some resources towards scrapping some of the resources that have solid APIs, and running them through some per-source filters that would serve to ensure that they do not have unusual acronyms or other confusing industry jargon.

I would recommend that you develop a corpus that details the following information for each source of sentences:

  • The sentence itself
  • The url source for the sentence
  • A reference to the rule (or contribution agreement to the cc-0 license) that provides evidence that the sentence in question is available under the public domain.

I think I could write some scripts that would generate a few million (at least) sentences that met some basic rules that I could demonstrate were public domain. or someone else could if I fail to deliver (which I should warn is a frequent occurrence, this is only tangentially related to my day job, after all)

Let me know if this would be helpful.

Regards,
-FT

I'm back! Here's another hundred:

https://pastebin.com/1CU7GQYs

CC-0 or whatever license you need. There's a good few simple sentences.

Thanks @ftrotter, that is very helpful. Let's discuss that on Discourse, and see how many contibutions we can get from this threads.

Thanks @akshit13 and @ajaydee, I added your contributions.

This a randomized set of sentences from some of my older "computer vision" reviews. I waive all rights to the text. Use it anyway you like.
https://pastebin.com/DTezZ7rA

However, I would urge you to use some large diverse contemporary corpus.

Thanks @orschiro and @michal-hradis!

However, I would urge you to use some large diverse contemporary corpus.

We are looking into that on the PR below. In the meantime, we are experimenting with gathering our corpus as we go.
https://github.com/mozilla/voice-web/pull/304

Here is another 300 sentences:
https://pastebin.com/Qf7Ykcbz

@jf99, some good ones in there.

The servers of the common voice project couldn't handle the heavy load.

We're working on it! :D

Some more for ya:
https://pastebin.com/tCWPrxZJ

Thanks @psullivan6 and @Sposito!

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

In my experience, it is really hard to find interesting, readable sentences from Wikipedia. The language there tends to be quite formal and sometimes awkward to speak. I built some tools to chop up text here, but never got around to adding wikipedia sentences to this project. That said, I'm always open to accepting contributions. 👍

The other thing I like about asking people for sentence donations (which is just an experiment for now), is that we get a lot of clever, interesting messages. Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)

is that we get a lot of clever, interesting messages.

Can you give an example you particularly like? :-)

Can you give an example you particularly like? :-)

https://twitter.com/mikehenrty/status/890350612923207680 :-D

Nice one! :-)

Woohoo, I'm famous on twitter! :D

Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)

Yeah, I also find this funny. Although, from validating sentences I have the impression that most people don't get the joke when they're reading sentences like My review of the sun: one star.

Anyway, here is another batch of 400 lines for you to add. Some are just for fun, but for the majority I tried to include words that are not yet covered by the other sentences.
https://pastebin.com/Wqk4c24t

Here's a simple python program to generate random noun+adjective sentences https://github.com/fastrizwaan/sentence
example sentences:

sea is serene
not only the air is sweet but also wet
if land is sweet then the music is wet

@fastrizwaan Thanks for the contribution.

While you code looks to be able to generate an almost infinite set of grammatical sentences, some of the sentences will be nonsensical

_If fire is cold then the water is dry_

The most famous example of such a sentence is from Noam Chomsky

_Colorless green ideas sleep furiously_

Which is grammatical, but makes no sense.

The problem with such grammatical but nonsensical sentences is that if a speech recognition engine is trained on such data, it will,through the language model it learns explicitly or implicitly, learn to expect such nonsensical sentences. This decreases the accuracy on the sentences that we want to actually work, the ones that make sense.

So generically we want to include only sentences that make sense in order to improve the accuracy of any speech recognition engine trained on the resulting data.

The problem then becomes automatically generating sentences that are grammatical _and_ make sense is a bit harder.

@mikehenrty

In my experience, it is really hard to find interesting, readable sentences from Wikipedia.

Wikipedia is also home to gigabytes upon gigabytes of discussions about article development, internal politics, and much more. Much of it will be too wiki-meta to apply to the general population, but I think leveraging the extensive history of the Reference Desk, an off-topic Q/A forum, would be a very effective way of collecting example sentences.

Here's some more for you:
https://pastebin.com/tDzX2uKw

Thanks @jf99 (again!), @est31 and @rupshabagchi! Your sentences are added!

Good call out @Riamse. I took a look and there does seem to be quite a bit of data, but extracting usable sentences will take some work. PRs are definitely welcome, but in the meantime I'm going to continue collecting sentences from people's personal writings if I can. We're really getting great stuff!

Did you have a look at tatoeba.org... it seems to have a big corpus of sentences (many with audio recordings).

I thought I could share this.

Sentence
Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.

Source
https://wikimediafoundation.org/wiki/Home

Some more sentences for you: https://pastebin.com/pWM1y61d

Hopefully some are useful! I agree to release my sentences to the public domain with a cc-0 license.

Do you also want sentences with specialty/jargon vocabulary at this point? In my case, it would be mostly IT/software/computer jargon, but I can also come up with some physics and math stuff, if that's helpful :)

https://pastebin.com/3VHV0cY1 here you have some poetry inspired by goethe (or excerpts of a poem i wrote a while ago)

Here comes a new load of sentences...
https://pastebin.com/gLTMV3kb

https://gist.github.com/vkatsikaros/551c284c609a62a4fa8035b612e0838f
I agree to release my sentences to the public domain with a cc-0 license.

Had some spare time. I agree to release this under the CC-0 License.

Thanks everyone! We'll be adding these soon!

Do you also want sentences with specialty/jargon vocabulary at this point? In my case, it would be mostly IT/software/computer jargon, but I can also come up with some physics and math stuff, if that's helpful :)

Yes, some jargon is good!

Let me push 100 more sentences into the queue.
https://pastebin.com/pWtbqTW6

Here are my sentences.

https://pastebin.com/e0GzTMNd

Here is my sentence:
https://pastebin.com/Ze45KJjC

Here are a few hundred sentences I wrote.
https://pastebin.com/WUtc8CBy

Some thoughts as I was writing them:

I chose names to use in the sentence fairly randomly from a list of common names. Would it be advantageous and worthwhile to be able to randomly swap out names with other names to be able collect more pronunciations of names and in different contexts?

Similarly, going off of https://discourse.mozilla.org/t/200-ways-to-hear-him-and-only-10-ways-to-hear-her/17301, I tried to use fairly balanced mix of "him" and "her." I agree, that in cases were there is not also a name in the sentence, the "his" and "her" could be randomly changed. Not sure how feasible these kind of "data augmentation" things are or whether it creates concern of duplicate sentences.

I tried to write some sentences with several variations of how numbers could be used. I wasn't sure whether I should have written out words for the numbers or not. Often writing out the numbers makes the sentences less realistic (like no one would write out "Pi is equal to three point one four one five" for "Pi is equal to 3.1415"), but it does avoid make it more closely show what the spoken sounds. There is some variation in valid ways to pronounce numbers (such as saying "1087 Oak Street" as "ten eighty-seven" or "one zero eight seven". Or saying the year 2030 as "twenty thirty" or "two thousand thirty"). I am not sure if that is ambiguity and variation there would be advantageous to have captured in the dataset as a STT system would have to deal with it, or if it would be better to spell out desired pronunciation. Additionally, another data augmentation scheme could be to randomly change out numbers.

Hi @DNGros, thanks for your sentences and for your thoughts here.

Swapping out names and pronoun usage is a good idea, but also a little hard to do (there are many contextual changes that make it not a straight swap). We might look into that in the future, but for now we will stick to just collecting sentences, and trying to keep the pronoun and name use balanced manually. I also agree this is not scaleable in the long run.

As for the numbers, when we publish the dataset, we will turn all the numbers into the spelled-out versions. This is a requirement from our speech recognition team, and therefore probably a requirement of most users of this data.

Hi there! I am willing to add more sentence to this project. I've few questions.

  • Do you want any random sentence or any priority based on tone or other various factor
  • Can I add local words which is famous global. E.G "accha" is accepted by Oxford to consider as okay and "Jugaad" is also a famous word in India.

Hi, I've done some for you, all released under CC0. I've tried to include a wide variety of the most common verbs and nouns in English, as well as quite a few numbers. Hope that helps.

https://pastebin.com/ZpWty4LR

Hi @drashti4!

  • Random sentences are fine for now.
  • Sure, if a word is in the Oxford dictionary, consider it fair game :)

Are you still in need of more? It doesn't seem that any of the sentences suggested over the last month have been used yet.

Yes, we still need more! I would also love some help getting the sentences adding in the next month into PRs.

Great! In that case, here are some more for you, again released under CC0. In this set I've tried to inject some of the commonest British English words and idioms, as suggested by a useful list on Wikipedia. The intent is to widen the range of international expressions, so that the corpus doesn't become overly focused on US-English.

https://pastebin.com/vCLjK3DQ

Hi there, I've done a small one for your:

https://pastebin.com/SUKYwB93

Here are 75 more, under cc-0.

https://pastebin.com/6WWa5A5S

Hi! Here are some very simple phrases from me: https://pastebin.com/bTKWpHiU

I tried to add a bunch of sentences with the goal of hitting some things I hadn't seen in the current set. A lot of it is trying to tease out words with obvious pronunciation differences in different accents, and some words which just have different pronunciations depending on who taught you.

I also added names, "English" words from other languages, words where it sounds like two words put together, and double contractions in common use.

I also tried to hit some random sampling of this set of splits and mergers in English.

https://gist.github.com/kscz/8f2581641caca265945786fe99274966

EDIT: Having spent some time speaking sentences, there are a few sentences in the database I thought were... off. Things like incomplete sentences which I would like to pull. Is there a complete list of the sentences in the database?

I may have missed something, but it looks as though the last commit for this issue was 10th August last year, more than five months ago. As it's discouraging for people to be invited to prepare and add sentence collections that are then not used, I suggest that this issue should be put on hold until a volunteer becomes available who can deal with it.

@MichaelNMaggs the last edit of the directory is from december 2017. Have a look yourself: https://github.com/mozilla/voice-web/commits/master/server/data

@est31 Thanks. I did see that, but none of those commits add the sentence collections from this issue #341, do they? Or if they do, I can't see any easy way to tell which of the collections suggested in this thread have been used so far. None of the commits reference #341 since 10 August.

50 sentences:
https://gist.github.com/jakethakur22/757500fdece319401be87c8e371d1501

I hope they’re useful! I agree to license them under whatever licenses you need to use them.

I fear that nobody has made use of any of the sentences posted to this issue since at least August 2017. I suggest it should be closed unless/until someone is able to pick up and use the existing submissions.

Thanks for the flag @MichaelNMaggs, indeed we are behind bringing these in.

However, we still plan to! We just need to review them (which due to our small team we have not had much time yet to do). That is why we are working on an event to get these sentences reviewed and into the project.

See here for more details:
https://github.com/mozilla/global-sprint/issues/259

50 sentences, English, CC-0:
https://pastebin.com/bjiHHDiE

I am contributing 100 sentences in English, subject to CC-0 1.0 "Public Domain Dedication."
Link to the pastebin: MyPasteBin
I hope these are helpful for the project. Looking forward to contributing in other ways in the future!

@Tignor82: Thanks for your contribution! The link does not work – could you correct it? (what one sees is correct, but what you get on click is not)

Sorry about that. I just fixed the link.

On Sun, May 20, 2018 at 7:44 AM, jdittrich notifications@github.com wrote:

@Tignor82 https://github.com/Tignor82: Thanks for your contribution!
The link does not work – could you correct it? (what one sees is correct,
but what you get on click is not)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/mozilla/voice-web/issues/341#issuecomment-390474975,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AhnSDob_emPfPJDQwvdqI-MVDZKTKLiyks5t0VcXgaJpZM4Oghi_
.

@mikehenrty can/shall we post German sentences here, too, or do they have another place?

I am contributing 200 more English sentences, subject to CC-0 1.0 "Public Domain Dedication."
Link to my PasteBin: MyPasteBin

I hope these sentences are useful for the project!

Here are 71 more French sentences : https://framabin.org/p/?762581a57bfd69d4#XYNbIcksgQj1kPpAVtYXfmBgef1voV+uW7EWFdcMyXY=

First 58 sentences in Spanish

https://pastebin.com/0pLpAa8f

Great job!

And 84 more sentences in French : https://framabin.org/p/?3e27b7ce0ef925da#OuAxLeQRtG9gCABKmQgL6OL+nyceq2Mj3M/hbDAinnU=

And 139 sentences more in Spanish:

https://pastebin.com/UU6Tppqz

@mikehenrty can/shall we post German sentences here, too, or do they have another place?

@jdittrich, all languages are welcome, but we will need some time to get these into the pipeline. im working on getting some help for this now.

72 Spanish sentences more:

https://pastebin.com/4KTZyeh6

I just uploaded 3000 5000 new sentences in German to my fork at https://github.com/jf99/voice-web/tree/german/server/data/de (files batch01.txt to batch12.txt batch20.txt). I split the data into files of 250 lines, as this is probably more practical for proofreaders than a single file of 3000 5000 lines.

Also note that lines 4823-10949 of 11k-german-sentences.txt are not yet proofread. Shall I extract them and make smaller batches out of it?

The new sentences can either be directly added to the corpus or you can create pull requests to my fork repository if you want to discuss something.

Just as an official update here:

First: thank you all for your contributions to our text collection effort. Right now, we are working on building a better workflow for collecting and reviewing these sentences (outside of github). I imagine this is more than a month of work, but we will update this thread when we have that process in place. Stay tuned, and thank you for all your help so far!

To everyone who has participated in this issue, I want to point that we are asking for your feedback on the sentence collection process, thanks!

https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358

Nice discussion at discourse. Unfortunately, I cannot participate, because the only working way for me to log into discourse was my Github account. Now, Github requires 2FA for external logins and I'm not going to give them my phone number. NO WAY.

@jf99 You should be able to login with the same email you used for your github account using the regular login, not the github one.

@nukeador Nope. After clicking on the link in the email:

Sorry, you may not login using passwordless email. We require login to be performed using the most secure method available for your account, which is GitHub.

There's also a thread about that at discourse:
https://discourse.mozilla.org/t/2fa-cannot-be-mandatory/30352

Mandatory 2FA for a forum is ridiculous. Anyway, this is going offtopic. Djfe already said lots of things at discourse I'd agree to and I currently don't have much time either.

Hi @jf99, if you reach out to me via email, I can probably help you. Feel free to contact me on [email protected]

Thank you!
Best regards,
Henrik

Hi @jf99 ,

Did you know that you can use github with 2FA without providing a phone number?
https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/

This article would seem to indicate that you have your choice of methods to 2FA.

  • SMS
  • TOTP ( An app like Google Authenticator or FreeOTP )
  • or even a U2F token like a Yubikey.

I know I personally use 2FA for every service on the web that I can because I consider it to be good personal operational security. A single account compromise can lead to an attacker pivoting into other parts of your life.

Further when you do use 2FA to sign into things like Github the session lifetimes are pretty long. So you don't have to pull out the 2FA device again for what often feels like months between sign ins.

It sounds like @hmitsch is doing an awesome job getting you some help to go back to email as a single factor and if that's what you decide works best for you that is great! Just wanted to highlight the ways you can keep 2FA+Github a little more useable.

Thanks for reading.

Your friendly neighborhood security engineer,

Andrew

I am adding 100 sentences in Indonesian language here. I will add more there.

I created a 50 more sentences, I hope it helps:

https://pastebin.com/J5mke3fz

zh-hk
General: https://pastebin.com/fzzesRfB
God of Gamblers II (A zh-hk movie)(Rewritten for better pronunciation): https://pastebin.com/GAEDrjrY
Hope this help.

50 Polish sentences:

https://pastebin.com/zJktPtau

100 german sentences from me:

https://pastebin.com/rT5JtUUs

@DonHege Thanks for your contribution! I have reviewed your sentences and made a pull request. Have a look! #1481

61 English sentences:
https://pastebin.com/8xWgCaeJ

I submitted two sets of (British English) sentences on 29 Oct and & 7 Nov 2017. So far as I'm aware they have not been used. Were they unsuitable in some way?

https://pastebin.com/ZpWty4LR
https://pastebin.com/vCLjK3DQ

hi @MichaelNMaggs. Thanks again for you continued help with this project.

Right now we are building a tool to collect public domain sentences for Common Voice. The discussion around that tool happened here:
https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358

Since that discussion, we have started to create a "Sentence Collection tool" (basically a website to submit and review sentences). Once that tool is in good enough shape, we are going to go back through this entire thread (and other places) and get all the sentences into that tool for review and eventually into Common Voice itself to be spoken. This process is taking some time, but we are fully committed to it.

Thanks for your patience!

I wan to contribute 53 sentences (Traditional Chinese)
https://pastebin.com/7JTZ1ncy

@Fatimuskii Thank you for your interest. Here are some guidelines regarding contributing to this project. You can put your questions here to get answers more quickly: https://discourse.mozilla.org/t/readme-how-to-see-my-language-on-common-voice/31530/10.

I think this and similar issues can be closed since a tool and guidelines are developed like discussed in https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358.

I contribute 50 sentences (Vietnamese sentence)
https://pastebin.com/Dmr2a3BP

https://pastebin.com/8wq3cSm1
More than 50 sentences in Spanish

Hello everyone and happy new year.

As we have been commenting over Common Voice discourse, we want to make sure sentences are properly reviewed and there are some automated quality checks so they are useful for the algorithms.

During the next 6 days we are doing some quality control to the new sentence collection tool and we hope to have it in beta form mid this month so all of you can submit and review sentences yourselves:

https://discourse.mozilla.org/t/sentence-collection-tool-development-topic/33390/5

(This means we won't be using PR or this issue to collect sentences, please keep them in any format you can copy and paste as soon as the tool is ready we'll inform everyone)

Cheers.

50 sentences more in Spanish
https://pastebin.com/rFuBFfJb

50 more sentences in Spanish
https://pastebin.com/wkRC2Xij

50 more sentences
https://pastebin.com/G2G0imMR

50 more in Spanish
https://pastebin.com/pYQpCFSH

50 more sentences in Spanish https://pastebin.com/Mxsf1CJH

Fatima, thanks for your efforts, please check my previous message.

Keep collecting these sentences and we will inform once the sentence collector tool is ready :-)

Hello everyone,

I'm super excited to announce that after a few months of intense work, today we launch the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing.

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain "hard requirements" this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don't worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

Special thanks

I would like to extend a special recognition and thank you to some people who have been responsible for this tool to be launched.

  • @mhenretty for his idea and initial development
  • @MKohler for taking the technical lead as volunteer.
  • @Gweber for his support from the voice web side.
  • Deep Speech team for their guidance on validation (@josh_meyer, @kdavis)
  • Kinto team for their support optimizing the code (@leplatrem)
  • Every volunteer who was involved in the QA testing phase during the last weeks (you were really fundamental)

    • @ftyers

    • @mozillakab

    • @gtimoshaz

    • @Txopi

    • @tauheedul

    • @irvin

    • @danielsjf

    • @whehd16

    • @dcela

    • @freaktechnik

Thank you everyone!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ivonnekn picture ivonnekn  ·  5Comments

Djfe picture Djfe  ·  5Comments

mikehenrty picture mikehenrty  ·  3Comments

r00ster91 picture r00ster91  ·  4Comments

selimsumlu picture selimsumlu  ·  3Comments