Common-voice: We need a larger corpus of text input

Created on 11 May 2017  Â·  31Comments  Â·  Source: mozilla/common-voice

Right now, we only have 3K sentences or so.

Two possible new sources are

  1. Wikimedia text
  2. BYU Corpus
  3. Leipzig Corpora
Enhancement

Most helpful comment

Regarding the use of Wikipedia, I'd point out that there exist the Spoken Wikipedia Corpora: http://nats.gitlab.io/swc/ (CC BY-SA 4.0 license):

image

All 31 comments

Can we get a measure of success here? How big do we need?

This? http://wortschatz.uni-leipzig.de/en/download/

Added to list.

Can we get a measure of success here? How big do we need?

Right, good question. From what I understand from @kdavis-mozilla, it's not super useful to have a lot of different people repeating the same phrases over and over (although a few repeats is ok). We can put the number of required sentences at around the expected users. One possible projection is:
```
100K users = 100K sentences

Shouldn't we work back from number of hours?

10K hours = 600K mins = 9M sentences (@3secs per sentence)

10K hours = 600K mins = 9M sentences (@3secs per sentence)

Yes we will eventually need that to fulfill our goal, but this bug is about June activation so I'm not setting sites quite so high yet.

One must also juggle the extra ball of legality.

For example the Leipzig Corpora terms of use state

Any data provided by Projekt Deutscher Wortschatz are subject to copyright. Permission for use is granted free of charge solely for non-commercial personal and scientific purposes

and further

...any commercial use of the data obtained is forbidden without explicit written permission by the copyright owner

So on our time scale Leipzig is a no. Similarly the BYU Corpus is targeted at low usage academics

We are committed to keeping the BYU corpora free -- for those universities that have light to moderate use, and which cannot afford a license. As a result, there is no cost to use the corpora, as long as your class or department has less than 250 queries each day.

Wikipedia seems the best bet, but them one also has to worry about copyright[1] there too

You are free to:
• Share and Reuse our articles and other media under free and open licenses.
• ..

Under the following conditions:
• Lawful Behavior – You do not violate copyright or other laws.
• ..

I think this might be a place to ask for Brian's help in interpreting these terms. Let's add that to our next agenda with him -- maybe @mikehenrty you can ping him with an email in advance?

@geroter I think he can help with other ToS, but I think these are pretty clear in forbidding us use, and I don't think we have time to parse subtleties.

The easiest way out is to simply use text for which the copyright has expired in all countries.

Gutenberg[1] houses texts for which the copyright has expired in the US[2], a good staring point.

For example the Gutenberg[3] license which, in the non-normative text, states, with respect to texts for which the US copyright has expired:

If you strip the Project Gutenberg license and all references to Project Gutenberg from the text, you are left with a text unprotected by U.S. intellectual property law. You can do anything you want with that text in the United States and most of the rest of the world.

This is the kind of clarity we need.

We don't have time to parse Wikipedia in to copyrighted and non-copyrighted texts, a task which, even if we had time for, would be hard to do.

Ya, I meant stuff like Wikipedia or others that might be murkier.

I also do wonder if using copyrighted info in a different context (e.g.
outside of a full text) is a problem. For example, I can quote paragraphs
of a novel on my own commercial website without it being copyright
infringement. Google does searches on copyrighted materials and makes money
off ads on the pages of those searches. So, I'd like Brian to walk us
through the details of fair use in this case.

On Fri, May 12, 2017 at 10:33 AM, Kelly Davis notifications@github.com
wrote:

@geroter https://github.com/geroter I think he can help with other ToS,
but I think these are pretty clear in forbidding us use.

The easiest way out is to simply use text for which the copyright has
expired in all countries.

Gutenberg[1 https://www.gutenberg.org/] houses texts for which the
copyright has expired in the US[2
https://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use#Book_Copyright],
a good staring point.

For example the Gutenberg[3
https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License]
license which, in the non-normative text, states, with respect to texts for
which the US copyright has expired:

If you strip the Project Gutenberg license and all references to Project
Gutenberg from the text, you are left with a text unprotected by U.S.
intellectual property law. You can do anything you want with that text in
the United States and most of the rest of the world.

This is the kind of clarity we need.

We don't have time to parse Wikipedia in to copyrighted and
non-copyrighted texts, a task which, even if we had time for, would be hard
to do.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/mozilla/voice-web/issues/23#issuecomment-301017786,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKP84Nm5VmIkPtBRGfWZ6JY7HlhMNOyGks5r5BlZgaJpZM4NYb2J
.

--
George Roter
Head of Core Contributors, Participation

irc: geroter | skype: geroter
Cell - Germany: +49 172 714 2439
Cell - USA: +1 650.210.6953
<650%20308%208443>

For the moment, I've written a script to extract sentences from project Gutenberg, and curate them based on length and some reading complexity metric. The current sentence set is from War of the Worlds.

https://github.com/mozilla/voice-web/blob/master/tools/gen.js

Where'd the reading complexity metric come from? Has a non-obvious
algorithm.

On May 15, 2017 20:27, "Michael Bebenita" notifications@github.com wrote:

For the moment, I've written a script to extract sentences from project
Gutenberg, and curate them based on length and some reading complexity
metric. The current sentence set is from War of the Worlds.

https://github.com/mozilla/voice-web/blob/master/tools/gen.js

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/mozilla/voice-web/issues/23#issuecomment-301562355,
or mute
the thread
https://github.com/notifications/unsubscribe-auth/AKP84Euyvv_YLaWad8GvpIxpNRqnU-pVks5r6JksgaJpZM4NYb2J
.

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests

I didn't actually test to see how well it works.

Hello, how about adding some example sentences from Wiktionary and GNU Collaborative International Dictionary of English? They contain many short sentences, such as in the GNU Collaborative International Dictionary of English (from the definition of "master"):

Little masters, certain German engravers of the 16th century, so called from the extreme smallness of their prints.

Master in chancery, an officer of courts of equity, who acts as an assistant to the chancellor or judge, by inquiring into various matters referred to him, and reporting thereon to the court.

Master of arts, one who takes the second degree at a university; also, the degree or title itself, indicated by the abbreviation M. A., or A. M.

Here's another example sentences from Wiktionary (from the definition of "whether"):

He chose the correct answer, but I don't know whether it was by luck or by skill.

Do you know whether he's coming?

He's coming, whether you like it or not.

The FAQ says that this corpus will be used to train voice recognition. One thing that I frequently say into voice recognition tools, but haven't seen in any of the example sentences, is numbers. It might be worth prioritizing the addition of sentences with spoken numbers in them so that the corpus is more useful to real-world AI.

Indeed you need more sources. The current sentences all sound like "Arabs meet in Englishman in the desert" and could all belong to one fairy tail or maybe a fable (considering the animals, which are often acting like humans.).

Consider looking at CC0 (or, if allowed, CC-BY or CC-BY-SA) blogs, like http://dougbelshaw.com/blog/, https://people.gnome.org/~michael/, and http://blog.ninapaley.com/.

Unfortunately, it's hard to find such blogs, so perhaps also ask people to submit CC0-licensed blogs (or license their blogs with this license).

Maybe you could get a hold onto:
http://catalog.elra.info/product_info.php?products_id=1032
http://catalog.elra.info/product_info.php?products_id=1033

Both datasets have been used by RASR in the past

And since elda has been already reaching out to you ;)
Discourse thread

Hi,
I left this comment on your discussion board, but I thought it might be helpful here as well.

Hello. Thank you for an amazingly simple implementation of a wonderful idea. Rather than randomly including writing, you might consider using some already qualified public domain resources. There is a list of such resources here: https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

Of course, while you could consider any resource from that page, I would ask that you specifically consider the inclusion of healthcare terms that included simple medical terminology. Not the kind of things that doctors say about their healthcare (that is too technical and specific) but the types of things that patients might like to read and or discuss in everyday terms.

An amazing resource for this that is written in lay terms is the Medline plus website. Not everything on Medline is public domain, but they specify what is covered and what is not here:
https://medlineplus.gov/copyright.html

Note that the Medline encyclopedia is licensed content and is therefore not public domain. But the Health Topics are public domain:
https://medlineplus.gov/acousticneuroma.html
As are the FAQ answers
https://medlineplus.gov/faq/disease.html
And the medline plus magazine
https://medlineplus.gov/magazine/issues/summer17/articles/summer17pg13-14.html

Obviously, as a healthcare data journalist I have an ax to grind here, but there is a huge amount of english sentences here that are not medically contextual. For instance the sentence "The people who write the materials are the ones who decide if they are easy to read." is found on one of the FAQ pages. Moreover, while the terms in Medline are intended to be "laymans terms" they include words like "Alzheimer" which are common enough words, that will likely have huge pronunciation differences.

I should note that the sections on women's health topics in Medline are likely to include more sentences including female pronouns. I will make that comment on the other github page as well.

Given that you are obviously also interested in resources that are not medical, I would also suggest the Federal Register, which is also without copyright.
example text:
https://www.gpo.gov/fdsys/pkg/FR-2015-01-02/html/2014-30754.htm

It should be relatively simple to run a script which removes all sentences that include the goggly-gook internal reference system and also acronyms. NIST, NASA, etc. Once that is done, this would be a huge corpus of sentences that should be composed of relatively clear text. If you wanted to ensure that the sentences were even more "common language-full" you might simple exclude everything except the contents of the executive summaries of the articles, which are intended to be relatively jargon-free.

If that is still not enough, you should consider including the text of comments made to various regulations on regulations.gov. Most people are unaware that the comments that they make on regulations themselves become public domain. See here: https://www.regulations.gov/userNotice

This data is available via an API, and here is an example:
https://www.regulations.gov/document?D=VA-2016-VHA-0011-184061

HTH,
-FT

From @rugk

Maybe you can use http://shtooka.net/ as another source.

However, http://shtooka.net/ already contains the voice examples, so it does not really belong to this issue…

yeah but you can integrate the text with the correlating voice samples and collect even further voice samples with this service for the same text, that's why it belongs to this particular issue/topic

Mozilla will probably integrate our ideas from this issues from time to time, so it's helpful to have all suggestions in one place

Regarding the use of Wikipedia, I'd point out that there exist the Spoken Wikipedia Corpora: http://nats.gitlab.io/swc/ (CC BY-SA 4.0 license):

image

In an ideal speech corpus, most utterances would be unique. Right now the small text size is a disaster for benchmarking the Common Voice corpus v1, as the same sentences used for training also appear in the test and dev portions of the corpus. This means the "best" performing models are the ones that overfit the most to the 7000 sentences used in the training data.

See also the discussion here: https://github.com/kaldi-asr/kaldi/issues/2141 (Commonvoice results misleading, complete overlap of train/dev/test sentences) and here https://discourse.mozilla.org/t/common-voice-v1-corpus-design-problems-overlapping-train-test-dev-sentences/24288 (Common Voice v1 corpus design problems, overlapping train/test/dev sentences)

The 1 billion word corpus from Google is Apache licensed, not sure if that license would fit.
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Transcriptions of the European Parliament are afaik in public domain (and also cover many more languages than English): http://www.statmt.org/europarl/

Would be nice if some staff from the project looked at discourse: https://discourse.mozilla.org/t/help-wanted-write-some-nice-short-sentences-for-people-to-read/17317/33
There's multiple other speech corpus that have been donated and are still not used.

Just a thought, maybe we could use the sentences from pontoon.mozilla.org that fit the required criteria has a good source of text since this is pretty massive and I'm pretty sure we could use it considering the licensing?

yeah the engagement strings seem to be suitable :)

I found another possible source:
the ccc subtitles (chaos communication congress and related events)
http://mirror.selfnet.de/c3subtitles/
CC BY 4.0 license
lots of English, some german

https://media.ccc.de/

Maybe they are even accurate enough so that we can use the audio from those videos (their goal is to make very accurate subtitles (even if the speaker repeats words etc.; only stuff like "umm" is left out), so if the timestamps are accurate enough, they could be used for training, maybe)

Hi, I'm not sure this is all new ideas (or any of this). I wanted to suggest using :

1- transcripts from political speeches, courts videos, official announcements etc. In France we have a transcript of all debates of the house (but cleaned up) [http://www2.assemblee-nationale.fr/feeds/detail/crs]
3- Today many podcasts publicly available have transcripts of part or all the content. I could be worth going around radio station to find editorials which have complete transcripts.
eg : [https://www.franceculture.fr/emissions/le-tour-du-monde-des-idees/le-tour-du-monde-des-idees-du-mardi-04-septembre-2018]
4 - It might also be possible to get transcripts from programs subtitled for deaf people.

Since these might not be perfectly clean, you might want to process them through your STT first then match and correct discrepencies before adding them to the dataset.
It could also be okay to keep only the correctly matched parts (with length condition for instance). I'm not sure how that would induce bias though?

Best,

good suggestions!
very welcome :)
keep them coming

More sentences are better, bias will be reduced by adding more over time

we have to be careful with transcriptions though because of copyright
They don't always can be used since we want to publish the whole dataset/korpus (voice and according text) as cc-0/public domain

Regards

Hi,
So I wanted to bump on my previous post, it appears the French Senate actually has good tech initiatives. Not only do they provide cleaned up transcripts but they also bind sentences to the right video timing. What's more they allow downloading directly the audio file easily.
It appears the license is open, It is very explicit for the transcript, a tad bit less for the media.

Enclosed is a script able to extract all this data by crawling their website (you have to change the ext to python). It downloads a "xml" file containing transcripts, and timestamps corresponding to sentences along with the audio. Of course to be used with caution so as to not overflow their server, this is fully synchronous so it should be fine but I wouldn't thread it too much. Also I wouldn't flag their voices with their names or age or anything of the sort.
I counted 282 sessions so I would expect between 500 and 1000 hours of speech.

If you deem this usable, it could be worth it to open a new category here to regroup international public institutions that do provide cc-0 transcripted speeches and give them credit. And maybe a bunch of people could make a few scripts to fetch the data (or kindly ask)

Edit: So I checked out the EU parliament, their site is a bit wacky but on copyright they say :

As a general rule, the reuse (reproduction or use) of textual data and multimedia items which are the property of the European Union (identified by the words '© European Union, [year(s)] – Source: European Parliament' or '© European Union, [year(s)] – EP' ) or of third parties (© External source, [year(s)]), and for which the European Union holds the rights of use, is authorised, for personal use or for further non-commercial or commercial dissemination, provided that the entire item is reproduced and the source is acknowledged. However, the reuse of certain data may be subject to different conditions in some instances; in this case, the item concerned is accompanied by a mention of the specific conditions relating to it .

The 'entire item is reproduced' clause seems to be a hitch but they actually do some extensive cutting up.
See here Aside from that it looks like we may use it as long as we are careful to use originals (not simultaneous translations)

Hello everyone,

In case you missed the announcement, after a few months of intense work, we launched the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing. This tool also includes a How to page with ideas on where and how to find corpus (open to improvements).

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain "hard requirements" this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don't worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

In other to centralize all discussions about where to find corpus, I would like to suggest we use the Common Voice discourse instead. Here there is one topic where community is talking about this:

https://discourse.mozilla.org/t/problems-finding-public-domain-sentences/34790

Thanks for your contributions!

Was this page helpful?
0 / 5 - 0 ratings