Common-voice: Getting the same sentences multiple times

Created on 24 Jun 2017 · 11Comments · Source: mozilla/common-voice

As I user, when recording, I often get the same sentences multiple times -- sometimes you can get twice the same sentence in one batch of three recordings.

There should be some way to remember which sentences have already been served to a given session, and not serving them again once they have been recorded.

Enhancement

Source

a3nm

👍3

Most helpful comment

I think this should be dealt with high priority. If you as an avid and excited user get to see and record the same sentences over and over again, it takes away your motivation to contribute.

orschiro on 16 Jul 2017

👍2

All 11 comments

One part of it is that the sentence dataset is really small, at 188 sentences (I expected thousands.) May I suggest the use of Wikisource? Something speech-heavy, like Agatha Christie, or modern speeches, maybe a play?

The second part is that it simply uses Math.random(). It will not yield well-distributed results, so you risk having a large disparity between the number of recordings per sample. One solution is to increment a global variable on the server, and yield the sentence at that index (and wrapping around, obviously). That variable should be initialized to a random value (or persisted), to prevent one part being repeated due to server restarts.

espadrine on 24 Jun 2017

👍1

I think this should be dealt with high priority. If you as an avid and excited user get to see and record the same sentences over and over again, it takes away your motivation to contribute.

orschiro on 16 Jul 2017

👍2

Is this still happening? We have increased our sentences to several thousand (with more coming soon).

(Note, this is not a duplicate of #260, that one is about listening to sentence, this one is about recording. They come from entirely different pools).

mikehenrty on 17 Jul 2017

@mikehenrty I am receiving a lot of new sentences now. Thanks!

orschiro on 18 Jul 2017

🎉1

Am late to this, but if you're expanding the range of sentences, might be worth considering phonetic pangrams, as by definition these cover a large chunk of sounds quickly (although they are typically unrealistic)

https://www.quora.com/Is-there-a-text-that-covers-the-entire-English-phonetic-range/answer/Sheetal-Srivastava-1

nmstoker on 18 Jul 2017

I have tried a bit, and it seems like there are now sufficiently many different sentences to avoid getting the same ones multiple times. Thanks for fixing!

a3nm on 18 Jul 2017

I'm reopening because the pool of sentences is not so large after all: you can still get the same sentences occasionally when you record a sufficient number of them (around 100), even in the same session.

I think this could be fixed (within one session) by remembering which sentences have already been recorded, and not asking for these same sentences again.

a3nm on 23 Jul 2017

This will probably be fixed once #304 gets accepted, but before that remembering sentences in a session could be a temporary fix. Allowing users to skip sentences (#278) would probably be a good enough fix as well for now (and skipping would be useful anyway).

Omniscimus on 23 Jul 2017

👍1

Skipping sentences would help but it's a bit more tedious, and also as a user I'm not always sure whether I have already seen a sentence or not. (Did I see it when recording, or when validating? Was it that sentence, or another sentence from the same novel? etc.) So even if users can skip sentences there would probably be some frustration and some duplicate recordings (but I don't know whether having duplicate recordings of the same sentence by the same speaker pollutes the dataset).

a3nm on 23 Jul 2017

Let's close this bug as we are actively trying to gather new sentences in #341. We are increasing our sentences by the day.

mikehenrty on 25 Jul 2017

I think this should be reopened: while recording some sentences this morning I got the same one multiple times again. ("Gossips are frogs, they drink and talk", and "Where did he get it", if I remember correctly.)

a3nm on 26 May 2018

Was this page helpful?

0 / 5 - 0 ratings