Common-voice: Getting the same sentences multiple times

Created on 24 Jun 2017  路  11Comments  路  Source: mozilla/common-voice

As I user, when recording, I often get the same sentences multiple times -- sometimes you can get twice the same sentence in one batch of three recordings.

There should be some way to remember which sentences have already been served to a given session, and not serving them again once they have been recorded.

Enhancement

Most helpful comment

I think this should be dealt with high priority. If you as an avid and excited user get to see and record the same sentences over and over again, it takes away your motivation to contribute.

All 11 comments

One part of it is that the sentence dataset is really small, at 188 sentences (I expected thousands.) May I suggest the use of Wikisource? Something speech-heavy, like Agatha Christie, or modern speeches, maybe a play?

The second part is that it simply uses Math.random(). It will not yield well-distributed results, so you risk having a large disparity between the number of recordings per sample. One solution is to increment a global variable on the server, and yield the sentence at that index (and wrapping around, obviously). That variable should be initialized to a random value (or persisted), to prevent one part being repeated due to server restarts.

I think this should be dealt with high priority. If you as an avid and excited user get to see and record the same sentences over and over again, it takes away your motivation to contribute.

Is this still happening? We have increased our sentences to several thousand (with more coming soon).

(Note, this is not a duplicate of #260, that one is about listening to sentence, this one is about recording. They come from entirely different pools).

@mikehenrty I am receiving a lot of new sentences now. Thanks!

Am late to this, but if you're expanding the range of sentences, might be worth considering phonetic pangrams, as by definition these cover a large chunk of sounds quickly (although they are typically unrealistic)

https://www.quora.com/Is-there-a-text-that-covers-the-entire-English-phonetic-range/answer/Sheetal-Srivastava-1

I have tried a bit, and it seems like there are now sufficiently many different sentences to avoid getting the same ones multiple times. Thanks for fixing!

I'm reopening because the pool of sentences is not so large after all: you can still get the same sentences occasionally when you record a sufficient number of them (around 100), even in the same session.

I think this could be fixed (within one session) by remembering which sentences have already been recorded, and not asking for these same sentences again.

This will probably be fixed once #304 gets accepted, but before that remembering sentences in a session could be a temporary fix. Allowing users to skip sentences (#278) would probably be a good enough fix as well for now (and skipping would be useful anyway).

Skipping sentences would help but it's a bit more tedious, and also as a user I'm not always sure whether I have already seen a sentence or not. (Did I see it when recording, or when validating? Was it that sentence, or another sentence from the same novel? etc.) So even if users can skip sentences there would probably be some frustration and some duplicate recordings (but I don't know whether having duplicate recordings of the same sentence by the same speaker pollutes the dataset).

Let's close this bug as we are actively trying to gather new sentences in #341. We are increasing our sentences by the day.

I think this should be reopened: while recording some sentences this morning I got the same one multiple times again. ("Gossips are frogs, they drink and talk", and "Where did he get it", if I remember correctly.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mbransn picture mbransn  路  5Comments

mbebenita picture mbebenita  路  3Comments

r00ster91 picture r00ster91  路  4Comments

ivonnekn picture ivonnekn  路  5Comments

kenrick95 picture kenrick95  路  4Comments