Common-voice: Too many recordings per user?

Created on 1 Apr 2019  路  11Comments  路  Source: mozilla/common-voice

Recently a change was made to the validation section to list sentences with the fewest recordings first in order to add more unique sentences for the DeepSpeech model. This has made me more aware of something that could potentially be an issue.

Some users are recording a LOT of sentences. In fact, over the past few days I have validated around 1500-2000 clips and I would estimate around 70% of them were recorded by the same user, all of which were unique sentences.

I鈥檓 sure that the DeepSpeech team makes certain that there aren鈥檛 too many recordings by a single user, so these sentences will most likely be discarded until there are more recordings available. But if the site shows sentences with the fewest recordings first, it will have to go through the thousands of unrecorded sentences to get to that point again, which may never happen if more sentences keep getting added. And if the validation site shows unique sentences first, it will take even longer to get the new sentence approved.

The DeepSpeech team said they don鈥檛 want more than a few hundred recordings from any one user. So a user with 5000 recordings may have prevented 4700 sentences from making it into the model.

So I think the solution to this is either:

  1. Put a hard limit on the total number of recordings users can make or have a daily per-user limit.

  2. Change the algorithm so that each sentence has, say, 3 recordings minimum before it鈥檚 given a lower priority in the record queue.

  3. Deprioritize recordings in the validation queue by users who have x number of validated recordings. So we鈥檙e prioritizing unique users AND unique sentences.

Investigate

All 11 comments

I agree that there are a variety of problems here. Some more information from the team would be helpful to aid our discussions. What is the DeepSpeech team actually doing with the contributions of users who have made large numbers of recordings? Using them will bias the corpus; not using them results in a lot of volunteer time being wasted. Better guidelines to volunteers as to what is needed and what is not would help considerably.

On the question of unique sentences, it's still not clear to me which recordings are used. If a sentence is recorded three times, once in Indian English, once in American English, and once in British English, what happens? Choosing the first recording, or a recording at random, will tend to exclude those who have lower representation within the volunteers. It will particularly tend to perpetuate the low number of female voices.

@dabinat I'm not sure I understand your statement of the problem. Let me start here...

  • You state "...so these sentences will most likely be discarded until there are more recordings available..." what do you mean by "discarded"? I'm not sure your definition reflects what is actually occurring.
  • You state "...But if the site shows sentences with the fewest recordings first, it will have to go through the thousands of unrecorded sentences to get to that point again..." What do you mean by "that point again"?
  • You state "...And if the validation site shows unique sentences first, it will take even longer to get the new sentence approved." What do you mean by "unique"? I don't think you mean unique. All the sentences are unique.

There are many other question I have about your issue, but lets start there.

Ok, I鈥檒l try to explain it another way.

The recording view displays sentences with the fewest recordings first. So it will always prioritize sentences with 0 recordings first.

Let鈥檚 say I record a sentence that has no prior recordings and my recording gets approved by validators. The number of recordings for that sentence is now 1.

This is the 2000th recording I have made (we鈥檒l assume all 2000 were approved). Obviously you鈥檙e not going to use all 2000 recordings because that would bias the model too much, so some of my recordings would not be used to train the model.

But some of those clips that are unused are the only recording of that particular sentence, so that sentence doesn鈥檛 get coverage in the model.

That鈥檚 ok - we鈥檒l just wait for someone else to record it and use their recording instead. But the recording view prioritizes sentences with 0 recordings first and that sentence has 1 recording. So in order for it to appear again, all unrecorded sentences would need to be recorded so that there are no more sentences with 0 recordings.

English has over 30,000 sentences with more being added all the time so the chances of unrecorded sentences ever being reduced to zero seems pretty slim. So therefore that sentence will never appear again and will never get recorded again, so recordings of that sentence will never make it into the model.

This may seem hypothetical but there are people with over 15,000 recordings and I often hear the same one or two voices again and again when I validate. So my question is: are people with thousands of recordings doing more harm than good?

(Also, even if they鈥檙e not actively doing harm, they鈥檙e probably wasting their own time and the time of validators. While there may be some uses for this outside of DeepSpeech it seems like it鈥檚 not the most optimal use of volunteer time.)

Some of your assumptions are not true.

For example...

...Obviously you鈥檙e not going to use all 2000 recordings...

Is not necessarily true. This depends on the how many other people have read sentences you've read. Some of your validated recordings _may_ not be put in the train.tsv, but all will be put in validated.tsv set. Anyone is free to use the validated.tsv to train on.

But some of those clips that are unused are the only recording of that particular sentence, so that sentence doesn鈥檛 get coverage in the model.

This is also incorrect.

I think to not take too much time explaining all the details, it'd be best if you took a look at the code that actually creates the various data splits corpus.py then come back with specific questions about that code.

If my understanding of the code is correct, it sorts by total user recordings ascending and then filters out duplicates, meaning that in the case of more than one recording of a sentence it will favor the user with the fewest total recordings.

It then sorts again, this time in descending order. I鈥檓 not really sure what the purpose of sorting again is. Maybe to make sure users with more recordings go to test/dev instead of train?

But then it loops through each individual user and makes sure their recordings that haven鈥檛 already been filtered out fit in the test/dev sets, otherwise they end up in train.

So it seems like I could make thousands and thousands of recordings and as long as the sentences had no other recordings, all of them would end up in train?

Maybe to make sure users with more recordings go to test/dev instead of train?

Basically, yes. Worded in another way, this sort is made to have users with fewer recordings in the test set.

For smaller data sets which don't have enough data to train a robust STT engine the maximal utility of the data set can be obtained by having a diverse test set, as the test set can become a standard used in research and industry. This is accomplished by having the test set contain a more diverse set of users which is accomplished by this sort.

So it seems like I could make thousands and thousands of recordings and as long as the sentences had no other recordings, all of them would end up in train?

Yes.

So should users create thousands of recordings or just say 300 per language and focus on verification then?

If so, the leader board makes no sense since it encourages users to create more recordings.

Please communicate what helps most to reach our goal (get robust STT ASAP) on the website and build a workflow to encourage that.

@kdavis-mozilla any update here? are we going in the right direction?

So should users create thousands of recordings or just say 300 per language and focus on verification then?

Thousands of recordings will be useful for the train data set. Similarly, 300 per language + validation is also useful.

Both are useful, and I don't think there is one "most useful" path other than to contribute the the languages you want to support!

@kdavis-mozilla thanks. that's a clear statement.

since 1 recording needs at least 2 verifications, it makes sense to make the goals accordingly in the web ui.

the daily goal in the statistic already reflects that

Screenshot from 2020-07-21 12-49-59

and we might want to communicate it as a common goal and it is fine if someone only records or verifies.

or we also apply it to the personal goal and achievements

Screenshot from 2020-07-21 12-52-58

Yeah I agree

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mbransn picture mbransn  路  5Comments

mikehenrty picture mikehenrty  路  3Comments

r00ster91 picture r00ster91  路  4Comments

psubhashish picture psubhashish  路  5Comments

kenrick95 picture kenrick95  路  4Comments