Common-voice: Clips are being repeatedly re-served for reading

Created on 14 Feb 2019  路  19Comments  路  Source: mozilla/common-voice

When I'm engaged in a long reading session, it slows me down quite a lot and is rather annoying to be constantly served the same clips to read over and over again. During the course of a one-hour session, I can easily be asked to read the same clip 10 or 15 times, which is not very useful if I've read it already, and simply wastes time if I've already told the system that I want to skip it (for example because I'm unsure of a pronunciation).

Could the algorithm that selects the clips for reading be altered to exclude those that the user has already recorded, or has previously specified should be skipped?

The way the algorithm works does seem strange, as it's not as if there's nothing new available for me to record: among all the clips that I've seen, many, many times before, something new will filter through perhaps one in 20 times.

Most helpful comment

Many thanks. I'm very pleased to hear that it's not intended behaviour, anyway!

I just completed a short recording session ten minutes ago (my username is MichaelMaggs), and this is what I found:

About 70% of the sentences I've seen many times before, mostly during long sessions yesterday or the day before. Some of the sentences I have previously skipped, and some I have previously recorded (often several times).

Here are three sentences I read today that I've seen frequently before:

_* Empty vessels make the most noise._
_* Whoever lived on the ranch did that._
_* I can do no more._

Here are three more that I skipped today that I've also previously seen often:

_* You are what you eat._
_* The goose was brought straight from the old market._
_* Ignorance is bliss._

There was one sentence I've seen before that actually repeated itself twice during today's session:

_* Empty barrels make the most noise._

I stopped the session when the system presented me with a blank screen that I was not able to skip. After two or three minutes, the screen did eventually populate with a new sentence.

I hope that's helpful. I鈥檓 happy to do what I can to help track down the issue.

All 19 comments

That's strange indeed, thanks for the report! I couldn't immediately recreate it. Can you post a sentence you've gotten twice here (with your username on Common Voice)?

Many thanks. I'm very pleased to hear that it's not intended behaviour, anyway!

I just completed a short recording session ten minutes ago (my username is MichaelMaggs), and this is what I found:

About 70% of the sentences I've seen many times before, mostly during long sessions yesterday or the day before. Some of the sentences I have previously skipped, and some I have previously recorded (often several times).

Here are three sentences I read today that I've seen frequently before:

_* Empty vessels make the most noise._
_* Whoever lived on the ranch did that._
_* I can do no more._

Here are three more that I skipped today that I've also previously seen often:

_* You are what you eat._
_* The goose was brought straight from the old market._
_* Ignorance is bliss._

There was one sentence I've seen before that actually repeated itself twice during today's session:

_* Empty barrels make the most noise._

I stopped the session when the system presented me with a blank screen that I was not able to skip. After two or three minutes, the screen did eventually populate with a new sentence.

I hope that's helpful. I鈥檓 happy to do what I can to help track down the issue.

I seem to have reached some sort of reading limit as I'm now being repeatedly presented with the same eight or so clips; never anything new now.

Hi Michael, sorry for not answering sooner, busy weeks. This is still very much on my radar.. err bookmark bar!

Hm, I looked at the query for your user account and it doesn't seem to return previously recorded sentences for me.

I should nevertheless write a test case for this, to be sure!

Yes please! Also, some of the sentences being returned are ones I've previously skipped multiple times: not all are ones I've read before.

There seems to be a suggestion in this thread that repeated readings may be quite common: https://discourse.mozilla.org/t/why-train-tsv-includes-a-few-files-just-3-of-validated-set/36471

Yup, that's because of we have too little sentences. But it's a different case from this one, as in the thread Kelly talks about multiple readings by different users.

Still happening. I'm being asked to read sentences that I know have already been read several times by others (as I've just validated the recordings). These are sentences I uploaded myself a couple of weeks ago. Is there no way of ordering the sentences to be read to give priority to those that haven't already been done?

There is and that should already be done, I'm sorry but I have trouble reproducing this. I just did a batch of validations and checked the clips and all of them had 1 or 0 votes (and were from sentences that had few clips, the change we talked about in another issue).
Could you open up the network tab of your browsers dev tools when voting on clips and share a screenshot of it? Maybe sth goes wrong there.

Happy to help in whatever way I can, but I'm concerned here about the clips presented for reading, not those for validation. As you say, validation seems much better now.

Oh sorry, not the first time I mix that up 馃檲 Let me check again!

I've just been presented with "Oh my fur and whiskers!" to read. I know that's been recorded at least twice already, as I validated it for two different readers this morning.

So the sentences I've gotten where read between 0 and 2 times, which is true for about 1000 sentences. In english we have another ~5k sentences that have been read more often than that. Another thing that comes into play here is the bucket, i.e. users in the smallest bucket atm only have 270 sentences left that have been read fewer than 3 times.

"Oh my fur and whiskers!" has indeed been read three times.

I'm afraid I don't know how the bucket system works, but it sounds as though the basic problem is too few English sentences. Would it be more helpful at the moment for me to spend time uploading and validating sentences in the Sentence Collector, rather than validating recordings and reading?

Yes, you got that exactly right. We can do a better job of highlighting that need in Common Voice itself. The sentence collector is still in an early stage which is probably why we don't feel as confident highlighting it on the site more. But in english there's definitely a need for more sentences.

OK, will do!

Great, thanks for your thoughtful efforts and all the feedback!

RE bucketing: don't worry about it, it's likely we'll remove it from Common Voice itself, as the tool that processes and bundles the clips does a better job of bucketing atm.

Given the trend towards one recording per sentence, I'm closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

orschiro picture orschiro  路  5Comments

jankeromnes picture jankeromnes  路  3Comments

Gregoor picture Gregoor  路  5Comments

kenrick95 picture kenrick95  路  4Comments

mikehenrty picture mikehenrty  路  3Comments