Common-voice: Adjust volume automatically when listening to multiple recordings

Created on 23 Jul 2017 · 11Comments · Source: mozilla/common-voice

Many music players can do this, why not use it here too?

Just adjust the audio pitch when hearing to sentences. I always have to adjust the volume up and down and by using such an automatically adjustment I could prevent this.

help wanted

Source

rugk

👍9

Most helpful comment

That's awesome, thanks! I've been reading into this topic today and was a bit overwhelmed by ReplayGain (and surprised by the lack of JS implementations). Happy to review a PR or look into it further myself!

Gregoor on 21 Jul 2018

👍3

All 11 comments

+1, I guess these audio samples are anyway normalized before further processing, so maybe it's a good idea to normalize them before validating. Stable audio volume will make validating process more pleasant.

azymohliad on 3 Aug 2017

Thinking about implementations for this... I think I agree with @PaulXiCao's comments in #445
Presumably any consumers of the voice database will want to have control over the normalisation method they use (if any). So first and foremost I think this should be solved specifically for the listening use case, and any solution should avoid changing the training corpus.

The standard method of normalising music volume is referred to as ReplayGain. Which is a specification for analysing the the RMS and peak values of the audio. And using this information to apply a constant gain to a file / album at playback time. So all your music plays at a comparable volume. We could do something similar, or do something more like what Spotify does: fluctuate the volume over the course of the audio clip to normalise volume within the file itself. Which is obviously horrible sounding for music but makes perfect sense for speech.

I see three good options. Here they are from my favourite to my least favourite:

Analyse the audio stream as it's uploaded and saving metadata along with the file.
- This _might_ be possible in one ffmpeg process using -filter:a volumedetect
- Will more likely require decoding, analysing in JS, and then encoding (currently a single ffmpeg transcoding process).
- This analysed meta-data can then be saved along-side the mp3 (maybe as IDv3 tags).
- The client will then have to be capable of reading these values (hard?) and applying a simple gain (easy).
  - May not be trivial if we use embedded tags in the mp3.
Read the audio on the client using the WebAudio API and normalise at time of playback.
- This avoids any complications on the server.
- This complicates what is otherwise a very simple section of the client.
- Would have to normalise constantly over the course of the audio file as it downloads (not necessarily a problem).
Normalise the audio in the transcoding step of the upload, and save two files.
- This can be done with a simple ffmpeg filter -filter:a loudnorm.
- It requires double the encoding time and storage space.
- It means the validated file isn't strictly the same file as might be used for training.

voice-web transcodes submitted audio as a stream, using an ffmpeg wrapper. Here are ffmpeg's utilities for audio normalisation: https://trac.ffmpeg.org/wiki/AudioVolume

Any solution needs to be robust, as we're deliberately looking for audio of all kinds and qualities.

For instance, I've come across silent audio files which are just noise, which isn't something music normalisation solutions have to worry about.

Anyway those are my two entire cents. I'm new here but this issue caught my imagination. Hopefully ths helps someone. And if any of these options appeal to core maintainers perhaps I can investigate implementation details more thoroughly.

willstott101 on 7 Dec 2017

👍1

I could barely hear one sample, so I raised the volume a little. The next sample knocked me out of my chair.

jwdunn1 on 11 Jan 2018

👍2

genereally - i think its also going to be generally superbad for anyone who want's to process the data if its way to loud or to silent. both ways loads of data has gone missed.

so why don't we feedback to much or too little recording volume to the speakers? - the same way submissions are not excepted if they're too long. ( wich is sometimes a bit to strict btw)

i think voxforge did it this way. - i can also think of some usecases were they are happy to have too loud / too silent examples but i think they are rare and not taking these into account right now would speed up the overall project a little..

paskalito on 10 Jun 2018

so why don't we feedback to much or too little recording volume to the speakers?

Great idea? Maybe you can open a new issue for that?

rugk on 10 Jun 2018

While there are errors for that, with the thresholds we chose it only errors in absolute silence or when you're recording next to a festival speaker. Happy to look at PRs that have a better detection, without excluding noise (we want noise!).
For listening we should be able to build something that automatically adjusts the volume, I'll add it to my backlog 😁

Gregoor on 13 Jun 2018

I can second this is really a big deal: there are clips which are barely audible even at maximum volume, and others that are already loud enough at minimum volume. It'd be great if we could have just normalization on the listening / validation process, not on the original data.

lissyx on 9 Jul 2018

👍1

I've built a little normalizer thing: normalizer.zip. It's not doing as many fancy things as ReplayGain does but it is quite good. Try it out! Maybe if I have time tomorrow or such I'll send a PR for integration.

est31 on 21 Jul 2018

👍3

Gregoor on 21 Jul 2018

👍3

@est31

I've built a little normalizer thing: normalizer.zip.

As for your ZIP there, maybe you could release it as a FLOSS (open-source) project somewhere at GitHub or so? It could be useful for other's/other projects.

rugk on 6 Sep 2018

@rugk I hope I could help: https://github.com/est31/js-audio-normalizer

est31 on 7 Sep 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings