Espnet: About usage of pretrained vocoder like "espnet_model_zoo"

Created on 9 Oct 2020  路  4Comments  路  Source: espnet/espnet

Hi, I see that in TTS colab of espnet2 , It has tts pretrained model from "espnet model zoo" and I also find detail usage of "espnet model zoo". Is there some similar usage about vocoder锛烮 mean how to know what vocoder that espnet supported and how to combine them ,any tutorial document?

Question TTS

All 4 comments

ESPNet will support any vocoder that uses 80-band Mel-Spectrogram as an input. Or you can rescale the output of a spectrogram to a linear scale. That's almost every vocoder in the world. There are lots of options including:

  • WaveNet - Google's original model. It has the highest fidelity but slow.
  • WaveGlow - Nvidia's approach to make WaveNet faster but it doesn't sound as good
  • MelGAN and ParallelWaveGan - You have seen them in a demo

What you can't use is Mozilla's LPCNet that uses Bark Scale with pitch components. You cannot really rescale to that, only train your text-to-spec to produce the required scale.

Thank you for your answer, @shigabeev.
As @shigabeev said, you can combine any vocoder which uses mel-spectrogram as the input.
Officially, we use my following repository (which is used in the demo):
https://github.com/kan-bayashi/ParallelWaveGAN

This repository supports the following models:

  • Parallel WaveGAN
  • MelGAN
  • Multi-band MelGAN

You can find the usage of pretrained models:
https://github.com/kan-bayashi/ParallelWaveGAN#how-to-use-pretrained-models
https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features

You can create your vocoder recipe:
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#how-to-make-the-recipe-for-your-own-dateset

ESPNet will support any vocoder that uses 80-band Mel-Spectrogram as an input. Or you can rescale the output of a spectrogram to a linear scale. That's almost every vocoder in the world. There are lots of options including:

  • WaveNet - Google's original model. It has the highest fidelity but slow.
  • WaveGlow - Nvidia's approach to make WaveNet faster but it doesn't sound as good
  • MelGAN and ParallelWaveGan - You have seen them in a demo

What you can't use is Mozilla's LPCNet that uses Bark Scale with pitch components. You cannot really rescale to that, only train your text-to-spec to produce the required scale.

Thanks for your reply!

Thank you for your answer, @shigabeev.
As @shigabeev said, you can combine any vocoder which uses mel-spectrogram as the input.
Officially, we use my following repository (which is used in the demo):
https://github.com/kan-bayashi/ParallelWaveGAN

This repository supports the following models:

  • Parallel WaveGAN
  • MelGAN
  • Multi-band MelGAN

You can find the usage of pretrained models:
https://github.com/kan-bayashi/ParallelWaveGAN#how-to-use-pretrained-models
https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features

You can create your vocoder recipe:
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#how-to-make-the-recipe-for-your-own-dateset

Thanks! I'll try it.

Was this page helpful?
0 / 5 - 0 ratings