Espnet: About usage of pretrained vocoder like "espnet_model_zoo"

Created on 9 Oct 2020 · 4Comments · Source: espnet/espnet

Hi, I see that in TTS colab of espnet2 , It has tts pretrained model from "espnet model zoo" and I also find detail usage of "espnet model zoo". Is there some similar usage about vocoder？I mean how to know what vocoder that espnet supported and how to combine them ,any tutorial document?

Question TTS

Source

ymzlygw

All 4 comments

ESPNet will support any vocoder that uses 80-band Mel-Spectrogram as an input. Or you can rescale the output of a spectrogram to a linear scale. That's almost every vocoder in the world. There are lots of options including:

WaveNet - Google's original model. It has the highest fidelity but slow.
WaveGlow - Nvidia's approach to make WaveNet faster but it doesn't sound as good
MelGAN and ParallelWaveGan - You have seen them in a demo

What you can't use is Mozilla's LPCNet that uses Bark Scale with pitch components. You cannot really rescale to that, only train your text-to-spec to produce the required scale.

shigabeev on 9 Oct 2020

👍1

Thank you for your answer, @shigabeev.
As @shigabeev said, you can combine any vocoder which uses mel-spectrogram as the input.
Officially, we use my following repository (which is used in the demo):
https://github.com/kan-bayashi/ParallelWaveGAN

This repository supports the following models:

Parallel WaveGAN
MelGAN
Multi-band MelGAN

You can find the usage of pretrained models:
https://github.com/kan-bayashi/ParallelWaveGAN#how-to-use-pretrained-models
https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features

You can create your vocoder recipe:
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#how-to-make-the-recipe-for-your-own-dateset

kan-bayashi on 9 Oct 2020

👍1

ESPNet will support any vocoder that uses 80-band Mel-Spectrogram as an input. Or you can rescale the output of a spectrogram to a linear scale. That's almost every vocoder in the world. There are lots of options including:

WaveNet - Google's original model. It has the highest fidelity but slow.

WaveGlow - Nvidia's approach to make WaveNet faster but it doesn't sound as good

MelGAN and ParallelWaveGan - You have seen them in a demo

What you can't use is Mozilla's LPCNet that uses Bark Scale with pitch components. You cannot really rescale to that, only train your text-to-spec to produce the required scale.

Thanks for your reply!

ymzlygw on 10 Oct 2020

Thank you for your answer, @shigabeev.
As @shigabeev said, you can combine any vocoder which uses mel-spectrogram as the input.
Officially, we use my following repository (which is used in the demo):
https://github.com/kan-bayashi/ParallelWaveGAN

This repository supports the following models:

Parallel WaveGAN

MelGAN

Multi-band MelGAN

You can find the usage of pretrained models:
https://github.com/kan-bayashi/ParallelWaveGAN#how-to-use-pretrained-models
https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features

You can create your vocoder recipe:
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#how-to-make-the-recipe-for-your-own-dateset