Hi, I see that in TTS colab of espnet2 , It has tts pretrained model from "espnet model zoo" and I also find detail usage of "espnet model zoo". Is there some similar usage about vocoder锛烮 mean how to know what vocoder that espnet supported and how to combine them ,any tutorial document?
ESPNet will support any vocoder that uses 80-band Mel-Spectrogram as an input. Or you can rescale the output of a spectrogram to a linear scale. That's almost every vocoder in the world. There are lots of options including:
What you can't use is Mozilla's LPCNet that uses Bark Scale with pitch components. You cannot really rescale to that, only train your text-to-spec to produce the required scale.
Thank you for your answer, @shigabeev.
As @shigabeev said, you can combine any vocoder which uses mel-spectrogram as the input.
Officially, we use my following repository (which is used in the demo):
https://github.com/kan-bayashi/ParallelWaveGAN
This repository supports the following models:
You can find the usage of pretrained models:
https://github.com/kan-bayashi/ParallelWaveGAN#how-to-use-pretrained-models
https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features
You can create your vocoder recipe:
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#how-to-make-the-recipe-for-your-own-dateset
ESPNet will support any vocoder that uses 80-band Mel-Spectrogram as an input. Or you can rescale the output of a spectrogram to a linear scale. That's almost every vocoder in the world. There are lots of options including:
- WaveNet - Google's original model. It has the highest fidelity but slow.
- WaveGlow - Nvidia's approach to make WaveNet faster but it doesn't sound as good
- MelGAN and ParallelWaveGan - You have seen them in a demo
What you can't use is Mozilla's LPCNet that uses Bark Scale with pitch components. You cannot really rescale to that, only train your text-to-spec to produce the required scale.
Thanks for your reply!
Thank you for your answer, @shigabeev.
As @shigabeev said, you can combine any vocoder which uses mel-spectrogram as the input.
Officially, we use my following repository (which is used in the demo):
https://github.com/kan-bayashi/ParallelWaveGANThis repository supports the following models:
- Parallel WaveGAN
- MelGAN
- Multi-band MelGAN
You can find the usage of pretrained models:
https://github.com/kan-bayashi/ParallelWaveGAN#how-to-use-pretrained-models
https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-featuresYou can create your vocoder recipe:
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/README.md#how-to-make-the-recipe-for-your-own-dateset
Thanks! I'll try it.