Espnet: Reproduce SOTA TTS result on LJspeech

Created on 12 Mar 2020  Â·  4Comments  Â·  Source: espnet/espnet

I want to reproduce the SOTA TTS result on LJspeech, which based on Transformer.v3 and its MOS is 4.25.
Does "train_pytorch_transformer.v3.yaml" corresponds to the SOTA model configuration? I notice that these also exists "train_pytorch_transformer.v3.single.yaml", is this config for single_gpu training? How many GPUS should I use and do I have to modify these yaml files to reproduce the SOTA TTS result on LJspeech? (also the provided pretrained model config is also a little different from "train_pytorch_transformer.v3.yaml", which is more closer to the posted result?)
BTW, which vocoder did you use?
Thanks!

Question TTS

All 4 comments

Hi @Syrup274. I will answer your questions.

  • The model used in the paper is this result. You can access the model and samples. This model is trained with trans_type=phn, train_pytorch_transformer.v3.yaml, and full band mel (fmin=0 fmax=11025).
  • The vocoder used in the paper is MoL WaveNet. You can check the other one from here.
  • The difference between train_pytorch_transformer.v3.yaml and train_pytorch_transformer.v3.single.yaml is batch_size. *.single.yaml config can be ran on a single gpu with 12 GB memory. Since we use gradient accumulation, the results w/ both configs should be the same theoretically.
  • In current master, we use limited band mel 89-7600 and we can select phn or char for both transformer and tacotron 2. Therefore, the quality of taco 2 and the transformer is almost the same.
  • You can check the samples and pretrained models of all models from here.
  • If you want to try on online, play with our demo notebook.

Thanks @kan-bayashi for your reply.

Another three questions:

  1. How many GPUs did you use in "train_pytorch_transformer.v3.yaml"? (in model.json of pretrained model it seems to be 2, but it seems to be 3 according to batch size). Can I use more GPUs?
  2. I didn't find "train_pytorch_transformer.v3.yaml" in tag v0.5.3. Do I have to checkout to v0.5.4 or just modify the scripts on master?

3. I notice some differences in "decode.yaml" between master and v0.5.3, does these extra lines affect the final result?

Thanks again for your patient.

How many GPUs did you use in "train_pytorch_transformer.v3.yaml"? (in model.json of pretrained model it seems to be 2, but it seems to be 3 according to batch size). Can I use more GPUs?

Three gpus. I can accelerate the training with 6 gpus by setting accum_grad: 1 in config.

I didn't find "train_pytorch_transformer.v3.yaml" in tag v0.5.3. Do I have to checkout to v0.5.4 or just modify the scripts on master?

Use v.0.5.4 or modify the parameters fmin and fmax in run.sh with current master.

I notice some differences in "decode.yaml" between master and v0.5.3, does these extra lines affect the final result?

https://github.com/espnet/espnet/blob/9e2bfc5cdecbb8846f5c6cb26d22010b06e98c40/egs/ljspeech/tts1/conf/decode.yaml#L5-L7
These above options are available for only Tacotron2. So no effect for Transformer-TTS.

Thanks for your reply.

Was this page helpful?
0 / 5 - 0 ratings