Espnet: Reproduce SOTA TTS result on LJspeech

Created on 12 Mar 2020 · 4Comments · Source: espnet/espnet

I want to reproduce the SOTA TTS result on LJspeech, which based on Transformer.v3 and its MOS is 4.25.
Does "train_pytorch_transformer.v3.yaml" corresponds to the SOTA model configuration? I notice that these also exists "train_pytorch_transformer.v3.single.yaml"， is this config for single_gpu training? How many GPUS should I use and do I have to modify these yaml files to reproduce the SOTA TTS result on LJspeech? (also the provided pretrained model config is also a little different from "train_pytorch_transformer.v3.yaml", which is more closer to the posted result?)
BTW, which vocoder did you use?
Thanks!

Question TTS

Source

Syrup274

👀1

All 4 comments

Hi @Syrup274. I will answer your questions.

The model used in the paper is this result. You can access the model and samples. This model is trained with trans_type=phn, train_pytorch_transformer.v3.yaml, and full band mel (fmin=0 fmax=11025).
The vocoder used in the paper is MoL WaveNet. You can check the other one from here.
The difference between train_pytorch_transformer.v3.yaml and train_pytorch_transformer.v3.single.yaml is batch_size. *.single.yaml config can be ran on a single gpu with 12 GB memory. Since we use gradient accumulation, the results w/ both configs should be the same theoretically.
In current master, we use limited band mel 89-7600 and we can select phn or char for both transformer and tacotron 2. Therefore, the quality of taco 2 and the transformer is almost the same.
You can check the samples and pretrained models of all models from here.
If you want to try on online, play with our demo notebook.

kan-bayashi on 13 Mar 2020

🎉1

Thanks @kan-bayashi for your reply.

Another three questions:

How many GPUs did you use in "train_pytorch_transformer.v3.yaml"? (in model.json of pretrained model it seems to be 2, but it seems to be 3 according to batch size). Can I use more GPUs?
I didn't find "train_pytorch_transformer.v3.yaml" in tag v0.5.3. Do I have to checkout to v0.5.4 or just modify the scripts on master?

3. I notice some differences in "decode.yaml" between master and v0.5.3, does these extra lines affect the final result?

Thanks again for your patient.

Syrup274 on 13 Mar 2020

How many GPUs did you use in "train_pytorch_transformer.v3.yaml"? (in model.json of pretrained model it seems to be 2, but it seems to be 3 according to batch size). Can I use more GPUs?

Three gpus. I can accelerate the training with 6 gpus by setting accum_grad: 1 in config.

I didn't find "train_pytorch_transformer.v3.yaml" in tag v0.5.3. Do I have to checkout to v0.5.4 or just modify the scripts on master?

Use v.0.5.4 or modify the parameters fmin and fmax in run.sh with current master.

I notice some differences in "decode.yaml" between master and v0.5.3, does these extra lines affect the final result?

https://github.com/espnet/espnet/blob/9e2bfc5cdecbb8846f5c6cb26d22010b06e98c40/egs/ljspeech/tts1/conf/decode.yaml#L5-L7
These above options are available for only Tacotron2. So no effect for Transformer-TTS.

kan-bayashi on 13 Mar 2020

🎉1

Thanks for your reply.

Syrup274 on 14 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

The utilization rate of gpu is strang when training transformer & fastspeech

JoeyHeisenberg · 3Comments

About usage of pretrained vocoder like "espnet_model_zoo"

ymzlygw · 4Comments

failed at loading the model when decoding on fastspeech.v3.single

JoeyHeisenberg · 3Comments

RuntimeError: Error(s) in loading state_dict for Transformer: size mismatch for encoder.embed.0.weight: copying a param with shape torch.Size([43, 384]) from checkpoint, the shape in current model is torch.Size([37, 384]).

thrfdth · 4Comments

Mandarin Chinese TTS with the input text in Chinese characters

vjdtao · 5Comments