Espnet: Is MoL Wavenet able to decode in real-time on GPU?

Created on 11 Jan 2020 · 3Comments · Source: espnet/espnet

Hi, thank you so much for this fantastic repo and I apologize if this is an ignorant question, but on my GTX 1070 Ti the pre-trained wavenet.mol.v1 models don't decode mels faster than real-time. I probably just did something wrong when trying to run it (I'm pretty certain it is running on GPU as opposed to CPU though), but is it possible that this performance would be typical? I couldn't find any speed benchmarks to answer this question, other than those in the original Parallel Wavenet paper (https://arxiv.org/abs/1711.10433). So, would you happen to know if this vocoder inference speed is normal, and if you might happen to have certain GPU speed benchmarks?

Question

Source

panademo

Most helpful comment

Our Parallel WaveGAN implementation inference speed is as follows:

```

on GPU (TITAN V)

2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)

2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).
```

You can try it with TTS models on Google Colab.
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

kan-bayashi on 11 Jan 2020

👍2

All 3 comments

Short answer: No.

Decoding MoL WaveNet involves autoregressive sampling operations and thus is inevitably slow either on CPU or GPU. Real-time factor (RTF; time in sec. to generate 1 sec. of audio) for WaveNet would be normally around 180 ~ 300. If you need real-time waveform generation, you might want to try Parallel WaveGAN instead, which is also supported in the repo.

As for the GPU speed benchmarks, there's a comparison for (single Gaussian) WaveNet, ClariNet, and Parallel WaveGAN in https://arxiv.org/abs/1910.11480. Inference speed of single Gaussian WaveNet and MoL WaveNet is almost the same.

r9y9 on 11 Jan 2020

👍1

Our Parallel WaveGAN implementation inference speed is as follows:

```

on GPU (TITAN V)

2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)

2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).
```

You can try it with TTS models on Google Colab.
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

kan-bayashi on 11 Jan 2020

👍2

Thank you so much for the fast and super helpful responses! I had previously seen your speed benchmarks in the README for Parallel WaveGAN, and was blown away by its performance in the samples. It really is SOTA. Congrats, and thanks once again for the info!

panademo on 11 Jan 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Error when installing espnet

ghost · 5Comments

Mandarin Chinese TTS with the input text in Chinese characters

vjdtao · 5Comments

RuntimeError: Error(s) in loading state_dict for Transformer: size mismatch for encoder.embed.0.weight: copying a param with shape torch.Size([43, 384]) from checkpoint, the shape in current model is torch.Size([37, 384]).

thrfdth · 4Comments

Memory limitations

mdeisher · 4Comments

First multi-speaker Transformer

yyggithub · 4Comments