Espnet: Is MoL Wavenet able to decode in real-time on GPU?

Created on 11 Jan 2020  路  3Comments  路  Source: espnet/espnet

Hi, thank you so much for this fantastic repo and I apologize if this is an ignorant question, but on my GTX 1070 Ti the pre-trained wavenet.mol.v1 models don't decode mels faster than real-time. I probably just did something wrong when trying to run it (I'm pretty certain it is running on GPU as opposed to CPU though), but is it possible that this performance would be typical? I couldn't find any speed benchmarks to answer this question, other than those in the original Parallel Wavenet paper (https://arxiv.org/abs/1711.10433). So, would you happen to know if this vocoder inference speed is normal, and if you might happen to have certain GPU speed benchmarks?

Question

Most helpful comment

Our Parallel WaveGAN implementation inference speed is as follows:

```

on GPU (TITAN V)

2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)

2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).
```

You can try it with TTS models on Google Colab.
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

All 3 comments

Short answer: No.

Decoding MoL WaveNet involves autoregressive sampling operations and thus is inevitably slow either on CPU or GPU. Real-time factor (RTF; time in sec. to generate 1 sec. of audio) for WaveNet would be normally around 180 ~ 300. If you need real-time waveform generation, you might want to try Parallel WaveGAN instead, which is also supported in the repo.

As for the GPU speed benchmarks, there's a comparison for (single Gaussian) WaveNet, ClariNet, and Parallel WaveGAN in https://arxiv.org/abs/1910.11480. Inference speed of single Gaussian WaveNet and MoL WaveNet is almost the same.

Our Parallel WaveGAN implementation inference speed is as follows:

```

on GPU (TITAN V)

2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

on the CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)

2019-11-06 09:04:56,697 (decode:129) INFO: finished generation of 250 utterances (RTF = 0.734).
```

You can try it with TTS models on Google Colab.
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

Thank you so much for the fast and super helpful responses! I had previously seen your speed benchmarks in the README for Parallel WaveGAN, and was blown away by its performance in the samples. It really is SOTA. Congrats, and thanks once again for the info!

Was this page helpful?
0 / 5 - 0 ratings