Espnet: Speaker mumbling when synthesize very long sentence

Created on 7 Nov 2019 · 4Comments · Source: espnet/espnet

Hello, I trained Transformer and Fastspeech and Tacotron2 (nvidia), all of them suffer from long sentence synthesis, when the speaker starts mumbling when the length of the (synthesized) audio exceed the maximum length of audio in the training set. I know cut the original sentence into slices can help fixing this, but it looks like a workaround to me, not a solution :D

This is the predicted melspec, the first 10 seconds are good, but from that point on, it's disaster.

Are there any ideas? Really looking forward to hearing some suggestion
Thanks!

Question Stale TTS Wontfix

Source

enamoria

All 4 comments

Did you try forward attention in Tacotron~2?
I've never tried to generate too long sentences but it is worthwhile to try.

Recently google guys have proposed dynamic convolution attention to avoid such kind of issue.
This paper might be interesting for you.
https://arxiv.org/abs/1910.10288
We will plan to implement this feature in v.0.6.0 (#1329).

kan-bayashi on 7 Nov 2019

👍1

I would like to consider this a fundamental problem we need to address in the future. Can we tag a proper label for the issue?

r9y9 on 8 Nov 2019

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.