Hello, I trained Transformer and Fastspeech and Tacotron2 (nvidia), all of them suffer from long sentence synthesis, when the speaker starts mumbling when the length of the (synthesized) audio exceed the maximum length of audio in the training set. I know cut the original sentence into slices can help fixing this, but it looks like a workaround to me, not a solution :D
This is the predicted melspec, the first 10 seconds are good, but from that point on, it's disaster.

Are there any ideas? Really looking forward to hearing some suggestion
Thanks!
Did you try forward attention in Tacotron~2?
I've never tried to generate too long sentences but it is worthwhile to try.
Recently google guys have proposed dynamic convolution attention to avoid such kind of issue.
This paper might be interesting for you.
https://arxiv.org/abs/1910.10288
We will plan to implement this feature in v.0.6.0 (#1329).
I would like to consider this a fundamental problem we need to address in the future. Can we tag a proper label for the issue?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue is closed. Please re-open if needed.