Essentials: Text-To-Speech API's

Created on 24 Feb 2018 · 12Comments · Source: xamarin/Essentials

Make devices speak arbitrary text

TextToSpeech

public static class TextToSpeech
{
    public static int MaxSpeechInputLength { get; }

    public static Task SpeakAsync (string text, CancellationToken cancelToken = default(CancellationToken));
    public static Task SpeakAsync (string text, SpeakSettings settings, CancellationToken cancelToken = default(CancellationToken));

    public static Task GetLocalesAsync ();
}

SpeakSettings (struct)

public struct SpeakSettings
{
    public Locale Locale;
    public float? Pitch;
    public float? SpeakRate;
    public float? Volume;
}

Locale

public static class Locale
{
    public string Language { get; }
    public string Country { get; }
    public string Name { get; }
}

in-progress

Source

Redth

All 12 comments

The table above tries to show what each platform does. Android is mostly just "1.0" is normal and then multiply. iOS has specific ranges and defaults. UWP follows the SSML and appears to support an "enum"-based method as well as a flat percentage-based method. We could use the "enum" values and just map them to the platform restrictions. Or we could go with a percentage based and just clip them when they exceed the range for a particular platform.

tts-table

Redth on 24 Feb 2018

Any reason for creating a new class (Locale) instead of using CultureInfo?

alfredmyers on 9 Mar 2018

This is a great point @alfredmyers I suspect it was from the old plugin which may have been done for reasons before netstandard.

We should change this

Redth on 9 Mar 2018

👍1

I think the difference is that the codes each engine uses for culture info may not match the .NET codes.... however we would have to validate this.

jamesmontemagno on 9 Mar 2018

If that's the case it's plausible we could write a mapping. That could be a useful API regardless of text to speech.

Redth on 9 Mar 2018

👍1

As an example, of what @jamesmontemagno said, even within a single OS, different TTS engines can return locales in different formats.

For instance, on Android Brazilian Portuguese is returned as:

"pt-BR" on Google's TTS
"por-BRA" on Samsung's TTS

This is specially important if you're going to check if a language is supported querying on the proposed GetLocalesAsync method.

alfredmyers on 29 Mar 2018

From what I could grasp going through the different TextToSpeech implementations in @jamesmontemagno 's TextToSpeechPlugin, the only one that needs a MaxSpeechInputLength property is the implementation for Android.

If that is the case, and we could solve the issue splitting the text from within the implementation for Android, would exposing MaxSpeechInputLength on the API still be necessary?

alfredmyers on 30 Mar 2018

👍1

Yes I love this idea!

Redth on 30 Mar 2018

While we are writing code, we realised that splitting the text internally may be a very hard task. English is easy, we separate words with spaces and sentences with punctuation. But, many non-English languages are quite different.

there may be languages that don't use punctuation
there may be languages that are backwards
there may be languages that have multi-byte characters

Android appears to be the guy with the limit, but just because they are the only one, it doesn't mean we can't still create a cross-platform way.

If we are to add a property to return the limit, we can have a rule that if there is _no_ limit, we return -1 and if there _is_ a limit, return that.

The Android source appears to have a limit of 4K:
https://github.com/aosp-mirror/platform_frameworks_base/blob/b056324630b8adfeb38393bcab49f3b9c720f4fd/core/java/android/speech/tts/TextToSpeech.java#L2364-L2366

In addition, other speech engines may have other limits as seen here: https://stackoverflow.com/questions/19312536/android-tts-fails-to-speak-large-amount-of-text

mattleibow on 12 Apr 2018

@mattleibow

there may be languages that don't use punctuation

there may be languages that are backwards

there may be languages that have multi-byte characters

I have a prototype of a method that splits the string on the nearest punctuation mark or white space just before MaxSpeechInputLength.

I really don't have experience with RTL languages, but I still have a copy of Developing International Software lying around. Best case, if it is only a matter of iterating over the string in reverse order, I can take a look into it.

alfredmyers on 13 Apr 2018

@alfredmyers I did not have time to finish and test the whole SplitText method, but I ended up iterating from the buffer end in reverse order searching for Char.IsPunctuation and Char.IsWhitespace. From that position the process is repeated.

I really don't have experience with RTL languages,

My experience is not huge in that area, but I asked some people that did something with TTS in Persian. I have received an answer, but waiting for more info.

moljac on 13 Apr 2018

Hi guys, I'm the Persian language. @moljac say to me that we have the problem with spiting RTL text.

I don't have experience with TTS but can help with issues.

In the first case, I present a Persian TTS Project that it's ParsKhan. This open source project developed by iranian programmers ( that solved many matters in RTL).

ParsKhan does not have code quality ( i refactor it somewhat) but we don't need project codes, just use algorithms and scenarios.

ParsKhan has the doc but doc language is Persian ( if it's necessary I translate it ).

ParsKhan Repo is here