Azure-docs: Use of SSML tags within text strings

Created on 1 Oct 2019 · 14Comments · Source: MicrosoftDocs/azure-docs

Hi guys,

I created a Jupyter notebook based on the following documentation:
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstart-python-text-to-speech

The purpose of this notebook is to do a batch transcription of multiple strings in order to create an audio file for each of them. Is it somehow possible to use SSML tags within text strings? Something like that as input string: "Welcome to < prosody pitch="high" >Microsoft Cognitive Services Text-to-Speech API.< /prosody >"

Thanks for your help!
Best, Timm

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 6ba5bbd9-c69d-fbfe-6392-f5cea4ebfb08
Version Independent ID: 617eeeb5-0ad6-f54c-7d36-edeb2e6d9581
Content: Quickstart: Convert text-to-speech, Python - Speech Service - Azure Cognitive Services
Content Source: articles/cognitive-services/Speech-Service/quickstart-python-text-to-speech.md
Service: cognitive-services
Sub-service: speech-service
GitHub Login: @erhopf
Microsoft Alias: erhopf

Pri2 assigned-to-author cognitive-servicesvc product-question speech-servicsubsvc triaged

Source

nonstoptimm

All 14 comments

Hey @nonstoptimm - In your case, you can use tags within your text input to add things like a pause or to adjust pitch. Since the your text input is getting put into the XML, then flattened into a string, it should work just fine. If you run into any issues, let us know and we can help you troubleshoot.

Here's the list of supported SSML tags for the Speech service: https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup

erhopf on 1 Oct 2019

Hi @erhopf, thanks for your quick response!
Actually I tried it by just inserting the text with the respective tags like mentioned above. The tags are read in the TTS-result so somehow it does not work with this configuration. As far as I know/understand, the string variable “self.tts” is xml-encoded when it’s being put into the XML. Do you have a hint how to solve this?
Best, Timm

nonstoptimm on 4 Oct 2019

Moving this up - would it be possible to give some hint or feedback on this :)?

nonstoptimm on 20 Nov 2019

@nonstoptimm - Sorry for the delay. I'm going to get someone to look at this for you ;).

erhopf on 20 Nov 2019

❤1

Up - were you able to find somebody who can take a look at this :)?

nonstoptimm on 13 Jan 2020

Hi @nonstoptimm,

What have you done to troubleshoot this thus far? Are there any error responses returned from the REST call? Are you able to capture and output the request body before sending, to verify it for correctness?

Also, I've noticed in your input that you shared here, that there are spaces around the element names... this is actually invalid in XML.

Perhaps you could try this instead (notice that I've removed the spaces around the prosody element name):

Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>

Finally, you might need to escape the double quotes in this string as well -- see this to do so. But first, try the initial suggestion above -- as a last resort attempt it like the following:

Welcome to <prosody pitch=&quot;high&quot;>
Microsoft Cognitive Services Text-to-Speech API.</prosody>

IEvangelist on 13 Jan 2020

assign IEvangelist

IEvangelist on 13 Jan 2020

HI IEvangelist,
thanks for your your quick response. Within my script I used valid XML of course, without spaces (I just faced some formatting issues within this editor, as I did not use the code formatting ;)).
Will try your suggestions and provide feedback! Best, Timm

nonstoptimm on 13 Jan 2020

Hi again, unfortunately, the suggestions did not fix my problem. The request bodies look as follows (tried both options):
String:
Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>
Body:
b'<speak version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (en-US, GuyNeural)">Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>'

String
Welcome to <prosody pitch="high"> Microsoft Cognitive Services Text-to-Speech API.</prosody>
Body:
b'<speak version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (en-US, GuyNeural)">Welcome to <prosody pitch=&quot;high&quot;>Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>'

Do you see anything noticeable here? After checking with the documentation the request bodies look ok! Thanks - really appreciate your help!

nonstoptimm on 13 Jan 2020

Yes, the < and > seem to be a clue in the first example body... It appears that we should be using the fromstring option here - if the text has XML in it.

If you replace this line with the following, does it work?

voice.text = ElementTree.fromstring(self.tts)

IEvangelist on 13 Jan 2020

The request itself is successful, but the string part of the body is empty when using the following string:
<prosody pitch='high'>Welcome to Microsoft Cognitive Services Text-to-Speech API.</prosody>
b'<speak version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (de-DE, Hedda)" /></speak>'

In case the string does not start with an XML element, I receive a syntax error:

`
Traceback (most recent call last):

File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 12, in
app.save_audio()

File "", line 33, in save_audio
voice.text = ElementTree.fromstring(self.tts)

File "/anaconda/envs/py36/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)

File "", line unknown
ParseError: syntax error: line 1, column 0
`

So unfortunately it doesn't fix it :/

nonstoptimm on 13 Jan 2020

Hi @nonstoptimm,

In discussing this over with the engineering team they're able to get this to work, here is an example payload for the body of the HTTP POST:

<speak version="1.0" xml:lang="en-us"><voice xml:lang="en-US" name="Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)">Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>

We suggest that you directly use string concatenation instead of xml object. Below is the sample which can generate the expected result (need replace the {SubscriptionKey} with your own).

import os
import requests
import time
from xml.etree import ElementTree

class TextToSpeech(object):
    def __init__(self, subscription_key):
        self.subscription_key = subscription_key
        self.tts = '<speak version="1.0" xml:lang="en-us"><voice xml:lang="en-US" name="Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)">Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>'
        self.timestr = time.strftime("%Y%m%d-%H%M")
        self.access_token = None

    def get_token(self):
        fetch_token_url = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken"
        headers = {
            'Ocp-Apim-Subscription-Key': self.subscription_key
        }
        response = requests.post(fetch_token_url, headers=headers)
        self.access_token = str(response.text)

    def save_audio(self):
        base_url = 'https://westus.tts.speech.microsoft.com/'
        path = 'cognitiveservices/v1'
        constructed_url = base_url + path
        headers = {
            'Authorization': 'Bearer ' + self.access_token,
            'Content-Type': 'application/ssml+xml',
            'X-Microsoft-OutputFormat': 'riff-24khz-16bit-mono-pcm',
            'User-Agent': 'YOUR_RESOURCE_NAME'
        }

        print(self.tts)

        response = requests.post(constructed_url, headers=headers, data=self.tts)
        if response.status_code == 200:
            with open('sample-' + self.timestr + '.wav', 'wb') as audio:
                audio.write(response.content)
                print("\nStatus code: " + str(response.status_code) +
                      "\nYour TTS is ready for playback.\n")
        else:
            print("\nStatus code: " + str(response.status_code) +
                  "\nSomething went wrong. Check your subscription key and headers.\n")

if __name__ == "__main__":
    subscription_key = "{SubscriptionKey}"
    app = TextToSpeech(subscription_key)
    app.get_token()
    app.save_audio()

IEvangelist on 22 Jan 2020

Hi @IEvangelist - works like a charm - thank you so much for your support and efforts!

nonstoptimm on 23 Jan 2020

👍1