Hi guys,
I created a Jupyter notebook based on the following documentation:
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstart-python-text-to-speech
The purpose of this notebook is to do a batch transcription of multiple strings in order to create an audio file for each of them. Is it somehow possible to use SSML tags within text strings? Something like that as input string: "Welcome to < prosody pitch="high" >Microsoft Cognitive Services Text-to-Speech API.< /prosody >"
Thanks for your help!
Best, Timm
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
Hey @nonstoptimm - In your case, you can use tags within your text input to add things like a pause or to adjust pitch. Since the your text input is getting put into the XML, then flattened into a string, it should work just fine. If you run into any issues, let us know and we can help you troubleshoot.
Here's the list of supported SSML tags for the Speech service: https://docs.microsoft.com/azure/cognitive-services/speech-service/speech-synthesis-markup
Hi @erhopf, thanks for your quick response!
Actually I tried it by just inserting the text with the respective tags like mentioned above. The tags are read in the TTS-result so somehow it does not work with this configuration. As far as I know/understand, the string variable “self.tts” is xml-encoded when it’s being put into the XML. Do you have a hint how to solve this?
Best, Timm
Moving this up - would it be possible to give some hint or feedback on this :)?
@nonstoptimm - Sorry for the delay. I'm going to get someone to look at this for you ;).
Up - were you able to find somebody who can take a look at this :)?
Hi @nonstoptimm,
What have you done to troubleshoot this thus far? Are there any error responses returned from the REST call? Are you able to capture and output the request body before sending, to verify it for correctness?
Also, I've noticed in your input that you shared here, that there are spaces around the element names... this is actually invalid in XML.
Perhaps you could try this instead (notice that I've removed the spaces around the prosody element name):
Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>
Finally, you might need to escape the double quotes in this string as well -- see this to do so. But first, try the initial suggestion above -- as a last resort attempt it like the following:
Welcome to <prosody pitch="high">
Microsoft Cognitive Services Text-to-Speech API.</prosody>
HI IEvangelist,
thanks for your your quick response. Within my script I used valid XML of course, without spaces (I just faced some formatting issues within this editor, as I did not use the code formatting ;)).
Will try your suggestions and provide feedback! Best, Timm
Hi again, unfortunately, the suggestions did not fix my problem. The request bodies look as follows (tried both options):
String:
Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>
Body:
b'<speak version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (en-US, GuyNeural)">Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>'
String
Welcome to <prosody pitch="high">
Microsoft Cognitive Services Text-to-Speech API.</prosody>
Body:
b'<speak version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (en-US, GuyNeural)">Welcome to <prosody pitch=&quot;high&quot;>Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>'
Do you see anything noticeable here? After checking with the documentation the request bodies look ok! Thanks - really appreciate your help!
Yes, the < and > seem to be a clue in the first example body... It appears that we should be using the fromstring option here - if the text has XML in it.
If you replace this line with the following, does it work?
voice.text = ElementTree.fromstring(self.tts)
The request itself is successful, but the string part of the body is empty when using the following string:
<prosody pitch='high'>Welcome to Microsoft Cognitive Services Text-to-Speech API.</prosody>
b'<speak version="1.0" xml:lang="en-US"><voice name="Microsoft Server Speech Text to Speech Voice (de-DE, Hedda)" /></speak>'
In case the string does not start with an XML element, I receive a syntax error:
`
Traceback (most recent call last):
File "/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "
app.save_audio()
File "
voice.text = ElementTree.fromstring(self.tts)
File "/anaconda/envs/py36/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "
ParseError: syntax error: line 1, column 0
`
So unfortunately it doesn't fix it :/
Hi @nonstoptimm,
In discussing this over with the engineering team they're able to get this to work, here is an example payload for the body of the HTTP POST:
<speak version="1.0" xml:lang="en-us"><voice xml:lang="en-US" name="Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)">Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>
We suggest that you directly use string concatenation instead of xml object. Below is the sample which can generate the expected result (need replace the {SubscriptionKey} with your own).
import os
import requests
import time
from xml.etree import ElementTree
class TextToSpeech(object):
def __init__(self, subscription_key):
self.subscription_key = subscription_key
self.tts = '<speak version="1.0" xml:lang="en-us"><voice xml:lang="en-US" name="Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)">Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody></voice></speak>'
self.timestr = time.strftime("%Y%m%d-%H%M")
self.access_token = None
def get_token(self):
fetch_token_url = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken"
headers = {
'Ocp-Apim-Subscription-Key': self.subscription_key
}
response = requests.post(fetch_token_url, headers=headers)
self.access_token = str(response.text)
def save_audio(self):
base_url = 'https://westus.tts.speech.microsoft.com/'
path = 'cognitiveservices/v1'
constructed_url = base_url + path
headers = {
'Authorization': 'Bearer ' + self.access_token,
'Content-Type': 'application/ssml+xml',
'X-Microsoft-OutputFormat': 'riff-24khz-16bit-mono-pcm',
'User-Agent': 'YOUR_RESOURCE_NAME'
}
print(self.tts)
response = requests.post(constructed_url, headers=headers, data=self.tts)
if response.status_code == 200:
with open('sample-' + self.timestr + '.wav', 'wb') as audio:
audio.write(response.content)
print("\nStatus code: " + str(response.status_code) +
"\nYour TTS is ready for playback.\n")
else:
print("\nStatus code: " + str(response.status_code) +
"\nSomething went wrong. Check your subscription key and headers.\n")
if __name__ == "__main__":
subscription_key = "{SubscriptionKey}"
app = TextToSpeech(subscription_key)
app.get_token()
app.save_audio()
Hi @IEvangelist - works like a charm - thank you so much for your support and efforts!