Azure-docs: (REST Speech-to-text) How do you differentiate pronScore, Accuracy Score, and Fluency Score?

Created on 8 Jul 2020 · 7Comments · Source: MicrosoftDocs/azure-docs

The docs are unclear: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text#response-parameters

Fluency is a part of accuracy, and accuracy is a part of fluency. I don't understand what the difference is in the calculation/production of these two scores. The explanations are single lines that essentially say "x is x".

Also the pronScore is based on these two scores and "weighted" - weighted how? weighted towards what?

Forgive me if I've posted this issue in the wrong place!

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 11673bc9-9062-278d-c5b8-84b05f2b81be
Version Independent ID: 29347a63-de63-4ff0-5a62-149ab31d1b7e
Content: Speech-to-text API reference (REST) - Speech service - Azure Cognitive Services
Content Source: articles/cognitive-services/Speech-Service/rest-speech-to-text.md
Service: cognitive-services
Sub-service: speech-service
GitHub Login: @yinhew
Microsoft Alias: yinhew

Pri2 cognitive-servicesvc cxp doc-bug doc-enhancement speech-servicsubsvc triaged

Source

crevulus

All 7 comments

@crevulus
Thanks for the feedback! We are currently investigating and will update you shortly.

YutongTie-MSFT on 8 Jul 2020

@crevulus Thanks again for the feedback.
We will improve the document to make it more meaningful to help customer understand the API more easily.
For your questions, let me answer here.

About the difference between accuracy and fluency:
The accuracy score indicates the sounds accuracy of phonemes toward native pronunciation.
We calculate it on phoneme level first, and word level and full text level accuracy score is aggregated from phoneme level accuracy score.
The fluency score indicates the speech fluency of the given speech towards native speaking naturalness such as break, silence duration. It cares inter-word part. This is different from accuracy score.

The completeness score is calculated by the ratio of non-mispronunciation words towards reference text input.

For pronScore, it's the overall score which is aggregated from accuracy score, fluency score and completeness score. It's calculated by accuracyScore * X% + fluencyScore * Y% + completenessScore * Z%. There could be adjustment on the weight so we don't share it here. You can also calculate the over all score with your customized weight.
In the future we will introduce more dimensions like prosody score and aggregate it into pronScore.

Please let me know if you have further questions.

yinhew on 9 Jul 2020

👍1

Thanks for the feedback. It was very thorough.

Prosody score would be very useful for my purposes! Please keep me updated, and you can close this ticket if you wish.

crevulus on 9 Jul 2020

@crevulus
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.

YutongTie-MSFT on 13 Jul 2020

Revised public-facing content will appear at this address within 24 hours:
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text

sign-off

v-demjoh on 20 Jul 2020

@yinhew , thanks for the detailed explaination. I wonder if you have time to help me to better understand the fluency score:

Like you mentioned:

The fluency score indicates the speech fluency of the given speech towards native speaking naturalness such as break, silence duration. It cares inter-word part. This is different from accuracy score.

Is there a way for you to share more details of how exactly the fluency score is calculated? For example, I have the follow alignment result of a recording:

silence 0-0.1s
I       0.1-0.3s
<break> 0.3-0.6s
like    0.7-1.7s # the speaker pronounced the word 'like' longer than most of the native speakers.
it      1.7-2.0s
silence 2.0-2.3s

What is the fluency score for this case? And how it is calculated?
P.S.: only a rough idea of the calculation procedure is fine with me, no specific parameter is needed.

Thanks.