The docs are unclear: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text#response-parameters
Fluency is a part of accuracy, and accuracy is a part of fluency. I don't understand what the difference is in the calculation/production of these two scores. The explanations are single lines that essentially say "x is x".
Also the pronScore is based on these two scores and "weighted" - weighted how? weighted towards what?
Forgive me if I've posted this issue in the wrong place!
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@crevulus
Thanks for the feedback! We are currently investigating and will update you shortly.
@crevulus Thanks again for the feedback.
We will improve the document to make it more meaningful to help customer understand the API more easily.
For your questions, let me answer here.
About the difference between accuracy and fluency:
The accuracy score indicates the sounds accuracy of phonemes toward native pronunciation.
We calculate it on phoneme level first, and word level and full text level accuracy score is aggregated from phoneme level accuracy score.
The fluency score indicates the speech fluency of the given speech towards native speaking naturalness such as break, silence duration. It cares inter-word part. This is different from accuracy score.
The completeness score is calculated by the ratio of non-mispronunciation words towards reference text input.
For pronScore, it's the overall score which is aggregated from accuracy score, fluency score and completeness score. It's calculated by accuracyScore * X% + fluencyScore * Y% + completenessScore * Z%. There could be adjustment on the weight so we don't share it here. You can also calculate the over all score with your customized weight.
In the future we will introduce more dimensions like prosody score and aggregate it into pronScore.
Please let me know if you have further questions.
Thanks for the feedback. It was very thorough.
Prosody score would be very useful for my purposes! Please keep me updated, and you can close this ticket if you wish.
@crevulus
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.
Revised public-facing content will appear at this address within 24 hours:
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text
@yinhew , thanks for the detailed explaination. I wonder if you have time to help me to better understand the fluency score:
Like you mentioned:
The fluency score indicates the speech fluency of the given speech towards native speaking naturalness such as break, silence duration. It cares inter-word part. This is different from accuracy score.
Is there a way for you to share more details of how exactly the fluency score is calculated? For example, I have the follow alignment result of a recording:
silence 0-0.1s
I 0.1-0.3s
<break> 0.3-0.6s
like 0.7-1.7s # the speaker pronounced the word 'like' longer than most of the native speakers.
it 1.7-2.0s
silence 2.0-2.3s
What is the fluency score for this case? And how it is calculated?
P.S.: only a rough idea of the calculation procedure is fine with me, no specific parameter is needed.
Thanks.
@weiwchu @YutongTie-MSFT I'm also very interested in your reply to @weiwchu 's query. Would be useful to know for our product.