Google-cloud-java: Duplicate/combined word (lists) when using the response 'SpeechRecognitionResult' from longRunningRecognizeAsync

Created on 5 Nov 2018  路  9Comments  路  Source: googleapis/google-cloud-java

When looping trough the 'SpeechRecognitionResult' objects, I noticed that the transcript attribute and the 'words_' list do not match for (at least) the last result. In our app, we always use the first and only alternative. I noticed that the word lists do match the transcript from the first few results, but for the last result, all words including the last one will be returned in words. I would expect that the last result only contains the words which are related to that specific transcript.

I assume this is a bug. If not; please advice.

Environment details

  • OS: Windows 10
  • Java version: 1.8.0_102
  • google-cloud-java version(s): google-cloud-speech-0.67.0-beta

Code snippet

In order to clarify this, I added a simplified code snippet below.

List<SpeechRecognitionResult> results = response.getResultsList();

for (SpeechRecognitionResult result : results) {
     SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
     String transcript = alternative.getTranscript();
     for (WordInfo wordInfo : alternative.getWordsList()) {
           String word = wordInfo.getWord();
     }
}

In my current example, we have 3 results. The number of words are correct for the first two, but the third (last) is incorrect and includes all words from the whole text.

Result 0: Only the word Jaguar (both the transcript and the only word)
image

Result 1: A longer transcript, with 102 (correct) words in total
image

Result 2: A short transcript with only 23 words. As you can see, the list with words includes all 126 words (1+102+23).
image

speech question

All 9 comments

Could some one from Google please have a look at this? The behavior is not consistent. The results were good for a few days, but now the duplicate list is back.

Can you share your RecognitionConfig setup code?

I'm wondering if it's related to the fact that when EnableSpeakerDiarization is true, the words list is expected to contain all words from the beginning of the audio, even though the transcript does not.

Related documentation: https://github.com/googleapis/google-cloud-java/blob/master/google-api-grpc/proto-google-cloud-speech-v1p1beta1/src/main/java/com/google/cloud/speech/v1p1beta1/SpeechRecognitionAlternative.java#L187

If EnableSpeakerDiarization is set to true, can you set it to false and see if you experience the same problem?

Thanks for the update Jesse. While experimenting with your suggestions and reading the additional documentation, I can verify that the behavior is now explainable and consistent. The RecognitionConfig will be set by the end user (see the screenshot for an impression).

When the Speaker Diarization has been set to false, the results are fine. However the behavior is now explainable when set to true, I still think that this is not a preferable way of working with the response. We use the first and the last word in order to determine the start and end of a sentence, because there are no such fields on the 'alternative' level. In addition to that, we store the sentences and associated words altogether in our database. This is required in order to our front-end application. If we want to achieve this now, it feels like we need to build some custom logic to determine which words underneath the alternative are really part of that sentence.

Hopefully this explains our setup. Thanks for your clarification so far. We are happy to hear your thoughts on this.

2018-12-06 15_08_41

I agree that the behavior is a little strange and unintuitive, I can start a discussion with the Speech team to see if it can be re-thought, or if there's a workaround in place that I'm not aware of.

In the meantime, maybe a workaround could be to keep track of the word counts given in previous iterations, and on the last one get the relevant portion using List.subList()?

You may have tried this, but the code you provided in the original question could be replaced by something like this.

List<SpeechRecognitionResult> results = response.getResultsList();
Iterator<SpeechRecognitionResult> i = results.iterator();
int wordsCounted = 0;
while(i.hasNext()) {
  SpeechRecognitionResult result = i.next();
  SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
  String transcript = alternative.getTranscript();
  List<WordInfo> words;
  if (i.hasNext() || !config.getEnableSpeakerDiarization()) {
    wordsCounted += alternative.getWordsCount();
    words = alternative.getWordsList();
  } else {
    words = alternative.getWordsList().subList(wordsCounted, alternative.getWordsCount());
  }
  for (WordInfo wordInfo : words) {
    String word = wordInfo.getWord();
  }
}

Thanks for the update and workaround @JesseLovelace For now, it looks like this is working for our implementation. I am now able to set the correct start- and end time.

If there are any updates on a final 'solution' or a more intuitive way of working with the responses, I will be happy to hear about that in the future. For now, thanks for your support.

@beccasaurus FYI

Closing issue as it seem the original question has been answered. If you would like a feature request for the service itself I would suggest opening up a ticket with the GCP Public Issue Tracker.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ChengyuanZhao picture ChengyuanZhao  路  3Comments

Electricks94 picture Electricks94  路  4Comments

raintears picture raintears  路  3Comments

lucmult picture lucmult  路  5Comments

Mistic92 picture Mistic92  路  5Comments