Google-cloud-python: Speech v1p1beta1 words do not match transcription when diarization enabled

Created on 15 Aug 2018 · 2Comments · Source: googleapis/google-cloud-python

The words do not match the transcription in the LongRunningRecognitionResponse when speaker diarization is enable. The issue goes away when speaker diarization is disabled.

I attached a wave file that I used to run the following example
bug_v1p1beta1.wav.zip

import os

from google.cloud import storage
from google.cloud import speech_v1p1beta1 as speech
from google.cloud.speech_v1p1beta1 import enums
from google.cloud.speech_v1p1beta1 import types

GOOGLE_BUCKET      = os.environ['GOOGLE_BUCKET']

def transcribe(filepath):
    # put audio file into google storage
    storage_client = storage.Client.from_service_account_json('service-account.json')
    google_bucket  = storage_client.get_bucket(GOOGLE_BUCKET)
    blob = storage.Blob(os.path.basename(filepath), google_bucket)
    blob.upload_from_filename(filepath)

    # transcribe file
    gcs_uri = "gs://" + str(blob.bucket.name) + "/" + blob.name
    speech_client  = speech.SpeechClient()
    operation = speech_client.long_running_recognize(
        audio=speech.types.RecognitionAudio(
            uri=gcs_uri
        ),
        config=speech.types.RecognitionConfig(
            encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
            profanity_filter=False,
            language_code='en-US',
            enable_word_time_offsets=True,
            audio_channel_count=2,
            enable_speaker_diarization=True,
            enable_separate_recognition_per_channel=True
        )
    )
    operation_result = operation.result()
    blob.delete()

    # print results
    for speech_recognition_result in operation_result.results:
        print('Channel: {0}'.format(speech_recognition_result.channel_tag))
        print('\ttranscription: {0}'.format(speech_recognition_result.alternatives[0].transcript))
        print('\twords: {0}'.format(' '.join([w.word for w in speech_recognition_result.alternatives[0].words])))


if __name__ == '__main__':
    transcribe('./bug_v1p1beta1.wav')

This script results in:

Channel: 1
    transcription: this is a test of the Left Channel
    words: this is a test of the Left Channel
Channel: 2
    transcription: this is a test of the right Channel
    words: this is a test of the right Channel
Channel: 1
    transcription:  there seems to be a bug in Google Cloud speech v1p one day that one
    words: this is a test of the Left Channel there seems to be a bug in Google Cloud speech v1p one day that one
Channel: 2
    transcription:  this test is meant to recreate that bug
    words: this is a test of the right Channel this test is meant to recreate that bug
Channel: 1
    transcription:  goodbye
    words: this is a test of the Left Channel there seems to be a bug in Google Cloud speech v1p one day that one goodbye

The words do not match the transcription and appear to be accumulating by channel or speaker.

When speaker diarization is disabled, then the issue does not occur and the following output is produced.

Channel: 1
    transcription: this is a test of the Left Channel
    words: this is a test of the Left Channel
Channel: 2
    transcription: this is a test of the right Channel
    words: this is a test of the right Channel
Channel: 1
    transcription:  there seems to be a bug in Google Cloud speech v1p one day that one
    words: there seems to be a bug in Google Cloud speech v1p one day that one
Channel: 2
    transcription:  this test is meant to recreate that bug
    words: this test is meant to recreate that bug
Channel: 1
    transcription:  goodbye
    words: goodbye

Running:

Mac OSX 10.13.6
Python 2.7.14
google-cloud==0.32.0
google-cloud-speech==0.35.0

Source

ppopp

Most helpful comment

Nevermind. I see this is expected behavior from the docs:
https://googlecloudplatform.github.io/google-cloud-python/latest/speech/gapic/v1p1beta1/types.html#google.cloud.speech_v1p1beta1.types.RecognitionConfig.enable_speaker_diarization

When this is true, we send all the words from the beginning of the audio for the top alternative in every consecutive responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time.

I suggest changing this behavior. It requires users to implement different parsing logic depending on whether the diarization flag is set or not, which seems unnecessary.

ppopp on 15 Aug 2018

👍3

All 2 comments

When this is true, we send all the words from the beginning of the audio for the top alternative in every consecutive responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time.

I suggest changing this behavior. It requires users to implement different parsing logic depending on whether the diarization flag is set or not, which seems unnecessary.

ppopp on 15 Aug 2018

👍3

@ppopp Hi, So are you setting enable_speaker_diarization=False to get the correct results? Furthermore, is the channel identifier, same as speaker identifier?

I am trying to get both speaker tag per line:
speaker 1: Hello, I am here
speaker 2: How are you?

Thanks