The words do not match the transcription in the LongRunningRecognitionResponse when speaker diarization is enable. The issue goes away when speaker diarization is disabled.
I attached a wave file that I used to run the following example
bug_v1p1beta1.wav.zip
import os
from google.cloud import storage
from google.cloud import speech_v1p1beta1 as speech
from google.cloud.speech_v1p1beta1 import enums
from google.cloud.speech_v1p1beta1 import types
GOOGLE_BUCKET = os.environ['GOOGLE_BUCKET']
def transcribe(filepath):
# put audio file into google storage
storage_client = storage.Client.from_service_account_json('service-account.json')
google_bucket = storage_client.get_bucket(GOOGLE_BUCKET)
blob = storage.Blob(os.path.basename(filepath), google_bucket)
blob.upload_from_filename(filepath)
# transcribe file
gcs_uri = "gs://" + str(blob.bucket.name) + "/" + blob.name
speech_client = speech.SpeechClient()
operation = speech_client.long_running_recognize(
audio=speech.types.RecognitionAudio(
uri=gcs_uri
),
config=speech.types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
profanity_filter=False,
language_code='en-US',
enable_word_time_offsets=True,
audio_channel_count=2,
enable_speaker_diarization=True,
enable_separate_recognition_per_channel=True
)
)
operation_result = operation.result()
blob.delete()
# print results
for speech_recognition_result in operation_result.results:
print('Channel: {0}'.format(speech_recognition_result.channel_tag))
print('\ttranscription: {0}'.format(speech_recognition_result.alternatives[0].transcript))
print('\twords: {0}'.format(' '.join([w.word for w in speech_recognition_result.alternatives[0].words])))
if __name__ == '__main__':
transcribe('./bug_v1p1beta1.wav')
This script results in:
Channel: 1
transcription: this is a test of the Left Channel
words: this is a test of the Left Channel
Channel: 2
transcription: this is a test of the right Channel
words: this is a test of the right Channel
Channel: 1
transcription: there seems to be a bug in Google Cloud speech v1p one day that one
words: this is a test of the Left Channel there seems to be a bug in Google Cloud speech v1p one day that one
Channel: 2
transcription: this test is meant to recreate that bug
words: this is a test of the right Channel this test is meant to recreate that bug
Channel: 1
transcription: goodbye
words: this is a test of the Left Channel there seems to be a bug in Google Cloud speech v1p one day that one goodbye
The words do not match the transcription and appear to be accumulating by channel or speaker.
When speaker diarization is disabled, then the issue does not occur and the following output is produced.
Channel: 1
transcription: this is a test of the Left Channel
words: this is a test of the Left Channel
Channel: 2
transcription: this is a test of the right Channel
words: this is a test of the right Channel
Channel: 1
transcription: there seems to be a bug in Google Cloud speech v1p one day that one
words: there seems to be a bug in Google Cloud speech v1p one day that one
Channel: 2
transcription: this test is meant to recreate that bug
words: this test is meant to recreate that bug
Channel: 1
transcription: goodbye
words: goodbye
Running:
Mac OSX 10.13.6
Python 2.7.14
google-cloud==0.32.0
google-cloud-speech==0.35.0
Nevermind. I see this is expected behavior from the docs:
https://googlecloudplatform.github.io/google-cloud-python/latest/speech/gapic/v1p1beta1/types.html#google.cloud.speech_v1p1beta1.types.RecognitionConfig.enable_speaker_diarization
When this is true, we send all the words from the beginning of the audio for the top alternative in every consecutive responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time.
I suggest changing this behavior. It requires users to implement different parsing logic depending on whether the diarization flag is set or not, which seems unnecessary.
@ppopp Hi, So are you setting enable_speaker_diarization=False to get the correct results? Furthermore, is the channel identifier, same as speaker identifier?
I am trying to get both speaker tag per line:
speaker 1: Hello, I am here
speaker 2: How are you?
Thanks
Most helpful comment
Nevermind. I see this is expected behavior from the docs:
https://googlecloudplatform.github.io/google-cloud-python/latest/speech/gapic/v1p1beta1/types.html#google.cloud.speech_v1p1beta1.types.RecognitionConfig.enable_speaker_diarization
I suggest changing this behavior. It requires users to implement different parsing logic depending on whether the diarization flag is set or not, which seems unnecessary.