Wav2letter: Inference failed with long audio

Created on 16 Dec 2020  ·  15Comments  ·  Source: flashlight/wav2letter

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty:
773000,774000,làm gì
774000,775000,
775000,776000,nhà tao biệt
776000,777000,lập ở xóm trại
777000,778000,đây
778000,779000,bao năm chẳng
779000,780000,làm ruộng nên trắng
780000,781000,biết làm gì
781000,782000,ngoài
782000,783000,làm cái gì cá ba
783000,784000,
784000,785000,xã
785000,786000,không bắt sao
786000,787000,
787000,788000,gì
788000,789000,nhưng bây
789000,790000,giờ dân ít
790000,791000,chơi rồi
791000,792000,chỉ thỉnh thoảng
792000,793000,vào dịch lễ
793000,794000,tết thôi
794000,795000,nên cũng
795000,796000,chẳng
796000,797000,
797000,798000,
798000,799000,
799000,800000,
800000,801000,
801000,802000,
802000,803000,
803000,804000,
804000,805000,
805000,806000,
806000,807000,
807000,808000,
808000,809000,
809000,810000,
810000,811000,
811000,812000,
812000,813000,
813000,814000,
814000,815000,
815000,816000,
816000,817000,
817000,818000,
818000,819000,
819000,820000,
820000,821000,
821000,822000,
822000,823000,
823000,824000,
824000,825000,
825000,826000,
826000,827000,
827000,828000,
828000,829000,
829000,830000,
830000,831000,
831000,832000,
832000,833000,
833000,834000,
834000,835000,
835000,836000,
836000,837000,
837000,838000,
838000,839000,
839000,840000,
840000,841000,
841000,842000,
842000,843000,
843000,844000,
844000,845000,

Anyone have same problems, and how to fix it?
Thank you

bug

All 15 comments

@hieuhv94: I had the same problem a while ago and I am not quite sure how I fixed it (or whether I fixed it), but are you reading the file while it is being written to (although even this shouldn't cause any problems)?

@abhinavkulkarni Thank for your reply but i sure that i don't read file while it is being written, i recorded it before decoding.
If you remember how you fix it, please tell me, or I'll try fix it by myself :)))
Thanks!

@hieuhv94: Sorry, I meant the transcription file (rather than the audio file). But yeah, that's unlikely to be the cause behind the missing transcription.

Can you please verify that your audio is either wave or flac format, 16kz frequency, 16-bit depth integers and monochannel?

Thanks.

@hieuhv94: Sorry, I meant the transcription file (rather than the audio file). But yeah, that's unlikely to be the cause behind the missing transcription.

Can you please verify that your audio is either wave or flac format, 16kz frequency, 16-bit depth integers and monochannel?

Thanks.

Of course, parameters of wave file:
Sample rate: 16kHz
Bitrate: 256kbps => 16-bit depth intergers
Channels: mono

@hieuhv94: Sorry, I meant the transcription file (rather than the audio file). But yeah, that's unlikely to be the cause behind the missing transcription.

Can you please verify that your audio is either wave or flac format, 16kz frequency, 16-bit depth integers and monochannel?

Thanks.

And i print output to console not a transcripts file

cc @vineelpratap @xuqiantong

@vineelpratap , @xuqiantong Have you any ideal?

hi all,
does we have any update for this issues, I have the same problem? :((
I think problem from score beam when decode, because i try case with lmweight=0 and wordscore=0, streaming work normal with long audio, but when try set lmweight=0.7 and wordscore=0.8, streaming when to some chunk has no get any output.
Any idea?

hi hieuhv94,
can you confirm same experiment with your audio?

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty:
773000,774000,làm gì
774000,775000,
775000,776000,nhà tao biệt
776000,777000,lập ở xóm trại
777000,778000,đây
778000,779000,bao năm chẳng
779000,780000,làm ruộng nên trắng
780000,781000,biết làm gì
781000,782000,ngoài
782000,783000,làm cái gì cá ba
783000,784000,
784000,785000,xã
785000,786000,không bắt sao
786000,787000,
787000,788000,gì
788000,789000,nhưng bây
789000,790000,giờ dân ít
790000,791000,chơi rồi
791000,792000,chỉ thỉnh thoảng
792000,793000,vào dịch lễ
793000,794000,tết thôi
794000,795000,nên cũng
795000,796000,chẳng
796000,797000,
797000,798000,
798000,799000,
799000,800000,
800000,801000,
801000,802000,
802000,803000,
803000,804000,
804000,805000,
805000,806000,
806000,807000,
807000,808000,
808000,809000,
809000,810000,
810000,811000,
811000,812000,
812000,813000,
813000,814000,
814000,815000,
815000,816000,
816000,817000,
817000,818000,
818000,819000,
819000,820000,
820000,821000,
821000,822000,
822000,823000,
823000,824000,
824000,825000,
825000,826000,
826000,827000,
827000,828000,
828000,829000,
829000,830000,
830000,831000,
831000,832000,
832000,833000,
833000,834000,
834000,835000,
835000,836000,
836000,837000,
837000,838000,
838000,839000,
839000,840000,
840000,841000,
841000,842000,
842000,843000,
843000,844000,
844000,845000,

Anyone have same problems, and how to fix it?
Thank you

Hi @mlexplore1122
I tried the same experiemnt

hi hieuhv94,
can you confirm same experiment with your audio?

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty:
773000,774000,làm gì
774000,775000,
775000,776000,nhà tao biệt
776000,777000,lập ở xóm trại
777000,778000,đây
778000,779000,bao năm chẳng
779000,780000,làm ruộng nên trắng
780000,781000,biết làm gì
781000,782000,ngoài
782000,783000,làm cái gì cá ba
783000,784000,
784000,785000,xã
785000,786000,không bắt sao
786000,787000,
787000,788000,gì
788000,789000,nhưng bây
789000,790000,giờ dân ít
790000,791000,chơi rồi
791000,792000,chỉ thỉnh thoảng
792000,793000,vào dịch lễ
793000,794000,tết thôi
794000,795000,nên cũng
795000,796000,chẳng
796000,797000,
797000,798000,
798000,799000,
799000,800000,
800000,801000,
801000,802000,
802000,803000,
803000,804000,
804000,805000,
805000,806000,
806000,807000,
807000,808000,
808000,809000,
809000,810000,
810000,811000,
811000,812000,
812000,813000,
813000,814000,
814000,815000,
815000,816000,
816000,817000,
817000,818000,
818000,819000,
819000,820000,
820000,821000,
821000,822000,
822000,823000,
823000,824000,
824000,825000,
825000,826000,
826000,827000,
827000,828000,
828000,829000,
829000,830000,
830000,831000,
831000,832000,
832000,833000,
833000,834000,
834000,835000,
835000,836000,
836000,837000,
837000,838000,
838000,839000,
839000,840000,
840000,841000,
841000,842000,
842000,843000,
843000,844000,
844000,845000,
Anyone have same problems, and how to fix it?
Thank you

Hi @mlexplore1122, thank you for your ideal
I tried this and results as you predicted, model worked normally with lmweight=0 and wordscore=0.
Did u fix it?

Hi @mlexplore1122
I tried the same experiemnt

hi hieuhv94,
can you confirm same experiment with your audio?

Bug Description

When i inference by any type of inference (simple, multithread or interactive) with a long audio (30 minutes), inference work fine in 15 first minutes after that output is empty:
773000,774000,làm gì
774000,775000,
775000,776000,nhà tao biệt
776000,777000,lập ở xóm trại
777000,778000,đây
778000,779000,bao năm chẳng
779000,780000,làm ruộng nên trắng
780000,781000,biết làm gì
781000,782000,ngoài
782000,783000,làm cái gì cá ba
783000,784000,
784000,785000,xã
785000,786000,không bắt sao
786000,787000,
787000,788000,gì
788000,789000,nhưng bây
789000,790000,giờ dân ít
790000,791000,chơi rồi
791000,792000,chỉ thỉnh thoảng
792000,793000,vào dịch lễ
793000,794000,tết thôi
794000,795000,nên cũng
795000,796000,chẳng
796000,797000,
797000,798000,
798000,799000,
799000,800000,
800000,801000,
801000,802000,
802000,803000,
803000,804000,
804000,805000,
805000,806000,
806000,807000,
807000,808000,
808000,809000,
809000,810000,
810000,811000,
811000,812000,
812000,813000,
813000,814000,
814000,815000,
815000,816000,
816000,817000,
817000,818000,
818000,819000,
819000,820000,
820000,821000,
821000,822000,
822000,823000,
823000,824000,
824000,825000,
825000,826000,
826000,827000,
827000,828000,
828000,829000,
829000,830000,
830000,831000,
831000,832000,
832000,833000,
833000,834000,
834000,835000,
835000,836000,
836000,837000,
837000,838000,
838000,839000,
839000,840000,
840000,841000,
841000,842000,
842000,843000,
843000,844000,
844000,845000,
Anyone have same problems, and how to fix it?
Thank you

Hi @mlexplore1122, thank you for your ideal
I tried this and results as you predicted, model worked normally with lmweight=0 and wordscore=0.
Did u fix it?

sorry but i have in process for debug more. because code flow in lexicon decoder is not ez to understand so i think i need more time for debug. If any facebook dev dive to their code with idea behind my hypothesis with score lm, i think it will faster to resolv.
anyway, i will update when have any news. :)))

hi all,
I just test with decoder offline with LM is same with decoder in infrence streaming, so as same result, can't get full output of audio ( in this case LM is just 3 gram with prune and quantinize, this version is try to downsize of lm for streaming)
When I change LM to version 4 gram with quantinize only (version official for decoder) output of long audio is good, all audio has decoded.
So I think problem is from when you have LM is good enough, we can avoid problem (and problem from LexiconDecoder.cpp logic) like issuse #894 and this issuse not resolved too.
So just wait @vineelpratap for his answer :((((, and i will continue trying .

Hi, I think the problem could be that for very very long audios, we need to re-normalize the computed alphas (forward probabilities) . I'm looking into the best way to fix it. Will get back soon...

I cannot reproduce the issue.

I took a librispeech audio and replicated it 100 times to create a ~30 minute audio and used simple_streaming_asr_example and it transcribed everything correctly...

This is what I did...

> cd data 
> for f in acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt ; do wget http://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/${f} ; done
> // consider a file from Librispeech audio.flac 
> sox audio.flac audio.wav // convert to .wav
> cp audio.wav longaudio.wav && for i in {1..100};do sox audio.wav longaudio.wav longaudio.wav; done
> ./$PATH/simple_streaming_asr_example -input_files_base_path data -input_audio_file longaudio.wav 

If you can give a way for me to reproduce the issue, it will help me in debugging...

hi @vineelpratap
as some comment above, problem just occurred when language model not enough good meaning fit with audio domain. So as this idea, i just use lm_3 gram with prune 0 5 6 in kenlm. And I try to find audio with big gap difference domain, but may be acoustic model train with unsupervised so good so i hard to find audio to make reproduce error ez.
But finally i found this audio with domain from news about games can be reproduce error ( 1 hour and error occured from 42th minute to 51 minuute not all audio) like some experiment in my language. And of course when i cut audio from 42-51 alone and decode, everything work normaly. I have include my cut audio for your reproduce. Thankyou
All resource for reprocedure in my drive:
drive folder save audio and lm_3_gram_prune_056 for reproduce error

Was this page helpful?
0 / 5 - 0 ratings

Related issues

smolendawid picture smolendawid  ·  3Comments

JanX2 picture JanX2  ·  5Comments

mlexplore1122 picture mlexplore1122  ·  3Comments

bmblr497 picture bmblr497  ·  5Comments

bill-kalog picture bill-kalog  ·  4Comments