Wav2letter: Using Decoder on single audio file

Created on 30 Nov 2020 · 9Comments · Source: flashlight/wav2letter

How to use the decoder ?

I have downloaded the model files as mentioned in
https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples#download-the-example-trained-models-from-aws-s3

now I want to use decoder to decode an audio file
I have created a decoder.cfg

--am=path to acoustic_model.bin
--test=path to train.lst
--show
--sholetters
--uselexicon=true
--lm=path to language_model.bin
--lmtype=kenlm
--decodertype=wrd
--lmweight=2.5
--wordscore=1
--beamsize=500
--beamthreshold=25
--silweight=-0.5
--nthread_decoder=4
--smearing=max
--show=true

the train.lst contains the path to my audio file.

I am a bit new to this framework please guide me through and correct me if I am wrong

question

Source

ishan-modi

Most helpful comment

@ishan-modi: Take a look at the instructions about how to prepare data for training (and testing).

Also, if disk space and internet bandwidth is not a problem, try running the data preparation scripts for one of the recipes. That will download the Librispeech data and lay it out in a format that Train/Test/Decode binaries expect (including .wav, .lst files).

Also, you may want to edit the subject title of this post for the benefit of others.

abhinavkulkarni on 1 Dec 2020

👍2

All 9 comments

@ishan-modi: Only Flashlight backend models can be used with Train/Test/Decode binaries. The models that you downloaded are the FBGEMM backend ones. Those are to be used in streaming format with inference example binaries.

abhinavkulkarni on 1 Dec 2020

👍1

Thank you for the response got it !

Ok so now I am running on Flashlight backend models link

https://github.com/facebookresearch/wav2letter/tree/master/recipes/streaming_convnets/librispeech

and I want to recreate beam search decoding for a single audio file.
How do I generate .lst for this audio file which I can use as an input for decoding ?

ishan-modi on 1 Dec 2020

@ishan-modi: Take a look at the instructions about how to prepare data for training (and testing).

Also, you may want to edit the subject title of this post for the benefit of others.

abhinavkulkarni on 1 Dec 2020

👍2

Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)

# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription
1 /home/../1.wav 1234.34 hello world

tlikhomanenko on 1 Dec 2020

👍1

Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription
1 /home/../1.wav 1234.34 hello world
I have associated doubts with that thread

You have to provide the translations text to the Decoder in order to compare results?
If you only want to transcribe and you don't have the texts, just want to use the model, how can it be done?

Adportas on 1 Dec 2020

Thank you so much for response. Issue is resolved !!

ishan-modi on 2 Dec 2020

🎉1

Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription
1 /home/../1.wav 1234.34 hello world
I have associated doubts with that thread
1. You have to provide the translations text to the Decoder in order to compare results?

2. If you only want to transcribe and you don't have the texts, just want to use the model, how can it be done?

Answers

No you don't need to have transcripts if you want to decode
Checkout their inference module to generate transcripts by following the steps in given link

https://github.com/facebookresearch/wav2letter/wiki

ishan-modi on 2 Dec 2020

Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription
1 /home/../1.wav 1234.34 hello world
I have associated doubts with that thread
1. You have to provide the translations text to the Decoder in order to compare results?

2. If you only want to transcribe and you don't have the texts, just want to use the model, how can it be done?
Answers
1. No you don't need to have transcripts if you want to decode
-> 1 /home/../1.wav 1234.34 hello world
2. Checkout their inference module to generate transcripts by following the steps in given link
https://github.com/facebookresearch/wav2letter/wiki

Hi @ishan-modi
Thanks for answering

I'am a bit confused between decoder and inference purposes + differences. To me look as like the same but obviously for something there are both. Lately seeing the structure of lst file suggested by @tlikhomanenko in the thread above (/home/../1.wav 1234.34 hello world), it seems to me that the decoder is a transcription comparer between the text that comes in the lst file and the text which the model generates with de wav/flac files listed? (it is right?), otherwise the transcription data (hello world) must be unnecessary + illogical (i think).
Once i got the model, I start with the inference framework tutorial but get stuck because the default model generated by me was not of the streaming type, which is the only one supported for the listed examples: simple_streaming_asr_example, multithreaded_streaming_asr_example
So I wanted to convert it to streaming format in order to use, with the tool StreamingTDSModelConverter but it can only be done with the TDS type and my default model is not of this -> thread
So I started testing with the decoder at the suggestion of @abhinavkulkarni in the following thread, hoping it helps me to transcribe some wavs files, but it's a bit frustrating to read that it requires the transcripts or to go back to the inference that has tds and streaming requirements that the model doesn't have
If I have a model that is neither streaming nor tds type, how do I use it to transcribe wav files without the texts obviously. I've been reading wiki and testing for a few weeks without being able to use it.

Adportas on 2 Dec 2020

@Adportas Inference is done purely on cpu (in a streaming fashion) while decode.cpp is working both on cpu and gpu for any network and then cpu for beam search decoding. Inference right now is working only with conv type networks. Decoder is taking list file and predicts transcription, so you don't need to have targets. At the same time decode.cpp also computes wer. Right now decode.cpp computes wer in any case, so if you just provide empty targets (there is some bug people reported to have empty targets, so please just put fake text there) you still obtain predictions and wer, but you can simply ignore wer.

So please just use decode.cpp with some fake transcripts (or try even empty strings there)!

tlikhomanenko on 11 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings