I have downloaded the model files as mentioned in
https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples#download-the-example-trained-models-from-aws-s3
now I want to use decoder to decode an audio file
I have created a decoder.cfg
--am=path to acoustic_model.bin
--test=path to train.lst
--show
--sholetters
--uselexicon=true
--lm=path to language_model.bin
--lmtype=kenlm
--decodertype=wrd
--lmweight=2.5
--wordscore=1
--beamsize=500
--beamthreshold=25
--silweight=-0.5
--nthread_decoder=4
--smearing=max
--show=true
the train.lst contains the path to my audio file.
I am a bit new to this framework please guide me through and correct me if I am wrong
@ishan-modi: Only Flashlight backend models can be used with Train/Test/Decode binaries. The models that you downloaded are the FBGEMM backend ones. Those are to be used in streaming format with inference example binaries.
Thank you for the response got it !
Ok so now I am running on Flashlight backend models link
https://github.com/facebookresearch/wav2letter/tree/master/recipes/streaming_convnets/librispeech
and I want to recreate beam search decoding for a single audio file.
How do I generate .lst for this audio file which I can use as an input for decoding ?
@ishan-modi: Take a look at the instructions about how to prepare data for training (and testing).
Also, if disk space and internet bandwidth is not a problem, try running the data preparation scripts for one of the recipes. That will download the Librispeech data and lay it out in a format that Train/Test/Decode binaries expect (including .wav, .lst files).
Also, you may want to edit the subject title of this post for the benefit of others.
Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription
1 /home/../1.wav 1234.34 hello world
Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription 1 /home/../1.wav 1234.34 hello worldI have associated doubts with that thread
Thank you so much for response. Issue is resolved !!
Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription 1 /home/../1.wav 1234.34 hello worldI have associated doubts with that thread
1. You have to provide the translations text to the Decoder in order to compare results? 2. If you only want to transcribe and you don't have the texts, just want to use the model, how can it be done?
Answers
No you don't need to have transcripts if you want to decode
Checkout their inference module to generate transcripts by following the steps in given link
Just a quick answer on list file: the expected format (tab or space separated between columns, there should be 3 or 4 columns)
# audio_id (whatever name you want) absolute_audio_path audio_duration (in ms) transcription 1 /home/../1.wav 1234.34 hello worldI have associated doubts with that thread
1. You have to provide the translations text to the Decoder in order to compare results? 2. If you only want to transcribe and you don't have the texts, just want to use the model, how can it be done?Answers
1. No you don't need to have transcripts if you want to decode-> 1 /home/../1.wav 1234.34 hello world
2. Checkout their inference module to generate transcripts by following the steps in given link
Hi @ishan-modi
Thanks for answering
@Adportas Inference is done purely on cpu (in a streaming fashion) while decode.cpp is working both on cpu and gpu for any network and then cpu for beam search decoding. Inference right now is working only with conv type networks. Decoder is taking list file and predicts transcription, so you don't need to have targets. At the same time decode.cpp also computes wer. Right now decode.cpp computes wer in any case, so if you just provide empty targets (there is some bug people reported to have empty targets, so please just put fake text there) you still obtain predictions and wer, but you can simply ignore wer.
So please just use decode.cpp with some fake transcripts (or try even empty strings there)!
Most helpful comment
@ishan-modi: Take a look at the instructions about how to prepare data for training (and testing).
Also, if disk space and internet bandwidth is not a problem, try running the data preparation scripts for one of the recipes. That will download the Librispeech data and lay it out in a format that Train/Test/Decode binaries expect (including .wav, .lst files).
Also, you may want to edit the subject title of this post for the benefit of others.