Hello. I have a few things to ask you guys concerning speech recognition in CNTK. I would like to adapt the network so that it could regonize just a few words (more less 15 words), like "Switch on the light in the kitchen". I need it for a purpose of controlling devices in smart home environment (it's my Master's Thesis).
"id1 = Features/sample1.mfc
id2 = Features/sample2.mfc
.... "
If not, can you tell me how to convert many .mfc files into single .chunk file?
I would really appreciate your help! I owe you a beer if you help me with this, guys :)
Best regards.
This is difficult. CNTK only provides you a way to train the acoustic model. But you need a full Viterbi speech decoder. CNTK does not include one. You could look at HTK or Kaldi.
CNTK has a way of writing out the probability vectors to a HTK-format feature file. However, HTK would have to be modified to operate on such files.
For the chunks, you do not need to convert them. Instead, you can add the length information to your SCP, e.g.
id2 = Features/sample2.mfc[0,456]
where 456 is the number of frames minus one. Then CNTK should be able to read this.
Thank you for your quick answer. That's a pity CNTK doesn't support that. HTK is fine, but in my work I have to use speech recogniser based on deep learning (that's the requirement). As far as I know HTK is based on HMM. Do you think Kaldi, apart from returning the layers' output, will allow me to decode the words?
I don't have to build my own network, I just need a tool that will recognize the speech for me and that is based on deep learning. Some thing that I could connect to, stream the audio to their server (or send whole audio file) and get back the answer (specific sentence)? Or maybe some other way - an out-of-the-box ASR that I can implement in my code? I would really appreciate your help, you have far greater experience than me.
Kind regards.
There is an integration between Kaldi and CNTK. However, I am not familiar with it, and I don't know whether it can do what you need. @dongyu888, do you know?
yes. you can write out the network output amd then use kaldi decoder to generate text.
Thanks,
Dong Yu (淇炴爧)Sent from my smart phone. Please forgive my typos.
There is an integration between Kaldi and CNTK. However, I am not familiar with it, and I don't know whether it can do what you need. @dongyu888, do you know?
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/Microsoft/CNTK/issues/740#issuecomment-237759682
Can I do it directly in CNTK? Or I have to download Kaldi as well? Because I see there is some example in \Examples\Speech\Miscellaneous\AMI using Kaldi.
No, you cannot do it directly using CNTK. You can just copy a Kaldi decoder binary from someone to do decoding. In the past we have released our research decoder argon which can directly decode with CNTK. I will check to see whether we can still do that. In either case, you still need to build decoding graph (HMM model, pronunciation, LM, etc.) using other tools.
Much thanks guys for your replies. Now I know where I stand. Ok @dongyu888 I'm waiting for your answer.
I just checked. We intend to open source argon but it takes time. For now, you might want to get the Kaldi decoder instead.
Ok, thank you again
@dongyu888. I noticed this thread is a bit old but was curious if anything came of the 'research decoder argon' project you mention above? I'm quite interested in using CNTK to directly decode and would love to see a pointer if anything got released.
I talked to MSR folks and there's no plan for releasing the decoder argon. @dongyu888 has left the team.
Closing this issue for now. Thanks!
Most helpful comment
I just checked. We intend to open source argon but it takes time. For now, you might want to get the Kaldi decoder instead.