Cntk: How can I recognize the spoken words from the network's output?

Created on 5 Aug 2016 · 11Comments · Source: microsoft/CNTK

Hello. I have a few things to ask you guys concerning speech recognition in CNTK. I would like to adapt the network so that it could regonize just a few words (more less 15 words), like "Switch on the light in the kitchen". I need it for a purpose of controlling devices in smart home environment (it's my Master's Thesis).

First of all, I've started your EndToEnd Test in location "..EndToEndTests\Speech\DNN\WriteCommand". And it works fine - it trains the network on the An4 database set and then writes the output of the 10 test utterance to the file "Output.ScaledLogLikelihood". This file contains only the values that output nodes returned, but I would like to know what words are codeed by these values. To put it more simply - what did the speakers say in these 10 sentences? Is there a way I can modify the "write [ ]" section in CNTK files so that the network returned words instead of values? or some other way?
I would like to expend this network so that it could reconize a few words spoken by me. What I want to do is to add additional data to the training and test sets described in .mlf and .scp files. How can I combine existing traning and test sets with those created by me? Should I add another "reader [ features[] labels[] ]" section in .cntk file? And can my features and labels have different dimension that the already existing files? If yes , what then with network's layerSizes? Now it is layerSizes = 363:512:512:132, because existing files' features dim is 363 and labels dim is 132. And If I add additional features and labels files how to recalculate layerSize?
In your examples, all speech data features (mfcc) are stored in one archive file - 000000000.chunk. In .scp files you just take a fragment of this file and grant it an id. For example:
"An4/71/71/cen5-fjam-b.mfc=Features/000000000.chunk[0,367]". My question: Is it obligatory to use one .chunk file instead of seperate .mfc files? Can I for example define the following .scp files:

"id1 = Features/sample1.mfc
id2 = Features/sample2.mfc
.... "
If not, can you tell me how to convert many .mfc files into single .chunk file?

I would really appreciate your help! I owe you a beer if you help me with this, guys :)

Best regards.

Source

TomaszAugustyn

Most helpful comment

I just checked. We intend to open source argon but it takes time. For now, you might want to get the Kaldi decoder instead.

dongyu888 on 5 Aug 2016

👍3

All 11 comments

This is difficult. CNTK only provides you a way to train the acoustic model. But you need a full Viterbi speech decoder. CNTK does not include one. You could look at HTK or Kaldi.

CNTK has a way of writing out the probability vectors to a HTK-format feature file. However, HTK would have to be modified to operate on such files.

For the chunks, you do not need to convert them. Instead, you can add the length information to your SCP, e.g.

id2 = Features/sample2.mfc[0,456]

where 456 is the number of frames minus one. Then CNTK should be able to read this.

frankseide on 5 Aug 2016

Thank you for your quick answer. That's a pity CNTK doesn't support that. HTK is fine, but in my work I have to use speech recogniser based on deep learning (that's the requirement). As far as I know HTK is based on HMM. Do you think Kaldi, apart from returning the layers' output, will allow me to decode the words?

I don't have to build my own network, I just need a tool that will recognize the speech for me and that is based on deep learning. Some thing that I could connect to, stream the audio to their server (or send whole audio file) and get back the answer (specific sentence)? Or maybe some other way - an out-of-the-box ASR that I can implement in my code? I would really appreciate your help, you have far greater experience than me.

Kind regards.

TomaszAugustyn on 5 Aug 2016

There is an integration between Kaldi and CNTK. However, I am not familiar with it, and I don't know whether it can do what you need. @dongyu888, do you know?

frankseide on 5 Aug 2016

yes. you can write out the network output amd then use kaldi decoder to generate text.

Thanks,
Dong Yu (俞栋)Sent from my smart phone. Please forgive my typos.

-------- Original message --------

From: Frank Seide

Date: 8/4/16 22:37 (GMT-08:00)

To: Microsoft/CNTK

Cc: Dong Yu , Mention

Subject: Re: [Microsoft/CNTK] How can I recognize the spoken words from the network's output? (#740)

There is an integration between Kaldi and CNTK. However, I am not familiar with it, and I don't know whether it can do what you need. @dongyu888, do you know?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/Microsoft/CNTK/issues/740#issuecomment-237759682

dongyu888 on 5 Aug 2016

Can I do it directly in CNTK? Or I have to download Kaldi as well? Because I see there is some example in \Examples\Speech\Miscellaneous\AMI using Kaldi.

TomaszAugustyn on 5 Aug 2016

No, you cannot do it directly using CNTK. You can just copy a Kaldi decoder binary from someone to do decoding. In the past we have released our research decoder argon which can directly decode with CNTK. I will check to see whether we can still do that. In either case, you still need to build decoding graph (HMM model, pronunciation, LM, etc.) using other tools.

dongyu888 on 5 Aug 2016

Much thanks guys for your replies. Now I know where I stand. Ok @dongyu888 I'm waiting for your answer.

TomaszAugustyn on 5 Aug 2016

I just checked. We intend to open source argon but it takes time. For now, you might want to get the Kaldi decoder instead.

dongyu888 on 5 Aug 2016

👍3

Ok, thank you again

TomaszAugustyn on 5 Aug 2016

@dongyu888. I noticed this thread is a bit old but was curious if anything came of the 'research decoder argon' project you mention above? I'm quite interested in using CNTK to directly decode and would love to see a pointer if anything got released.