Wav2letter: Tokenization related problem

Created on 24 Aug 2020  ·  18Comments  ·  Source: flashlight/wav2letter

I used sentence piece toolkit to do tokenization. After encoding, all the spaces were replaced with this special token "▁". Although while training I set the word separator value to '▁' , I get 99% WER and 10% CER. I read the predictions: all the words were correctly predicted but there was issues with the spaces. I found that in the test.lst.viterbi.ref file, there are spaces. So, I changed the test.lst file replacing the spaces with "▁". However, now in the test.lst.viterbi.ref file, all the samples are unknown.

This is the head of my test.lst file (After replacing the spaces):
`test1 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a0/TNF01_oishi_p_1_S_1-01.wav 12390.0 ▁০১কথোপকথন▁তুই▁নাকি▁আরব▁দেশে▁যাচ্ছিস▁বাদশা▁কিছু▁বলে▁না▁বোন▁তার▁হাত▁ধরে▁আবার▁বলে▁সত্যি▁বাদশাকে▁ভাত▁বেড়ে▁দিয়ে▁বলে▁তলে▁তলে▁কখন▁তুই▁ঠিক▁করলি
test2 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-09.wav 13470.0 ▁গত▁মঙ্গলবার▁বিকেলে▁সোনালি▁রোদ্দুরে▁একঝাঁক▁মেয়ের▁একাগ্র▁অনুশীলন▁দেখেই▁বোঝা▁যায়▁এটাই▁সেই▁মাঠ▁যেখানে▁ফুটবলার▁হওয়ার▁প্রথম▁দীক্ষা▁পেয়েছে▁তহুরা▁মারিয়া▁শামসুন্নাহার▁মার্জিয়ারা

test3 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-30.wav 6560.0 ▁নাম▁প্রকাশে▁অনিচ্ছুক▁এক▁শিক্ষক▁বললেন▁সবাই▁সুনামের▁ভাগীদার▁হতে▁চায়▁এ▁থেকেই▁তৈরি▁হয়েছে▁বিভেদ▁দ্বন্দ্ব
test4 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-31.wav 10880.0 ▁এই▁দ্বন্দ্বে▁যে▁ফুটবলাররা▁ক্ষতিগ্রস্ত▁হচ্ছে▁পিছিয়ে▁পড়ছে▁কলসিন্দুরের▁ফুটবল▁সেটির▁প্রমাণ▁গত▁দুই▁বছরে▁বঙ্গমাতা▁ফজিলাতুন্নেছা▁গোল্ড▁কাপে▁তাদের▁ফলাফল▁ভালো▁ছিল▁না`

This is the test.lst.viterbi.ref file:
<unk> (test6080) <unk> (test5066) <unk> (test2505) <unk> (test5276)

From the terminal (test.lst file has spaces in this case):
|T|:▁ক ালো ▁র ঙ ের ▁সাথে ▁ম ের ুন স হ ▁বিভিন্ন ▁র ঙ ের ▁ড িজ াই নের ▁সাথে ▁ ঐ শ ্ব র িয়ার ▁স াজ ▁স জ ্জ া টি ▁স াজ ানো ▁ছিল ▁ক ান ▁রা ণ ী ▁সহ জ েই ▁রে ড ▁কার ্প ে টে
|P|:▁ক ালো ▁র ঙ ের ▁সাথে ▁ব ের ুন স হ ▁বিভিন্ন ▁র ঙ ের ▁ড িজ াই নের ▁সাথে ▁ ঐ শ ্ব র িয়ার ▁স াজ ▁স জ ্জ া টি ▁স াজ ানো ▁ছিল ▁ক ান ▁রা ণ ী ▁সহ জে ▁রে ড ▁কার ্প ে টে
It shows 100% WER but I see most of the words are correctly predicted.

question

All 18 comments

@samin9796 Did you set also flag --usewordpiece=true?

@tlikhomanenko For training? No, I didn't. Do I need to train again with this flag?

when using wordpieces you need to set wordseparator as well as usewordpiece=true. Could you try to run test.cpp with usewordpiece=true at first, and maybe retrain with it too?

@tlikhomanenko Yes, I just tested with usewordpiece=true and got much better WER. I will retrain the model though.
Well, I wanted to see how it performs with OOV words. Could you please tell me how to train a lexicon free AM? Any changes in the flags?

The thing that these flags just specify how to process _. Your lexicon defines how to map words. So just prepare all words mapping into tokens sequence and in case of word pieces we only need these flags I pointed, feel free to chekc our configs from Librispeech sota models https://github.com/facebookresearch/wav2letter/tree/v0.2/recipes/models/sota/2019

Closing issue for now, feel free to reopen if you will have problems after retraining.

@tlikhomanenko
Hi! I would like to ask you a quick question- when you trained with 10,000 word pieces, what was the input channel of the last linear layer? I mean do I need to have more input channels than the output channels (number of output tokens) in the last FC layer?

We are using NLABEL to have the correct linear layer to map embedding into the used number of tokens, like here https://github.com/facebookresearch/wav2letter/blob/v0.2/recipes/models/sota/2019/am_arch/am_tds_ctc.arch#L38 (if I correctly understood your question). You can see that we had 1140 which we map to the 10k. Often you have bottleneck layer, like in vision and NPL, where you map say 512 vector into nTokens, which can be several hundred thousands.

@tlikhomanenko
I retrained a model using word pieces. While training, I see 15% WER on validation set but when I run test.cpp file for the same validation set, I get 23% WER. Why is WER increasing so much while testing?

Can you point how are you running Test.cpp? (you should set --uselexicon=false at least)

I didn't use that flag. Now getting better WER. Thank you!

Can you send you training command + flags too? Also what do you see in the output of test.cpp for the predictions vs target?

I see your tokens in test cpp is --tokens=v_3000_token.txt while in train it is --tokens=v_1000_token.txt

Well, it is not a problem. I copied from another train.cfg file (all the flags are exactly same). I double-checked that it is not a problem.

Could you try to run your 3k tokens model with test.cpp only with these flags (the remain flags will be loaded from the model itself)?

--am=/.../002_model_validation_updated.lst.bin
--datadir=/data/ahnaf/wav2letter/dataset_prep/
--test=validation_updated.lst
--maxload=-1
--show
--uselexicon=false

I have tried it now and got the following error:
what(): Unknown index in dictionary '2999'

@tlikhomanenko
I somehow managed to get it right. At first, I didn't mention the path of the lexicon file for running test.cpp. I was getting unknown index in dict error. So, I mistakenly added that token to the token set. Now that I removed the token from the token set and specified the lexicon file while testing, things are going right.

Glad to hear this!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kamakshi-malhotra picture kamakshi-malhotra  ·  5Comments

nihiluis picture nihiluis  ·  5Comments

Terry1504 picture Terry1504  ·  4Comments

smolendawid picture smolendawid  ·  3Comments

ekorudi picture ekorudi  ·  5Comments