Wav2letter: Tokenization related problem

Created on 24 Aug 2020 · 18Comments · Source: flashlight/wav2letter

I used sentence piece toolkit to do tokenization. After encoding, all the spaces were replaced with this special token "▁". Although while training I set the word separator value to '▁' , I get 99% WER and 10% CER. I read the predictions: all the words were correctly predicted but there was issues with the spaces. I found that in the test.lst.viterbi.ref file, there are spaces. So, I changed the test.lst file replacing the spaces with "▁". However, now in the test.lst.viterbi.ref file, all the samples are unknown.

This is the head of my test.lst file (After replacing the spaces):
`test1 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a0/TNF01_oishi_p_1_S_1-01.wav 12390.0 ▁০১কথোপকথন▁তুই▁নাকি▁আরব▁দেশে▁যাচ্ছিস▁বাদশা▁কিছু▁বলে▁না▁বোন▁তার▁হাত▁ধরে▁আবার▁বলে▁সত্যি▁বাদশাকে▁ভাত▁বেড়ে▁দিয়ে▁বলে▁তলে▁তলে▁কখন▁তুই▁ঠিক▁করলি
test2 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-09.wav 13470.0 ▁গত▁মঙ্গলবার▁বিকেলে▁সোনালি▁রোদ্দুরে▁একঝাঁক▁মেয়ের▁একাগ্র▁অনুশীলন▁দেখেই▁বোঝা▁যায়▁এটাই▁সেই▁মাঠ▁যেখানে▁ফুটবলার▁হওয়ার▁প্রথম▁দীক্ষা▁পেয়েছে▁তহুরা▁মারিয়া▁শামসুন্নাহার▁মার্জিয়ারা

test3 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-30.wav 6560.0 ▁নাম▁প্রকাশে▁অনিচ্ছুক▁এক▁শিক্ষক▁বললেন▁সবাই▁সুনামের▁ভাগীদার▁হতে▁চায়▁এ▁থেকেই▁তৈরি▁হয়েছে▁বিভেদ▁দ্বন্দ্ব
test4 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-31.wav 10880.0 ▁এই▁দ্বন্দ্বে▁যে▁ফুটবলাররা▁ক্ষতিগ্রস্ত▁হচ্ছে▁পিছিয়ে▁পড়ছে▁কলসিন্দুরের▁ফুটবল▁সেটির▁প্রমাণ▁গত▁দুই▁বছরে▁বঙ্গমাতা▁ফজিলাতুন্নেছা▁গোল্ড▁কাপে▁তাদের▁ফলাফল▁ভালো▁ছিল▁না`

This is the test.lst.viterbi.ref file:
<unk> (test6080) <unk> (test5066) <unk> (test2505) <unk> (test5276)

From the terminal (test.lst file has spaces in this case):
|T|:▁ক ালো ▁র ঙ ের ▁সাথে ▁ম ের ুন স হ ▁বিভিন্ন ▁র ঙ ের ▁ড িজ াই নের ▁সাথে ▁ ঐ শ ্ব র িয়ার ▁স াজ ▁স জ ্জ া টি ▁স াজ ানো ▁ছিল ▁ক ান ▁রা ণ ী ▁সহ জ েই ▁রে ড ▁কার ্প ে টে
|P|:▁ক ালো ▁র ঙ ের ▁সাথে ▁ব ের ুন স হ ▁বিভিন্ন ▁র ঙ ের ▁ড িজ াই নের ▁সাথে ▁ ঐ শ ্ব র িয়ার ▁স াজ ▁স জ ্জ া টি ▁স াজ ানো ▁ছিল ▁ক ান ▁রা ণ ী ▁সহ জে ▁রে ড ▁কার ্প ে টে
It shows 100% WER but I see most of the words are correctly predicted.

question

Source

samin9796

All 18 comments

@samin9796 Did you set also flag --usewordpiece=true?

tlikhomanenko on 24 Aug 2020

@tlikhomanenko For training? No, I didn't. Do I need to train again with this flag?

samin9796 on 24 Aug 2020

when using wordpieces you need to set wordseparator as well as usewordpiece=true. Could you try to run test.cpp with usewordpiece=true at first, and maybe retrain with it too?

tlikhomanenko on 24 Aug 2020

🎉1

@tlikhomanenko Yes, I just tested with usewordpiece=true and got much better WER. I will retrain the model though.
Well, I wanted to see how it performs with OOV words. Could you please tell me how to train a lexicon free AM? Any changes in the flags?

samin9796 on 24 Aug 2020

The thing that these flags just specify how to process _. Your lexicon defines how to map words. So just prepare all words mapping into tokens sequence and in case of word pieces we only need these flags I pointed, feel free to chekc our configs from Librispeech sota models https://github.com/facebookresearch/wav2letter/tree/v0.2/recipes/models/sota/2019

tlikhomanenko on 25 Aug 2020

Closing issue for now, feel free to reopen if you will have problems after retraining.

tlikhomanenko on 25 Aug 2020

@tlikhomanenko
Hi! I would like to ask you a quick question- when you trained with 10,000 word pieces, what was the input channel of the last linear layer? I mean do I need to have more input channels than the output channels (number of output tokens) in the last FC layer?

samin9796 on 26 Aug 2020

We are using NLABEL to have the correct linear layer to map embedding into the used number of tokens, like here https://github.com/facebookresearch/wav2letter/blob/v0.2/recipes/models/sota/2019/am_arch/am_tds_ctc.arch#L38 (if I correctly understood your question). You can see that we had 1140 which we map to the 10k. Often you have bottleneck layer, like in vision and NPL, where you map say 512 vector into nTokens, which can be several hundred thousands.

tlikhomanenko on 26 Aug 2020

👍1

@tlikhomanenko
I retrained a model using word pieces. While training, I see 15% WER on validation set but when I run test.cpp file for the same validation set, I get 23% WER. Why is WER increasing so much while testing?

samin9796 on 26 Aug 2020

Can you point how are you running Test.cpp? (you should set --uselexicon=false at least)

tlikhomanenko on 27 Aug 2020

❤1

I didn't use that flag. Now getting better WER. Thank you!

samin9796 on 27 Aug 2020

Can you send you training command + flags too? Also what do you see in the output of test.cpp for the predictions vs target?

tlikhomanenko on 28 Aug 2020

I see your tokens in test cpp is --tokens=v_3000_token.txt while in train it is --tokens=v_1000_token.txt

tlikhomanenko on 28 Aug 2020

Well, it is not a problem. I copied from another train.cfg file (all the flags are exactly same). I double-checked that it is not a problem.

samin9796 on 28 Aug 2020

Could you try to run your 3k tokens model with test.cpp only with these flags (the remain flags will be loaded from the model itself)?

--am=/.../002_model_validation_updated.lst.bin
--datadir=/data/ahnaf/wav2letter/dataset_prep/
--test=validation_updated.lst
--maxload=-1
--show
--uselexicon=false

tlikhomanenko on 28 Aug 2020

I have tried it now and got the following error:
what(): Unknown index in dictionary '2999'

samin9796 on 28 Aug 2020

@tlikhomanenko
I somehow managed to get it right. At first, I didn't mention the path of the lexicon file for running test.cpp. I was getting unknown index in dict error. So, I mistakenly added that token to the token set. Now that I removed the token from the token set and specified the lexicon file while testing, things are going right.

samin9796 on 28 Aug 2020

👍1

Glad to hear this!

tlikhomanenko on 29 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings