Wav2letter: Custom data preparation/usage error

Created on 18 Sep 2020 · 2Comments · Source: flashlight/wav2letter

I'm trying to train the conv_glu architecture on my own data. I followed the instructions for data preparation as given in the wiki. While I try to train the model I encounter the following error:

I0918 13:02:43.208295 28916 Train.cpp:152] Experiment path: /home/jupyter/wav2letter_model_rundir/nl_cv_conv_glu
I0918 13:02:43.208299 28916 Train.cpp:153] Experiment runidx: 1
I0918 13:02:43.208792 28916 Train.cpp:199] Number of classes (network): 42
terminate called after throwing an instance of 'std::runtime_error'
  what():  [loadWords] Invalid line:     |
*** Aborted at 1600434163 (unix time) try "date -d @1600434163" if you are using GNU date ***
PC: @     0x7fbbd8ad4fff gsignal
*** SIGABRT (@0x3e9000070f4) received by PID 28916 (TID 0x7fbc3e48c980) from PID 28916; stack trace: ***
    @     0x7fbbd9f580e0 (unknown)
    @     0x7fbbd8ad4fff gsignal
    @     0x7fbbd8ad642a abort
    @     0x7fbbff1ce045 __gnu_cxx::__verbose_terminate_handler()
    @     0x7fbbff13e276 __cxxabiv1::__terminate()
    @     0x7fbbff13e2c1 std::terminate()
    @     0x7fbbff132023 __cxa_throw
    @     0x55cfa5f2d66d w2l::loadWords()
    @     0x55cfa5db911c main
    @     0x7fbbd8ac22e1 __libc_start_main
    @     0x55cfa5e20b7a _start
    @                0x0 (unknown)
Aborted

Why is '|' being shown as an invalid line? I have attached both the token file and the lexicon file here for your reference.

Lexicon File: Lexicon File
Token File: Token File

question

Source

nikhilnagaraj

Most helpful comment

Hi @nikhilnagaraj, it looks like you have an empty entry in your lexicon file:

padentomasello@padentomasello-mbp Downloads % sort lexicon.txt | head
     |
a   a |
aalst   a a l s t |
aan a a n |
aanbeland   a a n b e l a n d |
aanbesteding    a a n b e s t e d i n g |
aanbeveling a a n b e v e l i n g |
aanbevelingen   a a n b e v e l i n g e n |
aanbevolen  a a n b e v o l e n |
aanbieden   a a n b i e d e n |
padentomasello@padentomasello-mbp Downloads %

| represents word separators, so the first entry has no word or spellings. Can you try removing this line, and rerunning?

FWIW, I can see why it's not clear why empty words shouldn't be in the lexicon. We assume that any training run will have silences, and they can sometimes be handled differently, so we don't specify them in the lexicon, which is just for word spelling.

Let me know if that works.

padentomasello on 18 Sep 2020

👍2

All 2 comments

Hi @nikhilnagaraj, it looks like you have an empty entry in your lexicon file:

padentomasello@padentomasello-mbp Downloads % sort lexicon.txt | head
     |
a   a |
aalst   a a l s t |
aan a a n |
aanbeland   a a n b e l a n d |
aanbesteding    a a n b e s t e d i n g |
aanbeveling a a n b e v e l i n g |
aanbevelingen   a a n b e v e l i n g e n |
aanbevolen  a a n b e v o l e n |
aanbieden   a a n b i e d e n |
padentomasello@padentomasello-mbp Downloads %

| represents word separators, so the first entry has no word or spellings. Can you try removing this line, and rerunning?

Let me know if that works.

padentomasello on 18 Sep 2020

👍2

Hi @padentomasello , thanks for the solution! It works after I remove the empty entry. Thanks for the tip about empty words in the lexicon!

nikhilnagaraj on 21 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings