I don't understand why I am getting errors opeing tokens.txt when I have specified the file path correctly in train.cfg. Maybe it is due to my encoding --utf-8--?
Mytokens.txt is encoded with utf-8 and the cotent is a set of korean tokens
path to my tokens.txt:
/home/wav2letter/data/processed_data/wav2letter/korean/tokens.txt
How I specified path to tokens.txt in my train.cfg:
--tokensdir=/home/wav2letter/data/processed_data/wav2letter/korean
--tokens=tokens.txt
My stack trace is
F0225 03:04:26.933456 499 Utils.cpp:237] Unable to open dictionary file 'tokens.txt'
*** Check failure stack trace: ***
@ 0x7fa8f766e5cd google::LogMessage::Fail()
@ 0x7fa8f7670433 google::LogMessage::SendToLog()
@ 0x7fa8f766e15b google::LogMessage::Flush()
@ 0x7fa8f7670e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x52e5a6 w2l::createTokenDict()
@ 0x52e689 w2l::createTokenDict()
@ 0x417d38 main
@ 0x7fa8a2d0b830 __libc_start_main
@ 0x4656f9 _start
@ (nil) (unknown)
Aborted (core dumped)
I had a similar error before when dealing with french. not sure it's the same case, but my problem was that I created the tokens.txt file in a python script without specifying utf8 as the encoding format. that alone did not correct the error by itself, another issue was that I executed the script inside the provided docker image. that image does not deal with utf8 encoding correctly. I solved it by installing the locals in the docker image https://stackoverflow.com/questions/28405902/how-to-set-the-locale-inside-a-ubuntu-docker-container
and then regenerating all the text files. everything worked smoothly after that.
Hope this helps.
Thanks a lot!
So I need to do 2 things: 1. Change setting to utf8, and 2. Dont use docker.
How do I do (1)?
Once the file is well encoded with UTF8 and system where you run the training handles utf8 well ( docker or otherwise) it should work, worked well for me for french characters. Check if the file is in the right directory too, maybe you just mi-specified the tokens.txt file location in the config file.
Thanks a lot :) Ill try tmrw and get back.
Training with a UTF-8-encoded tokens file is and has been supported. Closing for now.
Most helpful comment
Once the file is well encoded with UTF8 and system where you run the training handles utf8 well ( docker or otherwise) it should work, worked well for me for french characters. Check if the file is in the right directory too, maybe you just mi-specified the tokens.txt file location in the config file.