Tesseract: LSTM: Training - Can't generate ara.lstm for Arabic Language

Created on 23 May 2017  ·  39Comments  ·  Source: tesseract-ocr/tesseract

Hi guys,

I am new to tesseract and I was following tesseract 3.xx guide and was able to generate ara.traineddata for arabic language but after some time I came to know that there is no point of further train the engine for v.3.xx so I shifted to v4.00 alpha which is the current latest version of tesseract but I am facing some issues while training.

First issue is that there are no alternative commands for windows and only linux based commands for training are available but I manage to run some commands and facing an issue where I can't train further.

Command I am using:
_..\lstmtraining -U ara.unicharset --script_dir ..\langdata --net_spec "[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]" --model_output ..\training --debug_interval 0 --train_listfile ara.training_text.txt_

_Where lstmtraining.exe is the previous directory, I have ara.unicharst file, other options are same as I picked from sample command given here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 and I have ara.training_text.txt (same was used to generate box files and tiff files) file for arabic text but now I getting the following errors:_

" C:\Program Files (x86)\Tesseract-OCR\training>..\lstmtraining -U ara.unicharset --script_dir ..\langdata --net_spec "[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]" --model_output ..\training --debug_interval 0 --train_listfile ara.training_text.txt
Other case u of U is not in unicharset
Other case n of N is not in unicharset
Other case e of E is not in unicharset
Other case h of H is not in unicharset
Other case C of c is not in unicharset
Other case T of t is not in unicharset
Other case S of s is not in unicharset
Other case V of v is not in unicharset
Other case D of d is not in unicharset
Other case W of w is not in unicharset
Setting unichar properties
Warning: given outputs 105 not equal to unicharset of 96.
Num outputs,weights in serial:
1,36,0,1:1, 0
Num outputs,weights in serial:
C5,5:25, 0
Ft16:16, 416
Total weights = 416
Mp3,3:16, 0
Lfys64:64, 20736
Lfx128:128, 98816
Lrx128:128, 131584
Lfx256:256, 394240
Fc96:96, 24672
Total weights = 670464
Built network:[1,36,0,1[C5,5Ft16]Mp3,3Lfys64Lfx128Lrx128Lfx256Fc96] from request [1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.0001, momentum=0.9
Deserialize header failed: ∩╗┐╪▒╪│┘à╪º ╪╡╪¿┘è╪¡┘ç ╪│┘ä┘ê┘à ┘ç┘ä╪º┘ä┘ç ╪╣╪¿╪»╪º┘ä┘ä╪╖┘è┘ü ╪¿┘â╪▒┘è ╪ºé╪▒ ╪º┘ä┘é╪╣┘è┘é╪╣┘è ┘à╪╢╪º┘ê┘è ╪╣╪º┘à┘é ╪¡┘è╪»╪▒╪╣┘ä┘è ┘å┘ê┘è╪▓╪¡ ╪º┘ä╪¡┘à┘è╪»Load of page 0 failed!
Load of images failed!! "

Ignore the random text I think it is coming because of the text file I am using as I copied this from Command Line. Please find the text file I am using along with tiff & box files in the attachments.

Need help as soon as possible.

Thank You

ara.training_text.txt
ara.nicidprint.exp0.zip
ara.nicidprint.exp0.box.zip

Most helpful comment

--train_listfile ara.nicidprint.exp0.lstmf

is incorrect. you need a text file which has the filename in it.

See attached zip file for a sample.
ara.zip

EDIT: ara.zip file is from May 2017. Ray has changed the training method in July/Aug 2017 or so. So, files from that will not work. Please see the wiki page on training for latest info.

All 39 comments

  1. you can install git for windows with bash option and then you will be
    able to run tesstrain.sh script under windows.

  2. your lstmtraining command is incorrect.

--train_listfile ara.training_text.txt

the listfile has a list of lstmf files created using box/tiff files by
tesseract with lstm.train config gile.

See *https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune*

*and follow the tutorial *

  1. your lstmtraining command is incorrect.

https://git-scm.com/downloads

download git for windows
install including the option for bash

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 23, 2017 at 10:59 PM, ShreeDevi Kumar shreeshrii@gmail.com
wrote:

  1. you can install git for windows with bash option and then you will be
    able to run tesstrain.sh script under windows.

  2. your lstmtraining command is incorrect.

--train_listfile ara.training_text.txt

the listfile has a list of lstmf files created using box/tiff files by
tesseract with lstm.train config gile.

See *https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune*

*and follow the tutorial *

  1. your lstmtraining command is incorrect.

Thank You so much, I will try and will update you.

Hi,

I am still facing issues.

Now I used this command:

../train/lstmtraining -U ara.unicharset --script_dir ../langdata --debug_interval 100 --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' --model_output ../ara_train --train_listfile ara.nicidprint.exp0.lstmf --max_iterations 5000 &>../ara_train/basetrain.log

and the result is:

Other case u of U is not in unicharset
Other case n of N is not in unicharset
Other case e of E is not in unicharset
Other case h of H is not in unicharset
Other case C of c is not in unicharset
Other case T of t is not in unicharset
Other case S of s is not in unicharset
Other case V of v is not in unicharset
Other case D of d is not in unicharset
Other case W of w is not in unicharset
Setting unichar properties
Warning: given outputs 105 not equal to unicharset of 96.
Num outputs,weights in serial:
1,36,0,1:1, 0
Num outputs,weights in serial:
C5,5:25, 0
Ft16:16, 416
Total weights = 416
Mp3,3:16, 0
Lfys64:64, 20736
Lfx128:128, 98816
Lrx128:128, 131584
Lfx256:256, 394240
Fc96:96, 24672
Total weights = 670464
Built network:[1,36,0,1[C5,5Ft16]Mp3,3Lfys64Lfx128Lrx128Lfx256Fc96] from request [1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
Training parameters:
Debug interval = 100, weights = 0.1, learning rate = 0.0001, momentum=0.9
Deserialize header failed: 8
Load of page 0 failed!
Load of images failed!!
Deserialize header failed:
Deserialize header failed: 
Deserialize header failed: !ÄtvÖÞáø"¤Ó³@¾ˆqÌ!Ÿ.
?‚ì­“Ì/ž~ÎÅáx§‘áÓð’Èîðè¯Æ¹Âž+Y¸„³’0šÚ¬/Cv{%7â¶\èmZ‰Ÿh>%ÅéF¸£ÇîKE¤ˆc÷Öô=ñ›HG@Pl>ù’_“…
“mg‹l"³@ʈÎ~f=±um1£l/b¬Ma[&UðsEÿ´t'û•  ­ÄúhU
Deserialize header failed: 5÷4Sye)p¿ÅIôú
ã+h]‡jÚOy½¦MƒbÚ]i,²ý”dBvó=ÈcÔÐY0QÿæF¡ñàSžŒ–(Vs
””§
-7ì
E»°Y[§î
÷õµ8>
3½t?Ö„+hÚ©ú¡v?wGâ¬:¾½ Û1êT3
ß

A¯î¢X Â3Åb°
7­ìv“
3Ó’ÚûÛð:Clñþt÷$&ªÌYc¤_.2vC¾Á§)Ù„?©ZêKð¾¾N1Ê£zy_Ÿ!„K©”f±¦#"©GW©ëC^7ë^Ûc îeH›˜†Äž#ÙB&1ÒOU)ÚâzÊÙsm§ô ‡QRÊH#"°BÒ©ù…%–DXH¬|­ŸÃ1
HŸˆ Deserialize header failed: ¢›K‡ÙŒØ‹¢ÂÏ÷ÙIÅmùŽmÅZ(ª>ˆ^¥¯æ!¨ä!若 Å «‘ºmÑ‹ÿMhµ©0ʳšü¯-Š–3áÆ ‘±"c+Ò H«=xù 3X‘Ð#®óY fÚà¼Ë.<¬{à64 ìš=99È›FBRF@W(-™© WNpñþ¯%d r8‚…±ªÙ+ Ééò›"@ÖTCŒµRR@L‚®k<¤¤0‹‰|#G°‡Ù
Deserialize header failed: !‰kOLɱ¥«øncÙ
mù~J²•G“
¤¡™–JŒå€6yتךo“̏CH¬Nÿ£
‚r›¥#æ¤k¸Œf_…Š1d
Çj¿‰½’C¯Æ¼àŠõuA3£ŠªB™'-é•Îí´0µˆ[ٝσèITŽÏ¨1„R’=ųqH+Ó@Åâ‰3,]]¹s6 [
Ë)…&•„½pCeòzaG÷3¼²Î¢“
Xy),_)Œ¸n’‡!ÊFJ«¶J‚}“P|;}MÀ]Kjç­SÃF©)ðæ‰ÞŽáÚ•ö ƒdÏ‚Y¹f·é0°™í-
Deserialize header failed: ëÁ‰Ü4\Ò/T¶ñO[O ˆwõu‹VéeE
ÈŒu¼’}ëÑùJhÌ=šzÏ]ÚÓO€7 Ï=h¨Ä¶EµC
Deserialize header failed: Ò"ÔA‚N9KU¸¥ÇH©_]³#‘çT'RbX*ŽºAd¨¬…úô:v¾
²¨¤?¥N¶[Ã

Please help.

Thank You

--train_listfile ara.nicidprint.exp0.lstmf

is incorrect. you need a text file which has the filename in it.

See attached zip file for a sample.
ara.zip

EDIT: ara.zip file is from May 2017. Ray has changed the training method in July/Aug 2017 or so. So, files from that will not work. Please see the wiki page on training for latest info.

Thank You soooooo much. It worked and now it is running. Once it is finished then I will let you know.

Thanks again friend.

@latifwirelessmaker which fonts you added?

I am using Nic Id Print, its a custom font created for some specific data

@LatifWirelessMarketer
ok good
just your training finished tell us about the accuracy you reached..

Thank you

OK Sure but I am still in training phase as I just started with v 4.0 so maybe with current data I will not reach that much accuracy so I will run it again for last amount of training text.

So will keep updating about accuracy improvements.

Thanks.

Btw, can anyone tell me how much of text is required to get a good accuracy? I've read in the documentation that original training was done on more then 400000 lines of text so is it necessary to have that much text of languages like (english, arabic, etc...) or not.

Also let me know if I can get the original source of text used for training the languages that are available to be downloaded freely from original source?

Thank You

Regarding the Arabic language, the number of Lines required to produce a decent Model depends on the complexity of the typeface. But it is proven that you are able to create an Arabic Model capable of achieving +90% recognition rate using ~800 lines in your training.

Have a look at the research done by the OpenITI​ team, implimented using Kraken engine:
1)The Research paper

2)Kraken

@LatifWirelessMarketer can you post your test results such as recognition rate and time.
waiting for your updates

@christophered I was training on huge text before but it was a never ending task to now I am trying to train it on small text with different fonts like arial, tahoma etc. and text lines are less then 30. It took me more then 3 hours to train at first with error rate below 0.9 but the problem is that after I was able to generate ara.lstm file which is the final training file. Now I am unable to generate ara.traineddata

Command I used:
..\combine_tessdata -o ..\tessdata\ara.traineddata ara.lstm ara.unicharset ara.number-dawg ara.punc-dawg ara.word-dawg

and I am getting this error:
Failed to create a temporary file ara.traineddata.__tmp__

no matter what, if I change the traineddata directory. The error remains the same.

Does anyone know this problem and it's solution, then please help.

Thank You

@Shreeshrii

..\combine_tessdata -o ..\tessdata\ara.traineddata ara.lstm
ara.unicharset ara.number-dawg ara.punc-dawg ara.word-dawg

-o option is for overwrite

so it looks for an ara.traineddata in specified directory, renames it to
tmp version and then creates the new one.

check that u have an old ara.traineddata
check u have write access to directory

u may have to remove old tmp files, if any.

check the options of combine_tessdata

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jun 4, 2017 at 5:20 PM, chris notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/947#issuecomment-306035336,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o9cN923d1y6gHq1VcjJ4uXB3QLDQks5sApoDgaJpZM4Njq05
.

@Shreeshrii I checked everything already and even after you explained it to me. There are no files in any directory like ara.traineddata or ara.traineddata.__tmp__ etc. I deleted all previous ones and now trying from scratch but still it's giving me same error.

see attachment.

lstm

@Shreeshrii what percentage recognition rate for the Arabic language you where able to get?

@christophered My char error rate was 0.78 but how to check percentage recognition rate? I think we can only check after generating the trained data right?

@christophered I uninstalled tesseract engine after coping my files and reinstalled it, now its working.

@Shreeshrii , One questions brother.

I want to know if there is no ara.traineddata already present in the folder, then how can we create a new file? I mean I tried several times and the error I was getting yesterday was just because there is no ara.traineddata file present. If I copy any ara.traineddata from the internet in my tessdata folder then it combines the new trained data with that.

For example the size of original data was 11.9mb but I used only 20 lines of text and my trained data size become 11.0mb. How is that possible?

Both Fine Tune and replacing the top layer training start with the .lstm
file from the existing traineddata. Training builds on top of that. So the
files will be about the same size.

If you only want to make traineddata from your trained files, do not use

Command I used:
..\combine_tessdata -o ..\tessdata\ara.traineddata ara.lstm ara.unicharset
ara.number-dawg ara.punc-dawg ara.word-dawg

Instead, just combine the files you need

..\combine_tessdata ara.

where you have all the required files ie. ara.lstm ara.unicharset
ara.number-dawg ara.punc-dawg ara.word-dawg

combine-tessdata finds ara.* files (the ones required for traineddata and
combines them).

Your file size should be smaller.

You can compare what's included in the ara.traineddata from tessdata repo
by the following command:
-u option unpacks the data

combine_tessdata -u ara.traineddata ara.
Extracting tessdata components from ara.traineddata
Wrote ara.config
Wrote ara.unicharset
Wrote ara.punc-dawg
Wrote ara.word-dawg
Wrote ara.number-dawg
Wrote ara.freq-dawg
Wrote ara.lstm
Wrote ara.lstm-punc-dawg
Wrote ara.lstm-word-dawg
Wrote ara.lstm-number-dawg
0:config:size=545, offset=168
1:unicharset:size=7831, offset=713
6:punc-dawg:size=1066, offset=8544
7:word-dawg:size=6303746, offset=9610
8:number-dawg:size=426, offset=6313356
9:freq-dawg:size=1346, offset=6313782
17:lstm:size=5324626, offset=6315128
18:lstm-punc-dawg:size=1466, offset=11639754
19:lstm-word-dawg:size=895354, offset=11641220
20:lstm-number-dawg:size=658, offset=12536574

@Shreeshrii Thank You soo much brother.

Btw do you know what is the difference between ara.word-dawg & ara.lstm-word-dawg?

I mean I know how to generate word dawg as it is generated based on word list but what about lstm-word list?

Thank You

Please see
https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh

It is a copy of the word-dawg based on the wordlist, but using the
unicharset which was used for lstm training (this unicharset could be
different from the one used for tesseract or cube training).

Also, please note that currently the lstm training process still has some
'bugs' and you may not get better results than the 4.0 traineddata provided
by Google in the repo.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 5, 2017 at 4:36 PM, LatifWirelessMarketer <
[email protected]> wrote:

@Shreeshrii https://github.com/shreeshrii Thank You soo much brother.

Btw do you know what is the difference between ara.word-dawg &
ara.lstm-word-dawg?

I mean I know how to generate word dawg as it is generated based on word
list but what about lstm-word list?

Thank You


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/947#issuecomment-306162441,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o0p1c04ZQsgzTFr4qnfgAUYGfbVZks5sA-FCgaJpZM4Njq05
.

@LatifWirelessMarketer send me an email, I want to talk with you regarding the training process.
Also please delete the gibberish after Deserialize header failed: your making your topic hard to go through

@christophered sorry for it man, I removed it.

@LatifWirelessMarketer Are you using Tesseract 4.x on Windows or Linux?

@christophered Tesseract 4.00alpha on windows

@christophered I am facing one problem. After 2 weeks, I've achieved error rate of 0.61% which is good but after I generate the traineddata and tested, it give me bidi joined characters as opposite.

For example if I test it on my name as: SALMAN (سلمان) and when it detects it almost detects it correctly but lets just say سل is one character based on a whichever font I am using. So when this is what it will return (لسمان) it detects سل as لس which is wrong.

Can you help me with this? What can be the reason behind it? As it happens with every joined character.

Thank You

Can someone help me?? @christophered ?

@LatifWirelessMarketer Error rate of 0.6 is very bad, my guess is that you either:

  • Have some mistakes or inconsistencies in your training data, Your training data
  • Or there are some characters that are causing confusion, Tesseract itself

Please upload your training data and the created module, so that we can have a closer look.
Also, list your step-by-step commands that you used to create the training data, create and train the lstm model, and recognize the images.

Arabic has some known issues, like the one you described.
Hopefully, Ray will able to fix them.

@Shreeshrii Hi Shreeshri, I was trying to tesseract 4.00alpha on macOS to finetune on the ara.zip you posted in this thread. However, I still encountered this error:

Deserialize header failed: ~/ara/ara.Arabic_Typesetting.exp0.lstmf

The error is the same when I try to create the lstmf file by myself using:

> tesseract ara.Arabic_Typesetting.exp0.tif ara.Arabic_Typesetting.exp0 lstm.train batch.nochop

Could you help me to find where went wrong? Thanks so much

ara.zip file is from May 2017. Ray has changed the training method in July/Aug 2017 or so. So, files from that will not work. Please see the wiki page on training for latest info.

I will edit the earlier message also.

@Shreeshrii
Thanks for reply Shreeshrii!
I tried to generate a .lstmf file by my self with the latest version of tesseract, but still encountered the same error (Deserialize header failed):

tesseract ara.Arabic_Typesetting.exp0.tif ara.Arabic_Typesetting.exp0 lstm.train nobatch

Could you tell me where went wrong or what I can do? Many thanks!

@zc813 I would suggest that you use tesstrain.sh for creating the training data.

@Shreeshrii Thanks for the reply!
I tried tesstrain.sh on eng, which generated eng.Arial.exp0.lstmf successfully. However, when finetuning the model with lstmtraining and extracted eng.lstm and eng.traineddata from (tessdata_fast)[https://github.com/tesseract-ocr/tessdata_fast], I encountered the following error:

➜  tessdata ~/tesseract/training/lstmtraining --model_output engmodel --continue_from eng.lstm --traineddata eng.traineddata --train_listfile eng.training_files.txt --max_iterations 600 
Loaded file eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from eng.lstm
Loaded 72/72 pages (1-72) of document /tmp/tesstrain/tessdata/eng.Arial.exp0.lstmf
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 238
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 238
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 238
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 238
[1]    53218 segmentation fault  ~/tesseract/training/lstmtraining --model_output engmodel --continue_from   

when finetuning the model with lstmtraining and extracted eng.lstm and eng.traineddata from (tessdata_fast)[https://github.com/tesseract-ocr/tessdata_fast],

for training you have to start with traineddata from tessdata_best not tessdata_fast

@zdenop @egorpugin Please merge the following PRs which add info to the README at tessdata_fast and tessdata_best

https://github.com/tesseract-ocr/tessdata_best/pull/16

https://github.com/tesseract-ocr/tessdata_fast/pull/6

@LatifWirelessMarketer Did you ever resolve your Failed to create a temporary file error?

Was this page helpful?
0 / 5 - 0 ratings