Tesseract: Unclear Documentation for Training Tesseract lstm.

Created on 28 Apr 2018  ·  20Comments  ·  Source: tesseract-ocr/tesseract

Environment

Tesseract Version: 4.x
Platform: Ubuntu 64.

I dont understand why the Documentation for training tesseract is very much incomplete and sooo breif.
I dont see any way discussed in the documentation about editing box file and creating lstmf file for the edited boxfiles. Why would anyone want to train tesseract if the boxfiles prepared by it are perfect. Not enough discussed about the tesstrain_utils.sh which actually plays major role in this training thing. Idk maybe im just naive. This is just my idea. I would love to get some explanation from someone.

The problem im working on is improving accuracy on a font for English language (French Script MT). I know that tesseract was already trained on this font (from langdata repo) but it doesnt perform well i tried image pre-processing too. I can see the dominance of rest of the fonts on output so i wanna make network a little biased for my font so that i can get better accuracy by training it.

My understanding after reading documentation:

  1. Create trained data using tesstrain.sh (But how do i edit the boxing and characters before it creates lstm file for that)
  2. use lstmtraining command line program to train network.
  3. combine this with existing english best trained data.

Thank you in advance !

Most helpful comment

Thank you Miss [/mr] Shreedevi.

Mrs. :-)

Attached zip has a bash script for running finetuning for eng. You will have to change paths to match your setup. It also has the finetuned eng.traineddata for French Script, with 400 iterations.

If this is not enough, you can try the plus-minus type training.

eng-french-script.zip

All 20 comments

There is a lot of information regarding training for 4.0.0. Have you read https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM and all its links?

Info regarding training 4.0 is given in
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

It is very clearly stated that training from custom box/tiff pairs is NOT supported.

If you want to train for French Script MT, follow instructions for finetune for IMPACT using French Script MT. If that doesn't give desired results , use a larger representative training text and training using optiosn for plus-minus training.... and so on.

I read all those links atleast once and some of them atleast a few times. Custom box/tiff pairs not supported ? Are you saying 3.02 is better than 4.0 in case of training for fonts? Yeah Thanks for the help @Shreeshrii . Looks like thats the only option im left with.

One check on the custom box tiff pairs thing you said @Shreeshrii coz its clearly mentioned in the training part of the 1st link that box pairs of 3.02 can be used to train lstm with a few modifications done in them.

Link :- https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

Are you saying 3.02 is better than 4.0 in case of training for fonts?

No. I didn't say that.

When you train with fonts using tesstrain.sh, it creates synthetic training
data (box/tiff pairs) using text2image program. These box files do NOT need
to be edited like 3.02. No manual editing is required.If you are interested
in looking at them, see the /tmp directory.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 28, 2018 at 7:07 PM, gouthampro3 notifications@github.com
wrote:

One check on the custom box tiff pairs thing you said @Shreeshrii
https://github.com/Shreeshrii coz its clearly mentioned in the training
part of the 1st link that box pairs of 3.02 can be used to train lstm with
a few modifications done in them.

Link :- https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-
LSTM#training-tesseract-lstm-engine


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1536#issuecomment-385176522,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5P0DoKw5xk-_kvqcxT4Bf7qlTRUks5ttHCMgaJpZM4TrUip
.

Okay. I know it uses text2image to create synthetic data but you know lstm's need to train on data something they are gonna be used on right? They give better results and very high accuracy that way. So i thought there might be someway that I can use to do this.

One check on the custom box tiff pairs thing you said @Shreeshrii coz its clearly mentioned in the training part of the 1st link that box pairs of 3.02 can be used to train lstm with a few modifications done in them.

Link :- https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

Closing the issue because I see that @Shreeshrii has already discussed the same problem in #issue1065 and that's pretty enough for understanding how to train tesseract for fonts from existing data. Thank you Miss [/mr] Shreedevi.

Thank you Miss [/mr] Shreedevi.

Mrs. :-)

Attached zip has a bash script for running finetuning for eng. You will have to change paths to match your setup. It also has the finetuned eng.traineddata for French Script, with 400 iterations.

If this is not enough, you can try the plus-minus type training.

eng-french-script.zip

Mrs. :-)
Haha. Okie.

Thank you so much for you help but for my problem I need to bias my network coz I already know the font to be OCRed, So i trained model( eng tessdata_best ) on a large amount of text last night for around 5000 iterations. Thank you again @Shreeshrii .

Mrs. @Shreeshrii I see that you have already mastered the art of training tesseract from scratch can you tell me how many iterations will be ideal to train tesseract from scratch for one single font? (Say I have 150 pages of text obviously taken from a novel 12pt converted in to images using tesstrain.sh just for one font. How many iterations should I give for the network to train well for that font? Or Do you think changing the network specifications will be a better idea? What do i change them to for this specific problem when I know I am gonna use the model generated on something similar to training data)

All models have been trained by Ray Smith at Google. I have only had success in fine tuning the models.

I suggest you try to train by replacing a layer. For english, you would use tessdata_best/eng.traineddata to continue from.

See https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast for network specs used for different languages.

You should be able to build a smaller model by using 192 Instead of 512/384/256

@gouthampro3 Please share a sample image that you are trying to recognize.

@Shreeshrii I dont get these generalized terms smaller and larger, they are confusing. From the link you mentioned looks like ray has trained the eng_best model for around 6million iterations and maybe fine-tuning is the best thing we can do (even if we have a hundred pages of data I reckon) as mentioned by you and in the tutorial of training-tesseract 4.00 lstm

Version string:4.00.00alpha:eng:synth20170629
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1], flags=41, iteration=6352400, sample_iteration=6352704, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999

Here is the sample image you have asked for Click on this Drive Link

If you dont mind is there a mail id that i can reach you at?

Thank you for the sample image. I wanted to see whether you were trying to
recognize scanned page from a book or just synthetic data with French
Script MT font.

The font at 12 point is quite small. You can try to increase the image size
to 200% and change dpi to 300 and see if you get better recognition.

The other thing to try would be to change the text2image invocation command
and use ptsize 32 rather than the default of 12.

Trying finetune with plusminus maybe your best bet. You could try to
replace a layer as an alternative. There are no cut and dried answers, you
have to experiment what works best for your scenario.

I am trying some test trainings with this font. I will post the traineddata
so you can test with your image and compare with its groundtruth to see
which is better.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, May 6, 2018 at 2:48 AM, gouthampro3 notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii I dont get these generalized
terms smaller and larger are confusing. From the link you mentioned looks
like ray has trained the eng_best model for around 6million iterations and
maybe fine-tuning is the best thing we can do as mentioned by you and in
the tutorial of training-tesseract 4.00 lstm
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Version string:4.00.00alpha:eng:synth20170629
LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1],
flags=41, iteration=6352400, sample_iteration=6352704, null_char=110,
learning_rate=0.001, momentum=0.5, adam_beta=0.999

Here is the sample image you have asked for Click on this Drive Link
https://drive.google.com/open?id=1ZWRif1xFCP4lK559XjOwirOTcdsyLkTJ

If you dont mind is there a mail id that i can reach you at?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1536#issuecomment-386835711,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oydCebjR_FJvpMxv341VimYfKs6tks5tvhc-gaJpZM4TrUip
.

Thank you for the sample image. I wanted to see whether you were trying to
recognize scanned page from a book or just synthetic data with French
Script MT font.

Its a scanned copy(I guess. I dont really know, got it from a company) by the way u can see the font distortion and sharp edges if you zoom it up a bit so not synthesized.

The font at 12 point is quite small. You can try to increase the image size
to 200% and change dpi to 300 and see if you get better recognition.

Yes it is I have already tried this too. The results were pretty decent and there not much variance in accuracy. The image sample I uploaded was already converted using convert(Image Magisk) with depth set to 8, stripped background, density 300 resolution units dpi.

Trying finetune with plusminus maybe your best bet.

Is it so? Will try.

There are no cut and dried answers, you
have to experiment what works best for your scenario.

Obviously.

You could try to
replace a layer as an alternative.

I think this will perfectly work @Shreeshrii

I am trying some test trainings with this font. I will post the traineddata
so you can test with your image and compare with its groundtruth to see
which is better.

Thanks again. I appreciate that.

If you dont mind is there a mail id that i can reach you at?

??

Hello i am trying to do same i am getting 70% accuracy help me out to make it 100%

@BhaskarKN

i am trying to do same

Are you doing the lstm training? By what approach? Training few layers or +/- few chars or training from scratch or training the existing model to make it biased for your font ?
@Shreeshrii any progress ma'am?

From scratch

@Shreeshrii Hello. I am trying to document the working and architecture of Tesseract (version 4). Is there any other resource which is a bit more theoretical and explanatory (publications, etc) than the wiki. I found one but I think it talks about version 2 and I am not sure how much of that version still exists in version 4.

https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016

I have checked that as well. The content is a bit confusing for me in some places. Would be great if it had some explanations with it

Was this page helpful?
0 / 5 - 0 ratings