Tesseract: how to generate *.lstmf files according to *.tif and *.box files while fine tuning with tesseract 4.0

Created on 18 Oct 2017  ┬╖  52Comments  ┬╖  Source: tesseract-ocr/tesseract

I want to fine tune with tesseract 4.0, and I just have *.tif and *.box these two files,
I don't know how to generate *.lstmf files.

The wiki just says like this :
"The training data is provided via .lstmf files, which are serialized DocumentData They contain an image and the corresponding UTF8 text transcription, and can be generated from tif/box file pairs using Tesseract in a similar manner to the way .tr files were created for the old engine."

but it doesn't give the command to generate *.lstmf files.
So, I hope someone can help me, thanks very much ! @minly ,@CoCa520 , @Shreeshrii

question

Most helpful comment

Okay, I generated the .lstmf files! Since I couldn't find any single source as to how to go about in the way I intended to, I'm listing the steps here in case someone looks for it.

My original intention is to train Tesseract 4 on handwritten (english) characters and digits and check how does it perform in classifying handwritten stuff. I had the previous .jpg (apparently these work as well), .box and .tr files available from training in tesseract 3. The box file formats for tesseract 3 and 4 are significantly different and I manually edited them to suit the needs of tesseract 4 (a gruesome task, but not if you manage to come up with an automation script). What I did next is the following :

  • Move into the directory where the custom images are and run unicharset_extractor [lang].[fontname].exp[num].box - this generated the unicharset file.
  • Next, run the following:
    img_files=$(ls | grep exp0.jpg) [Note that this was the format my files were saved in, you might have them in a separate format.]
    and then
    for img_file in ${img_files}; do tesseract ${img_file} ${img_file%.*} 'lstm.train'; done
    sequentially. This generated the .lstmf files!

All 52 comments

The old format box files will not work for LSTM training. AFAIK, currently training is only supported with the synthetic box/tiff pairs generated via tesstrain.sh.

See https://github.com/tesseract-ocr/tesseract/issues/768 for more details.

@Shreeshrii, I have change the format box files according to the requirements of tesseract 4.0 , namely I add a TAB at end of line and spaces to demarcate words for the box files.

could you tell me how to use new format box/tiff pairs to generate *.lstmf files? Thanks!

I copy them to my langdata/language directory and then use a modified
tesstrain.sh to copy them to the tmp training directory.

tesstrain.sh changes

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

cp  ../langdata/${LANG_CODE}/*.box ${TRAINING_DIR}
cp  ../langdata/${LANG_CODE}/*.tif ${TRAINING_DIR}

ls -l  ${TRAINING_DIR}
source "$(dirname $0)/language-specific.sh"

ShreeDevi


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

On Wed, Oct 18, 2017 at 2:53 PM, 694376965 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii, I have change the format box
files according to the requirements of tesseract 4.0 , namely I add a TAB
at end of line and spaces to demarcate words for the box files.

could you tell me how to use new format box/tiff pairs to generate *.lstmf
files? Thanks!

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-337521195,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5yzL6sQNcMR-xJeNDiL1Gq0R2naks5stcOQgaJpZM4P9XT0
.

@Shreeshrii I changed the tesstrain.sh file and the command could copy the box/tiff pairs to the tmp training directory.
but there was another error as following:

$ training/tesstrain.sh --lang eng --linedata_only --langdata_dir ../langdata --tessdata_dir ./tessdata --output_dir ../result

=== Starting training for language 'eng'
total 6068
-rwxrw-r-- 1 penny penny 66188 Oct 18 19:24 eng.num.exp0.box
-rwxrw-r-- 1 penny penny 6136385 Oct 18 19:24 eng.num.exp0.tif
-rw-rw-r-- 1 penny penny 42 Oct 18 19:24 tesstrain.log
[Wed Oct 18 19:24:13 CST 2017] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts/ --font=Arial Bold --outputbase=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt --text=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.fGYz2L7fuF
Could not find font named Arial Bold.
Pango suggested font FreeSerif Bold.
Please correct --font arg.

=== Phase I: Generating training images ===
ERROR: Could not find training text file ../langdata/eng/eng.training_text

it couldn't find 'eng.training_text' file, do you know what the 'eng.training_text' file is ?

Additionally, the command 'text2image' is used to generate tiff images according to text and fonts, but we have already had box/tiff pairs, so why it excute the command 'text2image' again ?
Should I change other script files to resolve these problems?

I'm looking forward to your reply! Thanks a lot!

Please check the syntax of your command.

Training text is in langdata dir.

Rather than modifying tesstrain.sh too much, you could keep a small dummy
training text in one font to use along with your box tiff pairs.

I have mostly tested training with synthetic images, using precreated box
tiff pairs just as sample.

training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
--lang san \
--noextract_font_properties --linedata_only \
--exposures "0" \
--langdata_dir ../langdata \
--tessdata_dir ../tessdata \
--fontlist \
"Siddhanta" \
--output_dir ../tesstutorial/san

You have to make sure that all directories reflect your setup.

ShreeDevi


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

On Wed, Oct 18, 2017 at 7:36 PM, ShreeDevi Kumar shreeshrii@gmail.com
wrote:

Please check the syntax of your command. Langdata is referred to via
script dir.

Training text is in langdata dir.

Rather than modifying tesstrain.sh too much, you could keep a small dummy
training text in one font to use along with your box tiff pairs.

I have mostly tested training with synthetic images, using precreated box
tiff pairs just as sample.

On 18-Oct-2017 5:12 PM, "694376965" notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii I changed the tesstrain.sh
file and the command could copy the box/tiff pairs to the tmp training
directory.
but there was another error as following:

$ training/tesstrain.sh --lang eng --linedata_only --langdata_dir
../langdata --tessdata_dir ./tessdata --output_dir ../result

=== Starting training for language 'eng'
total 6068
-rwxrw-r-- 1 penny penny 66188 Oct 18 19:24 eng.num.exp0.box
-rwxrw-r-- 1 penny penny 6136385 Oct 18 19:24 eng.num.exp0.tif
-rw-rw-r-- 1 penny penny 42 Oct 18 19:24 tesstrain.log
[Wed Oct 18 19:24:13 CST 2017] /usr/local/bin/text2image
--fonts_dir=/usr/share/fonts/ --font=Arial Bold
--outputbase=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt
--text=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt
--fontconfig_tmpdir=/tmp/font_tmp.fGYz2L7fuF
Could not find font named Arial Bold.
Pango suggested font FreeSerif Bold.
Please correct --font arg.

=== Phase I: Generating training images ===
ERROR: Could not find training text file ../langdata/eng/eng.training_t
ext

it couldn't find 'eng.training_text' file, do you know what the
'eng.training_text' file is ?

Additionally, the command 'text2image' is used to generate tiff images
according to text and fonts, but we have already had box/tiff pairs, so why
it excute the command 'text2image' again ?
Should I change other script files to resolve these problems?

I'm looking forward to your reply! Thanks a lot!

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-337563901,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_owYariVrUfRKKivM0kFTpiBC9Ypeks5steRBgaJpZM4P9XT0
.

@Shreeshrii, with your help, I generated the *.lstmf files! Thanks a lot!

@Shreeshrii but now there is a new problem like this:

$ tesseract eng.num.exp1.tif eng.num.exp1 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 395
Empty page!!
Estimating resolution as 395
Empty page!!

there are several images will generate this error, do you know why it appears "Empty page!!"? Have you ever seen this kind of mistake?

Please look at your tif files in a viewer. Why do they have 0 dpi?

ShreeDevi


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

On Mon, Oct 23, 2017 at 11:47 AM, 694376965 notifications@github.com
wrote:

@Shreeshrii https://github.com/shreeshrii but now there is a new
problem like this:

$ tesseract eng.num.exp1.tif eng.num.exp1 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 395
Empty page!!
Estimating resolution as 395
Empty page!!

there are several images will generate this error, do you know why it
appears "Empty page!!"? Have you ever seen this kind of mistake?

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-338557979,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ox27vOl9R1j2mq_qQFP1eYELKdzJks5svC-IgaJpZM4P9XT0
.

@Shreeshrii, the tif files are OK, and every tif image has the warning, for example:

$ tesseract eng.num.exp2.tif eng.num.exp2 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 369

Although it has the warning "Warning. Invalid resolution 0 dpi. Using 70 instead.", but after the command, the *.lstmf file could be generated correctly. So I guess the warning is not the truly problem.

And if a tif image has the "Empty page!!" error, the *.lstmf file would not be generated! So, I guess the problem is the "Empty page!!" error. But I don't know how to resolve the problem, hope for your help , thanks!!!

@Shreeshrii, I resolved the problem, I just add " -psm 7 nobatch" to the command like this:
$ tesseract eng.num.exp2.tif eng.num.exp2 -psm 7 nobatch lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

then, the error "Empty page!!" was disappeared. and finally the *.lstmf file was generated correctly!

Great! Thanks for informing what worked.

@Shreeshrii, You're welcome! the command is also right without "nobatch", just as the following:

$ tesseract eng.num.exp2.tif eng.num.exp2 -psm 7 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Dou you know the function of "nobatch" ? Have you used it in tesseract command ?
Looking forward to your reply! Thanks!

Please look at the config files named
makebox
nobatch
etc

in the
tessdata/configs
tessdata/tessconfigs directory.

They will show what config variables are being set for each.

I may have used the nobatch option with tesseract 3.02 or so - do not
remember details.

ShreeDevi


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

On Tue, Oct 24, 2017 at 4:26 PM, 694376965 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii, You're welcome! the command
is also right without "nobatch", just as the following:

$ tesseract eng.num.exp2.tif eng.num.exp2 -psm 7 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Dou you know the function of "nobatch" ? Have you used it in tesseract
command ?
Looking forward to your reply! Thanks!

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-338952527,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o4aAzjGxbIbaIbIXk4vRtCkQIH8dks5svcJBgaJpZM4P9XT0
.

@Shreeshrii, OK, Thanks~^_^~

@Shreeshrii At Making Box Files 4.0 there's the following :

The required format for LSTM 4.0alpha is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.
 'Newline' boxes with tab as the character must be inserted between textlines to indicate the end-of-line.

what's the meaning of the last line? Does it mean that we should add a 'tab' after each textline?

If you use tesstrain.sh, box/tiff pairs are created in correct format.

The textline based box files (WordStr ...) are NOT supported.

If you are modifying old 3.ox format box files, you have to add space after each word and tab after each textline.

However, please note that the box files generated using tesseract with makebox need to be manually edited for accuracy (since the boxes are filled with OCRed text). Also, I found that for Devanagari (and probably for other complex scripts), the box generation may not match what is generated by text2image.

You can do a simple test. Create a box/tiff pair using 'text2image'. Then create a box file for that same tif using tesseract with makebox. Compare the two box files.

@Shreeshrii Could you introduce the role of each file under langdata?

[root@localhost langdata]# ls chi_sim
chi_sim.config   chi_sim.punc           chi_sim.training_text.bigram_freqs   chi_sim.unicharambigs  desired_characters
chi_sim.numbers  chi_sim.training_text  chi_sim.training_text.unigram_freqs  chi_sim.wordlist       forbidden_characters

I found that the .unigram_freqs and .bigram_freqs files seems not suit me very well.

Please see https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh

My guess is that .unigram_freqs, .bigram_freqs, desired_characters and forbidden_characters are used at Google for building a representative training_text for doing training from scratch.

They are not used directly in the training process documented publicly by Ray.

Also see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files

NOTE Tesseract 4.00 will now run happily with a traineddata file that contains just lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The lstm-*-dawgs are optional, and none of the other components are required or used with OEM_LSTM_ONLY as the OCR engine mode. No bigrams, unichar ambigs or any of the other components are needed or even have any effect if present. The only other component that does anything is the lang.config, which can affect layout analysis, and sub-languages.

At Langdata files, It has the following:

training_text.unigram_freqs

This is a text file with a list of unigrams (characters) and the frequency with which they appear next to each other in the training_text, one unigram per line.

It seems that not the real frequency in training_text and I just confused about that. I have download the Chinese corpora from a National institution and I think it has a Higher precision. so I want to generate my own langdata of chi_sim. But here the training_text seems to have some relationships with the other files(.unigram_freqs, .bigram_freqs)

@Shreeshrii ,when I use Chinese character box/tiff pairs to fine tune, there was a new problem as the following:

$ unicharset_extractor chi_sim.black.exp0.box
Extracting unicharset from box file chi_sim.black.exp0.box
Invalid Unicode codepoint: 0xffffffe4
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
Segmentation fault (core dumped)

have you ever seen this problem?
Looking forward to your reply! Thanks!

You should search for past issues and in forum first.

See https://github.com/tesseract-ocr/tesseract/issues/1114

@Shreeshrii
i will generacte *.lstmf ,and run tesseract ./tif_box/eng.exp0.tif ./tif_box/eng.exp0 lstm.train ,then I got eng.exp0.txt eng.exp0.lstmf but the "eng.exp0.txt" nothing any char
box file and tif file is ok

run lstmtraining, i got a error
_Deserialize header failed: ~/tesstutorial/tif_box/eng.exp0.lstmf
Load of page 0 failed!
Load of images failed!!_
I guess my *.lstmf file is incorrect

ps: box , tif file form jTessBoxEditor, not from text2image command

please give me a suggestion

@Shreeshrii
--sequential_training true
then i got
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file imagedata.cpp, line 658

Sorry, I don't have any suggestions to try.

ShreeDevi


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

On Tue, Nov 14, 2017 at 5:25 PM, zhaiyongding notifications@github.com
wrote:

@Shreeshrii https://github.com/shreeshrii
--sequential_training true
then i got
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file imagedata.cpp, line 658

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-344236472,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxgq8x4iPB6FirpM338V5EzTMDsqks5s2X-rgaJpZM4P9XT0
.

I face the same problem which is lstmf does not exist

I already read solution from @Shreeshrii above, and did the following:

  1. In langdata/language directory, I put 3 files, training_text, .box, and .tif.
    For generating .box file, I used text2image and made sure that each line has tab and also space between words.
  2. I have modified the tesstrain.sh as @Shreeshrii suggest for copying the .box and .tif files to tmp folder.
  3. with the following command, I still have ".lstmf does not exist or is not readable" issue

training/tesstrain.sh --fonts_dir /usr/share/fonts/ --lang my_lang --noextract_font_properties --linedata_only --exposures "0" --langdata_dir ../langdata --tessdata_dir ./tessdata --fontlist "my_font" --output_dir ../tesstutorial/my_lang

any suggestion?

For generating .box file, I used text2image and made sure that each line has tab and also space between words.

If you use training/tesstrain.sh it will automatically call text2image with the training_text and fonts from font list, you do not have to generate them separately. The script creates the box and tif files in tmp directory and uses them to create the lstmf file.

Please review the log file to see what errors are generated and fix.

Thank you for your reply @Shreeshrii ,
Now, I placed only 2 files on langdata/my_lang directory which is my training_text file and .tif file that represents my font.
After I used training/tesstrain.sh, it generates 6 files on tmp directory, which is

  1. .txt file (generally same as my training_text),
  2. .tif file (my text in training_text with my font)
  3. .box file (represents coordinates for each character in .tif in this directory)
  4. .unicharset file
  5. .xheights file , and
  6. log file

but still no lstmf generated.

my log file showing that:

=== Starting training for language 'ind'
[Jum Des 15 15:31:27 WIB 2017] /usr/bin/text2image --fonts_dir=/usr/share/fonts/ --font=ubuntu --outputbase=/tmp/font_tmp.VhBcen4j0y/sample_text.txt --text=/tmp/font_tmp.VhBcen4j0y/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y
Rendered page 0 to file /tmp/font_tmp.VhBcen4j0y/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using ubuntu
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y --fonts_dir=/usr/share/fonts/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0 --max_pages=3 --font=ubuntu --text=../langdata/ind/ind.training_text
Rendered page 0 to file /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/unicharset_extractor --output_unicharset /tmp/tmp.dImW1DMEXe/ind/ind.unicharset --norm_mode 1 /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.box
Extracting unicharset from box file /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.box
Wrote unicharset file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/set_unicharset_properties -U /tmp/tmp.dImW1DMEXe/ind/ind.unicharset -O /tmp/tmp.dImW1DMEXe/ind/ind.unicharset -X /tmp/tmp.dImW1DMEXe/ind/ind.xheights --script_dir=../langdata
Loaded unicharset of size 51 from file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/tesseract /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
ERROR: /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.lstmf does not exist or is not readable

tif file (my text in training_text with my font)

This will NOT work. You need the ttf file (FONT) in the fonts directory.
The program will automatically create the tif and box using the training
text and fonts.

read_params_file: Can't open lstm.train

This means that you do NOT have the config file named lstm.train. That is
why the lstmf file is NOT being created.

https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/lstm.train

You need to have this in your tessdata directory.

ShreeDevi


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

On Fri, Dec 15, 2017 at 2:11 PM, easymavinmind notifications@github.com
wrote:

Thank you for your reply @Shreeshrii https://github.com/shreeshrii ,
Now, I placed only 2 files on langdata/my_lang directory which is my
training_text file and .tif file that represents my font.
After I used training/tesstrain.sh, it generates 6 files on tmp directory,
which is

  1. .txt file (generally same as my training_text),
  2. .tif file (my text in training_text with my font)
  3. .box file (represents coordinates for each character in .tif in
    this directory)
  4. .unicharset file
  5. .xheights file , and
  6. log file

but still no lstmf generated.

my log file showing that:

=== Starting training for language 'ind'
[Jum Des 15 15:31:27 WIB 2017] /usr/bin/text2image
--fonts_dir=/usr/share/fonts/ --font=ubuntu --outputbase=/tmp/font_tmp.VhBcen4j0y/sample_text.txt
--text=/tmp/font_tmp.VhBcen4j0y/sample_text.txt
--fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y
Rendered page 0 to file /tmp/font_tmp.VhBcen4j0y/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using ubuntu
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/text2image
--fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y
--fonts_dir=/usr/share/fonts/ --strip_unrenderable_words --leading=32
--char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0
--max_pages=3 --font=ubuntu --text=../langdata/ind/ind.training_text
Rendered page 0 to file /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/unicharset_extractor
--output_unicharset /tmp/tmp.dImW1DMEXe/ind/ind.unicharset --norm_mode 1
/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.box
Extracting unicharset from box file /tmp/tmp.dImW1DMEXe/ind/ind.
ubuntu.exp0.box
Wrote unicharset file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/set_unicharset_properties -U
/tmp/tmp.dImW1DMEXe/ind/ind.unicharset -O /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
-X /tmp/tmp.dImW1DMEXe/ind/ind.xheights --script_dir=../langdata
Loaded unicharset of size 51 from file /tmp/tmp.dImW1DMEXe/ind/ind.
unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/tesseract
/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0
lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
ERROR: /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.lstmf does not exist or is
not readable

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-351946693,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o23nVPnNN1jp0dIU3oZKduzdG4S-ks5tAjDRgaJpZM4P9XT0
.

This will NOT work. You need the ttf file (FONT) in the fonts directory.
The program will automatically create the tif and box using the training
text and fonts.

yes, i have my ttf file in my fonts directory.. and the tif file in tmp dir was automatically created by tesstrain.sh

This means that you do NOT have the config file named lstm.train. That is
why the lstmf file is NOT being created.

I have copied lstm.train and chmod to executable in tessdata directory, tessdata/configs and tessdata/tessconfigs
I am confused why still can not open lstm.train ...

The config files need to be in the tessdata-dir as defined by
tessdata_prefix or as specified in the command.

Make sure you are giving the correct path for it.

On 15-Dec-2017 3:47 PM, "easymavinmind" notifications@github.com wrote:

This will NOT work. You need the ttf file (FONT) in the fonts directory.
The program will automatically create the tif and box using the training
text and fonts.
yes, i have my ttf file in my fonts directory.. and the tif file in tmp dir
was automatically created by tesstrain.sh

This means that you do NOT have the config file named lstm.train. That is
why the lstmf file is NOT being created.
I have copied lstm.train and chmod to executable in tessdata directory,
tessdata/configs and tessdata/tessconfigs
I am confused why still can not open lstm.train ...

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-351968348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozRdp7GgTlbk4FXSSKmLjUF2sKgwks5tAkdVgaJpZM4P9XT0
.

@Shreeshrii thank you for your help..

Currently, my problems have been solved..
The last one, is because I modified the tesstrain.sh, so I just overwrite with the original one and run it, and indeed lstm.train can be opened.

@easymavinmind

Hi my name is june.

I'm studying tesseract 4.00

Could I get the script for tesstrain.sh you modified .?

Thank You

@Shreeshrii
Hi, I want also to generate the .lstmf file directly from the *.tif/.box pair, but something confuses me after reading the comments above.
I found that the command used for generating .lstmf always with such command: --lang eng.
Is that mean it will use the traineddata from eng rather than my *.tif/
.box pair for training?
Thank you!

@tangsanli5201

The following is a workaround that you can try.

Create your box tiff pairs. Make sure they follow the 4.0 format. Name them in the expected way.

Copy them to the langdata/lang folder.

Modify tesstrain.sh to copy these files at the beginning of training process. After the tmp directory is created, copy box and tif to that dir.

You should also give at least one font and training text as input, so that along with your box tiff will be used for training.

Run the process, look at the log file, console output to verify that all files are being picked up.

Test with one small box/tif pair and training_text to understand the process.

Then try your training.

@Shreeshrii
Following your suggestion, I do generate an lstmf file and I find the information about my own .tif/.box pair in the log file.
However, when I remove the following two lines in the tesstrain.sh
'cp /xxx/.box ${TRAINING_DIR}/xxx.box'
'cp /xxx/
.tif ${TRAINING_DIR}/xxx.tif',
and try to re-generate the lstmf file, I find that the sizes of xxx.traineddata and xxx.lstmf in the output directory don't change.
I wonder that my own data are really used in the generated traineddata?
Thank you!

Please see https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297707943

and download https://github.com/tesseract-ocr/tesseract/files/961633/boxtrain.zip
This has an old version of my modified bash scripts. Please compare with the official scripts tesstrain.sh.

Ray has modified the training process since then to use a starter traineddata. So, use the files from this zipfile just as a reference. They won't work as is.

@Shreeshrii

Hi Shreeeshrii

I have no idea about copying them to the langdata/lang folder ?
уЕЬуЕЬ
Why?

@tangsanli5201

The following is a workaround that you can try.

Create your box tiff pairs. Make sure they follow the 4.0 format. Name them in the expected way.

Copy them to the langdata/lang folder.

Modify eyestrain.sh to copy these files at the beginning of training process. After the tmp directory is created, copy box and tif to that dir.

You should also give at least one font and training text as input, so that along with your box tiff will be used for training.

Run the process, look at the log file, console output to verify that all files are being picked up.

Test with one small box/tif pair and training_text to understand the process.

Then try your training.

@Shreeshrii https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L394 says I can use any language-specific configs, any idea where to find them, or create them from scratch?

Not required for all languages

example for sanskrit

https://github.com/tesseract-ocr/langdata/blob/master/san/san.config

Okay, I generated the .lstmf files! Since I couldn't find any single source as to how to go about in the way I intended to, I'm listing the steps here in case someone looks for it.

My original intention is to train Tesseract 4 on handwritten (english) characters and digits and check how does it perform in classifying handwritten stuff. I had the previous .jpg (apparently these work as well), .box and .tr files available from training in tesseract 3. The box file formats for tesseract 3 and 4 are significantly different and I manually edited them to suit the needs of tesseract 4 (a gruesome task, but not if you manage to come up with an automation script). What I did next is the following :

  • Move into the directory where the custom images are and run unicharset_extractor [lang].[fontname].exp[num].box - this generated the unicharset file.
  • Next, run the following:
    img_files=$(ls | grep exp0.jpg) [Note that this was the format my files were saved in, you might have them in a separate format.]
    and then
    for img_file in ${img_files}; do tesseract ${img_file} ${img_file%.*} 'lstm.train'; done
    sequentially. This generated the .lstmf files!

Hi guys I have a question regarding the format of the box file.

The textline based box files (WordStr ...) are NOT supported.

Is this still true? Because I could not find the statement on the training 4.0 tutorial.
I am confused to how I should format the box coordinates.

If the textline based box file is supported,
is below correct way of doing this?

Ex.
"I have an apple."

would be

"I have an apple." 100 100 200 200 0
\

if so that would ease the annotation process.

The right place for asking questions is the forum.

@Shreeshrii

Hi Shreeeshrii,

I am trying to train tesseract 4.0 for english language but specific image. I had created 2 .tif and box file of the image using QT editor and I had make sure to include TAB and SPACES. I had also created dummy training_text file in lang_data but when i execute the command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist FreeMono --lang eng --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir ./output/engtrain --my_boxtiff_dir /home/user1/abc/tesseract-master/Training_dir

It create a single lstm file for exp0 and that too with random coordinates (not in sync with by box and tif file) and when we checked the box file in the tmp/ directory, it was updated with the data from the dummy file of eng.training_text and not from the actual box file of the image that we created using editor.

Please provide your input.

I have created a simple bash script for LSTM training - finetuning for
impact. Change file locations to match your setup and let me know if it
works for you.

https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_full.sh

On Tue, Feb 19, 2019 at 12:34 PM richa912 notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii

Hi Shreeeshrii,

I am trying to train tesseract 4.0 for english language but specific
image. I had created 2 .tif and box file of the image using QT editor and I
had make sure to include TAB and SPACES. I had also created dummy
training_text file in lang_data but when i execute the command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist FreeMono
--lang eng --linedata_only --noextract_font_properties --langdata_dir
./langdata --tessdata_dir ./tessdata --output_dir ./output/engtrain
--my_boxtiff_dir /home/user1/abc/tesseract-master/Training_dir

It create a single lstm file for exp0 and that too with random coordinates
(not in sync with by box and tif file) and when we checked the box file in
the tmp/ directory, it was updated with the data from the dummy file of
eng.training_text and not from the actual box file of the image that we
created using editor.

Please provide your input.

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465011033,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozzfo4okiaQqiiPojAcaKUwJQu2jks5vO6IZgaJpZM4P9XT0
.

--


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

Renamed the script to say it uses box tiff pairs.

https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_boxtiff.sh

On Wed, Feb 20, 2019 at 2:01 AM Shree Devi Kumar shreeshrii@gmail.com
wrote:

I have created a simple bash script for LSTM training - finetuning for
impact. Change file locations to match your setup and let me know if it
works for you.

https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_full.sh

On Tue, Feb 19, 2019 at 12:34 PM richa912 notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii

Hi Shreeeshrii,

I am trying to train tesseract 4.0 for english language but specific
image. I had created 2 .tif and box file of the image using QT editor and I
had make sure to include TAB and SPACES. I had also created dummy
training_text file in lang_data but when i execute the command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist
FreeMono --lang eng --linedata_only --noextract_font_properties
--langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir
./output/engtrain --my_boxtiff_dir
/home/user1/abc/tesseract-master/Training_dir

It create a single lstm file for exp0 and that too with random
coordinates (not in sync with by box and tif file) and when we checked the
box file in the tmp/ directory, it was updated with the data from the dummy
file of eng.training_text and not from the actual box file of the image
that we created using editor.

Please provide your input.

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465011033,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozzfo4okiaQqiiPojAcaKUwJQu2jks5vO6IZgaJpZM4P9XT0
.

--


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

--


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

Thank You Shreeshrii.
I have one more question before executing the shell script around lang_dir. We have created the dummy eng.training_text file with only 3 words. Do we need to use the same for fine tuning?
Also, there are multiple files required for each lang like: eng.number, eng.unicharambigs, eng.punc, eng.wordlist etc. Do I need to create all of these ?

Thank You in advance.

You don't need the dummy training text for this script.

eng.number, eng.unicharambigs, eng.punc, eng.wordlist are NOT needed for
finetune LSTM training. unicharambigs is NOT used at all.

This type of finetuning uses the files already there in the
tessdata_best/eng.traineddata.

All you for this is the box/tiff pairs and the
tessdata_best/eng.traineddata.

On Wed, Feb 20, 2019 at 12:47 PM richa912 notifications@github.com wrote:

Thank You Shreeshrii.
I have one more question before executing the shell script around
lang_dir. We have created the dummy eng.training_text file with only 3
words. Do we need to use the same for fine tuning?
Also, there are multiple files required for each lang like: eng.number,
eng.unicharambigs, eng.punc, eng.wordlist etc. Do I need to create all of
these ?

Thank You in advance.

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465454465,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oznBRfspFeddEfkorb2m6VcDiGFXks5vPPaDgaJpZM4P9XT0
.

--


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

https://github.com/tesseract-ocr/langdata

You need following files in langdata directory for running this finetune
training script

Common.unicharset
Latin.unicharset
radical-stroke.txt
https://github.com/tesseract-ocr/langdata/blob/master/radical-stroke.txt

On Wed, Feb 20, 2019 at 12:53 PM Shree Devi Kumar shreeshrii@gmail.com
wrote:

You don't need the dummy training text for this script.

eng.number, eng.unicharambigs, eng.punc, eng.wordlist are NOT needed for
finetune LSTM training. unicharambigs is NOT used at all.

This type of finetuning uses the files already there in the
tessdata_best/eng.traineddata.

All you for this is the box/tiff pairs and the
tessdata_best/eng.traineddata.

On Wed, Feb 20, 2019 at 12:47 PM richa912 notifications@github.com
wrote:

Thank You Shreeshrii.
I have one more question before executing the shell script around
lang_dir. We have created the dummy eng.training_text file with only 3
words. Do we need to use the same for fine tuning?
Also, there are multiple files required for each lang like: eng.number,
eng.unicharambigs, eng.punc, eng.wordlist etc. Do I need to create all of
these ?

Thank You in advance.

тАФ
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465454465,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oznBRfspFeddEfkorb2m6VcDiGFXks5vPPaDgaJpZM4P9XT0
.

--


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

--


рднрдЬрди - рдХреАрд░реНрддрди - рдЖрд░рддреА @ http://bhajans.ramparivar.com

Hi Shree,

I am getting error "Page not found" when click on "https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_full.sh"
Please help

Was this page helpful?
0 / 5 - 0 ratings

Related issues

reubano picture reubano  ┬╖  6Comments

Shreeshrii picture Shreeshrii  ┬╖  4Comments

egorpugin picture egorpugin  ┬╖  6Comments

eliyaz-kl picture eliyaz-kl  ┬╖  4Comments

spajak picture spajak  ┬╖  4Comments