tesseract 🚀 - how to generate *.lstmf files according to *.tif and *.box files while fine tuning with ...

The old format box files will not work for LSTM training. AFAIK, currently training is only supported with the synthetic box/tiff pairs generated via tesstrain.sh.

See https://github.com/tesseract-ocr/tesseract/issues/768 for more details.

Shreeshrii on 18 Oct 2017

@Shreeshrii, I have change the format box files according to the requirements of tesseract 4.0 , namely I add a TAB at end of line and spaces to demarcate words for the box files.

could you tell me how to use new format box/tiff pairs to generate *.lstmf files? Thanks!

694376965 on 18 Oct 2017

I copy them to my langdata/language directory and then use a modified
tesstrain.sh to copy them to the tmp training directory.

tesstrain.sh changes

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

cp  ../langdata/${LANG_CODE}/*.box ${TRAINING_DIR}
cp  ../langdata/${LANG_CODE}/*.tif ${TRAINING_DIR}

ls -l  ${TRAINING_DIR}
source "$(dirname $0)/language-specific.sh"

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 18, 2017 at 2:53 PM, 694376965 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii, I have change the format box
files according to the requirements of tesseract 4.0 , namely I add a TAB
at end of line and spaces to demarcate words for the box files.

could you tell me how to use new format box/tiff pairs to generate *.lstmf
files? Thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-337521195,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5yzL6sQNcMR-xJeNDiL1Gq0R2naks5stcOQgaJpZM4P9XT0
.

Shreeshrii on 18 Oct 2017

@Shreeshrii I changed the tesstrain.sh file and the command could copy the box/tiff pairs to the tmp training directory.
but there was another error as following:

$ training/tesstrain.sh --lang eng --linedata_only --langdata_dir ../langdata --tessdata_dir ./tessdata --output_dir ../result

=== Starting training for language 'eng'
total 6068
-rwxrw-r-- 1 penny penny 66188 Oct 18 19:24 eng.num.exp0.box
-rwxrw-r-- 1 penny penny 6136385 Oct 18 19:24 eng.num.exp0.tif
-rw-rw-r-- 1 penny penny 42 Oct 18 19:24 tesstrain.log
[Wed Oct 18 19:24:13 CST 2017] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts/ --font=Arial Bold --outputbase=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt --text=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.fGYz2L7fuF
Could not find font named Arial Bold.
Pango suggested font FreeSerif Bold.
Please correct --font arg.

=== Phase I: Generating training images ===
ERROR: Could not find training text file ../langdata/eng/eng.training_text

it couldn't find 'eng.training_text' file, do you know what the 'eng.training_text' file is ?

Additionally, the command 'text2image' is used to generate tiff images according to text and fonts, but we have already had box/tiff pairs, so why it excute the command 'text2image' again ?
Should I change other script files to resolve these problems?

I'm looking forward to your reply! Thanks a lot!

694376965 on 18 Oct 2017

Please check the syntax of your command.

Training text is in langdata dir.

Rather than modifying tesstrain.sh too much, you could keep a small dummy
training text in one font to use along with your box tiff pairs.

I have mostly tested training with synthetic images, using precreated box
tiff pairs just as sample.

Shreeshrii on 18 Oct 2017

training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
--lang san \
--noextract_font_properties --linedata_only \
--exposures "0" \
--langdata_dir ../langdata \
--tessdata_dir ../tessdata \
--fontlist \
"Siddhanta" \
--output_dir ../tesstutorial/san

You have to make sure that all directories reflect your setup.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Oct 18, 2017 at 7:36 PM, ShreeDevi Kumar shreeshrii@gmail.com
wrote:

Please check the syntax of your command. Langdata is referred to via
script dir.

Training text is in langdata dir.

Rather than modifying tesstrain.sh too much, you could keep a small dummy
training text in one font to use along with your box tiff pairs.

I have mostly tested training with synthetic images, using precreated box
tiff pairs just as sample.

On 18-Oct-2017 5:12 PM, "694376965" notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii I changed the tesstrain.sh
file and the command could copy the box/tiff pairs to the tmp training
directory.
but there was another error as following:

$ training/tesstrain.sh --lang eng --linedata_only --langdata_dir
../langdata --tessdata_dir ./tessdata --output_dir ../result

=== Starting training for language 'eng'
total 6068
-rwxrw-r-- 1 penny penny 66188 Oct 18 19:24 eng.num.exp0.box
-rwxrw-r-- 1 penny penny 6136385 Oct 18 19:24 eng.num.exp0.tif
-rw-rw-r-- 1 penny penny 42 Oct 18 19:24 tesstrain.log
[Wed Oct 18 19:24:13 CST 2017] /usr/local/bin/text2image
--fonts_dir=/usr/share/fonts/ --font=Arial Bold
--outputbase=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt
--text=/tmp/font_tmp.fGYz2L7fuF/sample_text.txt
--fontconfig_tmpdir=/tmp/font_tmp.fGYz2L7fuF
Could not find font named Arial Bold.
Pango suggested font FreeSerif Bold.
Please correct --font arg.

=== Phase I: Generating training images ===
ERROR: Could not find training text file ../langdata/eng/eng.training_t
ext

it couldn't find 'eng.training_text' file, do you know what the
'eng.training_text' file is ?

Additionally, the command 'text2image' is used to generate tiff images
according to text and fonts, but we have already had box/tiff pairs, so why
it excute the command 'text2image' again ?
Should I change other script files to resolve these problems?

I'm looking forward to your reply! Thanks a lot!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-337563901,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_owYariVrUfRKKivM0kFTpiBC9Ypeks5steRBgaJpZM4P9XT0
.

Shreeshrii on 18 Oct 2017

@Shreeshrii, with your help, I generated the *.lstmf files! Thanks a lot!

694376965 on 23 Oct 2017

@Shreeshrii but now there is a new problem like this:

$ tesseract eng.num.exp1.tif eng.num.exp1 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 395
Empty page!!
Estimating resolution as 395
Empty page!!

there are several images will generate this error, do you know why it appears "Empty page!!"? Have you ever seen this kind of mistake?

694376965 on 23 Oct 2017

Please look at your tif files in a viewer. Why do they have 0 dpi?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 23, 2017 at 11:47 AM, 694376965 notifications@github.com
wrote:

@Shreeshrii https://github.com/shreeshrii but now there is a new
problem like this:

$ tesseract eng.num.exp1.tif eng.num.exp1 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 395
Empty page!!
Estimating resolution as 395
Empty page!!

there are several images will generate this error, do you know why it
appears "Empty page!!"? Have you ever seen this kind of mistake?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-338557979,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ox27vOl9R1j2mq_qQFP1eYELKdzJks5svC-IgaJpZM4P9XT0
.

Shreeshrii on 23 Oct 2017

@Shreeshrii, the tif files are OK, and every tif image has the warning, for example:

$ tesseract eng.num.exp2.tif eng.num.exp2 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 369

Although it has the warning "Warning. Invalid resolution 0 dpi. Using 70 instead.", but after the command, the *.lstmf file could be generated correctly. So I guess the warning is not the truly problem.

And if a tif image has the "Empty page!!" error, the *.lstmf file would not be generated! So, I guess the problem is the "Empty page!!" error. But I don't know how to resolve the problem, hope for your help , thanks!!!

694376965 on 24 Oct 2017

@Shreeshrii, I resolved the problem, I just add " -psm 7 nobatch" to the command like this:
$ tesseract eng.num.exp2.tif eng.num.exp2 -psm 7 nobatch lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

then, the error "Empty page!!" was disappeared. and finally the *.lstmf file was generated correctly!

694376965 on 24 Oct 2017

Great! Thanks for informing what worked.

Shreeshrii on 24 Oct 2017

@Shreeshrii, You're welcome! the command is also right without "nobatch", just as the following:

$ tesseract eng.num.exp2.tif eng.num.exp2 -psm 7 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Dou you know the function of "nobatch" ? Have you used it in tesseract command ?
Looking forward to your reply! Thanks!

694376965 on 24 Oct 2017

Please look at the config files named
makebox
nobatch
etc

in the
tessdata/configs
tessdata/tessconfigs directory.

They will show what config variables are being set for each.

I may have used the nobatch option with tesseract 3.02 or so - do not
remember details.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Oct 24, 2017 at 4:26 PM, 694376965 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii, You're welcome! the command
is also right without "nobatch", just as the following:

$ tesseract eng.num.exp2.tif eng.num.exp2 -psm 7 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

Dou you know the function of "nobatch" ? Have you used it in tesseract
command ?
Looking forward to your reply! Thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-338952527,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o4aAzjGxbIbaIbIXk4vRtCkQIH8dks5svcJBgaJpZM4P9XT0
.

Shreeshrii on 24 Oct 2017

@Shreeshrii, OK, Thanks~^_^~

694376965 on 25 Oct 2017

@Shreeshrii At Making Box Files 4.0 there's the following :

The required format for LSTM 4.0alpha is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.
 'Newline' boxes with tab as the character must be inserted between textlines to indicate the end-of-line.

what's the meaning of the last line? Does it mean that we should add a 'tab' after each textline?

ivanzz1001 on 31 Oct 2017

If you use tesstrain.sh, box/tiff pairs are created in correct format.

The textline based box files (WordStr ...) are NOT supported.

If you are modifying old 3.ox format box files, you have to add space after each word and tab after each textline.

However, please note that the box files generated using tesseract with makebox need to be manually edited for accuracy (since the boxes are filled with OCRed text). Also, I found that for Devanagari (and probably for other complex scripts), the box generation may not match what is generated by text2image.

You can do a simple test. Create a box/tiff pair using 'text2image'. Then create a box file for that same tif using tesseract with makebox. Compare the two box files.

Shreeshrii on 31 Oct 2017

@Shreeshrii Could you introduce the role of each file under langdata?

[root@localhost langdata]# ls chi_sim
chi_sim.config   chi_sim.punc           chi_sim.training_text.bigram_freqs   chi_sim.unicharambigs  desired_characters
chi_sim.numbers  chi_sim.training_text  chi_sim.training_text.unigram_freqs  chi_sim.wordlist       forbidden_characters

I found that the .unigram_freqs and .bigram_freqs files seems not suit me very well.

ivanzz1001 on 1 Nov 2017

Please see https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh

My guess is that .unigram_freqs, .bigram_freqs, desired_characters and forbidden_characters are used at Google for building a representative training_text for doing training from scratch.

They are not used directly in the training process documented publicly by Ray.

Shreeshrii on 1 Nov 2017

Also see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files

NOTE Tesseract 4.00 will now run happily with a traineddata file that contains just lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The lstm-*-dawgs are optional, and none of the other components are required or used with OEM_LSTM_ONLY as the OCR engine mode. No bigrams, unichar ambigs or any of the other components are needed or even have any effect if present. The only other component that does anything is the lang.config, which can affect layout analysis, and sub-languages.

Shreeshrii on 1 Nov 2017

At Langdata files, It has the following:

training_text.unigram_freqs

This is a text file with a list of unigrams (characters) and the frequency with which they appear next to each other in the training_text, one unigram per line.

It seems that not the real frequency in training_text and I just confused about that. I have download the Chinese corpora from a National institution and I think it has a Higher precision. so I want to generate my own langdata of chi_sim. But here the training_text seems to have some relationships with the other files(.unigram_freqs, .bigram_freqs)

ivanzz1001 on 1 Nov 2017

@Shreeshrii ,when I use Chinese character box/tiff pairs to fine tune, there was a new problem as the following:

$ unicharset_extractor chi_sim.black.exp0.box
Extracting unicharset from box file chi_sim.black.exp0.box
Invalid Unicode codepoint: 0xffffffe4
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
Segmentation fault (core dumped)

have you ever seen this problem?
Looking forward to your reply! Thanks!

694376965 on 3 Nov 2017

You should search for past issues and in forum first.

See https://github.com/tesseract-ocr/tesseract/issues/1114

Shreeshrii on 3 Nov 2017

@Shreeshrii
i will generacte *.lstmf ,and run tesseract ./tif_box/eng.exp0.tif ./tif_box/eng.exp0 lstm.train ,then I got eng.exp0.txt eng.exp0.lstmf but the "eng.exp0.txt" nothing any char
box file and tif file is ok

run lstmtraining, i got a error
_Deserialize header failed: ~/tesstutorial/tif_box/eng.exp0.lstmf
Load of page 0 failed!
Load of images failed!!_
I guess my *.lstmf file is incorrect

ps: box , tif file form jTessBoxEditor, not from text2image command

please give me a suggestion

zhaiyongding on 14 Nov 2017

@Shreeshrii
--sequential_training true
then i got
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file imagedata.cpp, line 658

zhaiyongding on 14 Nov 2017

Sorry, I don't have any suggestions to try.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 14, 2017 at 5:25 PM, zhaiyongding notifications@github.com
wrote:

@Shreeshrii https://github.com/shreeshrii
--sequential_training true
then i got
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file imagedata.cpp, line 658

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-344236472,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxgq8x4iPB6FirpM338V5EzTMDsqks5s2X-rgaJpZM4P9XT0
.

Shreeshrii on 14 Nov 2017

I face the same problem which is lstmf does not exist

I already read solution from @Shreeshrii above, and did the following:

In langdata/language directory, I put 3 files, training_text, .box, and .tif.
For generating .box file, I used text2image and made sure that each line has tab and also space between words.
I have modified the tesstrain.sh as @Shreeshrii suggest for copying the .box and .tif files to tmp folder.
with the following command, I still have ".lstmf does not exist or is not readable" issue

training/tesstrain.sh --fonts_dir /usr/share/fonts/ --lang my_lang --noextract_font_properties --linedata_only --exposures "0" --langdata_dir ../langdata --tessdata_dir ./tessdata --fontlist "my_font" --output_dir ../tesstutorial/my_lang

any suggestion?

easymavinmind on 15 Dec 2017

For generating .box file, I used text2image and made sure that each line has tab and also space between words.

If you use training/tesstrain.sh it will automatically call text2image with the training_text and fonts from font list, you do not have to generate them separately. The script creates the box and tif files in tmp directory and uses them to create the lstmf file.

Please review the log file to see what errors are generated and fix.

Shreeshrii on 15 Dec 2017

Thank you for your reply @Shreeshrii ,
Now, I placed only 2 files on langdata/my_lang directory which is my training_text file and .tif file that represents my font.
After I used training/tesstrain.sh, it generates 6 files on tmp directory, which is

.txt file (generally same as my training_text),
.tif file (my text in training_text with my font)
.box file (represents coordinates for each character in .tif in this directory)
.unicharset file
.xheights file , and
log file

but still no lstmf generated.

my log file showing that:

=== Starting training for language 'ind'
[Jum Des 15 15:31:27 WIB 2017] /usr/bin/text2image --fonts_dir=/usr/share/fonts/ --font=ubuntu --outputbase=/tmp/font_tmp.VhBcen4j0y/sample_text.txt --text=/tmp/font_tmp.VhBcen4j0y/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y
Rendered page 0 to file /tmp/font_tmp.VhBcen4j0y/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using ubuntu
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y --fonts_dir=/usr/share/fonts/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0 --max_pages=3 --font=ubuntu --text=../langdata/ind/ind.training_text
Rendered page 0 to file /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/unicharset_extractor --output_unicharset /tmp/tmp.dImW1DMEXe/ind/ind.unicharset --norm_mode 1 /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.box
Extracting unicharset from box file /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.box
Wrote unicharset file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/set_unicharset_properties -U /tmp/tmp.dImW1DMEXe/ind/ind.unicharset -O /tmp/tmp.dImW1DMEXe/ind/ind.unicharset -X /tmp/tmp.dImW1DMEXe/ind/ind.xheights --script_dir=../langdata
Loaded unicharset of size 51 from file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/tesseract /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
ERROR: /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.lstmf does not exist or is not readable

easymavinmind on 15 Dec 2017

tif file (my text in training_text with my font)

This will NOT work. You need the ttf file (FONT) in the fonts directory.
The program will automatically create the tif and box using the training
text and fonts.

read_params_file: Can't open lstm.train

This means that you do NOT have the config file named lstm.train. That is
why the lstmf file is NOT being created.

https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/lstm.train

You need to have this in your tessdata directory.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Dec 15, 2017 at 2:11 PM, easymavinmind notifications@github.com
wrote:

Thank you for your reply @Shreeshrii https://github.com/shreeshrii ,
Now, I placed only 2 files on langdata/my_lang directory which is my
training_text file and .tif file that represents my font.
After I used training/tesstrain.sh, it generates 6 files on tmp directory,
which is

.txt file (generally same as my training_text),

.tif file (my text in training_text with my font)

.box file (represents coordinates for each character in .tif in
this directory)

.unicharset file

.xheights file , and

log file

but still no lstmf generated.

my log file showing that:

=== Starting training for language 'ind'
[Jum Des 15 15:31:27 WIB 2017] /usr/bin/text2image
--fonts_dir=/usr/share/fonts/ --font=ubuntu --outputbase=/tmp/font_tmp.VhBcen4j0y/sample_text.txt
--text=/tmp/font_tmp.VhBcen4j0y/sample_text.txt
--fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y
Rendered page 0 to file /tmp/font_tmp.VhBcen4j0y/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using ubuntu
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/text2image
--fontconfig_tmpdir=/tmp/font_tmp.VhBcen4j0y
--fonts_dir=/usr/share/fonts/ --strip_unrenderable_words --leading=32
--char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0
--max_pages=3 --font=ubuntu --text=../langdata/ind/ind.training_text
Rendered page 0 to file /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/unicharset_extractor
--output_unicharset /tmp/tmp.dImW1DMEXe/ind/ind.unicharset --norm_mode 1
/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.box
Extracting unicharset from box file /tmp/tmp.dImW1DMEXe/ind/ind.
ubuntu.exp0.box
Wrote unicharset file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/set_unicharset_properties -U
/tmp/tmp.dImW1DMEXe/ind/ind.unicharset -O /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
-X /tmp/tmp.dImW1DMEXe/ind/ind.xheights --script_dir=../langdata
Loaded unicharset of size 51 from file /tmp/tmp.dImW1DMEXe/ind/ind.
unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.dImW1DMEXe/ind/ind.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Jum Des 15 15:31:44 WIB 2017] /usr/bin/tesseract
/tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.tif /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0
lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
ERROR: /tmp/tmp.dImW1DMEXe/ind/ind.ubuntu.exp0.lstmf does not exist or is
not readable

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-351946693,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o23nVPnNN1jp0dIU3oZKduzdG4S-ks5tAjDRgaJpZM4P9XT0
.

Shreeshrii on 15 Dec 2017

This will NOT work. You need the ttf file (FONT) in the fonts directory.
The program will automatically create the tif and box using the training
text and fonts.

yes, i have my ttf file in my fonts directory.. and the tif file in tmp dir was automatically created by tesstrain.sh

This means that you do NOT have the config file named lstm.train. That is
why the lstmf file is NOT being created.

I have copied lstm.train and chmod to executable in tessdata directory, tessdata/configs and tessdata/tessconfigs
I am confused why still can not open lstm.train ...

easymavinmind on 15 Dec 2017

The config files need to be in the tessdata-dir as defined by
tessdata_prefix or as specified in the command.

Make sure you are giving the correct path for it.

On 15-Dec-2017 3:47 PM, "easymavinmind" notifications@github.com wrote:

This will NOT work. You need the ttf file (FONT) in the fonts directory.
The program will automatically create the tif and box using the training
text and fonts.
yes, i have my ttf file in my fonts directory.. and the tif file in tmp dir
was automatically created by tesstrain.sh

This means that you do NOT have the config file named lstm.train. That is
why the lstmf file is NOT being created.
I have copied lstm.train and chmod to executable in tessdata directory,
tessdata/configs and tessdata/tessconfigs
I am confused why still can not open lstm.train ...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-351968348,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozRdp7GgTlbk4FXSSKmLjUF2sKgwks5tAkdVgaJpZM4P9XT0
.

Shreeshrii on 15 Dec 2017

@Shreeshrii thank you for your help..

Currently, my problems have been solved..
The last one, is because I modified the tesstrain.sh, so I just overwrite with the original one and run it, and indeed lstm.train can be opened.

easymavinmind on 18 Dec 2017

👍1

@easymavinmind

Hi my name is june.

I'm studying tesseract 4.00

Could I get the script for tesstrain.sh you modified .?

Thank You

godofcheerup on 23 Mar 2018

@Shreeshrii
Hi, I want also to generate the .lstmf file directly from the *.tif/.box pair, but something confuses me after reading the comments above.
I found that the command used for generating .lstmf always with such command: --lang eng.
Is that mean it will use the traineddata from eng rather than my *.tif/.box pair for training?
Thank you!

tangsanli5201 on 26 Mar 2018

@tangsanli5201

The following is a workaround that you can try.

Create your box tiff pairs. Make sure they follow the 4.0 format. Name them in the expected way.

Copy them to the langdata/lang folder.

Modify tesstrain.sh to copy these files at the beginning of training process. After the tmp directory is created, copy box and tif to that dir.

You should also give at least one font and training text as input, so that along with your box tiff will be used for training.

Run the process, look at the log file, console output to verify that all files are being picked up.

Test with one small box/tif pair and training_text to understand the process.

Then try your training.

Shreeshrii on 26 Mar 2018

@Shreeshrii
Following your suggestion, I do generate an lstmf file and I find the information about my own .tif/.box pair in the log file.
However, when I remove the following two lines in the tesstrain.sh
'cp /xxx/.box ${TRAINING_DIR}/xxx.box'
'cp /xxx/.tif ${TRAINING_DIR}/xxx.tif',
and try to re-generate the lstmf file, I find that the sizes of xxx.traineddata and xxx.lstmf in the output directory don't change.
I wonder that my own data are really used in the generated traineddata?
Thank you!

tangsanli5201 on 27 Mar 2018

Please see https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297707943

and download https://github.com/tesseract-ocr/tesseract/files/961633/boxtrain.zip
This has an old version of my modified bash scripts. Please compare with the official scripts tesstrain.sh.

Ray has modified the training process since then to use a starter traineddata. So, use the files from this zipfile just as a reference. They won't work as is.

Shreeshrii on 27 Mar 2018

@Shreeshrii

Hi Shreeeshrii

I have no idea about copying them to the langdata/lang folder ?
ㅜㅜ
Why?

@tangsanli5201

The following is a workaround that you can try.

Create your box tiff pairs. Make sure they follow the 4.0 format. Name them in the expected way.

Copy them to the langdata/lang folder.

Modify eyestrain.sh to copy these files at the beginning of training process. After the tmp directory is created, copy box and tif to that dir.

You should also give at least one font and training text as input, so that along with your box tiff will be used for training.

Run the process, look at the log file, console output to verify that all files are being picked up.

Test with one small box/tif pair and training_text to understand the process.

Then try your training.

godofcheerup on 28 Mar 2018

@Shreeshrii https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain_utils.sh#L394 says I can use any language-specific configs, any idea where to find them, or create them from scratch?

srdg on 15 Jun 2018

Not required for all languages

example for sanskrit

https://github.com/tesseract-ocr/langdata/blob/master/san/san.config

Shreeshrii on 15 Jun 2018

Okay, I generated the .lstmf files! Since I couldn't find any single source as to how to go about in the way I intended to, I'm listing the steps here in case someone looks for it.

My original intention is to train Tesseract 4 on handwritten (english) characters and digits and check how does it perform in classifying handwritten stuff. I had the previous .jpg (apparently these work as well), .box and .tr files available from training in tesseract 3. The box file formats for tesseract 3 and 4 are significantly different and I manually edited them to suit the needs of tesseract 4 (a gruesome task, but not if you manage to come up with an automation script). What I did next is the following :

Move into the directory where the custom images are and run unicharset_extractor [lang].[fontname].exp[num].box - this generated the unicharset file.
Next, run the following:
img_files=$(ls | grep exp0.jpg) [Note that this was the format my files were saved in, you might have them in a separate format.]
and then
for img_file in ${img_files}; do tesseract ${img_file} ${img_file%.*} 'lstm.train'; done
sequentially. This generated the .lstmf files!

srdg on 15 Jun 2018

👍3

Hi guys I have a question regarding the format of the box file.

The textline based box files (WordStr ...) are NOT supported.

Is this still true? Because I could not find the statement on the training 4.0 tutorial.
I am confused to how I should format the box coordinates.

If the textline based box file is supported,
is below correct way of doing this?

Ex.
"I have an apple."

would be

"I have an apple." 100 100 200 200 0
\

if so that would ease the annotation process.

reo911gt3 on 7 Aug 2018

The right place for asking questions is the forum.

amitdo on 15 Oct 2018

@Shreeshrii

Hi Shreeeshrii,

I am trying to train tesseract 4.0 for english language but specific image. I had created 2 .tif and box file of the image using QT editor and I had make sure to include TAB and SPACES. I had also created dummy training_text file in lang_data but when i execute the command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist FreeMono --lang eng --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir ./output/engtrain --my_boxtiff_dir /home/user1/abc/tesseract-master/Training_dir

It create a single lstm file for exp0 and that too with random coordinates (not in sync with by box and tif file) and when we checked the box file in the tmp/ directory, it was updated with the data from the dummy file of eng.training_text and not from the actual box file of the image that we created using editor.

Please provide your input.

richa912 on 19 Feb 2019

I have created a simple bash script for LSTM training - finetuning for
impact. Change file locations to match your setup and let me know if it
works for you.

https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_full.sh

On Tue, Feb 19, 2019 at 12:34 PM richa912 notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii

Hi Shreeeshrii,

I am trying to train tesseract 4.0 for english language but specific
image. I had created 2 .tif and box file of the image using QT editor and I
had make sure to include TAB and SPACES. I had also created dummy
training_text file in lang_data but when i execute the command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist FreeMono
--lang eng --linedata_only --noextract_font_properties --langdata_dir
./langdata --tessdata_dir ./tessdata --output_dir ./output/engtrain
--my_boxtiff_dir /home/user1/abc/tesseract-master/Training_dir

It create a single lstm file for exp0 and that too with random coordinates
(not in sync with by box and tif file) and when we checked the box file in
the tmp/ directory, it was updated with the data from the dummy file of
eng.training_text and not from the actual box file of the image that we
created using editor.

Please provide your input.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465011033,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozzfo4okiaQqiiPojAcaKUwJQu2jks5vO6IZgaJpZM4P9XT0
.

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 19 Feb 2019

Renamed the script to say it uses box tiff pairs.

https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_boxtiff.sh

On Wed, Feb 20, 2019 at 2:01 AM Shree Devi Kumar shreeshrii@gmail.com
wrote:

I have created a simple bash script for LSTM training - finetuning for
impact. Change file locations to match your setup and let me know if it
works for you.

https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_full.sh

On Tue, Feb 19, 2019 at 12:34 PM richa912 notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii

Hi Shreeeshrii,

I am trying to train tesseract 4.0 for english language but specific
image. I had created 2 .tif and box file of the image using QT editor and I
had make sure to include TAB and SPACES. I had also created dummy
training_text file in lang_data but when i execute the command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist
FreeMono --lang eng --linedata_only --noextract_font_properties
--langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir
./output/engtrain --my_boxtiff_dir
/home/user1/abc/tesseract-master/Training_dir

It create a single lstm file for exp0 and that too with random
coordinates (not in sync with by box and tif file) and when we checked the
box file in the tmp/ directory, it was updated with the data from the dummy
file of eng.training_text and not from the actual box file of the image
that we created using editor.

Please provide your input.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465011033,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozzfo4okiaQqiiPojAcaKUwJQu2jks5vO6IZgaJpZM4P9XT0
.

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 19 Feb 2019

Thank You Shreeshrii.
I have one more question before executing the shell script around lang_dir. We have created the dummy eng.training_text file with only 3 words. Do we need to use the same for fine tuning?
Also, there are multiple files required for each lang like: eng.number, eng.unicharambigs, eng.punc, eng.wordlist etc. Do I need to create all of these ?

Thank You in advance.

richa912 on 20 Feb 2019

You don't need the dummy training text for this script.

eng.number, eng.unicharambigs, eng.punc, eng.wordlist are NOT needed for
finetune LSTM training. unicharambigs is NOT used at all.

This type of finetuning uses the files already there in the
tessdata_best/eng.traineddata.

All you for this is the box/tiff pairs and the
tessdata_best/eng.traineddata.

On Wed, Feb 20, 2019 at 12:47 PM richa912 notifications@github.com wrote:

Thank You Shreeshrii.
I have one more question before executing the shell script around
lang_dir. We have created the dummy eng.training_text file with only 3
words. Do we need to use the same for fine tuning?
Also, there are multiple files required for each lang like: eng.number,
eng.unicharambigs, eng.punc, eng.wordlist etc. Do I need to create all of
these ?

Thank You in advance.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465454465,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oznBRfspFeddEfkorb2m6VcDiGFXks5vPPaDgaJpZM4P9XT0
.

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 20 Feb 2019

https://github.com/tesseract-ocr/langdata

You need following files in langdata directory for running this finetune
training script

Common.unicharset
Latin.unicharset
radical-stroke.txt
https://github.com/tesseract-ocr/langdata/blob/master/radical-stroke.txt

On Wed, Feb 20, 2019 at 12:53 PM Shree Devi Kumar shreeshrii@gmail.com
wrote:

You don't need the dummy training text for this script.

eng.number, eng.unicharambigs, eng.punc, eng.wordlist are NOT needed for
finetune LSTM training. unicharambigs is NOT used at all.

This type of finetuning uses the files already there in the
tessdata_best/eng.traineddata.

All you for this is the box/tiff pairs and the
tessdata_best/eng.traineddata.

On Wed, Feb 20, 2019 at 12:47 PM richa912 notifications@github.com
wrote:

Thank You Shreeshrii.
I have one more question before executing the shell script around
lang_dir. We have created the dummy eng.training_text file with only 3
words. Do we need to use the same for fine tuning?
Also, there are multiple files required for each lang like: eng.number,
eng.unicharambigs, eng.punc, eng.wordlist etc. Do I need to create all of
these ?

Thank You in advance.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1172#issuecomment-465454465,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oznBRfspFeddEfkorb2m6VcDiGFXks5vPPaDgaJpZM4P9XT0
.

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 20 Feb 2019

Hi Shree,

I am getting error "Page not found" when click on "https://github.com/Shreeshrii/tesseract/blob/tess4train/tess4train_impact_from_full.sh"
Please help

kamrapooja on 17 Jun 2019

New repo is https://github.com/Shreeshrii/tess4training

Shreeshrii on 17 Jun 2019

Tesseract: how to generate .lstmf files according to .tif and *.box files while fine tuning with tesseract 4.0

Most helpful comment

All 52 comments

Related issues

Tesseract: how to generate *.lstmf files according to *.tif and *.box files while fine tuning with tesseract 4.0

Most helpful comment

All 52 comments

Related issues

Tesseract: how to generate .lstmf files according to .tif and *.box files while fine tuning with tesseract 4.0