Tesseract: Textline Box Files Tesseract 4.0 bad wording?

Created on 27 Mar 2019 · 9Comments · Source: tesseract-ocr/tesseract

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 states that:

the required format is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.

https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 states pretty the same:

The required format for LSTM 4.0alpha is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters. 'Newline' boxes with tab as the character must be inserted between textlines to indicate the end-of-line.

But example box file has individual character bboxes:

T 112 4663 140 4696 0
e 140 4662 160 4686 0
s 163 4662 179 4686 0
s 182 4661 198 4686 0
e 200 4661 220 4685 0
r 221 4662 238 4685 0
a 239 4661 260 4685 0
c 261 4661 281 4685 0
t 281 4661 296 4691 0
  296 4661 311 4696 0
O 311 4661 344 4696 0
C 347 4661 377 4696 0
R 378 4661 414 4695 0
     414 4694 415 4695 0

Does it means that we need to create character bboxes AND textline bboxes?
Thus, suggested wording needs to be something like:
"the required format is still the tiff/box file pair, except that the boxes need to cover a textline in addition to individual characters."?

Or example box file is wrong?

Source

banderlog

Most helpful comment

Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.

text2image generated box file using font files and training text

I 114 4655 120 4691 0
n 127 4655 150 4682 0
f 152 4655 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4681 0
m 214 4654 250 4681 0
a 255 4654 280 4681 0
t 282 4654 295 4689 0
i 298 4654 304 4690 0
o 308 4654 333 4681 0
n 337 4654 360 4681 0
  360 4653 378 4691 0
G 378 4653 413 4691 0
r 418 4653 434 4680 0
o 434 4653 459 4680 0
u 463 4653 486 4679 0
p 491 4643 515 4680 0
s 517 4653 540 4680 0
  540 4653 555 4690 0

Generated by tesseract using lstmbox config from image files - each char uses coordinates of its entire line.

I 114 4640 1912 4692 0
n 114 4640 1912 4692 0
f 114 4640 1912 4692 0
o 114 4640 1912 4692 0
r 114 4640 1912 4692 0
m 114 4640 1912 4692 0
a 114 4640 1912 4692 0
t 114 4640 1912 4692 0
i 114 4640 1912 4692 0
o 114 4640 1912 4692 0
n 114 4640 1912 4692 0
  114 4640 1912 4692 0
G 114 4640 1912 4692 0
r 114 4640 1912 4692 0
o 114 4640 1912 4692 0
u 114 4640 1912 4692 0
p 114 4640 1912 4692 0
s 114 4640 1912 4692 0
  114 4640 1912 4692 0

Generated by tesseract using wordstrbox config from image files - Uses Wordstr and text for whole line

WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION 
     1908 4640 1912 4692 0
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS 
     2016 4544 2020 4592 0

Please note that box files generated using makebox config file are OK for Tesseract3 but not for Tesseract4 LSTM training.

I 114 4654 120 4691 0
n 127 4654 150 4682 0
f 152 4654 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4682 0
m 214 4654 250 4682 0
a 230 4653 270 4692 0
t 255 4654 280 4682 0
i 282 4653 304 4691 0
o 308 4653 333 4681 0
n 337 4653 360 4681 0
G 378 4653 413 4691 0
r 395 4643 435 4691 0
o 418 4653 434 4681 0
u 434 4653 459 4681 0
p 463 4653 486 4680 0
s 491 4643 540 4681 0

Attached zip file has a sample tif file and the different types of box files for it so that it is easy to see the additional line with TAB character used to mark EOL in the box files for Tesseract4.

boxfiles.zip

Shreeshrii on 27 Mar 2019

👍7 🎉2

All 9 comments

The lstm training does not really need individual char coordinates.

For each char, you can give coordinates of its entire line.

amitdo on 27 Mar 2019

👍2

Or example box file is wrong?

Well, it's not wrong. Tesseract will accept it.

amitdo on 27 Mar 2019

Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.

text2image generated box file using font files and training text

I 114 4655 120 4691 0
n 127 4655 150 4682 0
f 152 4655 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4681 0
m 214 4654 250 4681 0
a 255 4654 280 4681 0
t 282 4654 295 4689 0
i 298 4654 304 4690 0
o 308 4654 333 4681 0
n 337 4654 360 4681 0
  360 4653 378 4691 0
G 378 4653 413 4691 0
r 418 4653 434 4680 0
o 434 4653 459 4680 0
u 463 4653 486 4679 0
p 491 4643 515 4680 0
s 517 4653 540 4680 0
  540 4653 555 4690 0

Generated by tesseract using lstmbox config from image files - each char uses coordinates of its entire line.

I 114 4640 1912 4692 0
n 114 4640 1912 4692 0
f 114 4640 1912 4692 0
o 114 4640 1912 4692 0
r 114 4640 1912 4692 0
m 114 4640 1912 4692 0
a 114 4640 1912 4692 0
t 114 4640 1912 4692 0
i 114 4640 1912 4692 0
o 114 4640 1912 4692 0
n 114 4640 1912 4692 0
  114 4640 1912 4692 0
G 114 4640 1912 4692 0
r 114 4640 1912 4692 0
o 114 4640 1912 4692 0
u 114 4640 1912 4692 0
p 114 4640 1912 4692 0
s 114 4640 1912 4692 0
  114 4640 1912 4692 0

Generated by tesseract using wordstrbox config from image files - Uses Wordstr and text for whole line

WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION 
     1908 4640 1912 4692 0
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS 
     2016 4544 2020 4592 0

Please note that box files generated using makebox config file are OK for Tesseract3 but not for Tesseract4 LSTM training.

I 114 4654 120 4691 0
n 127 4654 150 4682 0
f 152 4654 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4682 0
m 214 4654 250 4682 0
a 230 4653 270 4692 0
t 255 4654 280 4682 0
i 282 4653 304 4691 0
o 308 4653 333 4681 0
n 337 4653 360 4681 0
G 378 4653 413 4691 0
r 395 4643 435 4691 0
o 418 4653 434 4681 0
u 434 4653 459 4681 0
p 463 4653 486 4680 0
s 491 4643 540 4681 0

Attached zip file has a sample tif file and the different types of box files for it so that it is easy to see the additional line with TAB character used to mark EOL in the box files for Tesseract4.

boxfiles.zip

Shreeshrii on 27 Mar 2019

👍7 🎉2

WordStr box files

create phototest-wordstr.box from phototest.tif

cd tesseract/test/testing
tesseract phototest.tif phototest-wordstr  -l eng --psm 6 wordstrbox

Review and edit phototest-wordstr.box to have the correct text for each line
Save as phototest.box
Use phototest.tif and phototest.box to create phototest.lstmf files

In case a groundtruth text file is available for the image, you can try to automate the edit process.
This will work if groundtruth textlines match image lines.

Remove the OCRed text from the box file
Delete any blank lines from ground truth file
Add blank line after every textline in ground truth file
paste both files together

sed -i -e 's/\([0-9] \#\).*$/\1/g'  phototest-wordstr.box
sed '/^$/d' phototest.gold.txt > phototest-gt.txt
sed -i -e 's/$/\n/g' phototest-gt.txt
paste --delimiters="\0"  phototest-wordstr.box  phototest-gt.txt > phototest.box

Review phototest.box to make sure that the lines match.

Shreeshrii on 27 Mar 2019

@Shreeshrii many thanks for great and detailed explanation!

I''ll put some changes into wiki to clarify this question.

banderlog on 27 Mar 2019

The lstmbox and wordstrbox options have been added recently. Please try them out with your image files.

Thank you for changing the wiki to clarify this.

Shreeshrii on 27 Mar 2019

»WordStr« format: the lines beginning with tab have 1 space character before the first digit.
If not, you're getting »Encoding of string failed!«

jbarth-ubhd on 24 Apr 2019

@Shreeshrii hey i need small help as i have to train tesseract on my documents.
I have already read some training issues and i have steps that i can perform.

!tesseract "document.png" "document" -l eng --psm 11 wordstrbox it will give me line lavel box
correct ocr. copy image file and box file in tessdata folder

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

. so this is right thing or i have to do anything else?
. It will also finetune detection right?
. can i finetune on word level like linedata_only?
. i am using tesseract 5 and tessedit_do_invert=0 but it still slow. so which steps i can perform at time training will improve time.