https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 states that:
the required format is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters.
https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 states pretty the same:
The required format for LSTM 4.0alpha is still the tiff/box file pair, except that the boxes only need to cover a textline instead of individual characters. 'Newline' boxes with tab as the character must be inserted between textlines to indicate the end-of-line.
But example box file has individual character bboxes:
T 112 4663 140 4696 0
e 140 4662 160 4686 0
s 163 4662 179 4686 0
s 182 4661 198 4686 0
e 200 4661 220 4685 0
r 221 4662 238 4685 0
a 239 4661 260 4685 0
c 261 4661 281 4685 0
t 281 4661 296 4691 0
296 4661 311 4696 0
O 311 4661 344 4696 0
C 347 4661 377 4696 0
R 378 4661 414 4695 0
414 4694 415 4695 0
Does it means that we need to create character bboxes AND textline bboxes?
Thus, suggested wording needs to be something like:
"the required format is still the tiff/box file pair, except that the boxes need to cover a textline in addition to individual characters."?
Or example box file is wrong?
The lstm training does not really need individual char coordinates.
For each char, you can give coordinates of its entire line.
Or example box file is wrong?
Well, it's not wrong. Tesseract will accept it.
Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.
I 114 4655 120 4691 0
n 127 4655 150 4682 0
f 152 4655 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4681 0
m 214 4654 250 4681 0
a 255 4654 280 4681 0
t 282 4654 295 4689 0
i 298 4654 304 4690 0
o 308 4654 333 4681 0
n 337 4654 360 4681 0
360 4653 378 4691 0
G 378 4653 413 4691 0
r 418 4653 434 4680 0
o 434 4653 459 4680 0
u 463 4653 486 4679 0
p 491 4643 515 4680 0
s 517 4653 540 4680 0
540 4653 555 4690 0
lstmbox config from image files - each char uses coordinates of its entire line.I 114 4640 1912 4692 0
n 114 4640 1912 4692 0
f 114 4640 1912 4692 0
o 114 4640 1912 4692 0
r 114 4640 1912 4692 0
m 114 4640 1912 4692 0
a 114 4640 1912 4692 0
t 114 4640 1912 4692 0
i 114 4640 1912 4692 0
o 114 4640 1912 4692 0
n 114 4640 1912 4692 0
114 4640 1912 4692 0
G 114 4640 1912 4692 0
r 114 4640 1912 4692 0
o 114 4640 1912 4692 0
u 114 4640 1912 4692 0
p 114 4640 1912 4692 0
s 114 4640 1912 4692 0
114 4640 1912 4692 0
wordstrbox config from image files - Uses Wordstr and text for whole lineWordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION
1908 4640 1912 4692 0
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... 楼 INSTABILITY About WERE Crimson THAT HOPKINS
2016 4544 2020 4592 0
Please note that box files generated using makebox config file are OK for Tesseract3 but not for Tesseract4 LSTM training.
I 114 4654 120 4691 0
n 127 4654 150 4682 0
f 152 4654 169 4692 0
o 168 4654 193 4682 0
r 197 4654 213 4682 0
m 214 4654 250 4682 0
a 230 4653 270 4692 0
t 255 4654 280 4682 0
i 282 4653 304 4691 0
o 308 4653 333 4681 0
n 337 4653 360 4681 0
G 378 4653 413 4691 0
r 395 4643 435 4691 0
o 418 4653 434 4681 0
u 434 4653 459 4681 0
p 463 4653 486 4680 0
s 491 4643 540 4681 0
Attached zip file has a sample tif file and the different types of box files for it so that it is easy to see the additional line with TAB character used to mark EOL in the box files for Tesseract4.
create phototest-wordstr.box from phototest.tif
cd tesseract/test/testing
tesseract phototest.tif phototest-wordstr -l eng --psm 6 wordstrbox
Review and edit phototest-wordstr.box to have the correct text for each line
Save as phototest.box
Use phototest.tif and phototest.box to create phototest.lstmf files
In case a groundtruth text file is available for the image, you can try to automate the edit process.
This will work if groundtruth textlines match image lines.
Remove the OCRed text from the box file
Delete any blank lines from ground truth file
Add blank line after every textline in ground truth file
paste both files together
sed -i -e 's/\([0-9] \#\).*$/\1/g' phototest-wordstr.box
sed '/^$/d' phototest.gold.txt > phototest-gt.txt
sed -i -e 's/$/\n/g' phototest-gt.txt
paste --delimiters="\0" phototest-wordstr.box phototest-gt.txt > phototest.box
Review phototest.box to make sure that the lines match.
@Shreeshrii many thanks for great and detailed explanation!
I''ll put some changes into wiki to clarify this question.
The lstmbox and wordstrbox options have been added recently. Please try them out with your image files.
Thank you for changing the wiki to clarify this.
禄WordStr芦 format: the lines beginning with tab have 1 space character before the first digit.
If not, you're getting 禄Encoding of string failed!芦
@Shreeshrii hey i need small help as i have to train tesseract on my documents.
I have already read some training issues and i have steps that i can perform.
. so this is right thing or i have to do anything else?
. It will also finetune detection right?
@kbrajwani,
Please use the forum for asking questions.
Most helpful comment
Multiple formats of box files are accepted by Tesseract4 for LSTM training, though they are different from the one used by Tesseract 3.
lstmboxconfig from image files - each char uses coordinates of its entire line.wordstrboxconfig from image files - UsesWordstrand text for whole linePlease note that box files generated using
makeboxconfig file are OK for Tesseract3 but not for Tesseract4 LSTM training.Attached zip file has a sample tif file and the different types of box files for it so that it is easy to see the additional line with TAB character used to mark EOL in the box files for Tesseract4.
boxfiles.zip