Hi, I'm trying to train a new tesseract chinese dictionary using jTessBoxEditor.
The tool creates all files necessary to train tesseract.
I have 273 character to train. During the training I have this error for only two character of them:
* Moving generated traineddata file to tessdata folder *
* Training Completed *
* Run Tesseract for Training *
[C:Users\allvilardi\Downloads\jTessBoxEditorFX-2.0-Beta\jTessBoxEditorFX\tesseract-ocr/tesseract, CT_calibri.calibri.exp0.tif, CT_calibri.calibri.exp0, box.train]
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
FAIL!
APPLY_BOXES: boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 125/盤 ((1092,2680),(1164,2751)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 273
Boxes failed resegmentation: 2
Found 271 good blobs.
Generated training data for 28 words
I've changed also the box manually on those two charachter, but without success. On a Box gui, the boxes seems to be fine.
Does anyone know how to fix that problem?
ps. I have this error also on korean characters, for all the characters.
This are the grafic boxes on those character:

Anyone could help me? Thanks.
I have the same problem with arabic language
* Run Tesseract for Training *
[K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract, ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train]
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
row xheight=23, but median xheight = 30.5
APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't find a matching blob
APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't find a matching blob
....
..
.
.
APPLY_BOXES:
Boxes read from boxfile: 888
Boxes failed resegmentation: 176
For Arabic, you will get better results using tesseract 4.0alpha.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 9, 2017 at 10:26 PM, idrisalshikh notifications@github.com
wrote:
I have the same problem with arabic language
* Run Tesseract for Training *
[K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract,
ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train]
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
row xheight=23, but median xheight = 30.5
APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't
find a matching blob
APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't
find a matching blob
....
..
.
.APPLY_BOXES:
Boxes read from boxfile: 888
Boxes failed resegmentation: 176—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1166#issuecomment-335217245,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7ccW2V5QU-yptOQIhVH4AJ0NLwRks5sqlA9gaJpZM4Pys0D
.
Actually i'm already using v 4 as it showing in the training message log
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
I have the same problem. In my case I try to train digits from a display.
tesseract day_2_60_0_G3.cfont1.exp0.tif day_2_60_0_G3.cfont1.exp0 -l dianoche2 -psm 7 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 90/. ((1079,11),(1081,17)): FAILURE! Couldn't find a matching blob
APPLY_BOXES: boxfile line 141/. ((1758,2),(1762,16)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 189
Boxes failed resegmentation: 2
Found 187 good blobs.
Leaving 1 unlabelled blobs in 0 words.
TRAINING ... Font name = cfont1
Generated training data for 60 words

day_2_60_0_G3.cfont1.exp0.txt
I attach the file in .txt format because I couldn't attach in .box format
Anyone could help me? Thanks.
These errors have existed for a long time. I think it is a problem with how tesseract segments the page and finds lines. If you only have a couple of these errors, I would say to ignore them and proceed to next step.
hi man, did you solve it ?
If I remember well, if you try to train only the characters with box segmentation problem it goes well. Then, for the training I give both separated files to create single dictionary.
@alevillard can you give me a little more detailed information? how to 'with box segmentation'? specify some arguments?
Hi, I try to answer..
I mean problem of segmentation when tesseract can not find matching blob during the training:
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
FAIL!
APPLY_BOXES:boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob
FAIL!
..
Boxes failed resegmentation: 2
Isolating those characters in another file .box and .tif sometimes tesseract success in training those characters.
So, when you have file1.box , file1.tif with a set of characters and another file2.box and file2.tif in the same folder, JTessBoxEditor can join the charset of both files in a single dictionary.
@amitdo sir, after our team tracking the source code, we found a logical bug when getting the *.tr file by running command "tesseract chi.font.exp0.tif chi.font.exp0 nobatch box.train".
the program flow:main()->ProcessPages()->ProcessPageInternal()->ProcessPage()->Recognize()->ApplyBoxes()->ResegmentCharBox(), we found "logical bug" in ResegmentCharBox() function.
you will call for bounding_box().major_overlap() to judge a box(from box file) whether reasonable or not, here is code:
inline bool TBOX::major_overlap( // Do boxes overlap more that half.
const TBOX &box) const {
int overlap = MIN(box.top_right.x(), top_right.x());
overlap -= MAX(box.bot_left.x(), bot_left.x());
overlap += overlap;
if (overlap < MIN(box.width(), width()))
return false;
overlap = MIN(box.top_right.y(), top_right.y());
overlap -= MAX(box.bot_left.y(), bot_left.y());
overlap += overlap;
if (overlap < MIN(box.height(), height()))
return false;
return true;
}
don't you think this step unnecessary? since we have already prepared a good *.box file(checked/modified by jTessBoxEditor), this step will filter out the useful box information. and more, we guess you get the "blob_box" through 3rd-party leptonica, but as far as we test, it couldnt guarantee a good effect.
The attached zip is the test image and box file
run cmd: tesseract temp.tif temp nobatch box.train
you can see many blobs missing.
@GitHubGS,
I'm not a core developer, and I have no answer to your question, sorry.
I'm not a core developer, and I have no answer to your question, sorry.
@amitdo anyway, many thanks to you and your team for your brilliant work!
Duplicates
I suggest that we close the older issues since this has the most discussion.
@GitHubGS : Hi, I have encountered similar problem while training tesseract. In the code that you have mentioned, I understand that parameters with prefix 'box' are for the box as defined in boxfile.
MIN(box.top_right.x(), top_right.x())
For e.g. here the first parameter is box's top right corner's x-coordinate. But what is top_right.x()?
Is it for the detected blob?
Best Regards
same problem
can anyone say that what is the width and height of tile should be given while executing openalpr-utils-prepcharsfortraining
same problem
The problem is still sharp
same problem 😢
same problem cry
you may could add -l chi_tra to resolve it.
same problem
Most helpful comment
These errors have existed for a long time. I think it is a problem with how tesseract segments the page and finds lines. If you only have a couple of these errors, I would say to ignore them and proceed to next step.