Tesseract: Training tesseract, APPLY_BOXES: ... FAILURE! Couldn't find a matching blob

Created on 9 Oct 2017 · 23Comments · Source: tesseract-ocr/tesseract

Hi, I'm trying to train a new tesseract chinese dictionary using jTessBoxEditor.
The tool creates all files necessary to train tesseract.
I have 273 character to train. During the training I have this error for only two character of them:

* Moving generated traineddata file to tessdata folder *
* Training Completed *
* Run Tesseract for Training *
[C:Users\allvilardi\Downloads\jTessBoxEditorFX-2.0-Beta\jTessBoxEditorFX\tesseract-ocr/tesseract, CT_calibri.calibri.exp0.tif, CT_calibri.calibri.exp0, box.train]
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
FAIL!
APPLY_BOXES: boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 125/盤 ((1092,2680),(1164,2751)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 273
Boxes failed resegmentation: 2
Found 271 good blobs.
Generated training data for 28 words

I've changed also the box manually on those two charachter, but without success. On a Box gui, the boxes seems to be fine.
Does anyone know how to fix that problem?
ps. I have this error also on korean characters, for all the characters.

This are the grafic boxes on those character:

Anyone could help me? Thanks.

Source

alevillard

👍12

Most helpful comment

These errors have existed for a long time. I think it is a problem with how tesseract segments the page and finds lines. If you only have a couple of these errors, I would say to ignore them and proceed to next step.

Shreeshrii on 18 Oct 2017

👍3

All 23 comments

I have the same problem with arabic language
* Run Tesseract for Training *
[K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract, ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train]
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
row xheight=23, but median xheight = 30.5
APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't find a matching blob
APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't find a matching blob
....
..
.
.

APPLY_BOXES:
Boxes read from boxfile: 888
Boxes failed resegmentation: 176

idrisalshikh on 9 Oct 2017

For Arabic, you will get better results using tesseract 4.0alpha.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 9, 2017 at 10:26 PM, idrisalshikh notifications@github.com
wrote:

I have the same problem with arabic language
* Run Tesseract for Training *
[K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract,
ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train]
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
row xheight=23, but median xheight = 30.5
APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't
find a matching blob
APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't
find a matching blob
....
..
.
.

APPLY_BOXES:
Boxes read from boxfile: 888
Boxes failed resegmentation: 176

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1166#issuecomment-335217245,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7ccW2V5QU-yptOQIhVH4AJ0NLwRks5sqlA9gaJpZM4Pys0D
.

Shreeshrii on 10 Oct 2017

Actually i'm already using v 4 as it showing in the training message log
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica

idrisalshikh on 10 Oct 2017

Same issue as

https://github.com/tesseract-ocr/tesseract/issues/436
https://github.com/tesseract-ocr/tesseract/issues/445
https://github.com/tesseract-ocr/tesseract/issues/1033

Shreeshrii on 10 Oct 2017

I have the same problem. In my case I try to train digits from a display.

tesseract day_2_60_0_G3.cfont1.exp0.tif day_2_60_0_G3.cfont1.exp0 -l dianoche2 -psm 7 nobatch box.train

Tesseract Open Source OCR Engine v3.02 with Leptonica

FAIL!
APPLY_BOXES: boxfile line 90/. ((1079,11),(1081,17)): FAILURE! Couldn't find a matching blob
APPLY_BOXES: boxfile line 141/. ((1758,2),(1762,16)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 189
Boxes failed resegmentation: 2
Found 187 good blobs.
Leaving 1 unlabelled blobs in 0 words.
TRAINING ... Font name = cfont1
Generated training data for 60 words
day_2_60_0_g3 cfont1 exp0
day_2_60_0_G3.cfont1.exp0.txt

I attach the file in .txt format because I couldn't attach in .box format

Anyone could help me? Thanks.

iareizaga on 18 Oct 2017

Shreeshrii on 18 Oct 2017

👍3

hi man, did you solve it ?

gbolin on 2 Feb 2018

If I remember well, if you try to train only the characters with box segmentation problem it goes well. Then, for the training I give both separated files to create single dictionary.

alevillard on 5 Feb 2018

@alevillard can you give me a little more detailed information? how to 'with box segmentation'? specify some arguments?

gbolin on 5 Feb 2018

Hi, I try to answer..

I mean problem of segmentation when tesseract can not find matching blob during the training:

Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
Page 1
FAIL!
APPLY_BOXES:boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob
FAIL!
..
Boxes failed resegmentation: 2

Isolating those characters in another file .box and .tif sometimes tesseract success in training those characters.
So, when you have file1.box , file1.tif with a set of characters and another file2.box and file2.tif in the same folder, JTessBoxEditor can join the charset of both files in a single dictionary.

alevillard on 5 Feb 2018

❤1

@amitdo sir, after our team tracking the source code, we found a logical bug when getting the *.tr file by running command "tesseract chi.font.exp0.tif chi.font.exp0 nobatch box.train".

the program flow:main()->ProcessPages()->ProcessPageInternal()->ProcessPage()->Recognize()->ApplyBoxes()->ResegmentCharBox(), we found "logical bug" in ResegmentCharBox() function.
you will call for bounding_box().major_overlap() to judge a box(from box file) whether reasonable or not, here is code:
inline bool TBOX::major_overlap( // Do boxes overlap more that half. const TBOX &box) const { int overlap = MIN(box.top_right.x(), top_right.x()); overlap -= MAX(box.bot_left.x(), bot_left.x()); overlap += overlap; if (overlap < MIN(box.width(), width())) return false; overlap = MIN(box.top_right.y(), top_right.y()); overlap -= MAX(box.bot_left.y(), bot_left.y()); overlap += overlap; if (overlap < MIN(box.height(), height())) return false; return true; }
don't you think this step unnecessary? since we have already prepared a good *.box file(checked/modified by jTessBoxEditor), this step will filter out the useful box information. and more, we guess you get the "blob_box" through 3rd-party leptonica, but as far as we test, it couldnt guarantee a good effect.
The attached zip is the test image and box file
run cmd: tesseract temp.tif temp nobatch box.train
you can see many blobs missing.

Archive.zip

gbolin on 6 Feb 2018

😕1

@GitHubGS,

I'm not a core developer, and I have no answer to your question, sorry.

amitdo on 6 Feb 2018

I'm not a core developer, and I have no answer to your question, sorry.

@amitdo anyway, many thanks to you and your team for your brilliant work!

gbolin on 11 Feb 2018

Duplicates

436

445 1033

I suggest that we close the older issues since this has the most discussion.

Shreeshrii on 4 Mar 2018

@GitHubGS : Hi, I have encountered similar problem while training tesseract. In the code that you have mentioned, I understand that parameters with prefix 'box' are for the box as defined in boxfile.
MIN(box.top_right.x(), top_right.x())
For e.g. here the first parameter is box's top right corner's x-coordinate. But what is top_right.x()?
Is it for the detected blob?
Best Regards