Tesseract: wrong coordinates in .box file with LSTM

Created on 15 Jan 2018  ·  41Comments  ·  Source: tesseract-ocr/tesseract

While i run tesseract with LSTM then coordinates in box file look bad (oem=2). However the same code with oem=0 look fine, but ocr resoult is less accuracy even if I have fully cleared images before processing in high resolution (see images below).

my example code:
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" --tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata" -l pol --oem 2 --psm 6 -c tessedit_create_boxfile=1 -c tessedit_create_hocr=1 -c tessedit_create_tsv=1 -c tessedit_create_txt=1 "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\fl.txt" "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\tess"

platform:
W7U x64
tesseract v4.00.00a

111

accuracy feature request help wanted

All 41 comments

Try with traineddata from tessdata_best andbtessdata_fast with --oem 1

Also, LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes.

when I try to use best or fast then i got error:

lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193

What matters most is the recognition of text from images.

IMHO, accurate location of individual glyphs is not a very important feature.

LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes.

I believe Shree is right here.

Unlike the lstm engine, the legacy engine works on a glyph level.

So AFAIK this issue is not a bug.

when I try to use best or fast then i got error:

lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193

Use the latest code in the master.

Is that to say, that when i fine-tune tesseract 4 (LSTM) on scanned images, i should ignore the locations in the box file and only fix the recognised characters?
If LSTM works on a line level, how does it use the "character based" box files?

If LSTM works on a line level, how does it use the "character based" box files?

Basically, what the lstm engine really needs is lines bounding boxes & separated graphemes (or graphemes clusters) as input.

Still, currently only the box format is supported :(

Thanks @amitdo. Obviously Tesseract lstm has been successfully trained. And a box file made of individual characters is one of the main sub-steps. So what is currently happening regarding to the box file. Does Tesseract treat every character has a its own “line” or does it somehow combine all the characters between two EOLs to generate a line bounding box for them?

... or does it somehow combine all the characters between two EOLs to generate a line bounding box for them?

It combines chars boxes separated by a tab (EOL) to a line box. The chars themselves are kept separated.

I’m not sure I understand. If the LSTM trains on the “combains line box”, what do you mean by “the chars themselves are kept separated”?
Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char)

Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char)

I believe the answer is 'yes', but I didn't try it yet.

Make the first and last box accurate. Also change one char box so its top & bottom coordinates will be used for the whole line.

Please report if this trick works.

I will try and report
On Fri, 19 Jan 2018 at 15:49 Amit D. notifications@github.com wrote:

Does that means I can ignore the exact character coordinate as long as it
seems they form a reasonable line boxif combained? (E.g if a char cordinate
does not fully enclose the char)

I believe the answer is 'yes', but I didn't try it yet.

Make the first and last box accurate. Also change one char box so its top
& bottom coordinates will used for the whole line.

Please report if this trick works.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1276#issuecomment-358970736,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABzSnz5jWxDUVigipzzfgWHMZbQmYoq9ks5tMJ1mgaJpZM4RewbM
.

You must keep the tab as line separator.
Also don't drop words separators (one space char).

Reporting back.
As much as I could tell from the code (void Tesseract::TrainFromBoxes) the behaviour is indeed as @amitdo described.
I wrote some ugly python script to generate a box file from the tesseract TSV result so i wont need to insert the spaces and tabs. All seems to work fine.

The one thing I'm a big worried are cases where a word (or a line) has mix language chars in it.
e.g

בתאריך-10.10.2000

It seems that char order in the box file should be as they appears on page from left to right. i.e first char is "1" and last is "ב".
However, in the code, it appends all the chars in the line to a single string. In most computer-languages it will result in a string such that the first char is "ב" and the last is "1".
I was unable to figure out from the code if there is a mismatch here that will cause tesseract to train on the badly ordered string.

python script to generate a box file from the tesseract TSV result so i wont need to insert the spaces and tabs. All seems to work fine.

@amitm02 You may want to share it, as a number of people would like to use training from images option.

Another trick that can help you is to use text2image with just one font. Take the box file it produces and 'fix' the boxes with your script.

@amitm02 Please see the thread at https://github.com/tesseract-ocr/tesseract/issues/648#issuecomment-271870748 for how Arabic and other RTL languages are handled.

@Shreeshrii, thanks.
I think they made a good call with going strictly LTR in the training. stuff can get amazingly complex when it come to mixed languages text: link

I am confused about something here. How is charsegmentation layer is trained?
Does it use the overall accuracy of the network?
Isnt it bad for both networks?
While using especially synthetic data, default option should be to use box coords?

It uses a technique called CTC.

Here is the first paper to describe CTC used for text recognition (OCR):
.
A Novel Connectionist System for Unconstrained Handwriting Recognition (2009).
http://www.cs.toronto.edu/%7Egraves/tpami_2009.pdf

i see, actually it is a nice one to use.
here is another
ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf

it is hard to come up with a good nn for segmentation only anyway :)

ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf

Same authors, from 2006, CTC for speech recognition.

Does the above discussion imply that there is no way to get correct coordinates for every word when using LSTM mode?

correct coordinates for every word

May be possible.

It is not possible to get accurate coordinates for every character.

Try HOCR output.

Why is this not a bug? Accurate box files are a must for training. And the ability to train tesseract is one of its major strengths.

Accurate box files are a must for training.

Not for 4.0.0's lstm training.
https://github.com/tesseract-ocr/tesseract/issues/1276#issuecomment-358970736

Tesseract should warn users who want box files when they try to get them with LSTM. It currently does not which already caused several issue reports, so the missing warning needs to be fixed. Patches are welcome, but I don't think that's a reason to postpone 4.0.0.

Yes, please! And also, please hint to -oem 0 and the corresponding language files. I used tesseract in sophisticated ways many years. I still missed all this when I got 4.0 via a system upgrade. I just figured out what I had to change in my workflow so that it not just crashes. But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways.

But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways.

... and that the old ways are still available, but require additional work (like --oem 0 or getting the correct traineddata files).

@stweil Would it be appropriate to add a couple of line to tesseract --help before Usage to inform users of this?

Tesseract 4.0.0 provides neural net based LSTM engine in addition to the
legacy Tesseract engine.
Users wanting compatibility with Tesseract 3.0x should use --oem 0 with
traineddata files from tessdata repository.

On Tue, Oct 2, 2018 at 11:49 AM Stefan Weil notifications@github.com
wrote:

But I totally missed that this is a completely re-designed algorithm that
behaves differently in many ways.

... and that the old ways are still available, but require additional work
(like --oem 0 or getting the correct traineddata files).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1276#issuecomment-426325632,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_owifuEZgMOXG8ZfcyaByBYiVRtrcks5ug4sIgaJpZM4RewbM
.

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

I would not overload that help text, but suggest to enhance the manual page. Is there a better term for legacy Tesseract engine? If we avoid the exact revision number, we don't have to change it each time.

What about this text: _Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by --oem 0. It also needs traineddata files which support the legacy engine, for example those from the tessdata repository._

By the way: the man page currently misses information on the new --dpi n. @zdenop, do we need that option at all, or isn't the config variable sufficient?

I believe that very small share of our userbase reads man pages.

I read man pages, but only if I know that I am looking for something. So I would need some trigger first.

I liked the idea to throw out a warning if someone runs in NN mode and still requests box files. Or just stop and request to drop the box file request or use some additional override option.

I do not know how typical this behavior is, but I run tesseract most often from scripts that I have used since ten years, so that is why I missed this all together. That is why it took me months until it became annoying enough to look for the root cause. It continued to work "kind-of" and there was nothing catching my eyes directly.

@stweil : dpi warning message is IMO too common (based on testing several issues tracker images), so it need to easy accessible user. For this reason I decided to implement it as option for tesseract app.

BTW: it would be great if English native speaker could check & improve all docs, including wiki...

https://github.com/tesseract-ocr/tesseract/issues/1448

@sagimann commented 10 minutes ago

problem is, when using oem 0, that OCR does not work well with non-solid backgrounds. The point is: if bboxes are not used by line recognizer, what other kind of data is available to correctly find the symbol on the image in terms of location?

@amitdo is there anyway using tesseract to find the correct coordinate of characters while using the LSTM engine?

The bboxes are estimated. I don't think there is a way to make it more accurate with lstm.

There's also a known bug that cause the bbox to be sometimes way off than the real coordinates.

The pdf renderer might suffer from both the 'bug' and 'not a bug'.

@smarq8, this should be fixed by pull request #2576. Please test and report your results.

makebox output shows no overlap. Issue can be closed.

tesseract 1276.png  - -l eng  --tessdata-dir ~/tessdata_fast  --oem 1 --psm 6 makebox

P 122 475 155 525 0
r 158 475 182 513 0
z 183 475 211 512 0
e 213 474 245 513 0
p 248 458 283 513 0
r 287 475 311 513 0
a 313 474 343 513 0
s 346 474 372 513 0
z 375 475 402 512 0
a 404 474 435 513 0
m 439 475 491 513 0
y 470 458 504 526 0
! 494 459 542 526 0
P 332 368 365 419 0
r 368 368 393 406 0
o 394 366 430 407 0
s 433 367 459 406 0
z 462 368 489 406 0
e 492 350 524 406 0
. 528 368 542 382 0
N 302 238 357 311 0
- 364 261 387 273 0
N 394 238 449 311 0
i 426 237 463 316 0
e 459 238 478 316 0
. 482 237 553 294 0
N 54 145 96 201 0
i 104 145 118 205 0
e 121 144 158 187 0
m 178 145 237 187 0
a 241 144 275 187 0
s 295 144 324 187 0
p 328 125 367 187 0
r 370 145 399 187 0
a 401 144 433 187 0
w 437 145 496 186 0
y 475 125 512 187 0
. 498 126 551 186 0
W 268 28 351 94 0
o 315 26 367 99 0
l 352 26 398 79 0
n 405 28 419 99 0
a 426 28 469 78 0
! 474 27 536 95 0
Was this page helpful?
0 / 5 - 0 ratings