Tesseract: Dramatically different results with O1 and O2 optimizations in clang

Created on 15 May 2016 · 10Comments · Source: tesseract-ocr/tesseract

On this image:
000
with configuration:
"hocr_font_info":"1",
"tessedit_pageseg_mode":"1"

O1 optimization will produce:
"GI 96%:\n\n14:18\n\n \n\nno\" Kyivstar ‘3 '3\n\n< 3aMeTKl/I\n\n27 anpenn 2016 r‘, 14:18\n\nLeben ham.\nnicht von Dritten Uberwacht\n99.! mir attain Md.-\n\ngehbrt und\nLeben haben,\n\n \n\n"

O2 optimization will produce:
"GI 96%:\n\n14:18\n\n \n\nno\" Kyivstar ‘3 '3\n\n< 3aMeTKl/I\n\n27 anpenn 2016 r‘, 14:18\n\nnicht von Dritten Uberwacht\n99.! mir attain Md.-\n\ngehbrt und\nLeben haben,\n\n \n\n"

O2 optimization loses first line of text ("Leben haben").

This happens because of segmentation bug.
Looks like the issue is with colpartition::VCoreOverlap method. This method might cause integer overflow (which is undefined behavior) if median_top_ or median_bottom_ are not computed.
This happens for LeaderPartitions, median_top_ and median_bottom_ are initialized to MAX_INT32 and -MAX_INT32 and never get updated.

awaiting feedback bug

Source

myeyesareblind

All 10 comments

Is this still an issue? Did someone test it with recent Tesseract code? We should know that for Tesseract 4.0.0.

stweil on 17 Sep 2018

I compiled current tesseract code with following options:

CXXFLAGS='-O1 -Wreserved-id-macro' CC=clang CXX=clang++ ../tesseract/configure --prefix=/usr

and than I run ocr with (tessdata_best):

tesseract i320.jpg i320-O1 -c tessedit_pageseg_mode=1

Than I uninstalled tesseract and I did it with -O2, and -O3. I got identical results...
Tested on openSUSE 15, 64bit with clang version 5.0.1

zdenop on 14 Oct 2018

So there is a 4th different result.

@myeyesareblind, which language model did you use? Which version of clang++? Please post also the full command line used.

Commit 7f911ac5e027ac8a25d890c55c6c1bc367ea11b4 fixed the integer overflow, so maybe this issue was fixed then, too.

stweil on 14 Oct 2018

Closing as not reproduced and without respond from original reporter.

zdenop on 29 Jul 2019

Sorry, I probably missed email with first mention.

@zdenop from your output I can see that bug is now in all O1/O2/O3 - first line is missing everywhere.

mstobetskyi on 29 Jul 2019

So it looks like the integer overflow was handled differently depending on the compiler and compiler optimization. Now that overflow no longer occurs, but the result is worse (tesseract ... -l tessdata_best/script/Cyrillic+tessdata_best/script/Latin):

--- /tmp/old    2019-07-29 21:47:08.646069882 +0200
+++ /tmp/new    2019-07-29 21:48:02.981969439 +0200
@@ -4,8 +4,7 @@

 27 апреля 2016 г., 14:18

-Leben haben,
-nicht von Dritten überwacht
+nicht уоп Dritten überwacht
 der mir allẹin wird.

 gehört und

One text line is missing completely, and in the next line there is now a wrong character. That requires a closer look.

stweil on 29 Jul 2019

👀1

./tesseract -v --tessdata-dir tess4 --dpi 132 -l deu+rus leben_haben.jpg stdout --psm 6

tesseract 4.1.0
leptonica-1.79.0
  libjpeg 9b : libpng 1.6.25 : zlib 1.2.8
00000 Kyivstar 7 > 14:18 © 96 % ш
«. Заметки
27 апреля 2016 г., 14:18

Leben haben,

nicht von Dritten überwacht

der mir allein wird.

gehört und

Leben haben,

C И

With --psm 1, I can reproduce the problem. I believe that for this input, psm=6 (no page segmentation) is a natural choice, and it's not surprising that allowing page segmentation can produce in mistakes.

alexcohn on 4 Aug 2019

@alexcohn
Tesseract is used in very different cases and I don't want to explain users which parameters they should set.
Automatic must work.

mstobetskyi on 4 Aug 2019

@mstobetskyi: automatic in most case does not work. Simple because there are much possible scenarios (and too few contributors and a lot of people writing what tesseract should do). Tesseract was used/adjusted to work on google books OCR project (high quality scan with simple layout, minimum graphics) . And that the scenario where it works best.
In scenarios like tables, special graphics different size of texts etc. it fails (usually).
In your case if you pass to tesseract only text area you can get good result (e.g. you will do document layout analyze by yourself).

zdenop on 4 Aug 2019

👍1

Automatic must work.

'Automatic' is probably psm=12. Page segmentation is not for general use, it has been tuned for multi-page books (as @zdenop mentions above). It happens so that for smartphone screenshots you don't even need full power of Sparse text. The text is usually well structured, therefore psm=6 is enough.

Actually, psm=12 tries to apply meaning to the black dots (presumably, spellchecker underlines), so it needs additional parameters, like textord_min_xheight. For your screenshot, it's around 20.

alexcohn on 4 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings