On this image:

with configuration:
"hocr_font_info":"1",
"tessedit_pageseg_mode":"1"
O1 optimization will produce:
"GI 96%:\n\n14:18\n\n \n\nno\" Kyivstar ‘3 '3\n\n< 3aMeTKl/I\n\n27 anpenn 2016 r‘, 14:18\n\nLeben ham.\nnicht von Dritten Uberwacht\n99.! mir attain Md.-\n\ngehbrt und\nLeben haben,\n\n \n\n"
O2 optimization will produce:
"GI 96%:\n\n14:18\n\n \n\nno\" Kyivstar ‘3 '3\n\n< 3aMeTKl/I\n\n27 anpenn 2016 r‘, 14:18\n\nnicht von Dritten Uberwacht\n99.! mir attain Md.-\n\ngehbrt und\nLeben haben,\n\n \n\n"
O2 optimization loses first line of text ("Leben haben").
This happens because of segmentation bug.
Looks like the issue is with colpartition::VCoreOverlap method. This method might cause integer overflow (which is undefined behavior) if median_top_ or median_bottom_ are not computed.
This happens for LeaderPartitions, median_top_ and median_bottom_ are initialized to MAX_INT32 and -MAX_INT32 and never get updated.
Is this still an issue? Did someone test it with recent Tesseract code? We should know that for Tesseract 4.0.0.
I compiled current tesseract code with following options:
CXXFLAGS='-O1 -Wreserved-id-macro' CC=clang CXX=clang++ ../tesseract/configure --prefix=/usr
and than I run ocr with (tessdata_best):
tesseract i320.jpg i320-O1 -c tessedit_pageseg_mode=1
Than I uninstalled tesseract and I did it with -O2, and -O3. I got identical results...
Tested on openSUSE 15, 64bit with clang version 5.0.1
So there is a 4th different result.
@myeyesareblind, which language model did you use? Which version of clang++? Please post also the full command line used.
Commit 7f911ac5e027ac8a25d890c55c6c1bc367ea11b4 fixed the integer overflow, so maybe this issue was fixed then, too.
Closing as not reproduced and without respond from original reporter.
Sorry, I probably missed email with first mention.
@zdenop from your output I can see that bug is now in all O1/O2/O3 - first line is missing everywhere.
So it looks like the integer overflow was handled differently depending on the compiler and compiler optimization. Now that overflow no longer occurs, but the result is worse (tesseract ... -l tessdata_best/script/Cyrillic+tessdata_best/script/Latin):
--- /tmp/old 2019-07-29 21:47:08.646069882 +0200
+++ /tmp/new 2019-07-29 21:48:02.981969439 +0200
@@ -4,8 +4,7 @@
27 апреля 2016 г., 14:18
-Leben haben,
-nicht von Dritten überwacht
+nicht уоп Dritten überwacht
der mir allẹin wird.
gehört und
One text line is missing completely, and in the next line there is now a wrong character. That requires a closer look.
./tesseract -v --tessdata-dir tess4 --dpi 132 -l deu+rus leben_haben.jpg stdout --psm 6
tesseract 4.1.0
leptonica-1.79.0
libjpeg 9b : libpng 1.6.25 : zlib 1.2.8
00000 Kyivstar 7 > 14:18 © 96 % ш
«. Заметки
27 апреля 2016 г., 14:18
Leben haben,
nicht von Dritten überwacht
der mir allein wird.
gehört und
Leben haben,
C И
With --psm 1, I can reproduce the problem. I believe that for this input, psm=6 (no page segmentation) is a natural choice, and it's not surprising that allowing page segmentation can produce in mistakes.
@alexcohn
Tesseract is used in very different cases and I don't want to explain users which parameters they should set.
Automatic must work.
@mstobetskyi: automatic in most case does not work. Simple because there are much possible scenarios (and too few contributors and a lot of people writing what tesseract should do). Tesseract was used/adjusted to work on google books OCR project (high quality scan with simple layout, minimum graphics) . And that the scenario where it works best.
In scenarios like tables, special graphics different size of texts etc. it fails (usually).
In your case if you pass to tesseract only text area you can get good result (e.g. you will do document layout analyze by yourself).
Automatic must work.
'Automatic' is probably psm=12. Page segmentation is not for general use, it has been tuned for multi-page books (as @zdenop mentions above). It happens so that for smartphone screenshots you don't even need full power of Sparse text. The text is usually well structured, therefore psm=6 is enough.
Actually, psm=12 tries to apply meaning to the black dots (presumably, spellchecker underlines), so it needs additional parameters, like textord_min_xheight. For your screenshot, it's around 20.