Tesseract: Tesseract is very poor on single line images

Created on 11 Sep 2019  路  11Comments  路  Source: tesseract-ocr/tesseract

i am running tesseract on some images to extract text in it , although images preprocessing seems optimal , tesseract cant detect any character in these images although i have tried different PSM and resizing options and i tried to finetune the model with such images type but also no results
check images below
0crp
2crp
3crp
4crp
5crp
6crp
7crp
does this issue related to inverting the image , because the original images were something like that
6
IMG_20190215_225818
tesseract 4 and newer doesnot detect white text on black background , so i crop the black part invert it and send it to the engine , but no results

wontfix

Most helpful comment

I am affraid tesseract works according rule: "Bad in - bad out":
cropped_total

tesseract cropped_total.png - --psm 6
Total Due: 60.

cropped_total2

tesseract cropped_total2.png - --psm 6
Total Due: 297.50

All 11 comments

Well, at least some characters are recognized:

$ tesseract 64711738-69782d80-d4ba-11e9-80d2-23184f8c287b.jpg - --psm 7 -l tessdata_best/script/Latin
Warning: Invalid resolution 0 dpi. Using 70 instead.
w Tota e R

@stweil i was thinking that the image is to easy , the application that i am working on needs all characters to be extracted correctly

I am affraid tesseract works according rule: "Bad in - bad out":
cropped_total

tesseract cropped_total.png - --psm 6
Total Due: 60.

cropped_total2

tesseract cropped_total2.png - --psm 6
Total Due: 297.50

@zdenop
what tesseract version and langdata you used

I used tesseract 5.0.0-alpha-329-gc5a50 and tessdata_best.
IMO preprocessing is more useful than training (especially if you are able to do text detection outside of tesseract is cases like you shown here)

@zdenop did you preproceesd the image after cropping the borders ?

First image worked without anything else just cropping.
Second image I need to dewarp (fix slope) first, so cropping fit to rectangle...
Tesseract has problem with borders: just search for "table" in issue tracker...

@zdenop but it is not practical to just fit the text without any borders , although after many preprocessing some horizontal or vertical lines / blobs may residue in the image
do you see downgrading to versions that supports white text on black background may help

Then tesseract is not right tool for you task. Tesseract needs straight text without border& graphics elements (+ 4.x version need black text on white background after thresholding) to provide good result. It is not bug, it is requirement.

@zdenop you can share which your code preprocess I would like to learn more from your code
Thank you for reading
Thai Hoc

@NguyenThaiHoc1 i have made all preprocessing , he just manually cropped the image to remove black borders

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LaurentBerger picture LaurentBerger  路  3Comments

clarkk picture clarkk  路  7Comments

reubano picture reubano  路  6Comments

royudev picture royudev  路  5Comments

clarkk picture clarkk  路  6Comments