Hello,
When I try to OCR the attached image, digits and text inside table is ignored. This happens more when image contains multiple tables. Tests are done in MacOS and Ami Linux with the latest tesseract version.
Thanks in advance.

Ooooh yeah!) Best part in tesseract !))
I wasted year for this problem !)
Similar issue here.
Using Tesseract v4.0.0-rc3-3-g68a9 with Leptonica.
tesseract 4.0.0-rc3-3-g68a9
leptonica-1.77.0
libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.25 : libtiff 4.0.6 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
I believe this to be the latest build as I did a pull and built both Tesseract and Leptonica from scratch.
My issue is that the engine is not recognising text within the boxes (see attached image). This also happened for me with v4.0.0Alpha with Leptonica.
The command I am using is tesseract BoxValues.jpg output --oem 1 --psm 1 --tessdata-dir /home/osboxes/build/tesseract/tessdata
I am getting the following text returned.
Previous Payments & New Total Balance
Balance Credits Charges Balance Due
If I run with tesseract BoxValues.jpg output --oem 0 --psm 3 --tessdata-dir /home/osboxes/build/tesseract/tessdata I get the same results.
FYI I am using the best engine as located here : https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
I also get the same issue with the standard engine located here : https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
Also I am using osd.traineddata from here https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata
I also get the same issue when I use color images of the same layout. I have also seen issues similar to mariopinderist is experiencing, where a value is either missed, or returns garbage... in this instance if I chop the text out, and OCR it, its fine.
Any ideas why this issue is occurring?

If I use the debugger tool, and I can see that only the hatching box is detected. I would expect all boxes to at least be detected.

P.S. I don't expect it to get the value in the hatching, but I do expect the values in the boxes to be captured.
I have the same problem with words in boxes. @EynsherKiel did you solve the problem?
I did a rebuild today with the following updated version of Tesseract.
tesseract 4.0.0-1-g2a2b
leptonica-1.77.0
libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.25 : libtiff 4.0.6 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
I did a rebuild of Leptonica too and noted updates (among others) to these files
prog/boxa1_reg.c | 6 +-
prog/boxa2_reg.c | 5 +-
prog/boxa3_reg.c | 138 +++++++++++++
prog/boxap1.ba | 129 ++++++++++++
prog/boxap2.ba | 303 ++++++++++++++++++++++++++++
prog/boxap3.ba | 15 ++
prog/boxap4.ba | 53 +++++
prog/boxap5.ba | 553 +++++++++++++++++++++++++++++++++++++++++++++++++++
prog/displayboxa.c | 17 +-
I was hopeful they may be the culprit but I noted no change in the results documented above.
What do we need to do to get this issue looked at by some from the development group? Does anyone know a workaround?
Can a contributor please comment on here? This was posted 26 days ago. Sorry to tag but not sure what else to do.
@stweil
I am not aware of recent or planned activities to improve such layout detection issues.
Thanks @stweil . I think the box detection issue is one problem. But completely missing the text in the boxes is another. Would you class this as a bug/accuracy issue. Can anyone else comment please.
Sure, nobody wants to miss relevant text. I already added the accuracy label.
I've done one test with a simple table, synthetical generated "scan":
out.pdf, 600dpi, b/w (PS: Recognition done from 600dpi TIF, but can't upload this)
Generated from this pdf:
ocr-zahlen.pdf
A few figures are added/changed, and some columns are sometimes missing.
When using 300 dpi gray image, there are a few less/other errors.
Kind regards, Jochen
output (wdiff style)
Artikel-Nr. [-| Warengruppe|-]ββ{+Warengruppe+}βMenge [-Einzelpreis| MwSt%-]β{+EinzelpreisβMwStβ%+}βGesamtpreis
88.193.554β6β16β168,75β19β3.213,00
46.923.325β1β75β873,01β7β70.059,05
16.636.042β7β2β5,30β19β12,61
71.574.789β1β1β248,47β19β295,68
29.695.829β10β45β13,18β{+7+}β634,62
66.618.400β2β6β146,71β{+7+}β941,88
32.142.244β3β21β3,27β19β81,72
66.467.954β1 [-361,48-]β{+1β561,48+}β7β600,78
24.910.833 [-9-]β{+5+}β3β441,24β19β1.575,23
5.790.076β1β83β3,27β19β322,98
92.072.281β1β1β313,34β7β335,27
95.057.999β1β1β660,13β19 [-785,955-]β{+785,55+}
9.587.128β1β65β972,43β19β75.217,46
2.434.509β1β41β95,50β19β4.659,45
33.372.869β6β83β113,21β{+7+}β10.054,18
77.627.319β1β1β29,79β{+7+}β31,88
85.263.688β1β22β392,84β19β10.284,55
75.611.213β6β21β722,22β19β18.048,28
[-83.332.602-]
{+83.532.602+}β3β{+4+}β10,00β{+7+}β42,80
2.165.747β7β{+8+}β10,20β{+7+}β87,31
90.935.780β1β12β65,37β19β933,48
36.701.968β1β15β47,33β7β759,65
61.464.130β4β35β307,26β19β12.797,38
73.425.868β1β41 [-534,45-]β{+54,45β7+}β2.388,72
45.817.013β4β{+6+}β247,20β{+7+}β1.587,02
20.734.360β4β{+5+}β747,85β19β4.449,71
72.963.876β3β85β175,59β7β15.969,91
38.652.314β1β18 [-383,45-]β{+583,45+}β19β12.497,50
[-33.950.746-]
{+53.950.746+}β3β45β436,37β7β21.011,22
76.379.939β6β1β729,72β19β868,37
63.663.123β3β17β3,06β7 [-35,66-]β{+55,66+}
43.290.183β1β76β13,75β19β1.243,55
62.285.481β9β1β502,80β7 [-338,00-]β{+538,00+}
17.941.150β3β41β0,55β19β26,83
19.994.659β1β1β723,46β19β860,92
50.350.515β6β19β670,41β19β15.157,97
Summeβ288.430,17
DPI is missing from this image. I tried using a config file with DPI override parameter and I saw a lot more text come out: "user_defined_dpi 200"
Hello,
When I try to OCR the attached image, digits and text inside table is ignored. This happens more when image contains multiple tables. Tests are done in MacOS and Ami Linux with the latest tesseract version.
Thanks in advance.
+1 on this problem
i wrote this to remove the vertical lines from a table
import cv2
import sys
def main(img_n):
img = cv2.imread(img_n,0)
fld = cv2.ximgproc.createFastLineDetector(20,5,300,450,3,1)
lines = fld.detect(img)
img = cv2.imread(img_n,cv2.IMREAD_COLOR)
for line in lines:
x1, y1, x2, y2 = line[0]
if abs(y1 - y2) > 150:
cv2.line(img, (x1, y1), (x2, y2), (255, 255, 255), 40)
cv2.imwrite(img_n, img)
img_n = sys.argv[1]
print(img_n)
main(img_n)
I usually use https://scandocflow.com for tables and invoice like docs data extraction
Most helpful comment
Sure, nobody wants to miss relevant text. I already added the accuracy label.