Tesseract: Text in tables ignored

Created on 11 Oct 2018  Β·  13Comments  Β·  Source: tesseract-ocr/tesseract

Hello,

When I try to OCR the attached image, digits and text inside table is ignored. This happens more when image contains multiple tables. Tests are done in MacOS and Ami Linux with the latest tesseract version.

Thanks in advance.

d21cf3c6-9236-4d46-a160-f5a386b6be9a-sky lakes billing bates 1-136_1 pdf

accuracy

Most helpful comment

Sure, nobody wants to miss relevant text. I already added the accuracy label.

All 13 comments

Ooooh yeah!) Best part in tesseract !))
I wasted year for this problem !)

Similar issue here.

Using Tesseract v4.0.0-rc3-3-g68a9 with Leptonica.

tesseract 4.0.0-rc3-3-g68a9
 leptonica-1.77.0
  libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.25 : libtiff 4.0.6 : zlib 1.2.8
 Found AVX2
 Found AVX
 Found SSE

I believe this to be the latest build as I did a pull and built both Tesseract and Leptonica from scratch.

My issue is that the engine is not recognising text within the boxes (see attached image). This also happened for me with v4.0.0Alpha with Leptonica.

The command I am using is tesseract BoxValues.jpg output --oem 1 --psm 1 --tessdata-dir /home/osboxes/build/tesseract/tessdata

I am getting the following text returned.

Previous Payments & New Total Balance
Balance Credits Charges Balance Due

If I run with tesseract BoxValues.jpg output --oem 0 --psm 3 --tessdata-dir /home/osboxes/build/tesseract/tessdata I get the same results.

FYI I am using the best engine as located here : https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata

I also get the same issue with the standard engine located here : https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

Also I am using osd.traineddata from here https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddata

I also get the same issue when I use color images of the same layout. I have also seen issues similar to mariopinderist is experiencing, where a value is either missed, or returns garbage... in this instance if I chop the text out, and OCR it, its fine.

Any ideas why this issue is occurring?

boxvalues

If I use the debugger tool, and I can see that only the hatching box is detected. I would expect all boxes to at least be detected.

debug

P.S. I don't expect it to get the value in the hatching, but I do expect the values in the boxes to be captured.

I have the same problem with words in boxes. @EynsherKiel did you solve the problem?

I did a rebuild today with the following updated version of Tesseract.

tesseract 4.0.0-1-g2a2b
 leptonica-1.77.0
  libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.25 : libtiff 4.0.6 : zlib 1.2.8
 Found AVX2
 Found AVX
 Found SSE

I did a rebuild of Leptonica too and noted updates (among others) to these files

prog/boxa1_reg.c | 6 +-
prog/boxa2_reg.c | 5 +-
prog/boxa3_reg.c | 138 +++++++++++++
prog/boxap1.ba | 129 ++++++++++++
prog/boxap2.ba | 303 ++++++++++++++++++++++++++++
prog/boxap3.ba | 15 ++
prog/boxap4.ba | 53 +++++
prog/boxap5.ba | 553 +++++++++++++++++++++++++++++++++++++++++++++++++++
prog/displayboxa.c | 17 +-

I was hopeful they may be the culprit but I noted no change in the results documented above.

What do we need to do to get this issue looked at by some from the development group? Does anyone know a workaround?

Can a contributor please comment on here? This was posted 26 days ago. Sorry to tag but not sure what else to do.
@stweil

I am not aware of recent or planned activities to improve such layout detection issues.

Thanks @stweil . I think the box detection issue is one problem. But completely missing the text in the boxes is another. Would you class this as a bug/accuracy issue. Can anyone else comment please.

Sure, nobody wants to miss relevant text. I already added the accuracy label.

I've done one test with a simple table, synthetical generated "scan":
out.pdf, 600dpi, b/w (PS: Recognition done from 600dpi TIF, but can't upload this)

Generated from this pdf:
ocr-zahlen.pdf

A few figures are added/changed, and some columns are sometimes missing.

When using 300 dpi gray image, there are a few less/other errors.

Kind regards, Jochen

output (wdiff style)

Artikel-Nr. [-| Warengruppe|-]→→{+Warengruppe+}→Menge [-Einzelpreis| MwSt%-]→{+Einzelpreis→MwSt→%+}→Gesamtpreis

88.193.554β†’6β†’16β†’168,75β†’19β†’3.213,00

46.923.325β†’1β†’75β†’873,01β†’7β†’70.059,05

16.636.042β†’7β†’2β†’5,30β†’19β†’12,61

71.574.789β†’1β†’1β†’248,47β†’19β†’295,68

29.695.829β†’10β†’45β†’13,18β†’{+7+}β†’634,62

66.618.400β†’2β†’6β†’146,71β†’{+7+}β†’941,88

32.142.244β†’3β†’21β†’3,27β†’19β†’81,72

66.467.954β†’1 [-361,48-]β†’{+1β†’561,48+}β†’7β†’600,78

24.910.833 [-9-]β†’{+5+}β†’3β†’441,24β†’19β†’1.575,23

5.790.076β†’1β†’83β†’3,27β†’19β†’322,98

92.072.281β†’1β†’1β†’313,34β†’7β†’335,27

95.057.999β†’1β†’1β†’660,13β†’19 [-785,955-]β†’{+785,55+}

9.587.128β†’1β†’65β†’972,43β†’19β†’75.217,46

2.434.509β†’1β†’41β†’95,50β†’19β†’4.659,45

33.372.869β†’6β†’83β†’113,21β†’{+7+}β†’10.054,18

77.627.319β†’1β†’1β†’29,79β†’{+7+}β†’31,88

85.263.688β†’1β†’22β†’392,84β†’19β†’10.284,55

75.611.213β†’6β†’21β†’722,22β†’19β†’18.048,28

[-83.332.602-]

{+83.532.602+}β†’3β†’{+4+}β†’10,00β†’{+7+}β†’42,80

2.165.747β†’7β†’{+8+}β†’10,20β†’{+7+}β†’87,31

90.935.780β†’1β†’12β†’65,37β†’19β†’933,48

36.701.968β†’1β†’15β†’47,33β†’7β†’759,65

61.464.130β†’4β†’35β†’307,26β†’19β†’12.797,38

73.425.868β†’1β†’41 [-534,45-]β†’{+54,45β†’7+}β†’2.388,72

45.817.013β†’4β†’{+6+}β†’247,20β†’{+7+}β†’1.587,02

20.734.360β†’4β†’{+5+}β†’747,85β†’19β†’4.449,71

72.963.876β†’3β†’85β†’175,59β†’7β†’15.969,91

38.652.314β†’1β†’18 [-383,45-]β†’{+583,45+}β†’19β†’12.497,50

[-33.950.746-]

{+53.950.746+}β†’3β†’45β†’436,37β†’7β†’21.011,22

76.379.939β†’6β†’1β†’729,72β†’19β†’868,37

63.663.123β†’3β†’17β†’3,06β†’7 [-35,66-]β†’{+55,66+}

43.290.183β†’1β†’76β†’13,75β†’19β†’1.243,55

62.285.481β†’9β†’1β†’502,80β†’7 [-338,00-]β†’{+538,00+}

17.941.150β†’3β†’41β†’0,55β†’19β†’26,83

19.994.659β†’1β†’1β†’723,46β†’19β†’860,92

50.350.515β†’6β†’19β†’670,41β†’19β†’15.157,97

Summe→288.430,17

DPI is missing from this image. I tried using a config file with DPI override parameter and I saw a lot more text come out: "user_defined_dpi 200"

Hello,

When I try to OCR the attached image, digits and text inside table is ignored. This happens more when image contains multiple tables. Tests are done in MacOS and Ami Linux with the latest tesseract version.

Thanks in advance.

d21cf3c6-9236-4d46-a160-f5a386b6be9a-sky lakes billing bates 1-136_1 pdf

+1 on this problem

i wrote this to remove the vertical lines from a table

import cv2 
import sys

def main(img_n):
    img = cv2.imread(img_n,0) 
    fld = cv2.ximgproc.createFastLineDetector(20,5,300,450,3,1)
    lines = fld.detect(img)

    img = cv2.imread(img_n,cv2.IMREAD_COLOR)

    for line in lines:
        x1, y1, x2, y2 = line[0]
        if abs(y1 - y2) > 150:
            cv2.line(img, (x1, y1), (x2, y2), (255, 255, 255), 40)
    cv2.imwrite(img_n, img) 

img_n = sys.argv[1]
print(img_n)
main(img_n)

I usually use https://scandocflow.com for tables and invoice like docs data extraction

Was this page helpful?
0 / 5 - 0 ratings