Tesseract: Resolution information in PNG files is ignored

Created on 28 Oct 2016 · 11Comments · Source: tesseract-ocr/tesseract

As per @jbreiden's comment in #373, here's a problem I noticed with tesseract 3.04.01 (from the Ubuntu Yakkety package).

This is the original PDF without text as it is created by my scanner: scanned.pdf

I've used pdfsandwich with the -debug flag to get the intermediate files. The image it uses to feed into tesseract is the tif in this tif.zip. And this works just fine. Here's the identify information from that tif:

Image: extractedtif.tif
  Format: TIFF (Tagged Image File Format)
  Mime type: image/tiff
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: LSB
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

Running

tesseract extractedtif.tif outputtif -l deu pdf

gives me a perfectly fine PDF in format: A4, Portrait (210 x 296 mm).

outputtif.pdf

And now I converted that tif to png with simply:

convert extractedtif.tif pngfromtif.png

png.zip

The new png file shows the same resolution information and print size:

Image: pngfromtif.png
  Format: PNG (Portable Network Graphics)
  Mime type: image/png
  Class: PseudoClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: Undefined
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

However, running

tesseract pngfromtif.png outputpng -l deu pdf

gives me a PDF in format: 900 × 1270 mm paper size.

outputpng.pdf

PDF

Source

mbirth

Most helpful comment

That is a very good idea. Hope I remember once the turkey coma wears off.

jbreiden on 25 Nov 2016

😄2

All 11 comments

"Units: Undefined" is not so great. If you set it, things work correctly.
Will have to look more carefully what the unit possibilities are for PDF to
see if we want to make a code change or not.

$ mogrify -units PixelsPerInch --density 300x300 pngfromtif.png
$ identify -verbose pngfromtif.png
  Geometry: 2479x3500+0+0
  Resolution: 118.11x118.11
  Print size: 20.9889x29.6334
  Units: PixelsPerCentimeter

$ tesseract  pngfromtif.png correct pdf
$ pdfinfo correct.pdf
Producer:       Tesseract 3.04.00
Page size:      594.96 x 840 pts

jbreiden on 29 Oct 2016

❤1

But then, why does tesseract behave inconsistently between tif and png when both have Units: Undefined?

mbirth on 29 Oct 2016

Haven't had time to look at TIFF, but the PNG behaviour looks right. Spec says we know nothing about image resolution. Common practice from time immemorial is to default to some hopelessly wrong value. I could go trace code to find out what number was used, but honestly this is a garbage in, garbage out situation. Not sure it is worth spending time on. Are you in contact with the authors of the program that is producing the bad metadata? Fixing that is top priority.

The following values are legal for the unit specifier:
   0: unit is unknown
   1: unit is the meter
When the unit specifier is 0, the pHYs chunk defines pixel aspect ratio only; the actual 
size of the pixels remains unspecified.

jbreiden on 29 Oct 2016

I think I know why the units are Undefined. pdfsandwich does a 2-step conversion from a PDF page to tif:

convert -colorspace Gray -colors 256 -depth 8 -background white -flatten +matte -density 300x300 scanned.pdf[0] tmpfile.ppm

Which gives:

Image: tmpfile.ppm
  Format: PPM (Portable pixmap format (color))
  Mime type: image/x-portable-pixmap
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Units: Undefined
  Type: Grayscale
  Endianess: Undefined
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

And then:

convert -density 300x300 tmpfile.ppm tmpfile.tif

Which results in:

Image: tmpfile.tif
  Format: TIFF (Tagged Image File Format)
  Mime type: image/tiff
  Class: DirectClass
  Geometry: 2479x3500+0+0
  Resolution: 300x300
  Print size: 8.26333x11.6667
  Units: Undefined
  Type: Grayscale
  Endianess: LSB
  Colorspace: Gray
  Depth: 8-bit
  Channel depth:
    gray: 8-bit

I'll open a ticket with pdfsandwich to add -unit PixelsPerInch to the convert command.

_EDIT:_ https://sourceforge.net/p/pdfsandwich/bugs/14/

mbirth on 29 Oct 2016

1) Why use two convert commands instead of just one?
2) I suggest PNG over uncompressed TIFF. Filesize of the PDF should be smaller, and because Tesseract can skip transcoding the image, there will be some CPU savings as well.

jbreiden on 31 Oct 2016

Yup, asked both in the ticket there.

mbirth on 31 Oct 2016

1) Why use two convert commands instead of just one?

Because there's a use of unpaper in between them. It's info file says:

The image-file formats accepted by unpaper are those that libav can handle. In particular it supports the whole PNM-family: PBM, PGM and PPM. This ensures interoperability with the SANE tools under Linux. Support for TIFF and other complex file formats is not guaranteed.

That said, libav says that it handles png and tiff, if I read it correctly.

Jmuccigr on 7 Nov 2016

@mbirth I'm the author of ocrmypdf, which is similar to pdfsandwich. It handles your file without issue.

@jbreiden I think it would be helpful for tesseract to issue a warning when the DPI is nonsense. Lots of programs don't handle this metadata correctly so it's easy for a workflow to discard it. Wrong DPI isn't just a display/printing issue; in the case of say, scanned maps, losing scale information can change the interpretation.

jbarlow83 on 24 Nov 2016

That is a very good idea. Hope I remember once the turkey coma wears off.

jbreiden on 25 Nov 2016

😄2

This looks like a spot where we should emit the warning, but is not executed.

https://github.com/tesseract-ocr/tesseract/blob/a75ab450a8cc9a2b69cf05f5c4f7a39bc44cbacc/ccmain/osdetect.cpp#L167

This spot thinks the resolution is 0.

https://github.com/tesseract-ocr/tesseract/blob/9c7e99b04197fb9900c29be8bb9ac79a7a8b4672/ccmain/thresholder.cpp#L175

Oh, oh, maybe here.

https://github.com/tesseract-ocr/tesseract/blob/7b5b16779ad4980936724e85a548bccb717cc39c/api/baseapi.cpp#L2226

jbreiden on 28 Nov 2016

Looks like we have kMinCredibleResolution defined in two places. Only the
one in baseapi.ccp is active for this test case.

--- tesseract/api/baseapi.cpp   2016-11-07 07:44:03.000000000 -0800
+++ tesseract/api/baseapi.cpp   2016-11-28 11:23:48.000000000 -0800
@@ -2226,6 +2226,8 @@
   if (y_res < kMinCredibleResolution || y_res > kMaxCredibleResolution) {
     // Use the minimum default resolution, as it is safer to under-estimate
     // than over-estimate resolution.
+    tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+            y_res, kMinCredibleResolution);
     thresholder_->SetSourceYResolution(kMinCredibleResolution);
   }
   PageSegMode pageseg_mode =
--- tesseract/ccmain/osdetect.cpp   2016-11-07 07:44:03.000000000 -0800
+++ tesseract/ccmain/osdetect.cpp   2016-11-28 11:31:13.000000000 -0800
@@ -164,8 +164,14 @@
   int vertical_y = 1;
   tesseract::TabVector_LIST v_lines;
   tesseract::TabVector_LIST h_lines;
-  int resolution = (kMinCredibleResolution > pixGetXRes(pix)) ?
-      kMinCredibleResolution : pixGetXRes(pix);
+  int resolution;
+  if (kMinCredibleResolution > pixGetXRes(pix)) {
+    resolution = kMinCredibleResolution;
+    tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+            pixGetXRes(pix), resolution);
+  } else {
+    resolution = pixGetXRes(pix);
+  }

   tesseract::LineFinder::FindAndRemoveLines(resolution, false, pix,
                                             &vertical_x, &vertical_y,

jbreiden on 28 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings