As per @jbreiden's comment in #373, here's a problem I noticed with tesseract 3.04.01 (from the Ubuntu Yakkety package).
This is the original PDF without text as it is created by my scanner: scanned.pdf
I've used pdfsandwich with the -debug flag to get the intermediate files. The image it uses to feed into tesseract is the tif in this tif.zip. And this works just fine. Here's the identify information from that tif:
Image: extractedtif.tif
Format: TIFF (Tagged Image File Format)
Mime type: image/tiff
Class: DirectClass
Geometry: 2479x3500+0+0
Resolution: 300x300
Print size: 8.26333x11.6667
Units: Undefined
Type: Grayscale
Endianess: LSB
Colorspace: Gray
Depth: 8-bit
Channel depth:
gray: 8-bit
Running
tesseract extractedtif.tif outputtif -l deu pdf
gives me a perfectly fine PDF in format: A4, Portrait (210 x 296 mm).
And now I converted that tif to png with simply:
convert extractedtif.tif pngfromtif.png
The new png file shows the same resolution information and print size:
Image: pngfromtif.png
Format: PNG (Portable Network Graphics)
Mime type: image/png
Class: PseudoClass
Geometry: 2479x3500+0+0
Resolution: 300x300
Print size: 8.26333x11.6667
Units: Undefined
Type: Grayscale
Endianess: Undefined
Colorspace: Gray
Depth: 8-bit
Channel depth:
gray: 8-bit
However, running
tesseract pngfromtif.png outputpng -l deu pdf
gives me a PDF in format: 900 脳 1270 mm paper size.
"Units: Undefined" is not so great. If you set it, things work correctly.
Will have to look more carefully what the unit possibilities are for PDF to
see if we want to make a code change or not.
$ mogrify -units PixelsPerInch --density 300x300 pngfromtif.png
$ identify -verbose pngfromtif.png
Geometry: 2479x3500+0+0
Resolution: 118.11x118.11
Print size: 20.9889x29.6334
Units: PixelsPerCentimeter
$ tesseract pngfromtif.png correct pdf
$ pdfinfo correct.pdf
Producer: Tesseract 3.04.00
Page size: 594.96 x 840 pts
But then, why does tesseract behave inconsistently between tif and png when both have Units: Undefined?
Haven't had time to look at TIFF, but the PNG behaviour looks right. Spec says we know nothing about image resolution. Common practice from time immemorial is to default to some hopelessly wrong value. I could go trace code to find out what number was used, but honestly this is a garbage in, garbage out situation. Not sure it is worth spending time on. Are you in contact with the authors of the program that is producing the bad metadata? Fixing that is top priority.
The following values are legal for the unit specifier:
0: unit is unknown
1: unit is the meter
When the unit specifier is 0, the pHYs chunk defines pixel aspect ratio only; the actual
size of the pixels remains unspecified.
I think I know why the units are Undefined. pdfsandwich does a 2-step conversion from a PDF page to tif:
convert -colorspace Gray -colors 256 -depth 8 -background white -flatten +matte -density 300x300 scanned.pdf[0] tmpfile.ppm
Which gives:
Image: tmpfile.ppm
Format: PPM (Portable pixmap format (color))
Mime type: image/x-portable-pixmap
Class: DirectClass
Geometry: 2479x3500+0+0
Units: Undefined
Type: Grayscale
Endianess: Undefined
Colorspace: Gray
Depth: 8-bit
Channel depth:
gray: 8-bit
And then:
convert -density 300x300 tmpfile.ppm tmpfile.tif
Which results in:
Image: tmpfile.tif
Format: TIFF (Tagged Image File Format)
Mime type: image/tiff
Class: DirectClass
Geometry: 2479x3500+0+0
Resolution: 300x300
Print size: 8.26333x11.6667
Units: Undefined
Type: Grayscale
Endianess: LSB
Colorspace: Gray
Depth: 8-bit
Channel depth:
gray: 8-bit
I'll open a ticket with pdfsandwich to add -unit PixelsPerInch to the convert command.
1) Why use two convert commands instead of just one?
2) I suggest PNG over uncompressed TIFF. Filesize of the PDF should be smaller, and because Tesseract can skip transcoding the image, there will be some CPU savings as well.
Yup, asked both in the ticket there.
1) Why use two convert commands instead of just one?
Because there's a use of unpaper in between them. It's info file says:
The image-file formats accepted by unpaper are those that libav can handle. In particular it supports the whole PNM-family: PBM, PGM and PPM. This ensures interoperability with the SANE tools under Linux. Support for TIFF and other complex file formats is not guaranteed.
That said, libav says that it handles png and tiff, if I read it correctly.
@mbirth I'm the author of ocrmypdf, which is similar to pdfsandwich.
@jbreiden I think it would be helpful for tesseract to issue a warning when the DPI is nonsense. Lots of programs don't handle this metadata correctly so it's easy for a workflow to discard it. Wrong DPI isn't just a display/printing issue; in the case of say, scanned maps, losing scale information can change the interpretation.
That is a very good idea. Hope I remember once the turkey coma wears off.
This looks like a spot where we should emit the warning, but is not executed.
This spot thinks the resolution is 0.
Oh, oh, maybe here.
Looks like we have kMinCredibleResolution defined in two places. Only the
one in baseapi.ccp is active for this test case.
--- tesseract/api/baseapi.cpp 2016-11-07 07:44:03.000000000 -0800
+++ tesseract/api/baseapi.cpp 2016-11-28 11:23:48.000000000 -0800
@@ -2226,6 +2226,8 @@
if (y_res < kMinCredibleResolution || y_res > kMaxCredibleResolution) {
// Use the minimum default resolution, as it is safer to under-estimate
// than over-estimate resolution.
+ tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+ y_res, kMinCredibleResolution);
thresholder_->SetSourceYResolution(kMinCredibleResolution);
}
PageSegMode pageseg_mode =
--- tesseract/ccmain/osdetect.cpp 2016-11-07 07:44:03.000000000 -0800
+++ tesseract/ccmain/osdetect.cpp 2016-11-28 11:31:13.000000000 -0800
@@ -164,8 +164,14 @@
int vertical_y = 1;
tesseract::TabVector_LIST v_lines;
tesseract::TabVector_LIST h_lines;
- int resolution = (kMinCredibleResolution > pixGetXRes(pix)) ?
- kMinCredibleResolution : pixGetXRes(pix);
+ int resolution;
+ if (kMinCredibleResolution > pixGetXRes(pix)) {
+ resolution = kMinCredibleResolution;
+ tprintf("Warning. Invalid resolution %d dpi. Using %d instead.\n",
+ pixGetXRes(pix), resolution);
+ } else {
+ resolution = pixGetXRes(pix);
+ }
tesseract::LineFinder::FindAndRemoveLines(resolution, false, pix,
&vertical_x, &vertical_y,
Most helpful comment
That is a very good idea. Hope I remember once the turkey coma wears off.