Tesseract: Page size with the new pdf option

Created on 15 Nov 2015  路  13Comments  路  Source: tesseract-ocr/tesseract

The new pdf option to directly create a PDF with embedded text is awesome. Unfortunately, I haven't been able to figure out yet how to specify the page size (e.g. A4, letter, ...). Is that possible?

PDF

Most helpful comment

Resolution (DPI) is extracted from the header of the input image. If missing, then Tesseract has no choice but to make something up. Don't do that! Many tools can be used to inspect and adjust DPI for an input image file. If you want to use ImageMagick, the commands are "identify -verbose" to inspect and "mogrify -density 300x300 -units PixelsPerInch" to set.

All 13 comments

Sorry, no. If the input image is A4 then the output PDF is A4. The design goal of Tesseract's PDF module is to not change anything about the image. If you want to modify page size, either change the input image or post process the output PDF.

Here's a tip how to rescale all pages to e.g. DIN A4:
https://github.com/Wikinaut/utils/wiki#scale_all_pages_in_PDF_to_A4

This issue should be closed. (Working as intended)

@jbreiden, how tesseract determines page size of the input image? The page size depends on DPI which tesseract has no information about. For example A5 (5.83 x 8.27 inch) with 300 dpi has resolution 1748 x 2480 pixels.
The problem is that when the input image to tesseract has 1748 x 2480 pixels, it outputs pdf file with page size 24.97 脳 35.43 inch, not 5.83 x 8.27 inch.
Can you please reopen this issue or should I create a new issue?

Resolution (DPI) is extracted from the header of the input image. If missing, then Tesseract has no choice but to make something up. Don't do that! Many tools can be used to inspect and adjust DPI for an input image file. If you want to use ImageMagick, the commands are "identify -verbose" to inspect and "mogrify -density 300x300 -units PixelsPerInch" to set.

Is there anyway to directly specify the image's DPI to Tesseract?

No, there is not. And I am reluctant to add this capability.

There should an option to specify the size of the output PDF . The size of PDF page is becoming very large . If anyone have done anything inorder to avoid the same please let me know .

Many times I made related proposals, at least to achieve the goal to mix the original image and the OCRed text afterwards, all were dismissed. I fully support your proposal!

Set the dpi of the input images. Use mogrify from ImageMagick or similar.

I have a file which has as its original format A4.
When I convert it to images and then perform ocr using tesseract, the page size changes, as described here.

I tried setting the dpi using magick convert as below. However, the files kept the same size as the original ones.
What am I doing wrong?

$ identify my_file.png  
my_file.png PNG 3040x4560 3040x4560+0+0 8-bit Gray 2c 29989B 0.000u 0:00.000

$ convert my_file.png -page a4 my_file-1.png

$ identify my_file-1.png                        
my_file-1.png PNG 3040x4560 595x842+0+0 8-bit Gray 2c 35878B 0.000u 0:00.000

$ tesseract -l eng my_file-1.png my_file-1 pdf

$ pdfinfo my_file-1.pdf
Title:          
Producer:       Tesseract 3.04.01
CreationDate:   Wed Aug 19 11:26:53 2020 -03
...
Pages:          1
Page size:      3040 x 4560 pts
Page rot:       0
File size:      11336 bytes
Optimized:      no
PDF version:    1.5

However, if I skip tesseract, and export directly to pdf, everything is fine:

$convert my_file.png -page A4 my_file-1.pdf && pdfinfo my_file-1.pdf
...
Producer:       https://imagemagick.org
CreationDate:   Wed Aug 19 16:43:17 2020 -03
ModDate:        Wed Aug 19 16:43:17 2020 -03
...
Pages:          1
Page size:      595.165 x 842.234 pts (A4)
Page rot:       0
File size:      45306 bytes
Optimized:      no
PDF version:    1.3

If I use mogrify, I can alter the resolution:

mogrify -density 360x360 -units PixelsPerInch my_file.png

And that indeed comes close to A4, but I have to calculate the resolution for each image accordingly.

If I don't know the original dpi which was used, how can I automatically set the image size using mogrify or tesseract (i.e., without having to calculate it manually for each image separately)?

My images are extracted directly from a pdf which I desire to include an ocr layer.

FYI, I've implemented a utility to fix the DPI of an image if you know its actual dimension: Install Image Density Fixer for Linux using the Snap Store | Snapcraft.

My experience showed, that the easiest way to get the pdf to any size is using pdfjam. It keeps the text overlay in place.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ivder picture ivder  路  7Comments

YeisonVelez11 picture YeisonVelez11  路  5Comments

garry-ut99 picture garry-ut99  路  5Comments

egorpugin picture egorpugin  路  6Comments

spajak picture spajak  路  4Comments