Tesseract: Tesseract fails to process a large image with missing resolution data

Created on 10 Mar 2017 · 29Comments · Source: tesseract-ocr/tesseract

The original image in JPEG 2000 format includes two pages from a newspaper. This image is processed correctly by latest Tesseract. Tesseract fails with the same image in TIFF, JPEG or PNG format and reports two empty pages.

This happens also with older versions of Tesseract.

Source

stweil

👍3

Most helpful comment

Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too.

If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata.

stweil on 23 Apr 2017

👍7 🎉1

All 29 comments

checking...

jbreiden on 10 Mar 2017

Quick side note. It's good to see that there is resolution metadata in the JP2. Remember to carry that over to other formats during conversion. It did not make it to the PNG file.

$ jhove ~/Downloads/0604.jp2  | grep -i sampling
      SamplingFrequencyUnit: centimeter
      XSamplingFrequency: 118.11
      YSamplingFrequency: 118.11

jbreiden on 10 Mar 2017

Tesseract is able to find text when resolution metadata is properly set. Result is 54 megabytes, so a little too big to attach. But it works and you should be able to reproduce.

$ mogrify -density 300x300 -units PixelsPerInch 0604.png
$ tesseract -l ger 0604.png 0604 pdf

jbreiden on 10 Mar 2017

Do you think that Tesseract could handle missing resolution information in a more user friendly way? I created the test images using convert 0604.jp2 0604.png (or similar for other formats). I could imagine Tesseract trying 300 dpi in addition to the 70 dpi which it claims to use:

tesseract 0604.png /tmp/0604-png
Info in bmfCreate: Generating pixa of bitmap fonts from string
Tesseract Open Source OCR Engine v4.00.00alpha-332-g4c5d0b5 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Empty page!!
Empty page!!

Does 70 dpi as default value make sense at all? And why does the resolution matter? Will Tesseract detect only characters of a certain size?

stweil on 10 Mar 2017

👍1

Maybe 70 was chosen because it was screen resolution back when dinosaurs walked the earth, and Tesseract was first written? Why does resolution matter? I'm guessing there are complicated heuristics somewhere in the code that tries to guess at likely font sizes. For example, if I crop out a small piece of the newspaper and set to 0 dpi, we get results. Sounds like investigation is needed. Or we can ask Ray.

PS. Irrespective of this bug, try to use good hygiene with resolution metadata. Maybe some day later you'll want to know what size the fonts are. Or something where you might regret losing the resolution metadata. I've seen it happen far too many times.

$ tesseract -l ger_old /tmp/foo.png -
Warning. Invalid resolution 0 dpi. Using 70 instead.
Magdeburg. [55996]

In das iit heute
bei der _ unter RNr. 151 verzeichneten
Fort-
fchritt, eingetragene
mit befohräufter Heofipflicht' in Dl.
venftedt eingetragen worden: Die Ge-
nofenfhaft ift durd BVefhluf der Ge-
neralverfammlung vom 16. Uuguft 1920
aufgelöft. Anuguft Üterwedde und Leo
Krötfi, beide in Olvenfiedt, find zu
Liquidatoren bejielt.

Magdeburg, dem 19. AÄuguft 1920.
OVa& IAmtenericht A A

foo

jbreiden on 10 Mar 2017

😄1

Or we can ask Ray.

@theraysmith, the current code includes a hard coded value of 70 dpi as the minimum resolution and sets any resolution which is smaller to that value. This is also done for images which don't include a resolution information ("0 dpi"). Maybe it would be better to assume 300 dpi for that special case. Why does the resolution matter at all?

stweil on 23 Apr 2017

IMO, assuming Tesseract really needs to know the resolution, when the dpi is absent or seems suspicious, the program should not try to guess the dpi and ocr the page. It should just print an error message.

amitdo on 23 Apr 2017

If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata.

stweil on 23 Apr 2017

👍7 🎉1

The resolution is only used by layout analysis.
It sets the threshold size at which to call possible text so ridiculously
small that it can't possibly be text. I.e. it helps to distinguish text
from noise.
There is also some auto scaling somewhere in the preprocessing to magnify
low resolution text that is not needed by the LSTM engine, but is needed by
the legacy engine.

On Sun, Apr 23, 2017 at 6:53 AM, Stefan Weil notifications@github.com
wrote:

Maybe. It is not clear why the dpi information is needed at all. I can
read text of any dpi (just have to adapt the reading distance or get some
glasses) without knowing the actual dpi value, and ideally OCR software can
do that, too.

If the dpi value is important, we need an option to set it for images
without (or with wrong) resolution metadata.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/756#issuecomment-296444866,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056e75bpdwBby3WvUvTDgGbwjSMLBnks5ry1fjgaJpZM4MZ1WT
.

--
Ray.

theraysmith on 25 Apr 2017

Sounds like implementation is disagreeing with intention. Where's the layout analysis resolution code?

jbreiden on 25 Apr 2017

https://github.com/tesseract-ocr/tesseract/search?q=resolution

amitdo on 25 Apr 2017

At the time that 70 minimum was set, it was a question of how to cover the most probable case.
Back in the day when most inputs came from a flatbed scanner, the resolution was provided.
Most images that did not have a resolution were screenshots at ~70ppi. In theory processing a 300ppi image at 70 should be less damaging to accuracy than processing a 300ppi image at 70, but that seems not the case in the original post. It would be worth taking a look at why that happens. There might be an easy fix.
Incidentally, most monitors today still give you not much more than 70ppi, (maybe 150) but they give you a bigger screen with even more small text on it. Only phones manage ~300ppi and maybe my new laptop, which has more pixels than my 24" monitors in less than half the area.

Now, when a lot of images come from camera phones, the resolution is largely unknown, and layout analysis requires some more work.

Incidentally, there is an easy way in to set resolution. Set it in the Pix before passing it to TessBaseAPI.

theraysmith on 25 Apr 2017

I now have a reasonably general fix for the resolution issue.

There are multiple unsolved problems with the original 0604 image though:
There are large gaps between words, but tiny gaps between columns. That was
causing column finding to fail, causing the blank page determination. The
problem is that it sees the large gaps between words, which at 70 ppi look
huge, and decides that it shouldn't merge them into textlines. Although
that should be fixed, it is a highly dangerous thing to try without very
careful testing.
The columns aren't straight. The layout analysis is fundamentally broken in
such cases. It can't cut a straight line (even at an angle) through the
very narrow bent gap between columns.

A general fix for resolution is to estimate the resolution based on the
measured body text size, which is available before the column finder is
constructed. That makes for an easy fix.
On the original 0604 image, it estimates the resolution to be 470 ppi but
still generates a poor layout analysis, due to the above problems.

On Tue, Apr 25, 2017 at 5:20 AM, Amit D. notifications@github.com wrote:

https://github.com/tesseract-ocr/tesseract/search?q=resolution

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/756#issuecomment-297012426,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056WiqW85XlvkQ5YbrmK2x46GsQvNEks5rzeUIgaJpZM4MZ1WT
.

--
Ray.

theraysmith on 26 Apr 2017

Many thanks for this analysis and your efforts.

stweil on 27 Apr 2017

👍1

+1 this is biting me as well. I have a small demo which was working a year ago, but now it is giving:

> text <- ocr("http://jeroen.github.io/files/inlove.png")
Warning. Invalid resolution 0 dpi. Using 70 instead.
Too few characters. Skipping this page

I guess the problem is that the default resolution is too low?

jeroen on 9 Jun 2017

Hello. I'd like to propose to add a command-line option for user to manually specify the DPI as usually the person who scanned the image should know what DPI it is used.

Lin-Buo-Ren on 6 Jan 2018

👍1

Why not use a command line tool that modifies the dpi of an image file? Like mogrify -density.

jbreiden on 6 Jan 2018

@jbreiden
Thanks for the pointers, in fact I have no idea of the morgify command in the first place. I still feel that it should be viable and simplistic to specify an parameter unknown by the program directly though.

Lin-Buo-Ren on 6 Jan 2018

An override in Tesseract would produce PDF output with inconsistent metadata. The image's embedded resolution would disagree with the PDF image object metadata.

jbreiden on 7 Jan 2018

What if we only allow overriding the DPI when the image doesn't have any embedded resolution info so the inconsistency won't occur?

Lin-Buo-Ren on 7 Jan 2018

... when the image doesn't have any embedded resolution info

a18620c

amitdo on 7 Jan 2018

@amitdo is pointing out we already do exactly this. Which is a pretty good point. I still don't like it though. Somebody somewhere is inevitably going to build a document scanning product with this code, outputting PDF. Then someone else is going to re-OCR that data by extracting the images. And it will all go down hill due to missing or inconstant resolution metadata. I've had so many problems with this sort of thing in life that I definitely don't want to encourage inconsistency. But I'm just one person with an opinion, and reasonable people can disagree.

jbreiden on 8 Jan 2018

Should Tesseract simply refuse to handle images without resolution metadata (instead of guessing the resolution and potentially producing wrong results)? That would solve my reported problem, too.

stweil on 8 Jan 2018

@stweil It's tempting. Let's think about this. We would lose the ability to OCR certain types of image files that don't support resolution metadata, like pnm. And it's a little hard to predict the chaos this might cause in the 341 packages that now depend on Tesseract. A possible compromise is to make PDF output fail when image metadata is unset, since that's the most problematic scenario. Honestly I'm not sure what is best.

jbreiden on 8 Jan 2018

If you a just searching for a workaround with the Java API of Tesseract, try this:

import static org.bytedeco.javacpp.tesseract.TessBaseAPI;

TessBaseAPI api = init();
tesseract.TessBaseAPISetImage2(api, image);
tesseract.TessBaseAPISetSourceResolution(api, 70);

This will simply set the resolution of the image to 70dpi .
See https://stackoverflow.com/questions/47268601/suppress-warning-on-console-when-using-tess4j-for-ocring .

asmaier on 23 Feb 2018

👍3

IMO there are 2 ways how we can easily improve situation:

implement kMinCredibleResolutionas parameter that user can modify
implement option for tesseract app to set dpi (e.g. --dpi 300) to input image (with pixSetResolution)