The original image in JPEG 2000 format includes two pages from a newspaper. This image is processed correctly by latest Tesseract. Tesseract fails with the same image in TIFF, JPEG or PNG format and reports two empty pages.
This happens also with older versions of Tesseract.
checking...
Quick side note. It's good to see that there is resolution metadata in the JP2. Remember to carry that over to other formats during conversion. It did not make it to the PNG file.
$ jhove ~/Downloads/0604.jp2 | grep -i sampling
SamplingFrequencyUnit: centimeter
XSamplingFrequency: 118.11
YSamplingFrequency: 118.11
Tesseract is able to find text when resolution metadata is properly set. Result is 54 megabytes, so a little too big to attach. But it works and you should be able to reproduce.
$ mogrify -density 300x300 -units PixelsPerInch 0604.png
$ tesseract -l ger 0604.png 0604 pdf
Do you think that Tesseract could handle missing resolution information in a more user friendly way? I created the test images using convert 0604.jp2 0604.png (or similar for other formats). I could imagine Tesseract trying 300 dpi in addition to the 70 dpi which it claims to use:
tesseract 0604.png /tmp/0604-png
Info in bmfCreate: Generating pixa of bitmap fonts from string
Tesseract Open Source OCR Engine v4.00.00alpha-332-g4c5d0b5 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Empty page!!
Empty page!!
Does 70 dpi as default value make sense at all? And why does the resolution matter? Will Tesseract detect only characters of a certain size?
Maybe 70 was chosen because it was screen resolution back when dinosaurs walked the earth, and Tesseract was first written? Why does resolution matter? I'm guessing there are complicated heuristics somewhere in the code that tries to guess at likely font sizes. For example, if I crop out a small piece of the newspaper and set to 0 dpi, we get results. Sounds like investigation is needed. Or we can ask Ray.
PS. Irrespective of this bug, try to use good hygiene with resolution metadata. Maybe some day later you'll want to know what size the fonts are. Or something where you might regret losing the resolution metadata. I've seen it happen far too many times.
$ tesseract -l ger_old /tmp/foo.png -
Warning. Invalid resolution 0 dpi. Using 70 instead.
Magdeburg. [55996]
In das iit heute
bei der _ unter RNr. 151 verzeichneten
Fort-
fchritt, eingetragene
mit befohräufter Heofipflicht' in Dl.
venftedt eingetragen worden: Die Ge-
nofenfhaft ift durd BVefhluf der Ge-
neralverfammlung vom 16. Uuguft 1920
aufgelöft. Anuguft Üterwedde und Leo
Krötfi, beide in Olvenfiedt, find zu
Liquidatoren bejielt.
Magdeburg, dem 19. AÄuguft 1920.
OVa& IAmtenericht A A

Or we can ask Ray.
@theraysmith, the current code includes a hard coded value of 70 dpi as the minimum resolution and sets any resolution which is smaller to that value. This is also done for images which don't include a resolution information ("0 dpi"). Maybe it would be better to assume 300 dpi for that special case. Why does the resolution matter at all?
IMO, assuming Tesseract really needs to know the resolution, when the dpi is absent or seems suspicious, the program should not try to guess the dpi and ocr the page. It should just print an error message.
Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too.
If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata.
The resolution is only used by layout analysis.
It sets the threshold size at which to call possible text so ridiculously
small that it can't possibly be text. I.e. it helps to distinguish text
from noise.
There is also some auto scaling somewhere in the preprocessing to magnify
low resolution text that is not needed by the LSTM engine, but is needed by
the legacy engine.
On Sun, Apr 23, 2017 at 6:53 AM, Stefan Weil notifications@github.com
wrote:
Maybe. It is not clear why the dpi information is needed at all. I can
read text of any dpi (just have to adapt the reading distance or get some
glasses) without knowing the actual dpi value, and ideally OCR software can
do that, too.If the dpi value is important, we need an option to set it for images
without (or with wrong) resolution metadata.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/756#issuecomment-296444866,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056e75bpdwBby3WvUvTDgGbwjSMLBnks5ry1fjgaJpZM4MZ1WT
.
--
Ray.
Sounds like implementation is disagreeing with intention. Where's the layout analysis resolution code?
At the time that 70 minimum was set, it was a question of how to cover the most probable case.
Back in the day when most inputs came from a flatbed scanner, the resolution was provided.
Most images that did not have a resolution were screenshots at ~70ppi. In theory processing a 300ppi image at 70 should be less damaging to accuracy than processing a 300ppi image at 70, but that seems not the case in the original post. It would be worth taking a look at why that happens. There might be an easy fix.
Incidentally, most monitors today still give you not much more than 70ppi, (maybe 150) but they give you a bigger screen with even more small text on it. Only phones manage ~300ppi and maybe my new laptop, which has more pixels than my 24" monitors in less than half the area.
Now, when a lot of images come from camera phones, the resolution is largely unknown, and layout analysis requires some more work.
Incidentally, there is an easy way in to set resolution. Set it in the Pix before passing it to TessBaseAPI.
I now have a reasonably general fix for the resolution issue.
There are multiple unsolved problems with the original 0604 image though:
There are large gaps between words, but tiny gaps between columns. That was
causing column finding to fail, causing the blank page determination. The
problem is that it sees the large gaps between words, which at 70 ppi look
huge, and decides that it shouldn't merge them into textlines. Although
that should be fixed, it is a highly dangerous thing to try without very
careful testing.
The columns aren't straight. The layout analysis is fundamentally broken in
such cases. It can't cut a straight line (even at an angle) through the
very narrow bent gap between columns.
A general fix for resolution is to estimate the resolution based on the
measured body text size, which is available before the column finder is
constructed. That makes for an easy fix.
On the original 0604 image, it estimates the resolution to be 470 ppi but
still generates a poor layout analysis, due to the above problems.
On Tue, Apr 25, 2017 at 5:20 AM, Amit D. notifications@github.com wrote:
https://github.com/tesseract-ocr/tesseract/search?q=resolution
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/756#issuecomment-297012426,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056WiqW85XlvkQ5YbrmK2x46GsQvNEks5rzeUIgaJpZM4MZ1WT
.
--
Ray.
Many thanks for this analysis and your efforts.
+1 this is biting me as well. I have a small demo which was working a year ago, but now it is giving:
> text <- ocr("http://jeroen.github.io/files/inlove.png")
Warning. Invalid resolution 0 dpi. Using 70 instead.
Too few characters. Skipping this page
I guess the problem is that the default resolution is too low?
Hello. I'd like to propose to add a command-line option for user to manually specify the DPI as usually the person who scanned the image should know what DPI it is used.
Why not use a command line tool that modifies the dpi of an image file? Like mogrify -density.
@jbreiden
Thanks for the pointers, in fact I have no idea of the morgify command in the first place. I still feel that it should be viable and simplistic to specify an parameter unknown by the program directly though.
An override in Tesseract would produce PDF output with inconsistent metadata. The image's embedded resolution would disagree with the PDF image object metadata.
What if we only allow overriding the DPI when the image doesn't have any embedded resolution info so the inconsistency won't occur?
... when the image doesn't have any embedded resolution info
a18620c
@amitdo is pointing out we already do exactly this. Which is a pretty good point. I still don't like it though. Somebody somewhere is inevitably going to build a document scanning product with this code, outputting PDF. Then someone else is going to re-OCR that data by extracting the images. And it will all go down hill due to missing or inconstant resolution metadata. I've had so many problems with this sort of thing in life that I definitely don't want to encourage inconsistency. But I'm just one person with an opinion, and reasonable people can disagree.
Should Tesseract simply refuse to handle images without resolution metadata (instead of guessing the resolution and potentially producing wrong results)? That would solve my reported problem, too.
@stweil It's tempting. Let's think about this. We would lose the ability to OCR certain types of image files that don't support resolution metadata, like pnm. And it's a little hard to predict the chaos this might cause in the 341 packages that now depend on Tesseract. A possible compromise is to make PDF output fail when image metadata is unset, since that's the most problematic scenario. Honestly I'm not sure what is best.
If you a just searching for a workaround with the Java API of Tesseract, try this:
import static org.bytedeco.javacpp.tesseract.TessBaseAPI;
TessBaseAPI api = init();
tesseract.TessBaseAPISetImage2(api, image);
tesseract.TessBaseAPISetSourceResolution(api, 70);
This will simply set the resolution of the image to 70dpi .
See https://stackoverflow.com/questions/47268601/suppress-warning-on-console-when-using-tess4j-for-ocring .
IMO there are 2 ways how we can easily improve situation:
kMinCredibleResolutionas parameter that user can modify--dpi 300) to input image (with pixSetResolution)345e5ee1f3e78d16927a667212623d9507cc4a63 commit allow user to specify dpi. e.g.
tesseract 0604.jp2 0604_jp2 -l deu --dpi 300 pdf
I assume it is commit a0564fd?
Yes - I copied wrong commit ;-)
Most helpful comment
Maybe. It is not clear why the dpi information is needed at all. I can read text of any dpi (just have to adapt the reading distance or get some glasses) without knowing the actual dpi value, and ideally OCR software can do that, too.
If the dpi value is important, we need an option to set it for images without (or with wrong) resolution metadata.