Tesseract: Tesseract inserting additional alternative characters

Created on 10 Apr 2018  Â·  13Comments  Â·  Source: tesseract-ocr/tesseract

Environment

  • Tesseract Version: <3.x stable and 4.0 alpha/beta> for English language text (using Fast and Best trained data) Command line

  • Platform:

Current Behavior:

All versions of tesseract mentioned above tend to insert additional alternative characters (probably) whenever its not very confident. For example - if theres a "#" in the image file it often spits out "#H" or "A#" or even "AH"... Thats 2 characters for 1. Another example: If theres a "$" in the image then it gives "S$" or "$s" etc.. happens very often for other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra character throws many things off.
I am currently a command-line user and may later use it in Java whenever a wrapper for 4.0 becomes available.

Expected Behavior:

Expect tesseract to give out only one character for each character in the image. I should be able to control this behaviour using command line parameters (assuming there isn't one yet..). I have looked into the parameters but there are hundreds and mostly non-self-explanatory. Hence raising this as an issue. Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Suggested Fix:

accuracy

Most helpful comment

Please fix this.. It's a big problem.

All 13 comments

For English, please also try the file from tessdata repo with --oem 0.

That will use the legacy tesseract engine. It is possible that it will be a
better fit for your use case.

On Tue 10 Apr, 2018, 6:28 PM jghare, notifications@github.com wrote:

Environment

  • Tesseract Version: <3.x stable and 4.0 alpha/beta> for English
    language text (using Fast and Best trained data)
  • Platform:

Current Behavior:

All versions of tesseract mentioned above tend to insert additional
alternative characters (probably) whenever its not very confident. For
example - if there a "#" in the image file it often spits out "#H" or "A#"
or even "AH"... Thats 2 characters for 1. Another example: If theres a "$"
in the image then all it gives "S$" of "$s" etc.. happens very often for
other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra
character throws many things off.
I am currently a command-line user and may later use it in Java whenever a
wrapper for 4.0 becomes available.
Expected Behavior:

Expect tesseract to give out only one character for each character in the
image. I should be able to control this behaviour using command line
parameters (assuming there isn't one yet..). I have looked into the
parameters but there are hundreds and mostly non-self-explanatory. Hence
raising this as an issue. Also is it possible to get a "Character-level"
HOCR output - current one is at word level granularity.
Suggested Fix:

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1465, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AE2_ozUU7WV9vOy5bqQWuoe5mAeKl_7Qks5tnKx3gaJpZM4TONQm
.

Hi Shreeshrii
When I say English I mean the English alphabet and special characters. The words themselves are not dictionary words and are cryptic and long sequences of mixed characters... 20-30 characters long.
The 4.0 alpha and beta give me far superior results on the OCR than legacy on my images. Is there no way to tell tesseract 4.0 to not insert extra alternatives?
Also would be good to give it a white-list of characters. I see that issue is also open...

Please fix this.. It's a big problem.

If it is a big problem that provide user case. Description by words is difficult to test and developers are forced to spent useless time on find what is your problem instead of solving problems.

1011 might be related

There are a number of issues regarding this, for different languages etc. Listing them below.

Incorrect recognotion of specific words - additional letters inserted #1011

tesseract add similar characters in Japanese text (ambiguity management?) #1063

German - Characters added to result multiple times (aä / AÄ) #1060

Tesseract LSTM 4.0: letters repeat in recognized text #884

Possibly related:

recognizes more characters than present #1362

This is still present in the latest master branch. It seems to happen after retraining (finetuning) the original tessdata files - in my case eng - and appears to be a result of ambiguous output from the LSTM, where it is providing more than one character for a bounding box (or at least that's how it appears without actually checking) - i.e. it is giving its possible or "unconfident" characters as well. More training does seem to balance this out slightly, but it's very hit or miss.

When I say English I mean the English alphabet and special characters. The words themselves are not dictionary words and are cryptic and long sequences of mixed characters... 20-30 characters long.

In that case try to disable the dictionary.
Also try to fine tune the model.

Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Yes. It is. Use -c hocr_char_boxes=1 hocr in your command line. Output is of the format:

<span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 42'>
             <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.041275'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.038635'>A</span>
             <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 98.950821'>S</span>
             <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 91.848969'>O</span>
             <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.027092'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.989304'>C</span>

Hi,
I tried to use it, but it is not working for me. Any idea

C:\Program Files (x86)\Tesseract-OCR>tesseract testImage.PNG out -l check -c hocr_char_boxes=1 hocr
Could not set option: hocr_char_boxes=1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
OSD: Weak margin (0.63), horiz textlines, not CJK: Don't rotate.
Detected 3 diacritics

It looks like this config is not longer there. I want the output that on a char level but it does not seen possible.

@jghare, can you provide some simple images which show this issue? That would help testing new code which tries to fix it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

reubano picture reubano  Â·  6Comments

clarkk picture clarkk  Â·  3Comments

mm-manu picture mm-manu  Â·  4Comments

Shreeshrii picture Shreeshrii  Â·  4Comments

johnthagen picture johnthagen  Â·  6Comments