Tesseract: How to specify min confidence "level" below which replacement char should be used?

Created on 24 Nov 2019  路  4Comments  路  Source: tesseract-ocr/tesseract

I there any command line parameter to set an arbitrary "level" of confidence (in text mode), below which program should not try to "guess" chars but use a replacement char instead? Thanks to this I could correct bad words by searching for replacement chars - this would speed up manual work.

5.0-alpha
Win10 64bit

question

All 4 comments

Please respect guidelines for posting issue: use tesseract user forum for asking questions/support.

This question is an _issue_

IIRC there is no rejection mechanism for the LSTM models (yet). There used to be plenty of related parameters in the legacy engine (see tesseract --print-parameters | grep rej), but whether any of these will ever be supported again is unclear. (Rejection in the LSTM beam decoder is possible in principle, but would probably need distinct parameters.)

In the meantime, you can emulate this to some degree by looking at the confidences of character outputs yourself:

  • with ALTO-XML output (alto config): WC on the word level
  • with hOCR output (hocr config): x_wconf on the word level, x_conf on the character level
  • with TSV output (tsv config): conf (second-last) column on the word level
  • using the API: ResultIterator.Confidence() (on the word or character level), ChoiceIterator.Confidence() (on the character alternative level)

This question is an _issue_

@spajak, the question may be an issue for you, but I don't think it meets the guidelines for an issue according to the docs of this repository. https://github.com/tesseract-ocr/tesseract/blob/master/CONTRIBUTING.md

Creating an Issue or Using the Forum

If you think you found a bug in Tesseract, please create an issue.

Use the users mailing-list instead of creating an Issue if ...

You have problems using Tesseract and need some help.
You have problems installing the software.
You are not satisfied with the accuracy of the OCR, and want to ask how you can improve it. Note: You should first read the ImproveQuality documentation.
You are trying to train Tesseract and you have a problem and/or want to ask a question about the training process. Note: You should first read the official guides [1] or [2] found in the project documentation.
You have a general question.
Was this page helpful?
0 / 5 - 0 ratings