Tesseract: "a" and "o" vowels replaced by unicode ordinal indicators for Portuguese language in the output

Created on 16 Aug 2017  路  10Comments  路  Source: tesseract-ocr/tesseract

Environment

  • Tesseract Version: 4.0
  • Commit Number: 7afa05a03ed87a95a46da798fd11cae60f390441
  • Platform: Ubuntu 17.04

Current Behavior:

For the following source images, the "a" and "o" vowels are replaced by "陋" and "潞" (not the the superscript lower-case "a" and "o" but the feminine and masculine ordinal indicators, U+00AA and U+00BA) in the output:

tag

tag2

$ tesseract notebooks/tag2.png stdout -l por
Decl陋r陋莽茫潞 de N陋scid潞 Viv潞

Expected Behavior:

"a" and "o" verbatim:

$ tesseract notebooks/tag2.png stdout -l por
Declara莽茫o de Nascido Vivo

What I have tried:

  • everything in the ImproveQuality wiki page
  • changing page segmentation modes and OCR engine
  • messing with character whitelists and blacklists
  • custom user word and pattern files
  • different fonts

Nothing made any difference.

Previous versions:

With 3.04 the output is like the expected.

Most helpful comment

You can also put the traineddata in a sub directory (like 'best') under your tessdata dir and use -l subdir/lang

All 10 comments

Hi!

Same thing happens for spanish with letter 'o'

screen shot 2017-08-16 at 6 06 34 pm

Original image
http://recursostic.educacion.es/observatorio/web/images/upload/1observatorio/iconos_art/texto.jpg

Confirmed, the same happens with the "o" in Spanish.

Can't reproduce this issue.

tesseract por2.png por2 -l best/por

Or

tesseract por2.png por2 -l best/por --oem 1

Output:

Declara莽茫o de Nascido Vivo

Nice. I was using pre-built binaries for Ubuntu from Alexander Pozdnyakov.

Problem is fixed in that repository now, new binaries were built from commit 7afa05a03ed87a95a46da798fd11cae60f390441

Please ignore the previous update, the package tesseract-ocr-por is not fixed yet.

I had to download from https://github.com/tesseract-ocr/tessdata/blob/master/best/por.traineddata and copy to /usr/share/tesseract-ocr/4.00/tessdata/por.traineddata

You can also put the traineddata in a sub directory (like 'best') under your tessdata dir and use -l subdir/lang

I got the "new" files from here, but I still get the same result.

@paulaceccon where did you copy them to and how are you calling tesseract? (in order to help you diagnose the problem we must be able to reproduce it)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LaurentBerger picture LaurentBerger  路  3Comments

reubano picture reubano  路  6Comments

garry-ut99 picture garry-ut99  路  5Comments

anavc94 picture anavc94  路  6Comments

royudev picture royudev  路  5Comments