Tesseract: "a" and "o" vowels replaced by unicode ordinal indicators for Portuguese language in the output

Created on 16 Aug 2017 · 10Comments · Source: tesseract-ocr/tesseract

Environment

Tesseract Version: 4.0
Commit Number: 7afa05a03ed87a95a46da798fd11cae60f390441
Platform: Ubuntu 17.04

Current Behavior:

For the following source images, the "a" and "o" vowels are replaced by "ª" and "º" (not the the superscript lower-case "a" and "o" but the feminine and masculine ordinal indicators, U+00AA and U+00BA) in the output:

tag

tag2

$ tesseract notebooks/tag2.png stdout -l por
Declªrªçãº de Nªscidº Vivº

Expected Behavior:

"a" and "o" verbatim:

$ tesseract notebooks/tag2.png stdout -l por
Declaração de Nascido Vivo

What I have tried:

everything in the ImproveQuality wiki page
changing page segmentation modes and OCR engine
messing with character whitelists and blacklists
custom user word and pattern files
different fonts

Nothing made any difference.

Previous versions:

With 3.04 the output is like the expected.

Source

scardine

Most helpful comment

You can also put the traineddata in a sub directory (like 'best') under your tessdata dir and use -l subdir/lang

amitdo on 17 Aug 2017

❤1 👍1

All 10 comments

Hi!

Same thing happens for spanish with letter 'o'

screen shot 2017-08-16 at 6 06 34 pm

Original image
http://recursostic.educacion.es/observatorio/web/images/upload/1observatorio/iconos_art/texto.jpg

tomascharad on 16 Aug 2017

Confirmed, the same happens with the "o" in Spanish.

scardine on 16 Aug 2017

👍1

https://en.wikipedia.org/wiki/Ordinal_indicator

amitdo on 17 Aug 2017

Can't reproduce this issue.

tesseract por2.png por2 -l best/por

tesseract por2.png por2 -l best/por --oem 1

Output:

Declaração de Nascido Vivo

amitdo on 17 Aug 2017

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400

amitdo on 17 Aug 2017

Nice. I was using pre-built binaries for Ubuntu from Alexander Pozdnyakov.

Problem is fixed in that repository now, new binaries were built from commit 7afa05a03ed87a95a46da798fd11cae60f390441

scardine on 17 Aug 2017

Please ignore the previous update, the package tesseract-ocr-por is not fixed yet.

I had to download from https://github.com/tesseract-ocr/tessdata/blob/master/best/por.traineddata and copy to /usr/share/tesseract-ocr/4.00/tessdata/por.traineddata

scardine on 17 Aug 2017

You can also put the traineddata in a sub directory (like 'best') under your tessdata dir and use -l subdir/lang

amitdo on 17 Aug 2017

❤1 👍1

I got the "new" files from here, but I still get the same result.

paulaceccon on 14 Nov 2017

@paulaceccon where did you copy them to and how are you calling tesseract? (in order to help you diagnose the problem we must be able to reproduce it)

scardine on 14 Nov 2017

Was this page helpful?

0 / 5 - 0 ratings