Tesseract: Tesseract doesn't recognize multiple languages

Created on 16 May 2018  Â·  36Comments  Â·  Source: tesseract-ocr/tesseract

If I were to run tesseract page356.png page356 -l eng+osd+ell pdf

It would only recognize the English characters, but produce no errors about other language recognition

If I run tesseract page356.png page356greek -l ell

It recognizes the Greek fine, but now there is no English

If I run tesseract page356.png greekandenglish356 -l ell+eng+osd pdf I get this pdf
greekandenglish356.pdf

only recognizes English

I ran apt-get install tesseract-ocr-all

and I'm experiencing this issue on multiple linux distros

Here is a sample image
1200_page_356

Most helpful comment

Also, try with the script trained data

https://github.com/tesseract-ocr/tessdata_best/blob/master/script/Greek.traineddata

It should have both Greek and English.

All 36 comments

AFAIK multiple language support is only available for version 4+, not necessarily available in the default repo of the distro. Version 3 only extracts one language at a time.

Multiple languages are supported on v3.x

Share a sample image please.

Added a sample image and pdf. I have tried 3x on opensuse and 4x on ubuntu based distros

It gives very poor results, but Tesseract 4 is producing a mix of Greek & English. This is also true in PDF output. Problem is not reproducing for me.

$ tesseract -l ell+eng example.png - -

[...]
6- 5- 4- 3- 2- Îą- Îē- N- Entry Name
[...]

And I even see some Greek in your attached PDF. There is a Îē in there.

Also, try with the script trained data

https://github.com/tesseract-ocr/tessdata_best/blob/master/script/Greek.traineddata

It should have both Greek and English.

Could be issue be closed?
There is no reaction from original reporter for several months....

zdenop, 30/09/2018 17:19:

There is no reaction from original reporter for several months....

What information is needed? I have the same experience with tesseract
3.05.02, but I don't understand from the responses above whether it's
considered expected result until version 4.

Reported claims that there are only English characters. jbreiden but see there greek alphabets...
Shreeshrii provided several suggestion for testing... No reply...

If reporter does not care why we should? We are not paid for this. If reporter does not value our time and cooperate than I will eliminate such report (if it does not have some value for this project)

Same problem reported in the forum today, but for Thai and English.

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/YLvnrS-01kI/x0PUNGsGBAAJ

en_th

tesseract 4.0.0-beta.4-179-g57a6
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

As mentioned by the OP in the forum if the input image have both language(eng+thai) in the same line it will read only in 1 language but when having single language in that line it will read in correct language

script/Thai.traineddata seems to give correct result.

ubuntu@tesseract-ocr:~/TEST$ bash ./en_th.sh

 *****  ./en_th.jpg LANG tha+eng TESSDATA tessdata OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
1āđ‚āļĨ!10 āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg LANG tha+eng TESSDATA tessdata_best OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
1āđ‚āļĨ!1āđ0 āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg LANG tha+eng TESSDATA tessdata_fast OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
1āđ‚āļĨ!10 āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg LANG eng+tha TESSDATA tessdata OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
1āđ‚āļĨ!10 āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg LANG eng+tha TESSDATA tessdata_best OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello aaa
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg LANG eng+tha TESSDATA tessdata_fast OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello ayaa
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg SCRIPT Thai TESSDATA tessdata_best OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš

 *****  ./en_th.jpg SCRIPT Thai TESSDATA tessdata_fast OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļīāļˆāđ‰āļē
This is a test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
DONE

Last(?) change from Ray regarding multi-language mode seems to be

https://github.com/tesseract-ocr/tesseract/commit/b453f74e0194f2cf08e9251b1846a0132657c4f8

Any updates on this issue. I am using Tesseract v4 to detect text "Bœuf Stroganoff" using German and French traindata. Text detection doesn't work when I use traindata for multiple languages together.

You can try Latin.traineddata from best/fast.

@amitdo Tried that. It doesn't work as well. The command I used :
tesseract <input image> <output file> -l lat

-l lat is Latin language.

Use script/Latin which is Latin script and has been trained using all
languages using that script.

On Sat, 6 Oct 2018, 19:02 Nawab Hussain, notifications@github.com wrote:

@amitdo https://github.com/amitdo Tried that. It doesn't work as well.
The command I used :
tesseract -l lat

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1579#issuecomment-427612285,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o1DBTo47XF2soIHl-5S6W28SQdmjks5uiTZpgaJpZM4UBqgR
.

@Shreeshrii Even the predictions with tesseract <input image> <output file> -l script\Latin are disappointing. Wrongly predicts the original text to "_Bouf Stroganoff_". Am I missing something? Is there no way to make it work for multiple languages where it does not predict only for the first mentioned language, in case of multiple languages. I previously also tried several combinations like tesseract <input image> <output file> -l deu+fra where it would not predict for French properly. However, the same works properly if the order of mentioned languages are reversed i.e. tesseract <input image> <output file> -l fra+deu.

Please provide a test image. We need to test whether this is a regression .

test _This image shows the character set I am targetting_
test1 _This image shows the text I was experimenting with as mentioned in the previous comments_

Any input on these would be highly appreciated.

https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375027879

theraysmith commented on Mar 21

I did have an idea for a better multi-language implementation that would cleanly use models from multiple languages at once, but that depends on getting rid of the old code, and moving the multi-language functionality into the beam search. Until the old code is gone, that would be very messy. â€Ķ

I can replicate this.

it seems to me that œ is only trained for French. However, it is not being recognized when French is listed second.

./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata *
Warning: Invalid resolution 0 dpi. Using 70 instead.
**Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata *
Warning: Invalid resolution 0 dpi. Using 70 instead.
**Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bouf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata *
Warning: Invalid resolution 0 dpi. Using 70 instead.
**Bœuf Stroganoff

Results with all three repos, tessdata, tessdata_fast and tessdata_best

./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata_best ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata_fast ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata_best ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata_fast ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata_best ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata_fast ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata_best ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata_fast ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bouf Stroganoff

./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata_best ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bouf Stroganoff

./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata_fast ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata_best ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata_fast ***
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

You already summarized the behavior when lstm is activated.

As mentioned by the OP in the forum if the input image have both language(eng+thai) in the same line it will read only in 1 language but when having single language in that line it will read in correct language

'in 1 language' -> 'in the first given language'

Also, try with the script trained data

https://github.com/tesseract-ocr/tessdata_best/blob/master/script/Greek.traineddata

It should have both Greek and English.

@Shreeshrii : My OCR even became faster by using Devanagari.traineddata, is there any reason for this to happen, also hin+eng was converting a lot of the hindi text to english

@sirius0503 Devanagari was trained with hin+san+mar+nep+eng so it is better at recognition, plus only one traineddata file is used rather than two diff ones.

@Shreeshrii : Surprisingly, it is faster than hin.traineddata, any ideas why this maybe so?

That's due to a difference in the net-spec between the two, which makes Devanagari's network smaller than hin's network.

In addition, like Shree said, in the case of hin+eng you add to hin another network, eng. Devanagari has just one network for both languages.

That's due to a difference in the net-spec between the two, which makes Devanagari's network smaller than hin's network.

@amitdo : When I am using only -l Devanagari instead of -l hin , I get better speed ( not even hin + eng), Can you explain more about the net-spec difference

https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs

https://github.com/tesseract-ocr/tesseract/issues/1404#issuecomment-374680492

Lang | Repo | Height | Lfys | Lfx,Lrx | Lrx
:----------: | :----------: | :------------: | :---------: | :------------: | :----------
Devanagari | best | 48 | 64 | 64 | 512
hin | best | 48 | 64 | 96 | 512
Devanagari | fast | 36 | 48 | 96 | 192
hin | fast | 48 | 64 | 96 | 384

I am struggling to get it to talk to laser. It states cannot find path? Which part am I missing, please?

Shree, can you retest the eng+tha / tha+reng, best/fast, with code from the master branch?
.

Using the same image used for test in https://github.com/tesseract-ocr/tesseract/issues/1579#issuecomment-426351989 and master code built with disable-legacy:

ubuntu@tesseract-ocr:~/TEST$ bash en_th.sh
tesseract 5.0.0-alpha-473-g6d171
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

 *****  ./en_th.jpg LANG tha+eng TESSDATA tessdata OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is āļĨ test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
1.66user 0.02system 0:01.70elapsed 99%CPU (0avgtext+0avgdata 81408maxresident)k
0inputs+0outputs (0major+1568minor)pagefaults 0swaps

 *****  ./en_th.jpg LANG tha+eng TESSDATA tessdata_best OEM 1 PSM 3 ****
Warning: Parameter not found: segsearch_max_futile_classifications
Warning: Parameter not found: language_model_ngram_on
Warning: Parameter not found: language_model_ngram_space_delimited_language
Warning: Parameter not found: chop_enable
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is āļĨ test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
2.00user 0.03system 0:02.06elapsed 98%CPU (0avgtext+0avgdata 64768maxresident)k
0inputs+0outputs (0major+1936minor)pagefaults 0swaps

 *****  ./en_th.jpg LANG tha+eng TESSDATA tessdata_fast OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is āļĨ test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
1.23user 0.00system 0:01.24elapsed 99%CPU (0avgtext+0avgdata 25408maxresident)k
0inputs+0outputs (0major+666minor)pagefaults 0swaps

 *****  ./en_th.jpg LANG eng+tha TESSDATA tessdata OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is āļĨ test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
1.64user 0.02system 0:01.67elapsed 99%CPU (0avgtext+0avgdata 78144maxresident)k
0inputs+0outputs (0major+1527minor)pagefaults 0swaps

 *****  ./en_th.jpg LANG eng+tha TESSDATA tessdata_best OEM 1 PSM 3 ****
Warning: Parameter not found: segsearch_max_futile_classifications
Warning: Parameter not found: language_model_ngram_on
Warning: Parameter not found: language_model_ngram_space_delimited_language
Warning: Parameter not found: chop_enable
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is āļĨ test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
1.95user 0.02system 0:01.97elapsed 99%CPU (0avgtext+0avgdata 59136maxresident)k
0inputs+0outputs (0major+1667minor)pagefaults 0swaps

 *****  ./en_th.jpg LANG eng+tha TESSDATA tessdata_fast OEM 1 PSM 3 ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 376
Hello āļŠāļ§āļąāļŠāļ”āļĩāļˆāđ‰āļē
This is āļĨ test.
āļ™āļĩāđˆāļ„āļ·āļ­āļāļēāļĢāļ—āļ”āļŠāļ­āļš
1.30user 0.02system 0:01.33elapsed 99%CPU (0avgtext+0avgdata 25472maxresident)k
0inputs+0outputs (0major+625minor)pagefaults 0swaps
DONE
ubuntu@tesseract-ocr:~/TEST$

Hello is now being recognized correctly.

a in This is a test is now being recognized as a Thai character.

Interesting, thanks.

Results for test case in https://github.com/tesseract-ocr/tesseract/issues/1579#issuecomment-428432287 are also changed.

ubuntu@tesseract-ocr:~/TEST$ bash euro.sh

 *****  ./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata_best ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG deu+fra TESSDATA tessdata_fast ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata_best ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra+deu TESSDATA tessdata_fast ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata_best ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG eng+fra TESSDATA tessdata_fast ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata_best ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra+eng TESSDATA tessdata_fast ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Boeuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bouf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata_best ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bouf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG script/Latin TESSDATA tessdata_fast ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata_best ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff

 *****  ./euro.png OEM 1 PSM 6 LANG fra TESSDATA tessdata_fast ****
Warning: Invalid resolution 0 dpi. Using 70 instead.
Bœuf Stroganoff
DONE
Was this page helpful?
0 / 5 - 0 ratings

Related issues

clarkk picture clarkk  Â·  6Comments

egorpugin picture egorpugin  Â·  6Comments

mm-manu picture mm-manu  Â·  4Comments

garry-ut99 picture garry-ut99  Â·  5Comments

anavc94 picture anavc94  Â·  6Comments