Tesseract: could not find a matching blob error while training tesseract 4 on urdu language data

Created on 30 Aug 2018 · 17Comments · Source: tesseract-ocr/tesseract

accuracy of default trained model of urdu is not good, have a look on OCR of deafult model of urdu (urd.traineddata) by tesseract

خی رص رکااری تنانم کے مطالقی م رکز اور صوبہ تخیرپچشتون خو اویٹش تح یک انصاف نے میید ان مار

لیا۔ :تاب میں ون لیک آ گے آکے سے۔ صصوبہ سندرھ میں ٦ زا ٹین ےکامیالی حاص لکی۔
0-7 ۔ و چتتان می ملا جلارجحان ہے۔ یہ تا وی ٹیں
ج کی ئن الالتوائی میڈیانے بھی پیک یک تھی۔پامتان ٹل بھی زکارم کب رس خے
کیہ اصمل متاللہ تح کیک انصاف اور فون لیگ یل ہو گا اور تح کیک انصا کا بڑ ابعار یر ےگا۔

here is the image which i used to perform ocr, text is much different from orignal.

training

Source

moeneqbal

Most helpful comment

@Shreeshrii please also let me know why the custom trained model is not recognizing space b/w letters, please check the following image (these are the alphabet of Urdu language).
shot
the output of this image should be like this
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک ک ل م ن د ہ ھ ء ی ے
but the custom trained model for the Urdu language is showing the following output (i used to train and test the model on the same image).
ابپتٹثجچح
خدڈذرڑزژسشصض
طظعغفقککل

من دہھءیے

the output is missing whitespace b/w characters if we manually add whitespace the output will look like this.

ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک ک ل م ن د ہ ھ ء ی ے

even the default model for the Urdu language which is included in tesseract 4.0 is showing the following output.

اب پت ٹ ث ىىغً
دڈوڈرڑز مس شی صصضص
اف یقکگلگل

و دو ءگی ١ے

font used in the image is Nastaliq.

moeneqbal on 3 Sep 2018

❤1 👍1

All 17 comments

@zdenop , @amitdo , @Shreeshrii can you guys please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.[email protected].

moeneqbal on 31 Aug 2018

I don't have those files.

I wonder if Ray used some Urdu fonts with the Nastaliq style for training.

amitdo on 31 Aug 2018

@amitdo thank you for your kind response, can you please mention ray here.

moeneqbal on 31 Aug 2018

@theraysmith sir can you please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.[email protected].

moeneqbal on 31 Aug 2018

Urdu LSTM training text etc are available at
https://github.com/tesseract-ocr/langdata_lstm/tree/master/urd

The fonts used for 3.04 are listed in
https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L552

Ray has not shared the fontlist for 4.0.0 training yet.

On Fri, Aug 31, 2018 at 4:57 PM, Mohammad Moin notifications@github.com
wrote:

@theraysmith https://github.com/theraysmith sir can you please provide
me tiff and box file used for training for the urd.traineddata. my email
address is moen.[email protected].

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-417635836,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o8hJbd5F8skQcUzebRKr_oKH8ZjBks5uWR2TgaJpZM4WTXe-
.

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 31 Aug 2018

here is the image which i used to perform ocr, text is much different from original.

Please also provide the correct text (ground truth) for the image for testing.

Shreeshrii on 31 Aug 2018

@Shreeshrii thank you so much for your response, actually, I am working to improve the accuracy of the model, box file will help me to understand the creating boxes around the characters manually for training, I am also facing 2 issue when I am trying to train my custom model.

the model is not recognizing the spaces b/w the words.
model is showing the text in LTR form (Urdu is RTL language, similar to Arabic)

I asked for help in issue #1832 but @zdenop closed that b/c the question was not an issue, i was asking for support, so now i want to go through the training process followed by the tesseract developers.

Urdu LSTM training text etc are available at
https://github.com/tesseract-ocr/langdata_lstm/tree/master/urd

link you provided does not contain the box file, it only have text file with urdu data which is used to train
the model.

please have a look in following link, tesseract team has provided tiff/box pairs for some lang data.
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files#tifbox-pairs-provided.

Note: i am using Tesseract 4.

moeneqbal on 31 Aug 2018

If you use tesstrain.sh it will create the box/tif pairs correctly for RTL languages also.

You can use --save_box_tiff with the command. Please build teseract using the latest code (beta.4) from Github.

Shreeshrii on 31 Aug 2018

@Shreeshrii , I've already trained my own model using my own tif/box file but the result is worse than the original trained model.

This is the image:

and this is the result:

ﻧﺮگﻛریﯾﯾﺎﺘﯿﮯﻮزﻧﺎﻧﺎﺮگﻟاورﺻﻮﺑﺑﻮﺮﻛﻧﺘنﯿواہﯿںﻧﺮﻛﮏاﻧﺼﺎﻑ۔ﮯﻣﮏرانﺎر

ن۔ﯾﯾﯿﺎبﯿںﻧننﮏ ﺁﮯ ﺁﮯﮪ۔ ﺻﻮﺑﻄﺪﮪﯿں ﯿﮯﻟﺎری۔ﮯﻛﻣرﺎیﺣﺎگﺻﻞگی۔
ﺮاﯾیﯿںﺸﮏوگﻮناﻢﻛﺑاﻢﻛﺻﮪﺎﺎﮨﻮﻛگﺎ ﺎﻮﻮﮧﺎںﻣںﺎﻻﺣﻼرﻘﺎںﮪ۔ﺪﯾﺎگﯿ وﺻیﮨں
ﻮںﻛوﺎﯿںاﻟوﻟﻮایﻣﯾرﻑگوﺎﮨﻧﻮیﻛوﺎیوﺎ۔ﻔﺮﺎنﯿںگوﺎﺰﯾﻛرﯿہوﺎﻛﮨﮧرﮪہﮯ
ﻛﮧاﺻﻞﻣﯿاﺎﺎﺼﯾگﺮﮯﮏاﻧﺼﺎﻑاورﻧنﺸﯾﮏﯿںﮨﻮﻛاورﯾگﺮﮯﮏاﻧﺼﺎﻑﻛﺒﻟاﺑﺎریرﮪﺸﻛ۔

If you can share the tif/box files with me, which is not in the link you have provided, it will help me find the problem.

Any thoughts?

Thanks

P.S. Is it possible to privately contact you for help.

moeneqbal on 31 Aug 2018

If you can share the tif/box files with me, which is not in the link you have provided, it will help me find the problem.

The box/tiff files have NOT been provided by Google. You can run the scripts on the training text to generate them.

Shreeshrii on 31 Aug 2018

and this is the result:

I asked for the ground truth ie. the correct text matching that image for testing of the various urd.traineddata files (tessdata, tessdata_best, tessdata_fast) and also the Arabic.traineddata.

Shreeshrii on 31 Aug 2018

@Shreeshrii here is the correct text.

غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار
لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ملک دشمن ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلار جحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔

moeneqbal on 31 Aug 2018

Thanks! Do you know what font was used for the image?

2018-08-31 19:27 GMT+05:30 Mohammad Moin notifications@github.com:

here is the correct text.

غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے
میدان مار
لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی
حاصل کی۔ کراچی میں ملک دشمن ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا
جلار جحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی
تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف
اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-417672284,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o37earCPu-C1zxVdiRaHAT1CS-A3ks5uWUDQgaJpZM4WTXe-
.

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 1 Sep 2018

@Shreeshrii font used in this image is "Nastaliq" font, but I have other 463 pages which I have to OCR, those pages are the mix up of "Nastaliq" and "majalla" font style, here is the sample of the page which I want to OCR.

sample

following is the correct text of this image.
عبدالکریم | اشرف حسین | 9-7220602-42000 | 22 | مکان نمبر 80 محله گلشن ضیاء گلی نمبر 1 لیاقت چوک اورنگی ٹاون، ضلع کراچی غربی

please look in the following image, font inside red box is "majalla" which can be ignored because the text in the red box is the copy of " عبدالکریم | اشرف حسین " written in majalla font style.

44946620-46cc2700-ae19-11e8-815b-370ef3a375d5