Tesseract: Arabic language (right to left in writing) stored (left to right) after create PDF Searchable

Created on 25 Feb 2016  ·  91Comments  ·  Source: tesseract-ocr/tesseract

I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!

i.e. original text In Arabic is
مرحبا
Stored in PDF as text as
ابحرم

PDF bug

Most helpful comment

Ray shared that he would like to have public alpha version by the end of September.

All 91 comments

​please put your sample file and the command you used for ocr job​

This is the command:

tesseract c:\temptest_ara.jpg -l ara -psm 3 c:\temptest_ara pdf

Files are attached (source JPG and output PDF)

test_ara
test_ara.pdf

please check original word
أنحاء
output inside PDF is
ءاحنا

Command and Samples are attached now in the previous comment

Which program are you using to view the PDF?

It does not look reversed wtth Chrome PDF viewer, just not very accurate...

@amitdo
is there any way to reach a better accuracy in Arabic language until to change to new engine?
now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40%
but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..

can we work on language data for a better results?

I am using Adobe Reader.
But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left

@roozgar

It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.

With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.

"Offline Printed Urdu Nastaleeq Script Recognition
with Bidirectional LSTM Networks"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf

I checked google drive ocr for Arabic and i see it have 100 results for same image..

Neither you or I know what programs they are using to do OCR there...

@tbadran

But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

Yes, I know...

Here is a copy of the invisible text layer (copied & pasted):

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

Using Chromium (Google browser) PDF viewer under Linux.

Your original jpg image:
test_ara

I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.

A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.

Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.

@roozgar

You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.

Like jbreiden said, the timeline could change...

Please note my testing using the binaries for Windows downloaded from:
http://domasofan.spdns.eu/tesseract/
and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file

I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11

On OS X, I'm seeing the opposite of earlier reports:

  • Acrobat Reader DC 15.10.20056.167417 appears correct when cutting & pasting
  • Google Chrome Version 48.0.2564.116 (64-bit) appears backwards

Adobe Acrobat:

امهمه مني اهادم
ةييرعلا ةغللا
. هم دهج ةغل
ملاعلا ءاحنا يه هرنسم

Google Chrome

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

Tom,

Look at the original jpg.
Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome.
Clearly, that's the 'good' output...

Again, in Google Chromium.
If I mark the first two lines in the PDF + first word in line 3,
copy the (invisible) text, paste it to a text file,
mark the second to last word in line 3 in the PDF,
copy the (invisible) text, paste it to the text file, I get:

مداها ينم همهما
اللغة العريية
لغة مسنره هي انحاء العالم

I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.

There are two things I can think of doing. One is to give up and write Arabic
backwards (which I really hate!). The other is to put an entry in the PDF
metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about
this, slowly.

@jbreiden
I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?

Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!

@amitdo Hebrew has the exact same problem as Arabic.

Maybe explicitly using unicode bidi control characters can help ?

That's another possibility, thanks for the suggestion.

@jbreiden, any progress? Which way you chose?
Personally, I care about our Hebrew support.

I am taking a look at this today. With current code, copy-paste works from Chrome, fails from Adobe Reader. Destination is gEdit. All tests are on Linux. I see no difference in Adobe Reader if I insert U+2067 RIGHT-TO-LEFT ISOLATE (RLI) at the beginning of each word, and U+2069 POP DIRECTIONAL ISOLATE (PDI) at the end of each word. It's possible that my copy of Adobe Reader is too old to understand these control characters. Or that I am using them wrong. Too early to tell.

a

b

c

The PDF 1.7 specification suggests using a left-to-right transformation matrix (Tm) while giving each character a negative width. A very crude experiment along these lines give good results with
Adobe Reader. But messes up cosmetic highlighting in Chrome and copy-paste is wrong with Evince. Please note that font metrics are inconsistent in this experiment.

In writing systems that are read from right to left (such as Arabic or Hebrew), 
one might expect that the glyphs in a font would have their origins at the lower right
and their widths (rightward horizontal displacements) specified as negative. 
[ .. then continues into a horrendous discussion of writing everything backwards ... ]

--- tesseract/api/pdfrenderer.cpp   2016-07-06 13:19:57.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2016-07-06 15:35:12.000000000 -0700
@@ -246,6 +246,7 @@
 void AffineMatrix(int writing_direction,
                   int line_x1, int line_y1, int line_x2, int line_y2,
                   double *a, double *b, double *c, double *d) {
+  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
   double theta = atan2(static_cast<double>(line_y1 - line_y2),
                        static_cast<double>(line_x2 - line_x1));
   *a = cos(theta);
@@ -527,7 +528,7 @@
                "endobj\n",
                5L,         // CIDToGIDMap
                7L,         // Font descriptor
-               1000 / kCharWidth);
+               - 1000 / kCharWidth);
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);

Chrome is unhappy
f

heb.pdf

@jbreiden
The PDF 1.7 spec refer to:

Unicode Standard Annex #9, The Bidirectional Algorithm, Version 4.0.0

http://www.unicode.org/reports/tr9/tr9-11.html

Support for RLI and PDI has been added in Unicode 6.3.
http://www.unicode.org/reports/tr9/tr9-29.html

I tried the other control characters U+202b RIGHT-TO-LEFT EMBEDDING and U+202e RIGHT-TO-LEFT OVERRIDE. Even when sprinkled all over the place, neither had any effect with Adobe Reader 9. We still get incorrect copy-paste.

--- tesseract/api/pdfrenderer.cpp   2016-07-06 13:19:57.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2016-07-07 10:55:41.000000000 -0700
@@ -410,6 +410,9 @@
     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
     STRING pdf_word("");
     int pdf_word_len = 0;
+    pdf_word += "<202E>";
+    pdf_word_len++;
     do {
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != '\0') {

heb.pdf

Filed feature request with Adobe to recognize -1 0 0 1 X Y Tm. No idea if they will consider it.

There are a number of issues relating to RTL and Arabic. Can they all be labelled with 'Arabic' for ease of finding, so that duplicate issues are not created.

https://github.com/tesseract-ocr/tesseract/issues?q=Arabic+in%3Atitle%2Cbody
gives a list of the same.

Hi @Shreeshrii!

Let's see...

#169
This is not Arabic specific issue, but an RTL issue. The reported issue was solved.

212

A question, not an issue.

238

PDF issue related to RTL. Not Arabic specific issue.

294

'Moved' to tesseract-ocr/langdata issues reports.

302

Seems to be solved.

325

Original issue was solved.

361

A broad complaint about bad RTL support.

410

Not Arabic specific. Can't be solved.

As said before, once the new LSTM code will finally land in Tesseract's public Github repo, the OCR accuracy of Arabic and Persian will be dramatically improved. Cube's code will be removed, so any issue with it will be irrelevant.

My conclusion: #238 is the only one in the list we should monitor.

The big question left is when we will see Tesseract 4.0 code. Unfortunately, Ray does not yet share any planned date with the Tesseract community :(

Ray shared that he would like to have public alpha version by the end of September.

That's good news. I promise that we'll give it a try as soon as it is available.

@stweil,

we'll give it a try...

'We'? The @UB-Mannheim team I guess... :)

Thanks.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Sep 15, 2016 at 12:20 AM, zdenop [email protected] wrote:

Ray shared that he would like to have public alpha version by the end of
September.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-247116411,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o53wUhtvKbMbG-B-TAutJfk3h64vks5qqEIDgaJpZM4His6k
.

I'm currently in discussion with some Adobe folks about this topic.

hi, where can i get the arabic tessdata files?
also, where do we get all other language files?
thanks

https://github.com/tesseract-ocr/tessdata

Download all ara.* Files for Arabic

Other language data files are also in same repository

On 21 Oct 2016 6:07 a.m., "Mehmet Altuntas" [email protected]
wrote:

hi, where can i get the arabic tessdata files?
also, where do we get all other language files?
thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-255288956,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxpzfVP9cDcNP9gxAe_kqigEshyfks5q2DpygaJpZM4His6k
.

The tesseract/langdata/ara repo has the 3.04 source files for Arabic
language data.

The Arabic traineddata is based on cube engine and is the 3.02version.

On 21 Oct 2016 11:56 a.m., "ShreeDevi Kumar" [email protected] wrote:

https://github.com/tesseract-ocr/tessdata

Download all ara.* Files for Arabic

Other language data files are also in same repository

On 21 Oct 2016 6:07 a.m., "Mehmet Altuntas" [email protected]
wrote:

hi, where can i get the arabic tessdata files?
also, where do we get all other language files?
thanks


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-255288956,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxpzfVP9cDcNP9gxAe_kqigEshyfks5q2DpygaJpZM4His6k
.

@jbreiden
Did you find a solution?

​is there any milestone to drop cube completely!?​

​is there any milestone to drop cube completely!?​

This issue is not caused by cube.

See https://github.com/tesseract-ocr/tesseract/issues/40#issuecomment-263039665

The Adobe folks suggested a few things to try, none of which worked so far. Still open and (relatively) active.

Okay, this bug has been open forever. As mentioned before, most PDF files deal with right-to-left (RTL) languages like Hebrew and Arabic by laying out the characters from left-to-right (LTR) but doing it backwards. This offends my programming sensibilities on many levels, and I've resisted this approach. But maybe it is time to swallow pride and wallow in the mud. Here's a few examples from the test suite. How is compatibility for search and copy-paste?

Arabic
ara.pdf

Single word Hebrew
simplest.pdf

Hebrew + English
heb_mivne.pdf

Hebrew + English, tilted
heb-tilt.pdf

English (should be no change from what we do now)
2.pdf

--- tesseract/api/pdfrenderer.cpp   2017-03-31 14:35:03.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2017-04-21 10:16:23.000000000 -0700
@@ -225,14 +225,10 @@
 // left-to-right no matter what the reading order is. We need the
 // word baseline in reading order, so we do that conversion here. Returns
 // the word's baseline origin and length.
-void GetWordBaseline(int writing_direction, int ppi, int height,
+void GetWordBaseline(int ppi, int height,
                      int word_x1, int word_y1, int word_x2, int word_y2,
                      int line_x1, int line_y1, int line_x2, int line_y2,
                      double *x0, double *y0, double *length) {
-  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
-    Swap(&word_x1, &word_x2);
-    Swap(&word_y1, &word_y2);
-  }
   double word_length;
   double x, y;
   {
@@ -260,15 +256,12 @@
 }

 // Compute coefficients for an affine matrix describing the rotation
-// of the text. If the text is right-to-left such as Arabic or Hebrew,
-// we reflect over the Y-axis. This matrix will set the coordinate
+// of the text. This matrix will set the coordinate
 // system for placing text in the PDF file.
 //
-//                           RTL
-// [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
-// [ y' ]   [ c d ][ y ]   [ 0 1 ] [-sin cos ][ y ]
-void AffineMatrix(int writing_direction,
-                  int line_x1, int line_y1, int line_x2, int line_y2,
+// [ x' ] = [ a b ][ x ] = [ cos sin ][ x ]
+// [ y' ]   [ c d ][ y ]   [-sin cos ][ y ]
+void AffineMatrix(int line_x1, int line_y1, int line_x2, int line_y2,
                   double *a, double *b, double *c, double *d) {
   double theta = atan2(static_cast<double>(line_y1 - line_y2),
                        static_cast<double>(line_x2 - line_x1));
@@ -276,17 +269,6 @@
   *b = sin(theta);
   *c = -sin(theta);
   *d = cos(theta);
-  switch(writing_direction) {
-    case WRITING_DIRECTION_RIGHT_TO_LEFT:
-      *a = -*a;
-      *b = -*b;
-      break;
-    case WRITING_DIRECTION_TOP_TO_BOTTOM:
-      // TODO(jbreiden) Consider using the vertical PDF writing mode.
-      break;
-    default:
-      break;
-  }
 }

 // There are some really awkward PDF viewers in the wild, such as
@@ -407,15 +389,14 @@
     {
       int word_x1, word_y1, word_x2, word_y2;
       res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
-      GetWordBaseline(writing_direction, ppi, height,
+      GetWordBaseline(ppi, height,
                       word_x1, word_y1, word_x2, word_y2,
                       line_x1, line_y1, line_x2, line_y2,
                       &x, &y, &word_length);
     }

     if (writing_direction != old_writing_direction || new_block) {
-      AffineMatrix(writing_direction,
-                   line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
+      AffineMatrix(line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
       pdf_str.add_str_double(" ", prec(a));  // . This affine matrix
       pdf_str.add_str_double(" ", prec(b));  // . sets the coordinate
       pdf_str.add_str_double(" ", prec(c));  // . system for all
@@ -459,23 +440,34 @@
     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
     STRING pdf_word("");
     int pdf_word_len = 0;
+    GenericVector<int> unicodes;
+
+    // Gather up unicode codepoints for the word
     do {
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != '\0') {
-        GenericVector<int> unicodes;
         UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
-        char utf16[kMaxBytesPerCodepoint];
-        for (int i = 0; i < unicodes.length(); i++) {
-          int code = unicodes[i];
-          if (CodepointToUtf16be(code, utf16)) {
-            pdf_word += utf16;
-            pdf_word_len++;
-          }
-        }
       }
       delete []grapheme;
       res_it->Next(RIL_SYMBOL);
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
+
+
+    // Use primitive "write it backwards" approach for RTL languages
+    if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
+      unicodes.reverse();
+    }
+
+    // Write out the word the way PDF likes it
+    char utf16[kMaxBytesPerCodepoint];
+    for (int i = 0; i < unicodes.length(); i++) {
+      int codepoint = unicodes[i];
+      if (CodepointToUtf16be(codepoint, utf16)) {
+        pdf_word += utf16;
+        pdf_word_len++;
+      }
+    }
+
     if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
       double h_stretch =
           kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));

Please tell us which pdf viewer you already tested if any.

So far, just Pdfium/Chrome/Linux AdobeReader9/Linux Preivew/MacOSX. I've asked a few different groups of people to take a look.

@christophered Can you please take a look at ara.pdf above and tell me if it works well for you?

@christophered Thanks, that's a very helpful report. I expect it to be equal to or better than what Tesseact does now for producing PDF of Arabic. Note that my change is just about PDF generation and does not touch the recognization process in any way. It is just stock Tesseract 4.0 support for Arabic.

What Tesseract does right now:
control.pdf

@jbreiden Here is a more thorough examination of "ara.pdf" that you posted in your comment

1) Wrong sentence order:
some sentences are over-stepping their location in the paragraph, I discovered that this was caused by the software used to view the pdf, in my case Chrome, but after using Windows Reader the problem was solved.

ara

2) Repetitive mistakes:
( لا ) is wrongly represented as ( ال ) , which is actually opposite to the correct spelling.
( ، ) is the Arabic Comma, is wrongly represented as ( ء ) or ( , ) or ( . ) or ( » )
( اً ) is represented by only ( ا ) , which is missing ( ً )
some rare cases of multiple combined words, there are 2 separate cases ( مرحامستبشرا ) and( منالناس ) , should be ( مرحاً مستبشرا ) and ( من الناس )

3) Rare Case:
When I copied the text to Microsoft Word, most of the font was in Arial font except of a couple of ( . ) full-stops which were in Calibri font, a weird thing to see.

Conclusion:
Altogether, except for the mistakes that I stated earlier, the recognition rate was very good in this sample.

Note that in the Arabic language, the state of ( لا ) is frequently used, so-much that if this misrepresentation of it as ( ال ) is continued, it would degrade the recognition rate drastically.

untitled22

Hebrew report:

I highlighted the text in each pdf viewer, and pasted it to gedit.

With Chromium the straight version is mostly fine. There are problems when there is a combination of Hebrew and English/other ltr symbols in the same line.
The skewed version is not fine. The words appear in wrong order in each line.

The pasted text of the straight version does not look good when Evince and pdf.js are used.
pdf.js - there are line breaks after each word.
Evince - total mess. Wrong line breaks and wrong word order. Unusable.

It will be helpful to compare the pasted text of these files to Tesseract's text renderer output, to see if each issue is really caused by the pdf renderer or by the ocr engine itself.

I think we should also compare to the current (4.00 with lstm) pdf output, without your patch.
The original issue, reversed letters, was with the Adobe pdf viewer, not the other viewers.

@christophered

The source of the 3 first mistakes in your 'Repetitive mistakes' section is the ocr engine itself, not the pdf renderer.

648 is a more suitable place to report about them.

( لا ) vs ( ال ) is a known issue. See https://github.com/tesseract-ocr/tesseract/issues/648#issuecomment-285633162 and the comments below it.

some rare cases of multiple combined words

It's not clear if the source for this issue is the pdf renderer or the ocr engine itself.

I discovered that the Sentence disorder is caused by Chrome which I used to view the PDF.
Note that after using Windows Reader the PDF was viewed correctly without the disorder that I mentioned.

@christophered,
We want that it will be displayed fine in all major pdf viewers, not just one.

Note that after using Windows Reader the PDF was viewed correctly

Which Windows version?

If you have Windows 10, try to open the pdf file with the Edge browser, and report how it is displayed there.

I am using Windows 8

Amit, the PDF displays the original image only, so lookswise it will be the
same. It is the text layer, as copied or saved which is different.

I can test on windows10 and post the result. Someone else will have to tell
if it is ok or not.

Even with legacy Devanagari fonts that use Latin range, I have found that
copied text is different between Adobe reader and foxit reader.

  • excuse the brevity, sent from mobile

On 25-Apr-2017 6:05 PM, "Amit D." notifications@github.com wrote:

Note that after using Windows Reader the PDF was viewed correctly

Which Windows version?

If you have Windows 10, try to open the pdf file with the Edge browser,
and report how it is displayed there.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-297015981,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7IteyLCIpPyivdsXVRS2B3QTGe_ks5rzeiagaJpZM4His6k
.

Me:

If you have Windows 10, try to open the pdf file with the Edge browser, and report how it is displayed there.

@Shreeshrii :

Amit, the PDF displays the original image only, so lookswise it will be the same. It is the text layer, as copied or saved which is different.

Yes. Although I said 'displayed', I meant to refer to the "invisible text layer underneath the visible image, which can be copied and pasted to text editor".

I just made a shortcut, assuming people will understand what I really meant.
Maybe I should add excuse the brevity, sent from my PC to my comments...
:rofl:

Thank you for the reports, and sorry for any and all confusion. I'm going to post what Tesseract does right now (CONTROL) and also repost the proposed modification (EXPERIMENT). I am especially interested in regressions, anywhere EXPERIMENT does worse than CONTROL.

CONTROL
2.pdf
ara.pdf
heb_mivne.pdf
heb-tilt.pdf
simplest.pdf

EXPERIMENT (repost)
2.pdf
ara.pdf
heb_mivne.pdf
heb-tilt.pdf
simplest.pdf

@amitdo funny joke :)

I downloaded ara.pdf (EXPERIMENT) version and opened in Adobe Reader XI and Foxit Reader 8.1 under Windows 10 and copied and pasted the text in Notepad++ under Windows 10.

Both text files look different - with text order being opposite of each other. Please see attached.

ara-error
ARA-ADOBE-XI.txt

ARA-FOXIT-8.1.txt

Additionally, on windows 10, in both Edge as well as the windows 10 pdf reader I am not able to copy text.

Text copied in Chrome version is similar to Foxit, but with additional line breaks. @christophered can check whether the order is also changed.

ara-chrome-error
ara-chrome.txt

Windows 10 - Internet Explorer - text matches output from Adobe Reader XI.

ara-internet-explorer.txt

@Shreeshrii ARA-FOXIT-8.1.txt is the most adequate one in terms of sentence organization.

It seems that Chrome splits the sentence into half, each at a new line.

I only checked the straight version of the Hebrew document.

The control is much worst with Evince and pdf.js.
Hello -> olleH (pdf.js) / o l l e H (Evince)

Chromium: The two look the same, except the two last lines.
Both are not good, but the experiment is worst than the control.
The digits in the zip code and the phone numbers are in the wrong order.
123-4567890 -> 0987654-321
The date on top 21.07.2009 is OK.

I will check the skewed version later.

Skewed version, Chromium:

Both have wrong word order in each line (but not exactly the same).

Experiment:
The second word in the document is missing.
Two separate lines become one. Hebrew line + English line (site address).
Last lines - same issue as the straight version.

@amitdo
Using the latest Tesseract 4.0 alpha and the latest best Arabic model, I created a searchable pdf output:

  • When using Chrome to view the pdf, the text can be selected/copied/pasted correctly (RTL).
  • When using Adobe Acrobat Reader 17.012 (latest to date), though the text is displayed correctly, but when selected/pasted, is in reverse text (LTR).

@amitdo
Using the latest ABBYY FineReader 14 to create a searchable pdf:

  • Both Chrome and Adobe Acrobat Reader can select/copy/paste correctly.

Conclusion:
It seems that Tesseract needs tweaking to solve this problem.

Original Image.zip
Tesseract.pdf
Abby Finereader.pdf

Looking at this again. Slowly losing the remainder of my sanity.

It seems that Tesseract needs tweaking to solve this problem.

The patch was not applied yet, so the original issue still exists.

Looking at this again. Slowly losing the remainder of my sanity.

It could be worse if you were rapidly losing your sanity :-)

You are using a simple reverse here. That's not good enough for bidi text.

Any luck fixing this issue yet?!

Don't apply the patch, because it didn't work very well.

wikipedia says:

If Tesseract is used to process right-to-left text such as Arabic or Hebrew, the results are ordered as though it is left-to-right text.[11]

Is this still true for current version?

PDF output in Tesseract is and has always been reading order.

@jbreiden So, should I delete that line on the wikipedia page?

What is the status of this issue? Is there still a problem which should be fixed for Tesseract 4.0.0?

Status is still unsolved, and currently inactive. I swallowed my pride and
experimented
with writing arabic backwards in PDF like everyone else does, and it still
didn't work nicely.

Correction: * currently inactive *. I still think it is important though.

On Mon, Sep 17, 2018 at 11:12 AM Jeff Breidenbach jeff@jab.org wrote:

Status is still unsolved, and currently inactive. I swallowed my pride and
experimented
with writing arabic backwards in PDF like everyone else does, and it still
didn't work nicely.

This sounds as if there will not be a fix in the near future. So we should not require that this bug must be fixed for 4.0.0.

Is this issue resolved ??

@amitdo
Using the latest ABBYY FineReader 14 to create a searchable pdf:

* Both Chrome and Adobe Acrobat Reader can select/copy/paste correctly.

Conclusion:
It seems that Tesseract needs tweaking to solve this problem.

Original Image.zip
Tesseract.pdf
Abby Finereader.pdf

@jbreiden I have experimented with the files he attached, and I came to notice something that does actually make sense : It seems both files have different setup for text orientation when I start selecting from mid-sentence and drag over a few lines :
Tesseract :
image
ABBY :
image

Not sure if this is of any value : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Page 598 (606 / 756) has a table describing writing mode :
image
And it seems to be possible to specify direction of writing as RTL using this parameter on a structure element and all child elements. @jbreiden can you have a look at it ?

Section 14.8.2.3.3 could be related to this issue as well. (I see that you had a look at this section 3 years ago in here https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-230928137 so it is probably not really what is causing our issue)

I would like to inform that the problem still persists in Tesseract 4.1.1
The Tesseract recognizes and displays Arabic text correctly. However, when export results as PDF/A, the stored text in PDF/A are reversed.

Yes, if you open the PDF in Acrobat, it will give you reversed words, and will work fine for Google Chrome PDF reader. However, when i extracted the stored text in PDF/A using pdfToText, the words are reversed too, which means the text was stored in the wrong order.

See the following example for more details:

ar

Here is the PDF/A generated by Tesseract
Recognized_PDFA_By_Tesseract.pdf

To summarize:

True Text
مرحبا بكم جميعا
اللغة العربية

Tesseract Text 100% correct
مرحبا بكم جميعا
اللغة العربية

Tesseract PDF/A Text
اعيمج مكب ابحرم
ةيبرعلا ةغللا

As you see in the Tesseract PDF/A text, every word is reversed although the .hOCR file is correct.

hOCR

Actually, the words are not reversed (you still can read every letter) but the "entire line is mirrored". Usually, we face this problem when rendering Arabic text in HTML by setting "text-align:right"

I think, the problem here is that the x-coord of each RTL letter is rendered by measuring x from left rather than right i.e., (x,y) should be (W-x, y) where W is the page width.

Was this page helpful?
0 / 5 - 0 ratings