please put your sample file and the command you used for ocr job

roozgar on 25 Feb 2016

This is the command:

tesseract c:\temptest_ara.jpg -l ara -psm 3 c:\temptest_ara pdf

Files are attached (source JPG and output PDF)

test_ara.pdf

please check original word
أنحاء
output inside PDF is
ءاحنا

tbadran on 25 Feb 2016

Command and Samples are attached now in the previous comment

tbadran on 25 Feb 2016

Which program are you using to view the PDF?

amitdo on 26 Feb 2016

It does not look reversed wtth Chrome PDF viewer, just not very accurate...

amitdo on 26 Feb 2016

@amitdo
is there any way to reach a better accuracy in Arabic language until to change to new engine?
now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40%
but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..

can we work on language data for a better results?

roozgar on 26 Feb 2016

I am using Adobe Reader.
But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

tbadran on 26 Feb 2016

This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left

tbadran on 26 Feb 2016

@roozgar

It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.

With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.

"Offline Printed Urdu Nastaleeq Script Recognition
with Bidirectional LSTM Networks"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf

amitdo on 26 Feb 2016

I checked google drive ocr for Arabic and i see it have 100 results for same image..

Neither you or I know what programs they are using to do OCR there...

amitdo on 26 Feb 2016

@tbadran

But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

Yes, I know...

Here is a copy of the invisible text layer (copied & pasted):

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

Using Chromium (Google browser) PDF viewer under Linux.

Your original jpg image:
test_ara

amitdo on 26 Feb 2016

I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.

jbreiden on 27 Feb 2016

A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.

jbreiden on 27 Feb 2016

Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.

jbreiden on 27 Feb 2016

@roozgar

You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.

Like jbreiden said, the timeline could change...

amitdo on 27 Feb 2016

Please note my testing using the binaries for Windows downloaded from:
http://domasofan.spdns.eu/tesseract/
and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file

tbadran on 27 Feb 2016

I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11

tbadran on 27 Feb 2016

On OS X, I'm seeing the opposite of earlier reports:

Acrobat Reader DC 15.10.20056.167417 appears correct when cutting & pasting
Google Chrome Version 48.0.2564.116 (64-bit) appears backwards

tfmorris on 29 Feb 2016

Adobe Acrobat:

امهمه مني اهادم
ةييرعلا ةغللا
. هم دهج ةغل
ملاعلا ءاحنا يه هرنسم

Google Chrome

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

tfmorris on 29 Feb 2016

Tom,

Look at the original jpg.
Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome.
Clearly, that's the 'good' output...

amitdo on 29 Feb 2016

Again, in Google Chromium.
If I mark the first two lines in the PDF + first word in line 3,
copy the (invisible) text, paste it to a text file,
mark the second to last word in line 3 in the PDF,
copy the (invisible) text, paste it to the text file, I get:

مداها ينم همهما
اللغة العريية
لغة مسنره هي انحاء العالم

amitdo on 29 Feb 2016

I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.

jbreiden on 1 Mar 2016

There are two things I can think of doing. One is to give up and write Arabic
backwards (which I really hate!). The other is to put an entry in the PDF
metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about
this, slowly.

jbreiden on 9 Mar 2016

@jbreiden
I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?

amitdo on 9 Mar 2016

Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!

amitdo on 9 Mar 2016

@amitdo Hebrew has the exact same problem as Arabic.

jbreiden on 9 Mar 2016

Maybe explicitly using unicode bidi control characters can help ?

amitdo on 10 Mar 2016

That's another possibility, thanks for the suggestion.

jbreiden on 18 Mar 2016

@jbreiden, any progress? Which way you chose?
Personally, I care about our Hebrew support.

amitdo on 2 Jun 2016

I am taking a look at this today. With current code, copy-paste works from Chrome, fails from Adobe Reader. Destination is gEdit. All tests are on Linux. I see no difference in Adobe Reader if I insert U+2067 RIGHT-TO-LEFT ISOLATE (RLI) at the beginning of each word, and U+2069 POP DIRECTIONAL ISOLATE (PDI) at the end of each word. It's possible that my copy of Adobe Reader is too old to understand these control characters. Or that I am using them wrong. Too early to tell.

jbreiden on 6 Jul 2016

The PDF 1.7 specification suggests using a left-to-right transformation matrix (Tm) while giving each character a negative width. A very crude experiment along these lines give good results with
Adobe Reader. But messes up cosmetic highlighting in Chrome and copy-paste is wrong with Evince. Please note that font metrics are inconsistent in this experiment.

In writing systems that are read from right to left (such as Arabic or Hebrew), 
one might expect that the glyphs in a font would have their origins at the lower right
and their widths (rightward horizontal displacements) specified as negative. 
[ .. then continues into a horrendous discussion of writing everything backwards ... ]

--- tesseract/api/pdfrenderer.cpp   2016-07-06 13:19:57.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2016-07-06 15:35:12.000000000 -0700
@@ -246,6 +246,7 @@
 void AffineMatrix(int writing_direction,
                   int line_x1, int line_y1, int line_x2, int line_y2,
                   double *a, double *b, double *c, double *d) {
+  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
   double theta = atan2(static_cast<double>(line_y1 - line_y2),
                        static_cast<double>(line_x2 - line_x1));
   *a = cos(theta);
@@ -527,7 +528,7 @@
                "endobj\n",
                5L,         // CIDToGIDMap
                7L,         // Font descriptor
-               1000 / kCharWidth);
+               - 1000 / kCharWidth);
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);

Chrome is unhappy

heb.pdf

jbreiden on 7 Jul 2016

@jbreiden
The PDF 1.7 spec refer to:

Unicode Standard Annex #9, The Bidirectional Algorithm, Version 4.0.0

http://www.unicode.org/reports/tr9/tr9-11.html

Support for RLI and PDI has been added in Unicode 6.3.
http://www.unicode.org/reports/tr9/tr9-29.html

amitdo on 7 Jul 2016

I tried the other control characters U+202b RIGHT-TO-LEFT EMBEDDING and U+202e RIGHT-TO-LEFT OVERRIDE. Even when sprinkled all over the place, neither had any effect with Adobe Reader 9. We still get incorrect copy-paste.

--- tesseract/api/pdfrenderer.cpp   2016-07-06 13:19:57.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2016-07-07 10:55:41.000000000 -0700
@@ -410,6 +410,9 @@
     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
     STRING pdf_word("");
     int pdf_word_len = 0;
+    pdf_word += "<202E>";
+    pdf_word_len++;
     do {
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != '\0') {

heb.pdf

jbreiden on 7 Jul 2016

Filed feature request with Adobe to recognize -1 0 0 1 X Y Tm. No idea if they will consider it.

jbreiden on 21 Jul 2016

There are a number of issues relating to RTL and Arabic. Can they all be labelled with 'Arabic' for ease of finding, so that duplicate issues are not created.

https://github.com/tesseract-ocr/tesseract/issues?q=Arabic+in%3Atitle%2Cbody
gives a list of the same.

Shreeshrii on 14 Sep 2016

Hi @Shreeshrii!

Let's see...

#169
This is not Arabic specific issue, but an RTL issue. The reported issue was solved.

212

A question, not an issue.

238

PDF issue related to RTL. Not Arabic specific issue.

294

'Moved' to tesseract-ocr/langdata issues reports.

302

Seems to be solved.

325

Original issue was solved.

361

A broad complaint about bad RTL support.

410

Not Arabic specific. Can't be solved.

As said before, once the new LSTM code will finally land in Tesseract's public Github repo, the OCR accuracy of Arabic and Persian will be dramatically improved. Cube's code will be removed, so any issue with it will be irrelevant.

My conclusion: #238 is the only one in the list we should monitor.

The big question left is when we will see Tesseract 4.0 code. Unfortunately, Ray does not yet share any planned date with the Tesseract community :(

amitdo on 14 Sep 2016

Ray shared that he would like to have public alpha version by the end of September.

zdenop on 14 Sep 2016

👍2

That's good news. I promise that we'll give it a try as soon as it is available.

stweil on 14 Sep 2016

@stweil,

we'll give it a try...

'We'? The @UB-Mannheim team I guess... :)

amitdo on 14 Sep 2016

Thanks.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Sep 15, 2016 at 12:20 AM, zdenop [email protected] wrote:

Ray shared that he would like to have public alpha version by the end of
September.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-247116411,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o53wUhtvKbMbG-B-TAutJfk3h64vks5qqEIDgaJpZM4His6k
.

Shreeshrii on 15 Sep 2016

I'm currently in discussion with some Adobe folks about this topic.

jbreiden on 15 Sep 2016

hi, where can i get the arabic tessdata files?
also, where do we get all other language files?
thanks

mehmetaltuntas on 21 Oct 2016

ara.* from https://github.com/tesseract-ocr/tessdata (Version 3.02)

https://github.com/tesseract-ocr/langdata/tree/master/ara (Version 3.04)

Shreeshrii on 23 Oct 2016

https://github.com/tesseract-ocr/tessdata

Download all ara.* Files for Arabic

Other language data files are also in same repository

On 21 Oct 2016 6:07 a.m., "Mehmet Altuntas" [email protected]
wrote:

hi, where can i get the arabic tessdata files?
also, where do we get all other language files?
thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-255288956,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxpzfVP9cDcNP9gxAe_kqigEshyfks5q2DpygaJpZM4His6k
.

Shreeshrii on 25 Oct 2016

The tesseract/langdata/ara repo has the 3.04 source files for Arabic
language data.

The Arabic traineddata is based on cube engine and is the 3.02version.

On 21 Oct 2016 11:56 a.m., "ShreeDevi Kumar" [email protected] wrote:

https://github.com/tesseract-ocr/tessdata

Download all ara.* Files for Arabic

Other language data files are also in same repository

On 21 Oct 2016 6:07 a.m., "Mehmet Altuntas" [email protected]
wrote:

hi, where can i get the arabic tessdata files?
also, where do we get all other language files?
thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-255288956,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxpzfVP9cDcNP9gxAe_kqigEshyfks5q2DpygaJpZM4His6k
.

Shreeshrii on 25 Oct 2016

@jbreiden
Did you find a solution?

amitdo on 28 Nov 2016

is there any milestone to drop cube completely!?

roozgar on 28 Nov 2016

is there any milestone to drop cube completely!?

This issue is not caused by cube.

See https://github.com/tesseract-ocr/tesseract/issues/40#issuecomment-263039665

amitdo on 28 Nov 2016

The Adobe folks suggested a few things to try, none of which worked so far. Still open and (relatively) active.

jbreiden on 28 Nov 2016

Okay, this bug has been open forever. As mentioned before, most PDF files deal with right-to-left (RTL) languages like Hebrew and Arabic by laying out the characters from left-to-right (LTR) but doing it backwards. This offends my programming sensibilities on many levels, and I've resisted this approach. But maybe it is time to swallow pride and wallow in the mud. Here's a few examples from the test suite. How is compatibility for search and copy-paste?

Arabic
ara.pdf

Single word Hebrew
simplest.pdf

Hebrew + English
heb_mivne.pdf

Hebrew + English, tilted
heb-tilt.pdf

English (should be no change from what we do now)
2.pdf

--- tesseract/api/pdfrenderer.cpp   2017-03-31 14:35:03.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2017-04-21 10:16:23.000000000 -0700
@@ -225,14 +225,10 @@
 // left-to-right no matter what the reading order is. We need the
 // word baseline in reading order, so we do that conversion here. Returns
 // the word's baseline origin and length.
-void GetWordBaseline(int writing_direction, int ppi, int height,
+void GetWordBaseline(int ppi, int height,
                      int word_x1, int word_y1, int word_x2, int word_y2,
                      int line_x1, int line_y1, int line_x2, int line_y2,
                      double *x0, double *y0, double *length) {
-  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
-    Swap(&word_x1, &word_x2);
-    Swap(&word_y1, &word_y2);
-  }
   double word_length;
   double x, y;
   {
@@ -260,15 +256,12 @@
 }

 // Compute coefficients for an affine matrix describing the rotation
-// of the text. If the text is right-to-left such as Arabic or Hebrew,
-// we reflect over the Y-axis. This matrix will set the coordinate
+// of the text. This matrix will set the coordinate
 // system for placing text in the PDF file.
 //
-//                           RTL
-// [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
-// [ y' ]   [ c d ][ y ]   [ 0 1 ] [-sin cos ][ y ]
-void AffineMatrix(int writing_direction,
-                  int line_x1, int line_y1, int line_x2, int line_y2,
+// [ x' ] = [ a b ][ x ] = [ cos sin ][ x ]
+// [ y' ]   [ c d ][ y ]   [-sin cos ][ y ]
+void AffineMatrix(int line_x1, int line_y1, int line_x2, int line_y2,
                   double *a, double *b, double *c, double *d) {
   double theta = atan2(static_cast<double>(line_y1 - line_y2),
                        static_cast<double>(line_x2 - line_x1));
@@ -276,17 +269,6 @@
   *b = sin(theta);
   *c = -sin(theta);
   *d = cos(theta);
-  switch(writing_direction) {
-    case WRITING_DIRECTION_RIGHT_TO_LEFT:
-      *a = -*a;
-      *b = -*b;
-      break;
-    case WRITING_DIRECTION_TOP_TO_BOTTOM:
-      // TODO(jbreiden) Consider using the vertical PDF writing mode.
-      break;
-    default:
-      break;
-  }
 }

 // There are some really awkward PDF viewers in the wild, such as
@@ -407,15 +389,14 @@
     {
       int word_x1, word_y1, word_x2, word_y2;
       res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
-      GetWordBaseline(writing_direction, ppi, height,
+      GetWordBaseline(ppi, height,
                       word_x1, word_y1, word_x2, word_y2,
                       line_x1, line_y1, line_x2, line_y2,
                       &x, &y, &word_length);
     }

     if (writing_direction != old_writing_direction || new_block) {
-      AffineMatrix(writing_direction,
-                   line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
+      AffineMatrix(line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
       pdf_str.add_str_double(" ", prec(a));  // . This affine matrix
       pdf_str.add_str_double(" ", prec(b));  // . sets the coordinate
       pdf_str.add_str_double(" ", prec(c));  // . system for all
@@ -459,23 +440,34 @@
     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
     STRING pdf_word("");
     int pdf_word_len = 0;
+    GenericVector<int> unicodes;
+
+    // Gather up unicode codepoints for the word
     do {
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != '\0') {
-        GenericVector<int> unicodes;
         UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
-        char utf16[kMaxBytesPerCodepoint];
-        for (int i = 0; i < unicodes.length(); i++) {
-          int code = unicodes[i];
-          if (CodepointToUtf16be(code, utf16)) {
-            pdf_word += utf16;
-            pdf_word_len++;
-          }
-        }
       }
       delete []grapheme;
       res_it->Next(RIL_SYMBOL);
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
+
+
+    // Use primitive "write it backwards" approach for RTL languages
+    if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
+      unicodes.reverse();
+    }
+
+    // Write out the word the way PDF likes it
+    char utf16[kMaxBytesPerCodepoint];
+    for (int i = 0; i < unicodes.length(); i++) {
+      int codepoint = unicodes[i];
+      if (CodepointToUtf16be(codepoint, utf16)) {
+        pdf_word += utf16;
+        pdf_word_len++;
+      }
+    }
+
     if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
       double h_stretch =
           kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));

jbreiden on 21 Apr 2017

Please tell us which pdf viewer you already tested if any.

amitdo on 21 Apr 2017

So far, just Pdfium/Chrome/Linux AdobeReader9/Linux Preivew/MacOSX. I've asked a few different groups of people to take a look.

jbreiden on 21 Apr 2017

@christophered Can you please take a look at ara.pdf above and tell me if it works well for you?

jbreiden on 24 Apr 2017

@christophered Thanks, that's a very helpful report. I expect it to be equal to or better than what Tesseact does now for producing PDF of Arabic. Note that my change is just about PDF generation and does not touch the recognization process in any way. It is just stock Tesseract 4.0 support for Arabic.

What Tesseract does right now:
control.pdf

jbreiden on 25 Apr 2017

@jbreiden Here is a more thorough examination of "ara.pdf" that you posted in your comment

1) Wrong sentence order:
some sentences are over-stepping their location in the paragraph, I discovered that this was caused by the software used to view the pdf, in my case Chrome, but after using Windows Reader the problem was solved.

ara

2) Repetitive mistakes:
( لا ) is wrongly represented as ( ال ) , which is actually opposite to the correct spelling.
( ، ) is the Arabic Comma, is wrongly represented as ( ء ) or ( , ) or ( . ) or ( » )
( اً ) is represented by only ( ا ) , which is missing ( ً )
some rare cases of multiple combined words, there are 2 separate cases ( مرحامستبشرا ) and( منالناس ) , should be ( مرحاً مستبشرا ) and ( من الناس )

3) Rare Case:
When I copied the text to Microsoft Word, most of the font was in Arial font except of a couple of ( . ) full-stops which were in Calibri font, a weird thing to see.

Conclusion:
Altogether, except for the mistakes that I stated earlier, the recognition rate was very good in this sample.

Note that in the Arabic language, the state of ( لا ) is frequently used, so-much that if this misrepresentation of it as ( ال ) is continued, it would degrade the recognition rate drastically.

untitled22

ghost on 25 Apr 2017

Hebrew report:

I highlighted the text in each pdf viewer, and pasted it to gedit.

With Chromium the straight version is mostly fine. There are problems when there is a combination of Hebrew and English/other ltr symbols in the same line.
The skewed version is not fine. The words appear in wrong order in each line.

The pasted text of the straight version does not look good when Evince and pdf.js are used.
pdf.js - there are line breaks after each word.
Evince - total mess. Wrong line breaks and wrong word order. Unusable.

amitdo on 25 Apr 2017

It will be helpful to compare the pasted text of these files to Tesseract's text renderer output, to see if each issue is really caused by the pdf renderer or by the ocr engine itself.

amitdo on 25 Apr 2017

I think we should also compare to the current (4.00 with lstm) pdf output, without your patch.
The original issue, reversed letters, was with the Adobe pdf viewer, not the other viewers.

amitdo on 25 Apr 2017

@christophered

The source of the 3 first mistakes in your 'Repetitive mistakes' section is the ocr engine itself, not the pdf renderer.

648 is a more suitable place to report about them.

( لا ) vs ( ال ) is a known issue. See https://github.com/tesseract-ocr/tesseract/issues/648#issuecomment-285633162 and the comments below it.

amitdo on 25 Apr 2017

some rare cases of multiple combined words

It's not clear if the source for this issue is the pdf renderer or the ocr engine itself.

amitdo on 25 Apr 2017

I discovered that the Sentence disorder is caused by Chrome which I used to view the PDF.
Note that after using Windows Reader the PDF was viewed correctly without the disorder that I mentioned.

ghost on 25 Apr 2017

@christophered,
We want that it will be displayed fine in all major pdf viewers, not just one.

amitdo on 25 Apr 2017

Note that after using Windows Reader the PDF was viewed correctly

Which Windows version?

If you have Windows 10, try to open the pdf file with the Edge browser, and report how it is displayed there.

amitdo on 25 Apr 2017

I am using Windows 8

ghost on 25 Apr 2017

Amit, the PDF displays the original image only, so lookswise it will be the
same. It is the text layer, as copied or saved which is different.

I can test on windows10 and post the result. Someone else will have to tell
if it is ok or not.

Even with legacy Devanagari fonts that use Latin range, I have found that
copied text is different between Adobe reader and foxit reader.

excuse the brevity, sent from mobile

On 25-Apr-2017 6:05 PM, "Amit D." notifications@github.com wrote:

Note that after using Windows Reader the PDF was viewed correctly

Which Windows version?

If you have Windows 10, try to open the pdf file with the Edge browser,
and report how it is displayed there.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-297015981,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7IteyLCIpPyivdsXVRS2B3QTGe_ks5rzeiagaJpZM4His6k
.

Shreeshrii on 25 Apr 2017

Me:

If you have Windows 10, try to open the pdf file with the Edge browser, and report how it is displayed there.

@Shreeshrii :

Amit, the PDF displays the original image only, so lookswise it will be the same. It is the text layer, as copied or saved which is different.

Yes. Although I said 'displayed', I meant to refer to the "invisible text layer underneath the visible image, which can be copied and pasted to text editor".

I just made a shortcut, assuming people will understand what I really meant.
Maybe I should add excuse the brevity, sent from my PC to my comments...
:rofl:

amitdo on 25 Apr 2017

Thank you for the reports, and sorry for any and all confusion. I'm going to post what Tesseract does right now (CONTROL) and also repost the proposed modification (EXPERIMENT). I am especially interested in regressions, anywhere EXPERIMENT does worse than CONTROL.

CONTROL
2.pdf
ara.pdf
heb_mivne.pdf
heb-tilt.pdf
simplest.pdf

EXPERIMENT (repost)
2.pdf
ara.pdf
heb_mivne.pdf
heb-tilt.pdf
simplest.pdf

jbreiden on 25 Apr 2017

@amitdo funny joke :)

ghost on 26 Apr 2017

😄1

I downloaded ara.pdf (EXPERIMENT) version and opened in Adobe Reader XI and Foxit Reader 8.1 under Windows 10 and copied and pasted the text in Notepad++ under Windows 10.

Both text files look different - with text order being opposite of each other. Please see attached.

ara-error
ARA-ADOBE-XI.txt

ARA-FOXIT-8.1.txt

Shreeshrii on 26 Apr 2017

Additionally, on windows 10, in both Edge as well as the windows 10 pdf reader I am not able to copy text.

Text copied in Chrome version is similar to Foxit, but with additional line breaks. @christophered can check whether the order is also changed.

ara-chrome-error
ara-chrome.txt

Shreeshrii on 26 Apr 2017

Windows 10 - Internet Explorer - text matches output from Adobe Reader XI.

ara-internet-explorer.txt

Shreeshrii on 26 Apr 2017

@Shreeshrii ARA-FOXIT-8.1.txt is the most adequate one in terms of sentence organization.

It seems that Chrome splits the sentence into half, each at a new line.

ghost on 26 Apr 2017

I only checked the straight version of the Hebrew document.

The control is much worst with Evince and pdf.js.
Hello -> olleH (pdf.js) / o l l e H (Evince)

Chromium: The two look the same, except the two last lines.
Both are not good, but the experiment is worst than the control.
The digits in the zip code and the phone numbers are in the wrong order.
123-4567890 -> 0987654-321
The date on top 21.07.2009 is OK.

I will check the skewed version later.

amitdo on 26 Apr 2017

Skewed version, Chromium:

Both have wrong word order in each line (but not exactly the same).

Experiment:
The second word in the document is missing.
Two separate lines become one. Hebrew line + English line (site address).
Last lines - same issue as the straight version.

amitdo on 26 Apr 2017

@amitdo
Using the latest Tesseract 4.0 alpha and the latest best Arabic model, I created a searchable pdf output:

When using Chrome to view the pdf, the text can be selected/copied/pasted correctly (RTL).
When using Adobe Acrobat Reader 17.012 (latest to date), though the text is displayed correctly, but when selected/pasted, is in reverse text (LTR).

ghost on 28 Aug 2017

@amitdo
Using the latest ABBYY FineReader 14 to create a searchable pdf:

Both Chrome and Adobe Acrobat Reader can select/copy/paste correctly.

Conclusion:
It seems that Tesseract needs tweaking to solve this problem.

Original Image.zip
Tesseract.pdf
Abby Finereader.pdf

ghost on 28 Aug 2017

Looking at this again. Slowly losing the remainder of my sanity.

jbreiden on 22 Sep 2017

It seems that Tesseract needs tweaking to solve this problem.

The patch was not applied yet, so the original issue still exists.

Looking at this again. Slowly losing the remainder of my sanity.

It could be worse if you were rapidly losing your sanity :-)

You are using a simple reverse here. That's not good enough for bidi text.

amitdo on 22 Sep 2017

Any luck fixing this issue yet?!

mesaleh on 18 Dec 2017

Don't apply the patch, because it didn't work very well.

jbreiden on 6 Feb 2018

wikipedia says:

If Tesseract is used to process right-to-left text such as Arabic or Hebrew, the results are ordered as though it is left-to-right text.[11]

Is this still true for current version?

Shreeshrii on 15 Mar 2018

PDF output in Tesseract is and has always been reading order.

jbreiden on 15 Mar 2018

@jbreiden So, should I delete that line on the wikipedia page?

Shreeshrii on 15 Mar 2018

What is the status of this issue? Is there still a problem which should be fixed for Tesseract 4.0.0?

stweil on 17 Sep 2018

Status is still unsolved, and currently inactive. I swallowed my pride and
experimented
with writing arabic backwards in PDF like everyone else does, and it still
didn't work nicely.

jbreiden on 17 Sep 2018

Correction: * currently inactive *. I still think it is important though.

On Mon, Sep 17, 2018 at 11:12 AM Jeff Breidenbach jeff@jab.org wrote:

Status is still unsolved, and currently inactive. I swallowed my pride and
experimented
with writing arabic backwards in PDF like everyone else does, and it still
didn't work nicely.

jbreiden on 17 Sep 2018

This sounds as if there will not be a fix in the near future. So we should not require that this bug must be fixed for 4.0.0.

stweil on 17 Sep 2018

Is this issue resolved ??

MalekBadi on 21 Apr 2019

@amitdo
Using the latest ABBYY FineReader 14 to create a searchable pdf:
* Both Chrome and Adobe Acrobat Reader can select/copy/paste correctly.
Conclusion:
It seems that Tesseract needs tweaking to solve this problem.

Original Image.zip
Tesseract.pdf
Abby Finereader.pdf

@jbreiden I have experimented with the files he attached, and I came to notice something that does actually make sense : It seems both files have different setup for text orientation when I start selecting from mid-sentence and drag over a few lines :
Tesseract :

ABBY :

yregaieg on 30 May 2019

Not sure if this is of any value : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Page 598 (606 / 756) has a table describing writing mode :

And it seems to be possible to specify direction of writing as RTL using this parameter on a structure element and all child elements. @jbreiden can you have a look at it ?

Section 14.8.2.3.3 could be related to this issue as well. (I see that you had a look at this section 3 years ago in here https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-230928137 so it is probably not really what is causing our issue)

yregaieg on 30 May 2019

I would like to inform that the problem still persists in Tesseract 4.1.1
The Tesseract recognizes and displays Arabic text correctly. However, when export results as PDF/A, the stored text in PDF/A are reversed.

Yes, if you open the PDF in Acrobat, it will give you reversed words, and will work fine for Google Chrome PDF reader. However, when i extracted the stored text in PDF/A using pdfToText, the words are reversed too, which means the text was stored in the wrong order.

See the following example for more details:

Here is the PDF/A generated by Tesseract
Recognized_PDFA_By_Tesseract.pdf

To summarize:

True Text
مرحبا بكم جميعا
اللغة العربية

Tesseract Text 100% correct
مرحبا بكم جميعا
اللغة العربية

Tesseract PDF/A Text
اعيمج مكب ابحرم
ةيبرعلا ةغللا

As you see in the Tesseract PDF/A text, every word is reversed although the .hOCR file is correct.

hOCR

Actually, the words are not reversed (you still can read every letter) but the "entire line is mirrored". Usually, we face this problem when rendering Arabic text in HTML by setting "text-align:right"

I think, the problem here is that the x-coord of each RTL letter is rendered by measuring x from left rather than right i.e., (x,y) should be (W-x, y) where W is the page width.

ReactNativeFan on 10 Jun 2020

Tesseract: Arabic language (right to left in writing) stored (left to right) after create PDF Searchable

Most helpful comment

All 91 comments

212

238

294

302

325

361

410

648 is a more suitable place to report about them.

Related issues