Tesseract: preserve_interword_spaces option not working on 4.00alpha

Created on 23 Mar 2017 · 14Comments · Source: tesseract-ocr/tesseract

I am running the same command line as for the 3.x versions with

-c preserve_interword_spaces=1

as option. The resulting text does not preserve the white spaces, which it correctly did for the 3.x version.

Thanks
Andre

Source

abieler

Most helpful comment

Checked with the proposed PR.

toc-eng.txt
toc-eng-space.txt
toc

Spaces are preserved now when using tesseract with --oem 1 --psm 6 -l best/eng -c preserve_interword_spaces=1

Here is the output:

1 First chapter                                   3
1.1 Section One                                 3
1.2 Section Two                                   3
1.3 Section Three                                  3

2 Last chapter                                     5
2.1 Section One                                 5
22 Section Two                                   5
2.3 Section Three                                  5

Shreeshrii on 11 Sep 2017

🎉5

All 14 comments

I am also facing the same problem. After using "preserve_interword_spaces", there is actually no significant difference noticed between the normal OCR and "preserve_interword_spaces=1" parameterized OCR.
Is there any solution?

MoinulJoje on 29 Mar 2017

Same problem here. "preserve_interword_spaces" has no effect.
Also many other parameters do not work in v4.0 :/
Would be nice to get some feedback from the developers (which parameter works/which not)

mm-manu on 16 May 2017

I have the same problem in version 4.0. I have tried with version 3.0.2 - It does not have this option.

s-asish on 24 Jun 2017

I also noticed this problem and it seems related with the trained data, because if I use tesseract 4 with the trained data of 3.05 I do get the interword spaces.

#!/bin/sh
export PATH=$HOME/local/tesseract/bin:$PATH
export LD_LIBRARY_PATH=$HOME/local/tesseract/lib:$LD_LIBRARY_PATH
#ubuntu tesseract 3 trained data
export TESSDATA_PREFIX=/usr/share/tesseract-ocr
#export TESSDATA_PREFIX=$HOME/local/tesseract/share/tessdata
tesseract $*

#tesseract -l spa -psm 4 $1 scanned
#mytesseract -c preserve_interword_spaces=1 -l spa  $1 scanned

mingodad on 14 Aug 2017

I also noticed this problem and it seems related with the trained data, because if I use tesseract 4 with the trained data of 3.05 I do get the interword spaces.

With 4.00, If you don't use the --oem option, the default oem will be used:

3 Default, based on what is available.

The traineddata files for 3.05 does not have lstm data, so oem 3 in your case is equivalent to:

0 Original Tesseract only.

amitdo on 14 Aug 2017

Thanks for pointing out !
I dived on tesseract source code and found an almost solution to the preserve_interword_spaces problem, it seems that when transferring the words in ccstruct/pageres.cpp the spaces were not transfered see patch bellow.
With this patch the output is almost identical with the 3.05 version except for the missing spaces for the first column (need more research to see where the first/second word is transfered and why the spaces/blanks are not).

@@ -1329,11 +1329,11 @@ void PAGE_RES_IT::ReplaceCurrentWord(
   WERD_RES* input_word = word();
   // Set the BOL/EOL flags on the words from the input word.
   if (input_word->word->flag(W_BOL)) {
     (*words)[0]->word->set_flag(W_BOL, true);
   } else {
-    (*words)[0]->word->set_blanks(1);
+    (*words)[0]->word->set_blanks(input_word->word->space());
   }
   words->back()->word->set_flag(W_EOL, input_word->word->flag(W_EOL));

   // Move the blobs from the input word to the new set of words.
   // If the input word_res is a combination, then the replacements will also be

mingodad on 18 Aug 2017

Please create a PR.

On 18-Aug-2017 2:52 PM, "Domingo Alvarez Duarte" notifications@github.com
wrote:

Thanks for pointing out !
I dived on tesseract source code and found an almost solution to the
preserve_interword_spaces problem, it seems that when transferring the
words the spaces were not transfered see patch bellow.
With this patch the output is almost identical with the 3.05 version
except for the missing spaces for the first column (need more research to
see where the first/second word is transfered and why the spaces/blanks are
not).

@@ -1329,11 +1329,11 @@ void PAGE_RES_IT::ReplaceCurrentWord(
WERD_RES* input_word = word();
// Set the BOL/EOL flags on the words from the input word.
if (input_word->word->flag(W_BOL)) {
(words)[0]->word->set_flag(W_BOL, true);
} else {
- (words)[0]->word->set_blanks(1);
+ (*words)[0]->word->set_blanks(input_word->word->space());
}
words->back()->word->set_flag(W_EOL, input_word->word->flag(W_EOL));

// Move the blobs from the input word to the new set of words.
// If the input word_res is a combination, then the replacements will also be

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/781#issuecomment-323304593,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o4sRtuKQml61AmRBaijFs5XDubIuks5sZVe9gaJpZM4MmXJm
.

Shreeshrii on 21 Aug 2017

Checked with the proposed PR.

toc-eng.txt
toc-eng-space.txt
toc

Spaces are preserved now when using tesseract with --oem 1 --psm 6 -l best/eng -c preserve_interword_spaces=1

Here is the output:

1 First chapter                                   3
1.1 Section One                                 3
1.2 Section Two                                   3
1.3 Section Three                                  3

2 Last chapter                                     5
2.1 Section One                                 5
22 Section Two                                   5
2.3 Section Three                                  5

Shreeshrii on 11 Sep 2017

🎉5

Can't reproduce @Shreeshrii :/
I tried it with the python tesserocr wrapper like so:

tess = PyTessBaseAPI(init=False, lang="eng", psm=PSM.SINGLE_BLOCK)
settings = {"preserve_interword_spaces": "1"}
tess.InitFull(lang="eng", oem=OEM.LSTM_ONLY, variables=settings)
tess.SetSourceResolution(300)
tess.SetPageSegMode(PSM.SINGLE_BLOCK)
img = Image.open("image.jpg")
tess.SetImage(img)
print(tess.GetUTF8Text())

prints the text without preserved spaces :/

Also tried it through the command line as follows:

tesseract image.jpg --tessdata_dir /usr/local/share/tessdata --oem 1 --psm 6 -l eng -c
preserve_interword_spaces=1

Gives me this error:

tesseract: genericvector.h:713: T& GenericVector::operator const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

I use the .traineddata files from https://github.com/tesseract-ocr/tessdata/tree/master/best
and latest leptonica & tesseract
Any advice please? :)

mm-manu on 14 Sep 2017

The fix was applied in https://github.com/tesseract-ocr/tesseract/commit/e62e8f5f802c0d8f3dd67da993327cdafaee9763

Build the new version and check.

Shreeshrii on 15 Sep 2017

👍4

Thanks, it works!

mm-manu on 15 Sep 2017

@zdenop You can close this. Thanks!

Shreeshrii on 15 Sep 2017

It seems to me that you have to upgrade to tesseract 5.0 to get this solution to work. But using a mac and using brew install tesseract, I can not upgrade to tesseract 5.0

kylefoley76 on 19 Dec 2020

@kylefoley76 The spaces at beginning of line are still not being recognized. See the image and result in https://github.com/tesseract-ocr/tesseract/issues/781#issuecomment-328490156

Shreeshrii on 25 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings