Tesseract: preserve_interword_spaces option not working on 4.00alpha

Created on 23 Mar 2017  Â·  14Comments  Â·  Source: tesseract-ocr/tesseract

I am running the same command line as for the 3.x versions with

-c preserve_interword_spaces=1

as option. The resulting text does not preserve the white spaces, which it correctly did for the 3.x version.

Thanks
Andre

Most helpful comment

Checked with the proposed PR.

toc-eng.txt
toc-eng-space.txt
toc

Spaces are preserved now when using tesseract with --oem 1 --psm 6 -l best/eng -c preserve_interword_spaces=1

Here is the output:

1 First chapter                                   3
1.1 Section One                                 3
1.2 Section Two                                   3
1.3 Section Three                                  3

2 Last chapter                                     5
2.1 Section One                                 5
22 Section Two                                   5
2.3 Section Three                                  5

All 14 comments

I am also facing the same problem. After using "preserve_interword_spaces", there is actually no significant difference noticed between the normal OCR and "preserve_interword_spaces=1" parameterized OCR.
Is there any solution?

Same problem here. "preserve_interword_spaces" has no effect.
Also many other parameters do not work in v4.0 :/
Would be nice to get some feedback from the developers (which parameter works/which not)

I have the same problem in version 4.0. I have tried with version 3.0.2 - It does not have this option.

I also noticed this problem and it seems related with the trained data, because if I use tesseract 4 with the trained data of 3.05 I do get the interword spaces.

#!/bin/sh
export PATH=$HOME/local/tesseract/bin:$PATH
export LD_LIBRARY_PATH=$HOME/local/tesseract/lib:$LD_LIBRARY_PATH
#ubuntu tesseract 3 trained data
export TESSDATA_PREFIX=/usr/share/tesseract-ocr
#export TESSDATA_PREFIX=$HOME/local/tesseract/share/tessdata
tesseract $*

#tesseract -l spa -psm 4 $1 scanned
#mytesseract -c preserve_interword_spaces=1 -l spa  $1 scanned

I also noticed this problem and it seems related with the trained data, because if I use tesseract 4 with the trained data of 3.05 I do get the interword spaces.

With 4.00, If you don't use the --oem option, the default oem will be used:

3 Default, based on what is available.

The traineddata files for 3.05 does not have lstm data, so oem 3 in your case is equivalent to:

0 Original Tesseract only.

Thanks for pointing out !
I dived on tesseract source code and found an almost solution to the preserve_interword_spaces problem, it seems that when transferring the words in ccstruct/pageres.cpp the spaces were not transfered see patch bellow.
With this patch the output is almost identical with the 3.05 version except for the missing spaces for the first column (need more research to see where the first/second word is transfered and why the spaces/blanks are not).

@@ -1329,11 +1329,11 @@ void PAGE_RES_IT::ReplaceCurrentWord(
   WERD_RES* input_word = word();
   // Set the BOL/EOL flags on the words from the input word.
   if (input_word->word->flag(W_BOL)) {
     (*words)[0]->word->set_flag(W_BOL, true);
   } else {
-    (*words)[0]->word->set_blanks(1);
+    (*words)[0]->word->set_blanks(input_word->word->space());
   }
   words->back()->word->set_flag(W_EOL, input_word->word->flag(W_EOL));

   // Move the blobs from the input word to the new set of words.
   // If the input word_res is a combination, then the replacements will also be

Please create a PR.

On 18-Aug-2017 2:52 PM, "Domingo Alvarez Duarte" notifications@github.com
wrote:

Thanks for pointing out !
I dived on tesseract source code and found an almost solution to the
preserve_interword_spaces problem, it seems that when transferring the
words the spaces were not transfered see patch bellow.
With this patch the output is almost identical with the 3.05 version
except for the missing spaces for the first column (need more research to
see where the first/second word is transfered and why the spaces/blanks are
not).

@@ -1329,11 +1329,11 @@ void PAGE_RES_IT::ReplaceCurrentWord(
WERD_RES* input_word = word();
// Set the BOL/EOL flags on the words from the input word.
if (input_word->word->flag(W_BOL)) {
(words)[0]->word->set_flag(W_BOL, true);
} else {
- (
words)[0]->word->set_blanks(1);
+ (*words)[0]->word->set_blanks(input_word->word->space());
}
words->back()->word->set_flag(W_EOL, input_word->word->flag(W_EOL));

// Move the blobs from the input word to the new set of words.
// If the input word_res is a combination, then the replacements will also be

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/781#issuecomment-323304593,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o4sRtuKQml61AmRBaijFs5XDubIuks5sZVe9gaJpZM4MmXJm
.

Checked with the proposed PR.

toc-eng.txt
toc-eng-space.txt
toc

Spaces are preserved now when using tesseract with --oem 1 --psm 6 -l best/eng -c preserve_interword_spaces=1

Here is the output:

1 First chapter                                   3
1.1 Section One                                 3
1.2 Section Two                                   3
1.3 Section Three                                  3

2 Last chapter                                     5
2.1 Section One                                 5
22 Section Two                                   5
2.3 Section Three                                  5

Can't reproduce @Shreeshrii :/
I tried it with the python tesserocr wrapper like so:

tess = PyTessBaseAPI(init=False, lang="eng", psm=PSM.SINGLE_BLOCK)
settings = {"preserve_interword_spaces": "1"}
tess.InitFull(lang="eng", oem=OEM.LSTM_ONLY, variables=settings)
tess.SetSourceResolution(300)
tess.SetPageSegMode(PSM.SINGLE_BLOCK)
img = Image.open("image.jpg")
tess.SetImage(img)
print(tess.GetUTF8Text())

prints the text without preserved spaces :/

Also tried it through the command line as follows:

tesseract image.jpg --tessdata_dir /usr/local/share/tessdata --oem 1 --psm 6 -l eng -c
preserve_interword_spaces=1

Gives me this error:

tesseract: genericvector.h:713: T& GenericVector::operator const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.

I use the .traineddata files from https://github.com/tesseract-ocr/tessdata/tree/master/best
and latest leptonica & tesseract
Any advice please? :)

Thanks, it works!

@zdenop You can close this. Thanks!

It seems to me that you have to upgrade to tesseract 5.0 to get this solution to work. But using a mac and using brew install tesseract, I can not upgrade to tesseract 5.0

@kylefoley76 The spaces at beginning of line are still not being recognized. See the image and result in https://github.com/tesseract-ocr/tesseract/issues/781#issuecomment-328490156

Was this page helpful?
0 / 5 - 0 ratings