Tesseract: TESSDATA_PREFIX and tessdata-dir

Created on 15 Sep 2017 · 17Comments · Source: tesseract-ocr/tesseract

Since there are now three possible locations of tessdata files,

https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
and
https://github.com/tesseract-ocr/tessdata

clarify the usage of TESSDATA_PREFIX and tessdata-dir.

Related: https://github.com/tesseract-ocr/tesseract/commit/e66d43390782f056b9be6e4aee4bf35c214a2f2d#diff-c2f87d92d6aa4f0f542b36a6e5c41161

- @param argv0 - paths to the directory with language files and config files.
- - An actual value of argv0 is used if not NULL, otherwise TESSDATA_PREFIX is
- - used if not NULL, next try to use compiled in -DTESSDATA_PREFIX. If previous
- - is not sucessul - use current directory.

awaiting feedback documentation question

Source

Shreeshrii

🚀1

Most helpful comment

Should tessdata_best and tessdata_fast be Git submodules of tessdata to support language options like -l eng (old model), -l best/eng (best LSTM model) or -l fast/eng (fast LSTM model)? Then only a single tessdata directory is needed for installations, and it would be easier to document the relationship between all three repositories (think of new versions). Locally trained models could easily be added in additional subdirectories and used like -l local/eng or -l user/eng.

stweil on 15 Sep 2017

🚀2

All 17 comments

stweil on 15 Sep 2017

🚀2

Should tessdata_best and tessdata_fast be Git submodules of tessdata

I remember reading a comment from @theraysmith why using best/eng etc will not work .. something to do with sublanguages being invoked via the config file.

Shreeshrii on 15 Sep 2017

🚀1

"config file" is a good keyword:

Will Tesseract continue to use the same configuration files for standard, fast and best traineddata? Then having a single tessdata directory would be better. Handling of "sublanguages" which are invoked could be fixed in the code or in the config files.

If Tesseract needs different configuration files for standard, fast and best traineddata, separate tessdata directories will be required.

stweil on 15 Sep 2017

😄1

I wonder whether the current approach with "best" and "fast" traineddata is reasonable: both contain basically the same data, only the LSTM model in the traineddata files differs. Some numbers for best/Latin.traineddata:

# Component size / MiB
12      Latin.lstm
1       Latin.lstm-number-dawg
1       Latin.lstm-punc-dawg
1       Latin.lstm-recoder
1       Latin.lstm-unicharset
85      Latin.lstm-word-dawg
1       Latin.version
97      total

fast/Latin.traineddata is identical with one exception:

# Component size / MiB
1       Latin.lstm

So it would also be possible to modify Tesseract to get both kinds of Latin.lstm from the same traineddata file (using a new component name like Latin.lstm-fast) and select the desired one with a new command line option. That would avoid the duplication of the other data and simplify the handling while increasing the size of the traineddata only by a small amount.

97      best/Latin.traineddata
86      fast/Latin.traineddata

A combined traineddata file with best and fast model included could be zipped and would be much smaller then:

51      best/Latin.zip

stweil on 15 Sep 2017

❤1

Just quoting Ray about 'best' and 'fast':

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

amitdo on 15 Sep 2017

👀1

I now did a complete comparison of the extracted "best" and "fast" traineddata files. Besides the lstm and version parts, they are identical, but "best" includes these additional files:

ara.config
ben.config
chi_sim.config
chi_sim_vert.config
chi_tra.config
chi_tra_vert.config
deu.config
ell.config
hin.config
ita.config
jpn.config
jpn_vert.config
kan.config
kor.config
mal.config
mar.config
nep.config
srp.config
tam.config
tel.config
tha.config
vie.config

It's not clear why these parts exist in "best", and already the first one ara.config looks wrong:

# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
tessedit_ocr_engine_mode        1
[...]

stweil on 15 Sep 2017

👍1

# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).

This comment should be updated

tessedit_ocr_engine_mode 1

This is still ok for ara.

amitdo on 15 Sep 2017

👀1

Shouldn't all fast and best traineddatas have this:

tessedit_ocr_engine_mode 1

in their config?

amitdo on 15 Sep 2017

🚀1

Tesseract chooses the correct mode automatically.

There are 161 tessdata files in tessdata_best, but only 28 of them contain a config part.
deu.traineddata loads frk.traineddata. That looks wrong for me. Maybe other config files should not be there, too.

stweil on 15 Sep 2017

🚀1

Tesseract chooses the correct mode automatically.

Yeah, I know, but oem 1 is more explicit.

amitdo on 15 Sep 2017

❤1

@stweil You raise some very good points, both about data redundancy and about git submodules. Ray told me he's going to think about over the weekend.