Tesseract: TESSDATA_PREFIX and tessdata-dir

Created on 15 Sep 2017  路  17Comments  路  Source: tesseract-ocr/tesseract

Since there are now three possible locations of tessdata files,

https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
and
https://github.com/tesseract-ocr/tessdata

clarify the usage of TESSDATA_PREFIX and tessdata-dir.

Related: https://github.com/tesseract-ocr/tesseract/commit/e66d43390782f056b9be6e4aee4bf35c214a2f2d#diff-c2f87d92d6aa4f0f542b36a6e5c41161


    • @param argv0 - paths to the directory with language files and config files.



      • An actual value of argv0 is used if not NULL, otherwise TESSDATA_PREFIX is





      • used if not NULL, next try to use compiled in -DTESSDATA_PREFIX. If previous





      • is not sucessul - use current directory.



awaiting feedback documentation question

Most helpful comment

Should tessdata_best and tessdata_fast be Git submodules of tessdata to support language options like -l eng (old model), -l best/eng (best LSTM model) or -l fast/eng (fast LSTM model)? Then only a single tessdata directory is needed for installations, and it would be easier to document the relationship between all three repositories (think of new versions). Locally trained models could easily be added in additional subdirectories and used like -l local/eng or -l user/eng.

All 17 comments

Should tessdata_best and tessdata_fast be Git submodules of tessdata to support language options like -l eng (old model), -l best/eng (best LSTM model) or -l fast/eng (fast LSTM model)? Then only a single tessdata directory is needed for installations, and it would be easier to document the relationship between all three repositories (think of new versions). Locally trained models could easily be added in additional subdirectories and used like -l local/eng or -l user/eng.

Should tessdata_best and tessdata_fast be Git submodules of tessdata

I remember reading a comment from @theraysmith why using best/eng etc will not work .. something to do with sublanguages being invoked via the config file.

"config file" is a good keyword:

Will Tesseract continue to use the same configuration files for standard, fast and best traineddata? Then having a single tessdata directory would be better. Handling of "sublanguages" which are invoked could be fixed in the code or in the config files.

If Tesseract needs different configuration files for standard, fast and best traineddata, separate tessdata directories will be required.

I wonder whether the current approach with "best" and "fast" traineddata is reasonable: both contain basically the same data, only the LSTM model in the traineddata files differs. Some numbers for best/Latin.traineddata:

# Component size / MiB
12      Latin.lstm
1       Latin.lstm-number-dawg
1       Latin.lstm-punc-dawg
1       Latin.lstm-recoder
1       Latin.lstm-unicharset
85      Latin.lstm-word-dawg
1       Latin.version
97      total

fast/Latin.traineddata is identical with one exception:

# Component size / MiB
1       Latin.lstm

So it would also be possible to modify Tesseract to get both kinds of Latin.lstm from the same traineddata file (using a new component name like Latin.lstm-fast) and select the desired one with a new command line option. That would avoid the duplication of the other data and simplify the handling while increasing the size of the traineddata only by a small amount.

97      best/Latin.traineddata
86      fast/Latin.traineddata

A combined traineddata file with best and fast model included could be zipped and would be much smaller then:

51      best/Latin.zip

Just quoting Ray about 'best' and 'fast':

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

I now did a complete comparison of the extracted "best" and "fast" traineddata files. Besides the lstm and version parts, they are identical, but "best" includes these additional files:

ara.config
ben.config
chi_sim.config
chi_sim_vert.config
chi_tra.config
chi_tra_vert.config
deu.config
ell.config
hin.config
ita.config
jpn.config
jpn_vert.config
kan.config
kor.config
mal.config
mar.config
nep.config
srp.config
tam.config
tel.config
tha.config
vie.config

It's not clear why these parts exist in "best", and already the first one ara.config looks wrong:

# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
tessedit_ocr_engine_mode        1
[...]

# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).

This comment should be updated

tessedit_ocr_engine_mode 1

This is still ok for ara.

Shouldn't all fast and best traineddatas have this:

tessedit_ocr_engine_mode 1

in their config?

Tesseract chooses the correct mode automatically.

There are 161 tessdata files in tessdata_best, but only 28 of them contain a config part.
deu.traineddata loads frk.traineddata. That looks wrong for me. Maybe other config files should not be there, too.

Tesseract chooses the correct mode automatically.

Yeah, I know, but oem 1 is more explicit.

@stweil You raise some very good points, both about data redundancy and about git submodules. Ray told me he's going to think about over the weekend.

My tesseract application always downloads eng.traineddata.gz file from internet. I want to make it offilne. How it;s possible? Help me please

@sabirhusssain, the right place to ask questions is our forum.

@stweil You have addressed tessdata related issues with the recent commits. Should this issue be closed now?

I suggest to keep it open.

You raise some very good points, both about data redundancy and about git submodules. Ray told me he's going to think about over the weekend.

@jbreiden, @theraysmith, what was the result of your thoughts after that weekend?

@jbreiden, @theraysmith: any updates on this?

This is a really good feature to have, any new updates?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

eliyaz-kl picture eliyaz-kl  路  4Comments

duzenko picture duzenko  路  3Comments

LaurentBerger picture LaurentBerger  路  3Comments

egorpugin picture egorpugin  路  6Comments

ivder picture ivder  路  7Comments