Since there are now three possible locations of tessdata files,
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
and
https://github.com/tesseract-ocr/tessdata
clarify the usage of TESSDATA_PREFIX and tessdata-dir.
Related: https://github.com/tesseract-ocr/tesseract/commit/e66d43390782f056b9be6e4aee4bf35c214a2f2d#diff-c2f87d92d6aa4f0f542b36a6e5c41161
Should tessdata_best and tessdata_fast be Git submodules of tessdata to support language options like -l eng (old model), -l best/eng (best LSTM model) or -l fast/eng (fast LSTM model)? Then only a single tessdata directory is needed for installations, and it would be easier to document the relationship between all three repositories (think of new versions). Locally trained models could easily be added in additional subdirectories and used like -l local/eng or -l user/eng.
Should tessdata_best and tessdata_fast be Git submodules of tessdata
I remember reading a comment from @theraysmith why using best/eng etc will not work .. something to do with sublanguages being invoked via the config file.
"config file" is a good keyword:
Will Tesseract continue to use the same configuration files for standard, fast and best traineddata? Then having a single tessdata directory would be better. Handling of "sublanguages" which are invoked could be fixed in the code or in the config files.
If Tesseract needs different configuration files for standard, fast and best traineddata, separate tessdata directories will be required.
I wonder whether the current approach with "best" and "fast" traineddata is reasonable: both contain basically the same data, only the LSTM model in the traineddata files differs. Some numbers for best/Latin.traineddata:
# Component size / MiB
12 Latin.lstm
1 Latin.lstm-number-dawg
1 Latin.lstm-punc-dawg
1 Latin.lstm-recoder
1 Latin.lstm-unicharset
85 Latin.lstm-word-dawg
1 Latin.version
97 total
fast/Latin.traineddata is identical with one exception:
# Component size / MiB
1 Latin.lstm
So it would also be possible to modify Tesseract to get both kinds of Latin.lstm from the same traineddata file (using a new component name like Latin.lstm-fast) and select the desired one with a new command line option. That would avoid the duplication of the other data and simplify the handling while increasing the size of the traineddata only by a small amount.
97 best/Latin.traineddata
86 fast/Latin.traineddata
A combined traineddata file with best and fast model included could be zipped and would be much smaller then:
51 best/Latin.zip
Just quoting Ray about 'best' and 'fast':
2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.
I now did a complete comparison of the extracted "best" and "fast" traineddata files. Besides the lstm and version parts, they are identical, but "best" includes these additional files:
ara.config
ben.config
chi_sim.config
chi_sim_vert.config
chi_tra.config
chi_tra_vert.config
deu.config
ell.config
hin.config
ita.config
jpn.config
jpn_vert.config
kan.config
kor.config
mal.config
mar.config
nep.config
srp.config
tam.config
tel.config
tha.config
vie.config
It's not clear why these parts exist in "best", and already the first one ara.config looks wrong:
# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
tessedit_ocr_engine_mode 1
[...]
# We do not yet have Tesseract for Arabic, so use OEM_CUBE_ONLY
# (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
This comment should be updated
tessedit_ocr_engine_mode 1
This is still ok for ara.
Shouldn't all fast and best traineddatas have this:
tessedit_ocr_engine_mode 1
in their config?
Tesseract chooses the correct mode automatically.
There are 161 tessdata files in tessdata_best, but only 28 of them contain a config part.
deu.traineddata loads frk.traineddata. That looks wrong for me. Maybe other config files should not be there, too.
Tesseract chooses the correct mode automatically.
Yeah, I know, but oem 1 is more explicit.
@stweil You raise some very good points, both about data redundancy and about git submodules. Ray told me he's going to think about over the weekend.
My tesseract application always downloads eng.traineddata.gz file from internet. I want to make it offilne. How it;s possible? Help me please
@sabirhusssain, the right place to ask questions is our forum.
@stweil You have addressed tessdata related issues with the recent commits. Should this issue be closed now?
I suggest to keep it open.
You raise some very good points, both about data redundancy and about git submodules. Ray told me he's going to think about over the weekend.
@jbreiden, @theraysmith, what was the result of your thoughts after that weekend?
@jbreiden, @theraysmith: any updates on this?
This is a really good feature to have, any new updates?
Most helpful comment
Should
tessdata_bestandtessdata_fastbe Git submodules oftessdatato support language options like-l eng(old model),-l best/eng(best LSTM model) or-l fast/eng(fast LSTM model)? Then only a singletessdatadirectory is needed for installations, and it would be easier to document the relationship between all three repositories (think of new versions). Locally trained models could easily be added in additional subdirectories and used like-l local/engor-l user/eng.