Tesseract: user pattern/dict does not work at all

Created on 30 May 2017  路  19Comments  路  Source: tesseract-ocr/tesseract

They do not work for me. I've been trying versions: 3.05.00 and 4.00.00alpha.
My file date.user-pattern contains one line:
2014-\d\d-\d\d
Picture is one line with date, like: 2014-03-19
I run: tesseract img.jpg stdout --user-patterns date.user-patterns -psm 8
and output: "mum-w" which obviously does not match the pattern.
Character whitelisting helps a bit, but format from pattern is not preserve and accuracy is poor.
I also tried some other examples - does not work either.
Many people have the same problem, aggregated links under this one:
https://stackoverflow.com/questions/34560697/tesseract-ocr-user-patterns
also #403
Should we assume that this feature does not work at all? Is there any official comment on this?

duplicate

Most helpful comment

We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).

All 19 comments

Same problem with user dictionary:
tesseract H3.png stdout --user-patterns date.user-patterns --psm 8 --user-words date.user-words -c language_model_penalty_non_dict_word=9999999999999999999 --oem 0 I tried different language_model_penalty_non_dict_word values with no luck
Related #297, which is closed, so I assume the feature doesn't work. I think it would be better for users if those flags are removed from command line and configurations, because it is misleading as long as they don't affect engine.

Tested user-words option with 3.05.01 on windows (using binaries by @stweil)

Works ok. See attached test image.

bazaar config file as used (uses system dictionary + user words)

load_system_dawg     T
load_freq_dawg       T
user_words_suffix    user-words
user_patterns_suffix user-patterns

eng.user-words as used

the
quick
brown
fox
jumped

image used for recognition

test

Output without user-words- Notice Dnline instead of Online

tesseract test.png stdout
shared Guruvayoor Dnline Friends's post

Output with user-words - Online recognized correctly

tesseract test.png stdout  bazaar
shared Guruvayoor Online Friends's post

So Online from eng.user-words was used, when using the bazaar config file, and led to improved accuracy.

I tested for user-patterns just now with versions 3.02 and 3.05.01, both for windows so that I didn't have to worry about correct versions of leptonica. The test image is attached.

There is no change in output with the user-patterns option in both. So, if this feature worked, it would be before 3.02.

However, just by resizing the image to 200%, the dates are correctly recognized.

date
date-small

@zdenop @amitdo @stweil Have you used user-patterns option? If so, with which version?

No, sorry, I never used that option. Nevertheless I also have a scenario where working user patterns would help.

No, sorry, I never used that option.

Same answer.

I also have a scenario where working user patterns would help.

@stweil Interesting project :-)

https://groups.google.com/forum/#!searchin/tesseract-ocr/user$20patterns$20%7Csort:date/tesseract-ocr/S9CIK3jOMWw/u7dnVDASFLgJ

The ability to use user patterns was added by Tesseract 3.01, and now has a little documentation. See the comment in dict/trie.h:

http://code.google.com/p/tesseract-ocr/source/browse/tags/release-3.01/dict/trie.h

So it broke somewhere between 3.01 and 3.02...

I did not use it either.
But as far as I understand: "user patterns" just help to extend tesseract dictionary.
And as it is known putting word to dictionary does not mean tesseract will recognize it (or other way around disabling dictionaries will not cause 0% recognition). => I do not know if the feature is working at all, but I would not expect significant effect on result from it.

User patterns are documented in doc/tesseract.1.asc and in dict/trie.h.

With 4.0 the problem might be that the Dict class is instantiated twice

tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::Classify::Classify()
tesseract::Wordrec::Wordrec()
tesseract::Tesseract::Tesseract()
tesseract::TessBaseAPI::Init(...)

and then here

tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::LSTMRecognizer::LoadDictionary(const char * lang, tesseract::TessdataManager * mgr)
tesseract::LSTMRecognizer::Load(const char * lang, tesseract::TessdataManager * mgr)
tesseract::Tesseract::init_tesseract_lang_data(...)
tesseract::Tesseract::init_tesseract_internal(...)
tesseract::Tesseract::init_tesseract(...)
tesseract::TessBaseAPI::Init(...)

and both initialise
https://github.com/tesseract-ocr/tesseract/blob/master/dict/dict.cpp#L43

The real problem is that variables are set between these calls so LSTM dict does not get the value from user defined variables.

Does this issue only happen on the command line executable? I mean I can workaround this issue by writing some C++ source file to directly call the API? Thanks.

@asmwarrior Answering your question: Both command line and API are affected.
Character whitelisting works for 3.05 but does not work for LSTM mode (version 4) at all.
@vidiecan Have you tried fixing the issue with whitelisting for 4.0 lstm? Your previous comment on this sounds reasonable.

Please also see comment by Ray at https://github.com/tesseract-ocr/tesseract/issues/403#issuecomment-265579471

Don't think it has been addressed yet.

@stweil Is this something you can fix?

@vidiecan you mentioned earlier that 'With 4.0 the problem might be that the Dict class is instantiated twice'.

Do you have a suggested patch to fix this issue?

1127, #1128

Any update to this issue?
I am running Tesseract 4.00.00 Alpha on Linux via Tess4J 3.3.1
I am using the following java code in Tess4J to try and use the bazaar file and subsequently the user_patterns_suffix

TessAPI1.TessBaseAPIReadConfigFile(handle, tessdatafolder+"/configs/bazaar", 0)

I am sure it is finding this file because if I change the name of 'bazaar' it throws a warning saying file is not found.

The contents of the bazaar file is the standard -

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

I populate the eng.user-patterns file in the tessdata folder with the standard values as default and also add my own to equate for values I need to capture correctly from a page -

1-\d\d\d-GOOG-411
www.\n\\\*.com
\A\A\d\d\A\A\A
ML\d\d\A\A\A
\A\A\d\d\d\d\d\d\d\d

However, I do not see any change in the results I am seeing. I know it is supposed to influence the results vs force, but the text looks so clearly incorrect there must be an issue.

The last time I did a build from source was around a month ago.

Any help is greatly appreciated.

We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).

So does this work when Tesseract 4 is used with --oem 0? Then it's not a regression (Tesseract 4 can then replace Tesseract 3), but a missing feature for LSTM mode.

Closed as duplicate to #403

Was this page helpful?
0 / 5 - 0 ratings

Related issues

reubano picture reubano  路  6Comments

dthrock picture dthrock  路  5Comments

royudev picture royudev  路  5Comments

clarkk picture clarkk  路  7Comments

lqhart picture lqhart  路  4Comments