They do not work for me. I've been trying versions: 3.05.00 and 4.00.00alpha.
My file date.user-pattern contains one line:
2014-\d\d-\d\d
Picture is one line with date, like: 2014-03-19
I run: tesseract img.jpg stdout --user-patterns date.user-patterns -psm 8
and output: "mum-w" which obviously does not match the pattern.
Character whitelisting helps a bit, but format from pattern is not preserve and accuracy is poor.
I also tried some other examples - does not work either.
Many people have the same problem, aggregated links under this one:
https://stackoverflow.com/questions/34560697/tesseract-ocr-user-patterns
also #403
Should we assume that this feature does not work at all? Is there any official comment on this?
Same problem with user dictionary:
tesseract H3.png stdout --user-patterns date.user-patterns --psm 8 --user-words date.user-words -c language_model_penalty_non_dict_word=9999999999999999999 --oem 0
I tried different language_model_penalty_non_dict_word values with no luck
Related #297, which is closed, so I assume the feature doesn't work. I think it would be better for users if those flags are removed from command line and configurations, because it is misleading as long as they don't affect engine.
Tested user-words option with 3.05.01 on windows (using binaries by @stweil)
Works ok. See attached test image.
bazaar config file as used (uses system dictionary + user words)
load_system_dawg T
load_freq_dawg T
user_words_suffix user-words
user_patterns_suffix user-patterns
eng.user-words as used
the
quick
brown
fox
jumped
image used for recognition

Output without user-words- Notice Dnline instead of Online
tesseract test.png stdout
shared Guruvayoor Dnline Friends's post
Output with user-words - Online recognized correctly
tesseract test.png stdout bazaar
shared Guruvayoor Online Friends's post
So Online from eng.user-words was used, when using the bazaar config file, and led to improved accuracy.
I tested for user-patterns just now with versions 3.02 and 3.05.01, both for windows so that I didn't have to worry about correct versions of leptonica. The test image is attached.
There is no change in output with the user-patterns option in both. So, if this feature worked, it would be before 3.02.
However, just by resizing the image to 200%, the dates are correctly recognized.


@zdenop @amitdo @stweil Have you used user-patterns option? If so, with which version?
No, sorry, I never used that option. Nevertheless I also have a scenario where working user patterns would help.
No, sorry, I never used that option.
Same answer.
I also have a scenario where working user patterns would help.
@stweil Interesting project :-)
https://groups.google.com/forum/#!searchin/tesseract-ocr/user$20patterns$20%7Csort:date/tesseract-ocr/S9CIK3jOMWw/u7dnVDASFLgJ
The ability to use user patterns was added by Tesseract 3.01, and now has a little documentation. See the comment in dict/trie.h:
http://code.google.com/p/tesseract-ocr/source/browse/tags/release-3.01/dict/trie.h
So it broke somewhere between 3.01 and 3.02...
I did not use it either.
But as far as I understand: "user patterns" just help to extend tesseract dictionary.
And as it is known putting word to dictionary does not mean tesseract will recognize it (or other way around disabling dictionaries will not cause 0% recognition). => I do not know if the feature is working at all, but I would not expect significant effect on result from it.
User patterns are documented in doc/tesseract.1.asc and in dict/trie.h.
With 4.0 the problem might be that the Dict class is instantiated twice
tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::Classify::Classify()
tesseract::Wordrec::Wordrec()
tesseract::Tesseract::Tesseract()
tesseract::TessBaseAPI::Init(...)
and then here
tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::LSTMRecognizer::LoadDictionary(const char * lang, tesseract::TessdataManager * mgr)
tesseract::LSTMRecognizer::Load(const char * lang, tesseract::TessdataManager * mgr)
tesseract::Tesseract::init_tesseract_lang_data(...)
tesseract::Tesseract::init_tesseract_internal(...)
tesseract::Tesseract::init_tesseract(...)
tesseract::TessBaseAPI::Init(...)
and both initialise
https://github.com/tesseract-ocr/tesseract/blob/master/dict/dict.cpp#L43
The real problem is that variables are set between these calls so LSTM dict does not get the value from user defined variables.
Does this issue only happen on the command line executable? I mean I can workaround this issue by writing some C++ source file to directly call the API? Thanks.
@asmwarrior Answering your question: Both command line and API are affected.
Character whitelisting works for 3.05 but does not work for LSTM mode (version 4) at all.
@vidiecan Have you tried fixing the issue with whitelisting for 4.0 lstm? Your previous comment on this sounds reasonable.
Please also see comment by Ray at https://github.com/tesseract-ocr/tesseract/issues/403#issuecomment-265579471
Don't think it has been addressed yet.
@stweil Is this something you can fix?
@vidiecan you mentioned earlier that 'With 4.0 the problem might be that the Dict class is instantiated twice'.
Do you have a suggested patch to fix this issue?
Any update to this issue?
I am running Tesseract 4.00.00 Alpha on Linux via Tess4J 3.3.1
I am using the following java code in Tess4J to try and use the bazaar file and subsequently the user_patterns_suffix
TessAPI1.TessBaseAPIReadConfigFile(handle, tessdatafolder+"/configs/bazaar", 0)
I am sure it is finding this file because if I change the name of 'bazaar' it throws a warning saying file is not found.
The contents of the bazaar file is the standard -
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns
I populate the eng.user-patterns file in the tessdata folder with the standard values as default and also add my own to equate for values I need to capture correctly from a page -
1-\d\d\d-GOOG-411
www.\n\\\*.com
\A\A\d\d\A\A\A
ML\d\d\A\A\A
\A\A\d\d\d\d\d\d\d\d
However, I do not see any change in the results I am seeing. I know it is supposed to influence the results vs force, but the text looks so clearly incorrect there must be an issue.
The last time I did a build from source was around a month ago.
Any help is greatly appreciated.
We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).
So does this work when Tesseract 4 is used with --oem 0? Then it's not a regression (Tesseract 4 can then replace Tesseract 3), but a missing feature for LSTM mode.
Closed as duplicate to #403
Most helpful comment
We tried using
straceontesseract 4.0.0-beta.4-26-gfd49and it seems that theuser-patternsanduser-wordsfiles only get opened in legacy mode (using--oem 0).