ref: https://groups.google.com/forum/#!msg/tesseract-ocr/S9CIK3jOMWw/vVBZULrJ9xcJ
I tried using bazaar config for user patterns suggested in above post ( \A\A\d\d\d\A\A
) with the latest windows binary. It does not seem to work. Does the functionality work on linux?
input, output and config files attached. I added.txt extension to bazaar and eng.user-patterns in order to upload it here.

OUTPUT
OX345PT
PT7895M
BA409QT
OMOOKM
WE4321M
OOLI9T7
OX345PT
PT789SM
BA409QT
OMOOKMI
WE432LM
OOLI9T7
OX345PT
PT7898M
BA409QT
OMOOKMI
WE432LM
Some other reports of user-patterns and user-words not working
https://groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM
http://stackoverflow.com/questions/17209919/tesseract-user-patterns
Has anyone tried this? Does it work?
Question:
There are 2 ways these things could work:
BTW, this could behave differently for base tesseract vs LSTM.
I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input.
The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns.
In addition to the cases mentioned by Amit, there are users who would like
to use the user_words dictionary in addition to Tesseract's wordlist,
some examples of user words could be client names, industry specific
terminology eg. Medical or pharmaceutical.
Is it possible to allow for both kinds of scenarios, based on some config/variable?
@theraysmith Ray, please also see
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/IUtQfIGZVdA/dm0-2n4DCQAJ
for discussion regarding a user looking for encrypted user words list to use with tesseract.
Handle pattern by code. It is the only best way and anle customize easily
Hint: Use your input result and regular expression checking online regular expression testing page. It will be great help.
Hope you solve this
@theraysmith
Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/p80qyGvVvP4/Rd1hlof3CAAJ
reg "recognize only from user word list"
please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/wnlJcF4zIvU/4cIt9f2iCgAJ
need to recognize words of medications ( Rare words that are most likely not included in the training data).
Any updates?
This is really needed. How can one fix this? Where to start?
When working on a fix for char whitelisting, @Shreeshrii and I discussed how user words/patterns could be reactivated in Tesseract 4 with LSTM models, too. This prompted me to work on a solution – see #2324. (Please review!)
Here is the relevant discussion leading up to it:
Do these changes also fix #960 ?
It does not seem so. Results for your pattern example from #403 are still unaffected, regardless of whether I use a config file or the
--user-patternsoption. Stracing confirms that Tesseract never attempts to open the pattern file, it just goes straight after the output file, once the traineddata itself is loaded.Looking into this, it appears that
LSTMRecognizer::Loadis responsible, and it does not callLoadDictionaryunless its first option (lang) is non-null, which in turnTesseract::init_tesseract_lang_datawill not give unlesslstm_use_matrix=1. But that does not help either!Looking deeper, only
Dict::LoadLSTMis in the current callgraph, but we would needDict::Loadto read the user words and user patterns, build a trie from them and add to the other dawgs. This can only come fromTesseract::init_tesseract_lm, and that fromTessBaseAPI::InitLangMod, which has a nice disclaimer comment above://TODO(amit): Adapt to lstm
I now believe
TessBaseAPI::InitLangMod/Tesseract::init_tesseract_lmare actually dead ends and should be removed. As toLSTMRecognizer::LoadDictionary, it would simply be a matter of replacingDict::LoadLSTMwith the oldDict::Load, but there is one _missing link_: TheLSTMRecognizernever gets to see the runtime variables of theTesseractinstance, andCCUtilhas no interface to set or initialize itsparams_member.Perhaps it would be best to pass
tesseract_->params()to the constructor ofLSTMRecognizer, and add a (delegating) constructor to bothLSTMRecognizerandCCUtilwhich takes aParamsVectors*.Or is there some reason to keep the member params of
TesseractandLSTMRecognizerdifferent? If so, which params besidesuser_patterns_fileanduser_words_fileshould be copied?
@theraysmith : Ray, can you reply to @bertsky ? This is important fix for tesseract 4.x... (cc: @jbreiden )
Ok, #2324 failed, but here comes #2328.
There's some way to use user patterns in 4.0 or we'll have to wait for a new version?
@KilianSillero Yes there is, just checkout the recent master and you will have the user words and user patterns facilities (as documented in the manpage) at your disgression.
The above mentioned fix is not quite satisfactory yet, in that the effect might be small, but these are larger issues to be dealt with in general terms.
@zdenop please close! (If users still have problems with beam narrowness or want to make patterns exclusive, those should be discussed as separate issues.)
Most helpful comment
I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input.
The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns.