Tesseract: LSTM: User patterns do not work

Created on 30 Aug 2016 · 16Comments · Source: tesseract-ocr/tesseract

ref: https://groups.google.com/forum/#!msg/tesseract-ocr/S9CIK3jOMWw/vVBZULrJ9xcJ

I tried using bazaar config for user patterns suggested in above post ( \A\A\d\d\d\A\A
) with the latest windows binary. It does not seem to work. Does the functionality work on linux?

input, output and config files attached. I added.txt extension to bazaar and eng.user-patterns in order to upload it here.

patterntest

OUTPUT

OX345PT
PT7895M
BA409QT
OMOOKM
WE4321M

OOLI9T7
OX345PT
PT789SM
BA409QT
OMOOKMI
WE432LM

OOLI9T7
OX345PT
PT7898M
BA409QT
OMOOKMI
WE432LM

patternbazaar.txt

bazaar.txt
eng.user-patterns.txt

Source

Shreeshrii

👍7

Most helpful comment

I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input.

The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns.

amitdo on 7 Dec 2016

👍9 ❤1

All 16 comments

Some other reports of user-patterns and user-words not working

https://groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM

http://stackoverflow.com/questions/17209919/tesseract-user-patterns

Has anyone tried this? Does it work?

Shreeshrii on 8 Sep 2016

Question:
There are 2 ways these things could work:

FORCE the output to match the provided pattern(s) and/or word(s). With this option, you can't get anything else out, whatever is in the image.
Use the user-patterns and user-words as a hint. Other things could be output, if it thinks it is more likely. The hint can be made stronger, but there will always be inputs that produce something outside of the patterns supplied.
Which is it to be?
Can someone familiar with the above discussions please summarize for me, and if the consensus is 1 above, then it could be made to happen, or else it might be possible to increase the strength of the hint.

BTW, this could behave differently for base tesseract vs LSTM.

theraysmith on 7 Dec 2016

I can tell you that in the Tesseract forum many users ask about these files. They are disappointed that there is no effect on accuracy when using them with their input.

The input is usually not a document but something like receipt, passport, car license plate, with a small set of known words/patterns.

amitdo on 7 Dec 2016

👍9 ❤1

In addition to the cases mentioned by Amit, there are users who would like
to use the user_words dictionary in addition to Tesseract's wordlist,

some examples of user words could be client names, industry specific
terminology eg. Medical or pharmaceutical.

Is it possible to allow for both kinds of scenarios, based on some config/variable?

Shreeshrii on 8 Dec 2016

👍4

@theraysmith Ray, please also see
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/IUtQfIGZVdA/dm0-2n4DCQAJ

for discussion regarding a user looking for encrypted user words list to use with tesseract.

Shreeshrii on 8 Dec 2016

Handle pattern by code. It is the only best way and anle customize easily

Teseract firstly have to process whole image anyway. We can not do anything to this.
Then they process pattern by their code (i assumed it is bad). We bypass this step
Get all result and hadle by regular expression in code. All input is in text or digits so it will be fast, dont worry.

Hint: Use your input result and regular expression checking online regular expression testing page. It will be great help.

Hope you solve this

quocpt on 26 Jan 2017

@theraysmith

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/p80qyGvVvP4/Rd1hlof3CAAJ

reg "recognize only from user word list"

Shreeshrii on 13 Feb 2017

please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/wnlJcF4zIvU/4cIt9f2iCgAJ

need to recognize words of medications ( Rare words that are most likely not included in the training data).

Shreeshrii on 18 Feb 2017

Also see: https://groups.google.com/d/msgid/tesseract-ocr/ab28b50f-d592-4f48-b813-c03451c4dbb0%40googlegroups.com?utm_medium=email&utm_source=footer

Shreeshrii on 22 Feb 2017

Any updates?

galharth on 30 Dec 2017

👍4

This is really needed. How can one fix this? Where to start?

msklvsk on 26 Nov 2018

When working on a fix for char whitelisting, @Shreeshrii and I discussed how user words/patterns could be reactivated in Tesseract 4 with LSTM models, too. This prompted me to work on a solution – see #2324. (Please review!)

Here is the relevant discussion leading up to it:

Do these changes also fix #960 ?

It does not seem so. Results for your pattern example from #403 are still unaffected, regardless of whether I use a config file or the --user-patterns option. Stracing confirms that Tesseract never attempts to open the pattern file, it just goes straight after the output file, once the traineddata itself is loaded.

Looking into this, it appears that LSTMRecognizer::Load is responsible, and it does not call LoadDictionary unless its first option (lang) is non-null, which in turn Tesseract::init_tesseract_lang_data will not give unless lstm_use_matrix=1. But that does not help either!

Looking deeper, only Dict::LoadLSTM is in the current callgraph, but we would need Dict::Load to read the user words and user patterns, build a trie from them and add to the other dawgs. This can only come from Tesseract::init_tesseract_lm, and that from TessBaseAPI::InitLangMod, which has a nice disclaimer comment above:

//TODO(amit): Adapt to lstm

I now believe TessBaseAPI::InitLangMod / Tesseract::init_tesseract_lm are actually dead ends and should be removed. As to LSTMRecognizer::LoadDictionary, it would simply be a matter of replacing Dict::LoadLSTM with the old Dict::Load, but there is one _missing link_: The LSTMRecognizer never gets to see the runtime variables of the Tesseract instance, and CCUtil has no interface to set or initialize its params_ member.

Perhaps it would be best to pass tesseract_->params() to the constructor of LSTMRecognizer, and add a (delegating) constructor to both LSTMRecognizer and CCUtil which takes a ParamsVectors*.

Or is there some reason to keep the member params of Tesseract and LSTMRecognizer different? If so, which params besides user_patterns_file and user_words_file should be copied?

@theraysmith : Ray, can you reply to @bertsky ? This is important fix for tesseract 4.x... (cc: @jbreiden )

bertsky on 14 Mar 2019

Ok, #2324 failed, but here comes #2328.

bertsky on 15 Mar 2019

There's some way to use user patterns in 4.0 or we'll have to wait for a new version?

KilianSillero on 12 Apr 2019

@KilianSillero Yes there is, just checkout the recent master and you will have the user words and user patterns facilities (as documented in the manpage) at your disgression.

The above mentioned fix is not quite satisfactory yet, in that the effect might be small, but these are larger issues to be dealt with in general terms.

bertsky on 15 Apr 2019

@zdenop please close! (If users still have problems with beam narrowness or want to make patterns exclusive, those should be discussed as separate issues.)

bertsky on 15 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings