Tesseract: user pattern/dict does not work at all

Created on 30 May 2017 · 19Comments · Source: tesseract-ocr/tesseract

They do not work for me. I've been trying versions: 3.05.00 and 4.00.00alpha.
My file date.user-pattern contains one line:
2014-\d\d-\d\d
Picture is one line with date, like: 2014-03-19
I run: tesseract img.jpg stdout --user-patterns date.user-patterns -psm 8
and output: "mum-w" which obviously does not match the pattern.
Character whitelisting helps a bit, but format from pattern is not preserve and accuracy is poor.
I also tried some other examples - does not work either.
Many people have the same problem, aggregated links under this one:
https://stackoverflow.com/questions/34560697/tesseract-ocr-user-patterns
also #403
Should we assume that this feature does not work at all? Is there any official comment on this?

duplicate

Source

wosiu

👍13

Most helpful comment

We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).

Necklaces on 14 Aug 2018

👍4

All 19 comments

Same problem with user dictionary:
tesseract H3.png stdout --user-patterns date.user-patterns --psm 8 --user-words date.user-words -c language_model_penalty_non_dict_word=9999999999999999999 --oem 0I tried different language_model_penalty_non_dict_word values with no luck
Related #297, which is closed, so I assume the feature doesn't work. I think it would be better for users if those flags are removed from command line and configurations, because it is misleading as long as they don't affect engine.

wosiu on 30 May 2017

Tested user-words option with 3.05.01 on windows (using binaries by @stweil)

Works ok. See attached test image.

bazaar config file as used (uses system dictionary + user words)

load_system_dawg     T
load_freq_dawg       T
user_words_suffix    user-words
user_patterns_suffix user-patterns

eng.user-words as used

the
quick
brown
fox
jumped

image used for recognition

test

Output without user-words- Notice Dnline instead of Online

tesseract test.png stdout
shared Guruvayoor Dnline Friends's post

Output with user-words - Online recognized correctly

tesseract test.png stdout  bazaar
shared Guruvayoor Online Friends's post

So Online from eng.user-words was used, when using the bazaar config file, and led to improved accuracy.

Shreeshrii on 3 Jun 2017

I tested for user-patterns just now with versions 3.02 and 3.05.01, both for windows so that I didn't have to worry about correct versions of leptonica. The test image is attached.

There is no change in output with the user-patterns option in both. So, if this feature worked, it would be before 3.02.

However, just by resizing the image to 200%, the dates are correctly recognized.

date
date-small

Shreeshrii on 3 Jun 2017

@zdenop @amitdo @stweil Have you used user-patterns option? If so, with which version?

Shreeshrii on 3 Jun 2017

👍1

No, sorry, I never used that option. Nevertheless I also have a scenario where working user patterns would help.

stweil on 3 Jun 2017

No, sorry, I never used that option.

Same answer.

amitdo on 3 Jun 2017

😄1

I also have a scenario where working user patterns would help.

@stweil Interesting project :-)

https://groups.google.com/forum/#!searchin/tesseract-ocr/user$20patterns$20%7Csort:date/tesseract-ocr/S9CIK3jOMWw/u7dnVDASFLgJ

The ability to use user patterns was added by Tesseract 3.01, and now has a little documentation. See the comment in dict/trie.h:

http://code.google.com/p/tesseract-ocr/source/browse/tags/release-3.01/dict/trie.h

So it broke somewhere between 3.01 and 3.02...

Shreeshrii on 3 Jun 2017

I did not use it either.
But as far as I understand: "user patterns" just help to extend tesseract dictionary.
And as it is known putting word to dictionary does not mean tesseract will recognize it (or other way around disabling dictionaries will not cause 0% recognition). => I do not know if the feature is working at all, but I would not expect significant effect on result from it.

zdenop on 4 Jun 2017

User patterns are documented in doc/tesseract.1.asc and in dict/trie.h.

stweil on 4 Jun 2017

With 4.0 the problem might be that the Dict class is instantiated twice

tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::Classify::Classify()
tesseract::Wordrec::Wordrec()
tesseract::Tesseract::Tesseract()
tesseract::TessBaseAPI::Init(...)

and then here

tesseract::Dict::Dict(tesseract::CCUtil * ccutil)
tesseract::LSTMRecognizer::LoadDictionary(const char * lang, tesseract::TessdataManager * mgr)
tesseract::LSTMRecognizer::Load(const char * lang, tesseract::TessdataManager * mgr)
tesseract::Tesseract::init_tesseract_lang_data(...)
tesseract::Tesseract::init_tesseract_internal(...)
tesseract::Tesseract::init_tesseract(...)
tesseract::TessBaseAPI::Init(...)

and both initialise
https://github.com/tesseract-ocr/tesseract/blob/master/dict/dict.cpp#L43

The real problem is that variables are set between these calls so LSTM dict does not get the value from user defined variables.

vidiecan on 17 Aug 2017

😕4 👍3

Does this issue only happen on the command line executable? I mean I can workaround this issue by writing some C++ source file to directly call the API? Thanks.

asmwarrior on 28 Sep 2017

@asmwarrior Answering your question: Both command line and API are affected.
Character whitelisting works for 3.05 but does not work for LSTM mode (version 4) at all.
@vidiecan Have you tried fixing the issue with whitelisting for 4.0 lstm? Your previous comment on this sounds reasonable.

wosiu on 18 Jan 2018

Please also see comment by Ray at https://github.com/tesseract-ocr/tesseract/issues/403#issuecomment-265579471

Don't think it has been addressed yet.

@stweil Is this something you can fix?

Shreeshrii on 25 Mar 2018

👍2

@vidiecan you mentioned earlier that 'With 4.0 the problem might be that the Dict class is instantiated twice'.

Do you have a suggested patch to fix this issue?

Shreeshrii on 2 May 2018

1127, #1128

amitdo on 2 May 2018

Any update to this issue?
I am running Tesseract 4.00.00 Alpha on Linux via Tess4J 3.3.1
I am using the following java code in Tess4J to try and use the bazaar file and subsequently the user_patterns_suffix

TessAPI1.TessBaseAPIReadConfigFile(handle, tessdatafolder+"/configs/bazaar", 0)

I am sure it is finding this file because if I change the name of 'bazaar' it throws a warning saying file is not found.

The contents of the bazaar file is the standard -

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

I populate the eng.user-patterns file in the tessdata folder with the standard values as default and also add my own to equate for values I need to capture correctly from a page -

1-\d\d\d-GOOG-411
www.\n\\\*.com
\A\A\d\d\A\A\A
ML\d\d\A\A\A
\A\A\d\d\d\d\d\d\d\d

However, I do not see any change in the results I am seeing. I know it is supposed to influence the results vs force, but the text looks so clearly incorrect there must be an issue.

The last time I did a build from source was around a month ago.

Any help is greatly appreciated.

nusynergi on 17 May 2018

👍1

We tried using strace on tesseract 4.0.0-beta.4-26-gfd49 and it seems that the user-patterns and user-words files only get opened in legacy mode (using --oem 0).

Necklaces on 14 Aug 2018

👍4

So does this work when Tesseract 4 is used with --oem 0? Then it's not a regression (Tesseract 4 can then replace Tesseract 3), but a missing feature for LSTM mode.

stweil on 14 Sep 2018

Closed as duplicate to #403

zdenop on 13 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings