tesseract 🚀 - Blacklist and whitelist unsupported with LSTM (4.0)

Same problem for me with 4.00alpha, I tried to set tessedit_char_whitelist by using:

cli with option -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz
cli with config file
tesserocr python module

But I keep getting non letter results

I can provide Dockerfile + python script + images if needed

Cryspart on 9 Mar 2017

Same problem for me. Still getting symbols and alphabets despite setting tessedit_char_whitelist="0123456789".

yshean on 14 Mar 2017

I encountered the same issue today when using --oem 1,2,3. It works fine for --oem 0 (Original Tesseract).

atefm on 28 Mar 2017

👍13

I am encountering the same issue, is there a solution for this issue yet?.

DanielRieske on 12 Apr 2017

I am encountering the same issue, is there a solution for this issue yet?.

No.

amitdo on 12 Apr 2017

I am facing the same issue. Is it really a bug or is it just not supported for LSTM?

RimacV on 13 Apr 2017

It's currently not supported for LSTM.

People, please do not add another "I have the same issue" comment.

amitdo on 13 Apr 2017

Are there plans to support whitelisting on LSTM in the future?

Adrian-at-CrimsonAuzre on 9 Jun 2017

👍29

I also have this problem when using Tesseract 4 from C++

tess->SetVariable("tessedit_char_whitelist", "01234567890abcdefg");

has no effect on the output. The same with blacklist.

Tesseract returns not only ascii + language-specific characters but also some strange other characters from UTF-8.

Is there a way to get a full list of all possible characters, specific for a language or not? Basing on such list one could make a workaround to map such wrong characters to best fitting ones that are expected (like EM DASH to plain ASCII dash etc.) and remove those without any sensible fit. It would be useful for me in current circumstances and maybe it could be useful for others in need of whitelisting.

Htarlov on 4 Aug 2017

@theraysmith Are there plans to support this for LSTM?

Shreeshrii on 4 Aug 2017

👍32

In response to https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw

You can try the plus-minus type of training if you just want a digits type of traineddata.

Your training_text can contain numbers in the format you need and you can train with a font matching your images.

For proof of concept you can try my experimental version at

https://github.com/Shreeshrii/tessdata_shreetest

Shreeshrii on 3 Oct 2017

👍12

I would like to exclude everything except letters and digits from the result. I started from eng.traineddata and trained my font from graphical images (@shreeshrii: thanks!!)) . Is there a way to get rid of all the other symbols, especially !"=)() ... ?

I am using --oem 1.

Thank you very much,
Ernst

ErnstTmp on 29 Oct 2017

👍2

Duplicate issue? "user pattern/dict does not work at all"
https://github.com/tesseract-ocr/tesseract/issues/960

I'm on 3.04.01 (from ubuntu 16.04 repos) and it doesn't work in that version either.

lope on 18 Jan 2018

has this been resolved or anyone found a workaround?

samFredLumley on 22 Mar 2018

has this been resolved

No.

amitdo on 22 Mar 2018

👍4

Not really - sort-of workaround only.

I've ended up by iterating through symbols found by Tesseract and doing some post-processing. Found out by analysis of many cases what are usual OCR errors for my type of documents, that move us out of chosen set and then used a mapping of those mistaken chars to proper chars (plus filtering of all that are outside of set). So finally I have only chosen character set on output, but it is suboptimal solution.

Htarlov on 22 Mar 2018

Another experiment with finetuning - minuschar - i.e. removing characters from an existing traineddata.

In my sample I have used upper and lower case alphabet and digits only.

Please see attached zip file. It has the bash script used, training text and resulting traineddata file. You wil l get better results if you use font similar to one you want to recognize and training text also of similar to what you need.

I have removed all the wordlists/dawgs so tesseract will give a warning message when doing OCR.

alphanum.zip

Shreeshrii on 23 Mar 2018

@Htarlov @Shreeshrii thanks interesting thoughts. I hadn't run much much post-processing or done any training yet so these should improve things considerably.

samFredLumley on 23 Mar 2018

https://github.com/tesseract-ocr/tesseract/blob/8f7be2e72c4d933c23e50cfb30e4317bd608e166/lstm/recodebeam.cpp#L258

I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.

https://github.com/tesseract-ocr/tesseract/blob/023e1b340e6a4dd35909dde82f928790caec3ea5/ccutil/unicharset.h#L877

amitdo on 15 Apr 2018

👍3

tesseract 4.0.0-beta.1 still has this problem.

teamcoltra on 27 Apr 2018

👍5

rebuilt from source -- whitelist still doesnt work

vivanov879 on 21 May 2018

AFAIK, this will not be addressed for 4.0.0.

Shreeshrii on 21 May 2018

I've posted a bounty to have this resolved: https://www.bountysource.com/issues/42806964-blacklist-and-whitelist-broken-in-4-00alpha

williape on 23 May 2018

👍14

Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.

Ungaminga on 4 Jul 2018

Use --oem 0 or -oem 0 and it works

ghost on 20 Jul 2018

Thanks Ben!

Can you please give a full command line example that you have found to work?

On Fri, 20 Jul 2018, 01:04 BenBaltz, notifications@github.com wrote:

Use --oem 0 or -oem 0 and it works

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/751#issuecomment-406440222,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACxTy7dMKhGHVpj_0ksDyBHqLTEd0sNzks5uIRB2gaJpZM4MWxei
.

lope on 20 Jul 2018

With --oem 0 tesseract will use the old engine for ocr.

amitdo on 20 Jul 2018

👍3

Is this problem still an issue?? or any have been able to solve it?

scorrea92 on 20 Sep 2018

This issue is still open, so up to now nobody solved it. Feel free to fix it.

@nguyenq, @zdenop, I suggest to change the title to _Blacklist and whitelist unsupported with LSTM_, because nothing is broken. I also suggest to add the labels _enhancement_, _help wanted_ and _user request_.

stweil on 20 Sep 2018

👍1

https://groups.google.com/d/msgid/tesseract-ocr/e402f05a-9a27-4816-958f-7b3da601c3a9%40googlegroups.com?utm_medium=email&utm_source=footer

Another recent request in forum

Shreeshrii on 21 Sep 2018

Possible workarounds:
1) Using the--oem 0 option (the legacy engine will be used)
2) Retraining (fine tuning)
https://github.com/tesseract-ocr/tesseract/issues/751#issuecomment-333904808
3) Post-processing
https://github.com/tesseract-ocr/tesseract/issues/751#issuecomment-375408508

amitdo on 21 Sep 2018

Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.

Sorry about that. I have uploaded a new set of traineddata for digits only at https://github.com/Shreeshrii/tessdata_shreetest

Shreeshrii on 25 Sep 2018

👍2

Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.

Sorry about that. I have uploaded a new set of traineddata for digits only at https://github.com/Shreeshrii/tessdata_shreetest

Are . and , also in?

bommo1 on 25 Sep 2018

Added another finetuned traineddata file with 0-9 . , - at https://github.com/Shreeshrii/tessdata_shreetest

Shreeshrii on 26 Sep 2018

👍6 🎉3

Thank you!

bommo1 on 26 Sep 2018

Hi Shreeshrii,

I'm trying to train ara with just alpha characters. I am trying your plus-minus method using the tesstrain_minuschars.sh script that you posted above. I just took the existing ara.training_text from the lstm langdata and removed all of the punctuation and numbers.

It looks like the first part of training worked correctly (text2image and box extraction), and it was able to generate a new ara.unicharset. But then, it failed to load Latin.unicharset and Arabic.unicharset. I don't understand why it's looking for those when I'm just trying to train ara?

Then it started Phase E Generating lstmf files, and I got another error - Error opening data file ./tessdata_best/eng.traineddata. Again I don't understand why it's looking for the english traineddata?

Ultimately my training failed with this error:

rebuild starter traineddata

Failed to load unicharset from ./trained_minuschars_ara/ara/ara.unicharset

training from ./tessdata_best/ara.traineddata

mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Segmentation fault (core dumped)

Tara-E on 1 Oct 2018

But then, it failed to load Latin.unicharset and Arabic.unicharset. I don't understand why it's looking for those when I'm just trying to train ara?

You need to give the correct paths to the script unicharsets that are available in the root directory of langdata and langdata_lstm repositories. These are used for setting the unichar properties.

Shreeshrii on 1 Oct 2018

I am trying your plus-minus method using the tesstrain_minuschars.sh script that you posted above.

Please modify it appropriately for ara instead of eng and the paths matching your environment.

Shreeshrii on 1 Oct 2018

Hi Shreeshrii,

Thank you for your help!

I am trying your plus-minus method using the tesstrain_minuschars.sh script that you posted above.

Please modify it appropriately for ara instead of eng and the paths matching your environment.

I had modified the script for ara and my environment and also set the path to the langdata. However, I had only downloaded the langdata_lstm/ara. I didn't realize that I would also need Latin and Arabic unicharsets. After adding that, the "failure to load Latin/Arabic.unicharset" errors went away.

However, training is still failing and is still looking for eng.traineddata for some reason even though 'eng' never appears anywhere in the script.

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata_best
[Mon Oct 1 14:03:13 PDT 2018] /usr/local/bin/tesseract /tmp/ara-2018-10-01.jtr/ara.Helvetica_Neue_LT_Arabic.exp0.tif /tmp/ara-2018-10-01.jtr/ara.Helvetica_Neue_LT_Arabic.exp0 --psm 6 lstm.train ./langdata/ara/ara.config
Error opening data file ./tessdata_best/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!

Thanks again,
-Tara

Tara-E on 1 Oct 2018

Tesseract seems to use eng.traineddata by default. Please download and try further training.

Shreeshrii on 2 Oct 2018

Tesseract seems to use eng.traineddata by default. Please download and try further training.

After adding eng.traineddata, I got a different error "read_params_file: Can't open lstm.train". lstm.train is usually in the ./tessdata directory, but the script sets the tessdata directory to ./tessdata_best where I had put only ara.traineddata and eng.traineddata. So I moved those files to the default tessdata folder and used that instead. Now the training has succeeded!

Thanks!
-Tara

Tara-E on 2 Oct 2018

O was having the same issue in 4.00beta. It works fine for me in 3.04.

eklam on 26 Oct 2018

See https://github.com/tesseract-ocr/tesseract/wiki/Planning#features-from-30x-which-are-missing-for-lstm.

stweil on 26 Oct 2018

Ok, so we can all agree that finetunning a model fo specific set of characters is a proper workaround for "whitelist" feature from 3.0. However bringing back whitelist feature as it was would be nice.
@theraysmith if I understand correctly, there is a CTC as an output of NN. So at some point we've got probability distribution at each time step for each available label (?). Why just not filter out (e.g. set to 0) probablities for blacklisted characters (labels). Or values before softmax (if such is used) to have sum of prob = 1. So that argmax considers only whitelisted labels? Am I whatsoever in the right world?

wosiu on 30 Oct 2018

@noahmetzger has worked on that and added a config option which outputs all symbols with their probabilities, so you could add that feature in a post processor. Maybe Noah can have a look whether there is a simple solution for a built in filter.

stweil on 30 Oct 2018

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Thanks!

carmonasl on 18 Dec 2018

👍6

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Thanks!

Is there any update on this? Or should I drop versions?

thekevshow on 26 Jan 2019

I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.

@amitdo was right. I was able to get the old behaviour (whitelist, blacklist, unblacklist) back with the LSTM decoder by querying the unicharset's get_enabled for each output in ComputeTopN, ignoring it if disabled.

But it was not so easy (for me) to get the UnicharCompress (recoder) and RecodedCharID (label mapping) right – so that might be the wrong way to do it. Also, one important ingredient was that the unicharset member of the Tesseract class (which SetBlackAndWhiteList operates on) is _not_ the same as lstm_recognizer_->GetUnicharset(). The latter seems to be a stripped down version, so I'll have the whitelisting operate on both. See #2294.

bertsky on 7 Mar 2019

🎉6

Any updates on how to implement a whitelist without using --oem 0 in version 4?

Can we use ChoiceIterator to iterate through all possibilities, keeping/rejecting based on whitelist/blacklist, and using the top result left over, if it exists?

jxu on 24 May 2019

Has this been fixed in 5.0.0-alpha?

sinall on 18 Jun 2019

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

axhagemann on 13 Aug 2019

👀2 👍1

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

@axhagemann

This appears to be working for me, just upgraded my 4.0 version to 4.1. FINALLY! lol been waiting on this.

thekevshow on 15 Aug 2019

Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes

Really excited ！ It works!

Jinnrry on 16 Aug 2019

@nguyenq, can we close this issue?

stweil on 17 Aug 2019

Didn't realize I'd opened this issue. It's been so long ago. :)

nguyenq on 17 Aug 2019

https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

Install from Alex's ppa

On Fri, 30 Aug 2019, 21:18 OllieD3711, notifications@github.com wrote:

This appears to be working for me, just upgraded my 4.0 version to 4.1.
FINALLY! lol been waiting on this.

@kev2316 https://github.com/kev2316

I'm having trouble upgrading to 4.1.0. I'm sure I'm doing something stupid
(using sudo apt-upgrade, and also tried sudo apt install), but when I try
upgrade, I'm told I already have the latest version,
4.00~git2288-10f4998a-2.

How can I upgrade to 4.1.0?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/751?email_source=notifications&email_token=ABG37I5ILLDA745L7XWIC6TQHE6LLA5CNFSM4DC3C6RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SBBTI#issuecomment-526651597,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG37I6G7RJ7R3ZNMOFTPKLQHE6LLANCNFSM4DC3C6RA
.

Shreeshrii on 30 Aug 2019

👍1

Just to conclude and in addition:

To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:

Update Tesseract to version 4.1 (the future-oriented approach)
Use the legacy mode with --oem flag

The Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.

mhellmeier on 20 Mar 2020

Alex's ppa can be used on Ubuntu for the latest versions.
Please see https://github.com/tesseract-ocr/tesseract/wiki

Shreeshrii on 20 Mar 2020

Just to conclude and in addition:

To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:

Update Tesseract to version 4.1 (the future-oriented approach)

Use the legacy mode with --oem flag

The Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.

How to upgrade from 4.0 to 4.1 version sir?
I don't know how to install it on my Ubuntu.

tammarut on 10 May 2020

@tammarut, please use our forum.for asking questions.

amitdo on 10 May 2020

Tesseract: Blacklist and whitelist unsupported with LSTM (4.0)

Most helpful comment

All 60 comments

rebuild starter traineddata

training from ./tessdata_best/ara.traineddata

Related issues