Blacklist and whitelist no longer work in 4.00alpha. They used to work in 3.04.
https://groups.google.com/forum/#!topic/tesseract-ocr/cpcJHTE2xMo
Same problem for me with 4.00alpha, I tried to set tessedit_char_whitelist by using:
-c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzBut I keep getting non letter results
I can provide Dockerfile + python script + images if needed
Same problem for me. Still getting symbols and alphabets despite setting tessedit_char_whitelist="0123456789".
I encountered the same issue today when using --oem 1,2,3. It works fine for --oem 0 (Original Tesseract).
I am encountering the same issue, is there a solution for this issue yet?.
I am encountering the same issue, is there a solution for this issue yet?.
No.
I am facing the same issue. Is it really a bug or is it just not supported for LSTM?
It's currently not supported for LSTM.
People, please do not add another "I have the same issue" comment.
Are there plans to support whitelisting on LSTM in the future?
I also have this problem when using Tesseract 4 from C++
tess->SetVariable("tessedit_char_whitelist", "01234567890abcdefg");
has no effect on the output. The same with blacklist.
Tesseract returns not only ascii + language-specific characters but also some strange other characters from UTF-8.
Is there a way to get a full list of all possible characters, specific for a language or not? Basing on such list one could make a workaround to map such wrong characters to best fitting ones that are expected (like EM DASH to plain ASCII dash etc.) and remove those without any sensible fit. It would be useful for me in current circumstances and maybe it could be useful for others in need of whitelisting.
@theraysmith Are there plans to support this for LSTM?
In response to https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw
You can try the plus-minus type of training if you just want a digits type of traineddata.
Your training_text can contain numbers in the format you need and you can train with a font matching your images.
For proof of concept you can try my experimental version at
I would like to exclude everything except letters and digits from the result. I started from eng.traineddata and trained my font from graphical images (@shreeshrii: thanks!!)) . Is there a way to get rid of all the other symbols, especially !"=)() ... ?
I am using --oem 1.
Thank you very much,
Ernst
Duplicate issue? "user pattern/dict does not work at all"
https://github.com/tesseract-ocr/tesseract/issues/960
I'm on 3.04.01 (from ubuntu 16.04 repos) and it doesn't work in that version either.
has this been resolved or anyone found a workaround?
has this been resolved
No.
Not really - sort-of workaround only.
I've ended up by iterating through symbols found by Tesseract and doing some post-processing. Found out by analysis of many cases what are usual OCR errors for my type of documents, that move us out of chosen set and then used a mapping of those mistaken chars to proper chars (plus filtering of all that are outside of set). So finally I have only chosen character set on output, but it is suboptimal solution.
Another experiment with finetuning - minuschar - i.e. removing characters from an existing traineddata.
In my sample I have used upper and lower case alphabet and digits only.
Please see attached zip file. It has the bash script used, training text and resulting traineddata file. You wil l get better results if you use font similar to one you want to recognize and training text also of similar to what you need.
I have removed all the wordlists/dawgs so tesseract will give a warning message when doing OCR.
@Htarlov @Shreeshrii thanks interesting thoughts. I hadn't run much much post-processing or done any training yet so these should improve things considerably.
I think that RecodeBeamSearch() is the method that should be modified to make the whitelist/blacklist feature work. get_enabled() should be used.
tesseract 4.0.0-beta.1 still has this problem.
rebuilt from source -- whitelist still doesnt work
AFAIK, this will not be addressed for 4.0.0.
I've posted a bounty to have this resolved: https://www.bountysource.com/issues/42806964-blacklist-and-whitelist-broken-in-4-00alpha
Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.
Use --oem 0 or -oem 0 and it works
Thanks Ben!
Can you please give a full command line example that you have found to work?
On Fri, 20 Jul 2018, 01:04 BenBaltz, notifications@github.com wrote:
Use --oem 0 or -oem 0 and it works
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/751#issuecomment-406440222,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACxTy7dMKhGHVpj_0ksDyBHqLTEd0sNzks5uIRB2gaJpZM4MWxei
.
With --oem 0 tesseract will use the old engine for ocr.
Is this problem still an issue?? or any have been able to solve it?
This issue is still open, so up to now nobody solved it. Feel free to fix it.
@nguyenq, @zdenop, I suggest to change the title to _Blacklist and whitelist unsupported with LSTM_, because nothing is broken. I also suggest to add the labels _enhancement_, _help wanted_ and _user request_.
Possible workarounds:
1) Using the--oem 0 option (the legacy engine will be used)
2) Retraining (fine tuning)
https://github.com/tesseract-ocr/tesseract/issues/751#issuecomment-333904808
3) Post-processing
https://github.com/tesseract-ocr/tesseract/issues/751#issuecomment-375408508
Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.
Sorry about that. I have uploaded a new set of traineddata for digits only at https://github.com/Shreeshrii/tessdata_shreetest
Is there some trained data for digits exsist? i would use, if you have some. The @Shreeshrii links are broken ATM.
Sorry about that. I have uploaded a new set of traineddata for digits only at https://github.com/Shreeshrii/tessdata_shreetest
Are . and , also in?
Added another finetuned traineddata file with 0-9 . , - at https://github.com/Shreeshrii/tessdata_shreetest
Thank you!
Hi Shreeshrii,
I'm trying to train ara with just alpha characters. I am trying your plus-minus method using the tesstrain_minuschars.sh script that you posted above. I just took the existing ara.training_text from the lstm langdata and removed all of the punctuation and numbers.
It looks like the first part of training worked correctly (text2image and box extraction), and it was able to generate a new ara.unicharset. But then, it failed to load Latin.unicharset and Arabic.unicharset. I don't understand why it's looking for those when I'm just trying to train ara?
Then it started Phase E Generating lstmf files, and I got another error - Error opening data file ./tessdata_best/eng.traineddata. Again I don't understand why it's looking for the english traineddata?
Ultimately my training failed with this error:
rebuild starter traineddata
Failed to load unicharset from ./trained_minuschars_ara/ara/ara.unicharset
training from ./tessdata_best/ara.traineddata
mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Segmentation fault (core dumped)
But then, it failed to load Latin.unicharset and Arabic.unicharset. I don't understand why it's looking for those when I'm just trying to train ara?
You need to give the correct paths to the script unicharsets that are available in the root directory of langdata and langdata_lstm repositories. These are used for setting the unichar properties.
I am trying your plus-minus method using the tesstrain_minuschars.sh script that you posted above.
Please modify it appropriately for ara instead of eng and the paths matching your environment.
Hi Shreeshrii,
Thank you for your help!
I am trying your plus-minus method using the tesstrain_minuschars.sh script that you posted above.
Please modify it appropriately for ara instead of eng and the paths matching your environment.
I had modified the script for ara and my environment and also set the path to the langdata. However, I had only downloaded the langdata_lstm/ara. I didn't realize that I would also need Latin and Arabic unicharsets. After adding that, the "failure to load Latin/Arabic.unicharset" errors went away.
However, training is still failing and is still looking for eng.traineddata for some reason even though 'eng' never appears anywhere in the script.
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata_best
[Mon Oct 1 14:03:13 PDT 2018] /usr/local/bin/tesseract /tmp/ara-2018-10-01.jtr/ara.Helvetica_Neue_LT_Arabic.exp0.tif /tmp/ara-2018-10-01.jtr/ara.Helvetica_Neue_LT_Arabic.exp0 --psm 6 lstm.train ./langdata/ara/ara.config
Error opening data file ./tessdata_best/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Thanks again,
-Tara
Tesseract seems to use eng.traineddata by default. Please download and try further training.
Tesseract seems to use eng.traineddata by default. Please download and try further training.
After adding eng.traineddata, I got a different error "read_params_file: Can't open lstm.train". lstm.train is usually in the ./tessdata directory, but the script sets the tessdata directory to ./tessdata_best where I had put only ara.traineddata and eng.traineddata. So I moved those files to the default tessdata folder and used that instead. Now the training has succeeded!
Thanks!
-Tara
O was having the same issue in 4.00beta. It works fine for me in 3.04.
See https://github.com/tesseract-ocr/tesseract/wiki/Planning#features-from-30x-which-are-missing-for-lstm.
Ok, so we can all agree that finetunning a model fo specific set of characters is a proper workaround for "whitelist" feature from 3.0. However bringing back whitelist feature as it was would be nice.
@theraysmith if I understand correctly, there is a CTC as an output of NN. So at some point we've got probability distribution at each time step for each available label (?). Why just not filter out (e.g. set to 0) probablities for blacklisted characters (labels). Or values before softmax (if such is used) to have sum of prob = 1. So that argmax considers only whitelisted labels? Am I whatsoever in the right world?
@noahmetzger has worked on that and added a config option which outputs all symbols with their probabilities, so you could add that feature in a post processor. Maybe Noah can have a look whether there is a simple solution for a built in filter.
Any updates on how to implement a whitelist without using --oem 0 in version 4?
Thanks!
Any updates on how to implement a whitelist without using --oem 0 in version 4?
Thanks!
Is there any update on this? Or should I drop versions?
I think that
RecodeBeamSearch()is the method that should be modified to make the whitelist/blacklist feature work.get_enabled()should be used.
@amitdo was right. I was able to get the old behaviour (whitelist, blacklist, unblacklist) back with the LSTM decoder by querying the unicharset's get_enabled for each output in ComputeTopN, ignoring it if disabled.
But it was not so easy (for me) to get the UnicharCompress (recoder) and RecodedCharID (label mapping) right – so that might be the wrong way to do it. Also, one important ingredient was that the unicharset member of the Tesseract class (which SetBlackAndWhiteList operates on) is _not_ the same as lstm_recognizer_->GetUnicharset(). The latter seems to be a stripped down version, so I'll have the whitelisting operate on both. See #2294.
Any updates on how to implement a whitelist without using --oem 0 in version 4?
Can we use ChoiceIterator to iterate through all possibilities, keeping/rejecting based on whitelist/blacklist, and using the top result left over, if it exists?
Has this been fixed in 5.0.0-alpha?
Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes
Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes
@axhagemann
This appears to be working for me, just upgraded my 4.0 version to 4.1. FINALLY! lol been waiting on this.
Has anyone tested the 4.1 Version? It's supposed to be fixed now? release notes
Really excited ! It works!
@nguyenq, can we close this issue?
Didn't realize I'd opened this issue. It's been so long ago. :)
https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata
Install from Alex's ppa
On Fri, 30 Aug 2019, 21:18 OllieD3711, notifications@github.com wrote:
This appears to be working for me, just upgraded my 4.0 version to 4.1.
FINALLY! lol been waiting on this.@kev2316 https://github.com/kev2316
I'm having trouble upgrading to 4.1.0. I'm sure I'm doing something stupid
(using sudo apt-upgrade, and also tried sudo apt install), but when I try
upgrade, I'm told I already have the latest version,
4.00~git2288-10f4998a-2.How can I upgrade to 4.1.0?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/751?email_source=notifications&email_token=ABG37I5ILLDA745L7XWIC6TQHE6LLA5CNFSM4DC3C6RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SBBTI#issuecomment-526651597,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG37I6G7RJ7R3ZNMOFTPKLQHE6LLANCNFSM4DC3C6RA
.
Just to conclude and in addition:
To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:
--oem flagThe Ubuntu package sources only contains tesseract version 4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.
Alex's ppa can be used on Ubuntu for the latest versions.
Please see https://github.com/tesseract-ocr/tesseract/wiki
Just to conclude and in addition:
To solve the black- and whitlist problem in version 4.0, two solutions have already been mentioned:
- Update Tesseract to version 4.1 (the future-oriented approach)
- Use the legacy mode with
--oemflagThe Ubuntu package sources only contains tesseract version
4.0.0-beta.1. If you can't upgrade and don't want to use the legacy mode, try to build a simple black- or whitelist function if you are using tesseract with a wrapper in another programming language. One example with PyTesserocr can be found in this blog article: return2 – Python Tesseract 4.0 OCR: Recognize only numbers / digits and exclude all other characters.
How to upgrade from 4.0 to 4.1 version sir?
I don't know how to install it on my Ubuntu.
@tammarut, please use our forum.for asking questions.
Most helpful comment
@theraysmith Are there plans to support this for LSTM?