Tesseract: Removing fi/fl ligatures from eng traindata

Created on 3 Sep 2018  路  8Comments  路  Source: tesseract-ocr/tesseract


Environment

  • Tesseract Version: 4.00.00alpha
  • Platform: Win7 x64

Current Behavior:

fi/fl output as ligatures in gImageReader

Expected Behavior:

While this is great for viewability, it is unhelpful to spellcheckers (hunspell) which do not recognise these UTF8 characters.

Is there a way to remove these from the ready-made english training models without having to retrain or any long-winded solution?

Thank you.

All 8 comments

This person thinks post-process after OCR is the right way to go. You might want to ask the gImageReader author if it makes sense to have ligature removal as a built in feature.

https://stb-tester.com/blog/2014/04/14/improving-ocr-accuracy

I think Ray changed this 'feature' in a latter version. Try the code from master/beta.4 with the best/fast traineddata.

gimagereader's hub doesn't appear to be that active and I still believe it to be an upstream "issue", though I will definitely drop an issue over there (I think I might have already done so).

Thanks Amit, I'll try that branch out and see how it plays out 馃憤

A year ago?! Okay there must be a bug in whatever gimagereader's using then, or itself. Did try to find a code change in tesseract, so thanks for the diff link.

Strangely the default install of gimagereader doesn't seem to do this is Linux so maybe it's windows specific. Either way I'll get onto them, it's clearly not an issue with tess.

Thanks again buddy!

https://github.com/manisandro/gImageReader/releases

As with previous releases, the Windows builds using tesseract 4 are to be considered experimental.

Tesseract Version: 4.00.00alpha

https://github.com/tesseract-ocr/tesseract/releases/tag/4.0.0-alpha
Nov 8, 2016

Prior to the ligature change I'm guessing almost 2 years ago! Wow. Using newer traindata seems to work fine. Thanks for all the help, really appreciate it

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Shreeshrii picture Shreeshrii  路  4Comments

mm-manu picture mm-manu  路  4Comments

ivder picture ivder  路  7Comments

eliyaz-kl picture eliyaz-kl  路  4Comments

clarkk picture clarkk  路  7Comments