Tesseract: Tag a new version for LSTM 4.0

Created on 17 Jun 2017  ·  108Comments  ·  Source: tesseract-ocr/tesseract

Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

@zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

Most helpful comment

+1 for a new tag.

Since Ray does not reply, I suggest to still use 'alpha'.

4.0.0-alpha.YYYYMMDD

All 108 comments

It would be good to decide about using semantic versioning soon. Maybe it can be used for the next tag.

I have not seen any comments against semver.

Maybe good to setup some kind of autoupdate for increasing the PATCH
version based on commit numbers to reduce manual administrative updates.

@stweil From what I have read about semver, if you were to implement the
zipped traineddata and related changes, it should cause a change in MINOR
version.

So, with that should it be 4.1.0alpha ?

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,

MINOR version when you add functionality in a backwards-compatible manner,

and
PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as
extensions to the MAJOR.MINOR.PATCH format.

First 4 version will be 4.0.0. What 4.1.0alpha are you talking about? We don't care about changes in dev branches.

We could tag the current release as a pre-release or as a release candidate. According to semver.org, it could be called something like 4.0.0-rc.1 (that's how semver.org named its own releases), 4.0.0-beta.1 or 4.0.0-beta.20170619.

We don't care about changes in dev branches.

OK.

Still, it will be good to have new tags when changes are substantial enough from previous commits. For example,

  • change of LSTM mode from --oem 4 to --oem 1 after removal of cube
  • change in .lstmf and .lstm file formats after update regarding endianness
  • proposed change in traineddata files to zipped format

That said, I have only done some cursory reading regarding semver. So, I am happy with whatever tag/version is used, as long as there is some demarcation.

The reason for asking for this is that people are using/trying to use master branch/4.0/LSTM and ask questions, where the version info says -alpha or -dev and it difficult to try and figure out what the issue is without knowing the version being used.

I vote for this format which includes date - easy to identify which version is more recent.

4.0.0-beta.20170619

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/V1tyGHIenbI/SUVuXheJAwAJ

An example of how 4.00.00alpha is NOT compatible with the current master branch eg. --oem options.

@theraysmith, can you give us an update on your work? When are we going to see it?

Hi, same: can you give us an update on your work? When are we going to see 4.0 released?

+1 for a new tag.

Since Ray does not reply, I suggest to still use 'alpha'.

4.0.0-alpha.YYYYMMDD

@zdenop, can you do it, or at least add your comment here?

I'm about ready to update the traineddatas. I have a training run almost
complete, and with accuracy that meets with my satisfaction.
There are a few regressions, but not too serious.
First though, I have to get some code reviewed in Google, and then make
some commits to github to match the new traineddatas.
Before that, there is the matter of a major pull...

Here's what's coming:

  • Fix to issue 653: New components in traineddata file for the
    unicharset, recoder and version string. Backwards compatible change, so the
    LSTM component can still read older files.
  • Change in training system. The above change makes open source training
    impossible. Will add a new program to build a starter traineddata from a
    unicharset and optional word lists.
  • New "normalization" code to clean corpus text in all languages. That
    was a big part of the work.
  • Improvements to the trained networks to improve accuracy on single
    characters and single words.
  • 2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the
    speed of legacy Tesseract in real time, provided you have the required
    parallelism components, and in total CPU only slightly slower for English.
    Way faster for most non-latin languages, while being <5% worse than "best"
    Only "best" will be retrainable, as "fast" will be integer.

I have other stuff that is still incomplete, but that is a good list for
now.

BTW, in case you hadn't noticed, there was a breaking change that made old
lstmf files unusable. That was needed to fix LSTM for OSD. It has to know
the language of each training sample.
The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.

On Tue, Jul 11, 2017 at 4:49 AM, Amit D. notifications@github.com wrote:

@zdenop https://github.com/zdenop, can you do it, or at least add your
comment here?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314419211,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056SvL5FeeE09JYW01xQ-dQyILyU8Wks5sM2ExgaJpZM4N9Nel
.

--
Ray.

Superb. Anything we could do to help you ? Cheers.

@theraysmith Thanks for the update. Look forward to it. Any estimate of expected date?

@zdenop I think this is a good reason to freeze the 'alpha' state by tagging the repo with the current version as 4.0.0-alpha.YYYYMMDD, since Ray is going to be making major changes.

I'm about ready to update the traineddatas.

That's good news.

The above change makes open source training impossible.

If I got that right, it would be horrible. Being able to create new traineddata is essential for me.

@Shreeshrii: I do not understand what do you want. Tag will not freeze anything. Tag is just specific points in history to mark something important (e.g. new version). Tagging should be driven by developer who knows roadmap and not by users...

@zdenop

Tag is just specific points in history to mark something important (e.g. new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as well as traineddata substantially. I am sure it will be tagged by Ray at that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point in history where a lot of development has taken place since the original 4.00.00alpha tag. In fact, that original tag just marked the start of the 4.00.00alpha development and many bugs in that original tag (missing lstm.train file etc.) have been fixed later.

Also, if the new changes by Ray will not allow for open source training :-( then the current github version will be the one which allows users to do their own training. So, it is certainly deserving of a tag in my opinion :-)

Open source training:
OK, I overstated it a bit.
One of my commits will temporarily break the training process. After doing
so, I will correct the documentation and add the new tool (which I have
already written) as quickly as possible after.

To help:
No more breaking commits! If it doesn't produce perfect results on
phototest, it broke something!
Cutting down on the code cleanup while I am working on it will also help.
When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match

Dates:
I was going to get started this week, but now I have to debug my pull from
github, which has broken tests (of the legacy engine), so that will take
time to fix. I'm hoping it's simple, but it is bizarre.
Even when it is fixed, there are 1500 lines of change from github for
someone here to review.
I really want to get 4.00 finished (in beta) in the next 5-6 weeks.

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii notifications@github.com
wrote:

@zdenop https://github.com/zdenop

Tag is just specific points in history to mark something important (e.g.
new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as
well as traineddata substantially. I am sure it will be tagged by Ray at
that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a point
in history where a lot of development has taken place since the original
4.00.00alpha tag. In fact, that original tag just marked the start of the
4.00.00alpha development and many bugs in that original tag have been fixed
by now.

Also, if the new changes by Ray will not allow for open source training
:-( then the current github version will be the one which allows users to
do their own training. So, it is certainly deserving of a tag in my opinion
:-)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314667002,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel
.

--
Ray.

When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.

​What kind of expertise do you need regarding the Indic scripts?​

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 12, 2017 at 10:58 PM, theraysmith notifications@github.com
wrote:

Open source training:
OK, I overstated it a bit.
One of my commits will temporarily break the training process. After doing
so, I will correct the documentation and add the new tool (which I have
already written) as quickly as possible after.

To help:
No more breaking commits! If it doesn't produce perfect results on
phototest, it broke something!
Cutting down on the code cleanup while I am working on it will also help.
When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match

Dates:
I was going to get started this week, but now I have to debug my pull from
github, which has broken tests (of the legacy engine), so that will take
time to fix. I'm hoping it's simple, but it is bizarre.
Even when it is fixed, there are 1500 lines of change from github for
someone here to review.
I really want to get 4.00 finished (in beta) in the next 5-6 weeks.

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii notifications@github.com
wrote:

@zdenop https://github.com/zdenop

Tag is just specific points in history to mark something important (e.g.
new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase as
well as traineddata substantially. I am sure it will be tagged by Ray at
that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a
point
in history where a lot of development has taken place since the original
4.00.00alpha tag. In fact, that original tag just marked the start of the
4.00.00alpha development and many bugs in that original tag have been
fixed
by now.

Also, if the new changes by Ray will not allow for open source training
:-( then the current github version will be the one which allows users to
do their own training. So, it is certainly deserving of a tag in my
opinion
:-)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
issuecomment-314667002>,
or mute the thread
AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel>
.

--
Ray.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314839820,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o71HIG266aj--aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel
.

The code determines what makes a valid/invalid sequence of unicodes in the
script, for instance, is it allowed to have two matras in a row? It gets
more difficult with questions over what category the additional characters
are.

On Wed, Jul 12, 2017 at 6:40 PM, Shreeshrii notifications@github.com
wrote:

When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.

​What kind of expertise do you need regarding the Indic scripts?​

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jul 12, 2017 at 10:58 PM, theraysmith notifications@github.com
wrote:

Open source training:
OK, I overstated it a bit.
One of my commits will temporarily break the training process. After
doing
so, I will correct the documentation and add the new tool (which I have
already written) as quickly as possible after.

To help:
No more breaking commits! If it doesn't produce perfect results on
phototest, it broke something!
Cutting down on the code cleanup while I am working on it will also help.
When I have committed the new corpus cleanup code, it would be useful to
have any experts in any of the following scripts review the code and make
comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there.
Since I plan to commit new copies of the training data (unicharsets,
wordlists, training text etc) then at that point they will match

Dates:
I was going to get started this week, but now I have to debug my pull
from
github, which has broken tests (of the legacy engine), so that will take
time to fix. I'm hoping it's simple, but it is bizarre.
Even when it is fixed, there are 1500 lines of change from github for
someone here to review.
I really want to get 4.00 finished (in beta) in the next 5-6 weeks.

On Tue, Jul 11, 2017 at 11:27 PM, Shreeshrii notifications@github.com
wrote:

@zdenop https://github.com/zdenop

Tag is just specific points in history to mark something important
(e.g.
new version).

Exactly my point :-)

When Ray makes his next set of commits, that will change the codebase
as
well as traineddata substantially. I am sure it will be tagged by Ray
at
that time, probably as a beta or release candidate.

My request to you to tag current commit (as an example) is to mark a
point
in history where a lot of development has taken place since the
original
4.00.00alpha tag. In fact, that original tag just marked the start of
the
4.00.00alpha development and many bugs in that original tag have been
fixed
by now.

Also, if the new changes by Ray will not allow for open source training
:-( then the current github version will be the one which allows users
to
do their own training. So, it is certainly deserving of a tag in my
opinion
:-)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
issuecomment-314667002>,
or mute the thread
AL056e0RzuP9Hpok6mT4eU026fofCwaBks5sNGdRgaJpZM4N9Nel>
.

--
Ray.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
issuecomment-314839820>,
or mute the thread
aGRLLsL6s9gxF_Xks5sNQIjgaJpZM4N9Nel>
.

>


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314945111,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056cfTz_q0IPjUvI65YCy4HVMGAjH2ks5sNXWDgaJpZM4N9Nel
.

--
Ray.

No, it is not valid to have any two matras in a row - Devanagari 093E-094C.

However, these can be followed by Anusvar, Chandrabindu or Visarga i.e. 0901-0903

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg. 0951, 0952, 1CDA etc

However, I have seen samples in legacy fonts where a number of separate matras are used to create another one eg. using unicode points as example 093E followed by 0947 to create 094b - ा े to make ो

Similarly in legacy fonts, half letters (letter followed by virama) maybe followed by aa maatraa to create the complete letter in cases such as ga, sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha

It is possible that some converters from legacy font to unicode retain these errors.

Also, in case of Vedic Sanskrit, the valid order should be matra, combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly use matra, vedic accent and combining mark which will lead to dotted circle. eg. अंशाः॑ vs अंशा॑ः

For a sample of Vedic Sanskrit and its ground truth, see
https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.tif
https://github.com/Shreeshrii/tess4training/blob/master/BRH-test.txt

Will your new sanskrit traineddata be able to OCR this?

The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.

Will you remove the code of the legacy engine in this round?

On Wed, Jul 12, 2017 at 9:39 PM, Shreeshrii notifications@github.com
wrote:

No, it is not valid to have any two matras in a row - Devanagari 093E-094C.

However, these can be followed by Anusvar, Chandrabindu or Visarge i.e.
0901-0903

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

In case of Vedic Sanskrit, these can be followed by the Vedic accents eg.
0951, 0952, 1CDA etc

However, I have seen samples in legacy fonts where a number of separate
matras are used to create another one eg. 093E followed by 0947 to create
094b

These are specifically dis-allowed by unicode, but the rules seem to be
very script-specific, and not very consistently documented in the unicode
standard. I don't think the rules are addressed properly for all scripts.

Similarly in legacy fonts, half letters (letter followed by virama) maybe
followed by aa maatraa to create the complete letter in cases such as ga,
sha etc. i.e. 0936 + 094D + 093E to create 0936 for sha

It is possible that some converters from legacy font to unicode retain
these errors.

Also, in case of Vedic Sanskrit, the valid order should be matra,
combining mark (anusvar, visarga), vedic accent . Some fonts incorrectly
use matra, vedic accent and combining mark which will lead to dotted
circle. eg. अंशाः॑ vs अंशा॑ः

The code aims to dis-allow text designed for such legacy fonts.
The documentation that I have found is very good for Devanagari, but
lacking for some of the other scripts.
For instance, there is a big table in the unicode standard for Myanmar, (
http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover
any of the extension Myanmar characters, and isn't explicit about whether
the table represents a specific valid order or not. The existence of a lot
of legacy Myanmar text on the web that is designed for non-compliant fonts
doesn't help make it easier to determine whether the filter is correct.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314968713,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056SFD_JftIXTWSw6Crvgb1j3-ZBT3ks5sNZ-XgaJpZM4N9Nel
.

--
Ray.

That is still an open question.
I have limited time to spend on it (therefore resistant to delaying tactics
changing types in the dead code to POSIX).
Whether enough uses of Tesseract can be covered by the new engine is still
being debated, and the new models that I have need to be evaluated before
enough of the community is convinced.
I accept the requirement to add one or more new characters without the need
for full retraining, and will not delete the legacy code until that need is
addressed. (I think it can be done).
The legacy code is used by the OSD model and deletion of the legacy code is
also blocked by a good enough replacement.

On Thu, Jul 13, 2017 at 5:18 AM, Amit D. notifications@github.com wrote:

The new traineddatas will mostly be smaller than the older ones, as they
won't contain the legacy components, and no bigram dawgs are needed.

Will you remove the code of the legacy engine in this round?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315060862,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056VPOW6xmGYPbAsOF_D3yEFAAfEshks5sNgr6gaJpZM4N9Nel
.

--
Ray.

It seems that Malayalam is unique in allowing multiple 0d02 (Anusvara)?

That does not sound right. Please see
https://en.wikipedia.org/wiki/Malayalam_script#Anusvaram

I did a search on ംം (two anusvarams in malayalam script) and most of them show up in the search result in pdfs.

FYI, pdfs created with documents having text in unicode fonts for complex scripts do not save the unicode text correctly. Devanagari text copied from these pdf is not correct, I assume similarly for malayalam and other Indian scripts, and that might be causing this double anusvar problem.

newer pdfs created in a special manner, eg. with 'actual text' with xelatex are ok (eg. http://sanskritdocuments.org/doc_devii/annapurna.pdf), but those created from various other software are not (http://www.sanskritweb.net/sansdocs/nala-d.pdf).

@jbreiden can give you the technical reasoning for this.

Google search does show pdfs as part of the search results, so there is some internal OCR (is it tesseract???) being done on the pdfs, books etc as part of the search process. But it may not be fully correct.

So for the corpus for training, I would suggest to avoid text taken from pdfs (in case it is being used).

@theraysmith Regarding Malayalam, double anusvara

Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.alanwood.net/unicode/malayalam.html
http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata

Direct from the unicode standard:
Anusvara. The anusvara can be seen multiple times after vowels, whether
independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02,
0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx
<0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be
prepared to handle Malayalam letters (including vowel letters), digits
(both European and Malayalam), dashes, U+00A0 no-break space and U+25CC
dotted circle as base characters for the Malayalam vowel signs, U+0D4D
malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam
sign visarga. They should also be prepared to handle multiple combining
marks on those bases.

Is it wrong?

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii notifications@github.com
wrote:

@theraysmith https://github.com/theraysmith Regarding Malayalam, double
anusvara

Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for
anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following
consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315286649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel
.

--
Ray.

OK, I have pushed this week's changes:
Fixes to pull from github. There were bugs introduced and required code
deleted. Also reformatted/modified according to Google code standards.
Major new normalization/text cleanup code in training/validat* The best
help with this would be expertise in the various scripts, as previously
discussed.
Deleted some code from the LSTM recognizer that was old and unused.
(Backwards compatible change).
Part 1 of the changes required to move the unicharset and recoder so they
are stored in the traineddata and therefore accessible.

I have not searched through my emails to find the relevant issues to update
them yet.
The traineddatas and training source data are not yet updated. That is
probably a while away yet, so the issue about the unicharset and recoder
are not yet fully resolved anyway.
The training process shouldn't be broken by these changes yet, I hope, but
the documentation is no longer accurate.

If you run a new training or incremental/fine tuning training, the new
output files will be a traineddata directly, not an LSTM traineddata
component.

That output traineddata should contain some version string and separate
lstm unicharset/recoder.

The next step is to change the lstmtraining program to accept a traineddata
instead of a unicharset, and add a tool to generate the traineddata, then
update the documentation to match.

On Fri, Jul 14, 2017 at 8:52 AM, Ray Smith rays@google.com wrote:

Direct from the unicode standard:
Anusvara. The anusvara can be seen multiple times after vowels, whether
independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02,
0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx
<0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be
prepared to handle Malayalam letters (including vowel letters), digits
(both European and Malayalam), dashes, U+00A0 no-break space and U+25CC
dotted circle as base characters for the Malayalam vowel signs, U+0D4D
malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam
sign visarga. They should also be prepared to handle multiple combining
marks on those bases.

Is it wrong?

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii notifications@github.com
wrote:

@theraysmith https://github.com/theraysmith Regarding Malayalam,
double anusvara

Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for
anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following
consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315286649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel
.

--
Ray.

--
Ray.

Actually, I take that back. I don't think the output from --stop_training
is different to what it was before. It is still an LSTM traineddata
component.

On Fri, Jul 14, 2017 at 11:35 AM, Ray Smith rays@google.com wrote:

OK, I have pushed this week's changes:
Fixes to pull from github. There were bugs introduced and required code
deleted. Also reformatted/modified according to Google code standards.
Major new normalization/text cleanup code in training/validat* The best
help with this would be expertise in the various scripts, as previously
discussed.
Deleted some code from the LSTM recognizer that was old and unused.
(Backwards compatible change).
Part 1 of the changes required to move the unicharset and recoder so they
are stored in the traineddata and therefore accessible.

I have not searched through my emails to find the relevant issues to
update them yet.
The traineddatas and training source data are not yet updated. That is
probably a while away yet, so the issue about the unicharset and recoder
are not yet fully resolved anyway.
The training process shouldn't be broken by these changes yet, I hope, but
the documentation is no longer accurate.

If you run a new training or incremental/fine tuning training, the new
output files will be a traineddata directly, not an LSTM traineddata
component.

That output traineddata should contain some version string and separate
lstm unicharset/recoder.

The next step is to change the lstmtraining program to accept a
traineddata instead of a unicharset, and add a tool to generate the
traineddata, then update the documentation to match.

On Fri, Jul 14, 2017 at 8:52 AM, Ray Smith rays@google.com wrote:

Direct from the unicode standard:
Anusvara. The anusvara can be seen multiple times after vowels, whether
independent letters or dependent vowel signs, as in vxxxx <0D08, 0D02,
0D02, 0D02, 0D02>. Vowel signs can also be seen after digits, as in 355wx
<0033, 0035, 0035, 0D3E, 0D02>. More generally, rendering engines should be
prepared to handle Malayalam letters (including vowel letters), digits
(both European and Malayalam), dashes, U+00A0 no-break space and U+25CC
dotted circle as base characters for the Malayalam vowel signs, U+0D4D
malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malayalam
sign visarga. They should also be prepared to handle multiple combining
marks on those bases.

Is it wrong?

On Fri, Jul 14, 2017 at 12:00 AM, Shreeshrii notifications@github.com
wrote:

@theraysmith https://github.com/theraysmith Regarding Malayalam,
double anusvara

Please see
http://unicode.org/charts/PDF/U0D00.pdf
http://www.omniglot.com/language/numbers/malayalam.htm

zero in Malayalam script - pujyam looks very much like the sign for
anusvar.

Also, there are different anusvars shown in unicode chart--

0D00 $ഀ MALAYALAM SIGN COMBINING ANUSVARA ABOVE
0D02 $ം MALAYALAM SIGN ANUSVARA
• used in Prakrit language texts to indicate gemination of the following
consonant

0D3B $഻ MALAYALAM SIGN VERTICAL BAR VIRAMA
0D3C $഼ MALAYALAM SIGN CIRCULAR VIRAMA

I will look up more info and post under an issue in langdata


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315286649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056dtEgSZxvdhC-1CLOZ04nX1xheUPks5sNxIPgaJpZM4N9Nel
.

--
Ray.

--
Ray.

--
Ray.

Ray,
You are right. Looks like Malayalam does have different rules, including repeated vowels.

Please see section 8.4.3 in http://thottingal.in/documents/Fontbook.pdf by @santhoshtr.

In samvruthokaram - ◌ു് virama is applied to a vowel sign

Another exception is у. This combination of a long vowel sign and anusvara is used to denote "nth" like, 16у or 16-у meaning 16th.

Repeated vowel signs are used to denote elongation of a vowel pronunciation

Request Santhosh Thottingal @santhoshtr to comment regarding multiple anusvars.

See https://github.com/tesseract-ocr/langdata/issues/35#issuecomment-320330996

for Ray's comments about next set of changes

See https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-dev/_s0TOmDlEAs/uRJ-Ozi8AAAJ for updates -msgs from Jeff Breidenbach


Aug 28, 2017

Alexander Pozdnyakov has done a really good job packing Tesseract in his
Personal Package Archive (PPA). I think it is getting to be time for wider usage,
so I'm working with him to promote these to official packages. First step is
Debian Experimental. That's a good place to work out problems, and hopefully
something can be ready for real users within a few weeks.


Sep 7, 2017

we will have three sets of .traineddata
files on GitHub in three separate repositories. Most users
will want LSTM Fast and that is what will be shipped as
part of Linux distributions. LSTM Best is for people willing
to trade a lot of speed for slightly better accuracy. It is also
better for certain retraining scenarios for advanced users.
The third set is for the legacy recognizer.


Sep 15, 2017

Populated the new repositories, and removed the LSTM files from tessdata.
I'm sure documentation needs updating.

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

@theraysmith, thanks for providing "fast" now. Are you planning to release free documentation / tools for everybody to produce "fast" data? I noticed that apart from the LSTM model the rest of the traineddata files for "best" and "fast" are identical. Wouldn't it save space and make the handling easier if both variants were in the same traineddata container (this requires an option to select the desired one, of course) instead of having two parallel sets?

Sorry, I don't follow. Which parts are identical?

$ du -sh best fast
1.7G    best
657M    fast

1.7G best
657M fast

Jeff, playing with the numbers?
:-)
[He changed the numbers in his comment]

​but its size is very different and dont follow an unique pattern..

tessdata_fast/eng.traineddata 3.9mb
tessdata_best/eng.traineddata 14.7

tessdata_fast/ara.traineddata 1.4mb
tessdata_best/ara.traineddata 12mb

​what can effect the fast traindata size ?1​

@roozgar, it is possible to extract the parts of a traineddata file using combine_tessdata -u traineddata_file output_path_prefix. Usually the largest parts are the LSTM model and the word list, but not all languages have a huge word list like eng.traineddata or Latin.traineddata.

@amitdo I can't seem to write a single comment without editing it three times to fix mistakes.
@roozgar Models using integer arithmetic (traineddata_fast) are smaller than ones using floating point.

@amitdo I can't seem to write a single comment without editing it three times to fix mistakes.

LOL. It happens to me too. I keep discovering mistakes after I post a comment.

@stweil You had asked somewhere about tools for converting to fast/integer models... Can't find that comment to reply to. The training wiki has the answer ...

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line

stop_training | bool | false | Convert the training checkpoint in --continue_from to a recognition model.
-- | -- | -- | --
convert_to_int | bool | false | With stop_training, convert to 8-bit integer for greater speed, with slightly less accuracy.

Thank you!

Hi everybody, so what are now the remaining tasks in order to release Tesseract 4.0.0 ?

Please see https://groups.google.com/d/msgid/tesseract-dev/2703d7a2-44e4-493c-a2fe-86891e2f0933%40googlegroups.com for comments from Jeff regarding debian and ubuntu release

Copied part of msg from @jbreiden

"To give a small update, a Dec 15 git snapshot is now shipping as
part of Debian Unstable and Debian Testing. I expect it to be part
of Ubuntu 18.04 (releasing in April 2018) but has not yet been
integrated there. Thank you again to Alexander for doing 99%
of the work with his PPA.

If I am reading these survey numbers right, Tesseract is installed on
8% of Debian systems, and executed recently on 2% of them. There
are now 347 packages that depend on Tesseract, with 6 of them being
direct dependencies.

https://qa.debian.org/popcon.php?package=tesseract

If anyone notices any problems with any of these packages, this is
a very good time to speak up."

$ tesseract --oem 0 phototest.tif - -
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

The right usage is:
tesseract --oem 0 -l osd phototest.tif -

Edit: I confused 'oem' with 'psm'.

@amitdo The 'fast' and 'best' traineddata files do not contain legacy model. Hence --oem 0 and --oem 2 will not work.

#!/bin/bash
img_files=$(ls ./Cap*.png)
for img_file in ${img_files}; do
  echo "****************************" ${img_file} oem 2"**********************************"
    time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_best/   ${img_file} ${img_file%.*}-eng-best  --oem 2 --psm 6 -l eng
    time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata_fast/   ${img_file} ${img_file%.*}-eng-fast  --oem 2 --psm 6 -l eng
    time tesseract --tessdata-dir /mnt/c/Users/User/shree/tessdata/   ${img_file} ${img_file%.*}-eng  --oem 2 --psm 6 -l eng
done
root@All-in-1-Touch:/mnt/c/Users/User/shree# bash ./tess.sh
**************************** ./Capture.png oem 1**********************************
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m4.469s
user    0m11.375s
sys     0m0.406s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m2.209s
user    0m3.797s
sys     0m0.234s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m3.785s
user    0m8.219s
sys     0m0.531s


root@All-in-1-Touch:/mnt/c/Users/User/shree# bash ./tess.sh
**************************** ./Capture.png oem 2**********************************
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

real    0m0.621s
user    0m0.078s
sys     0m0.297s
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

real    0m0.425s
user    0m0.031s
sys     0m0.125s
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

real    0m3.772s
user    0m7.969s
sys     0m0.578s

@jbreiden

I was suggesting an error/warning message when --oem 0 or 2 are used with 'best' or 'fast' traineddata, which is converse of the following:

https://github.com/tesseract-ocr/tesseract/blob/dc8745e6fd4c6c070076c44565924faa0d0643a7/ccmain/tessedit.cpp#L187

https://github.com/tesseract-ocr/tesseract/blob/dc8745e6fd4c6c070076c44565924faa0d0643a7/ccmain/tessedit.cpp#L196

      tprintf("Error: LSTM requested, but not present!! Loading tesseract.\n");
      tessedit_ocr_engine_mode.set_value(OEM_TESSERACT_ONLY);

The 'fast' and 'best' traineddata files do not contain legacy model. Hence --oem 0 and --oem 2 will not work.

I know, I even added this info to the wiki a while ago.

Still,

tesseract --psm 0 -l osd phototest.tif -

should work.

Hi everybody
Looks like the dec 15 release is/was a good milestone and at least a good test on Ubuntu 18.
What about now creating a "4.0.0beta" tag ?
Kind

The package version number is 4.00~git2188-cdc35338-2 so that's commit cdc35338. Maybe give it a little time to settle? We had a critical bug the other day, but that turned out to be in Leptonica.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=885704

Hi everybody
Looks like the dec 15 release is/was a good milestone and at least a good test on Ubuntu 18.
What about now creating a "4.0.0beta" tag ?
Kind

https://wiki.ubuntu.com/BionicBeaver/ReleaseSchedule

January 11th | Alpha 1
February 1st Alpha 2
March 1st FeatureFreeze, Debian Import Freeze
March 8th Beta 1 Freeze
April 5th Final Beta Freeze , Final Beta
April 19th FinalFreeze, ReleaseCandidate
April 26th FinalRelease, Ubuntu 18.04

The suggested tag would be 4.0.0-beta.20180105 (for today), see the discussion above.

Well I dont really care about the name of the tag as long as one is created soon.
Reminder: a tag is 'just' a tag (it's not a branch), just super convenient to compare between different milestones.

@jbreiden

Jeff, Would it be possible for you to update the langdata repository to match the 4.00alpha tessdata files, on behalf of @theraysmith ? It would help out those who are trying to finetune traineddata for their specific languages. Thanks!

edit:
It will also address the requirement of debian regarding language source files.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=699609

Updating the langdata files would also help to identify and fix systematic bugs for future trainings.

@Shreeshrii @stweil Okay, I'll investigate. Will probably take some time.

@jbreiden

How are you going to support SIMD with the debian/ubuntu binary package?
https://lists.debian.org/debian-mentors/2017/03/msg00163.html

I didn't do anything special with packaging for SIMD. Basically shipped the code as packaged by Alexander and am basically waiting for the bug reports to start rolling in. Can someone remind me which processors are going to have trouble, and what will that look like for a user perspective at runtime? (I forgot that my Pentium G4560 chip was useful for testing such things, so I gave it to young child to play with.)

Normally all kinds of processors should work, because Tesseract tests at runtime whether the CPU supports AVX2 or SSE and chooses the right code automatically.

Problems occurred in the past with virtual machines which claimed to support AVX2 but did not do so. That case needs an improved runtime test (which is still missing) to work.

There is also a build time detection which adds -msse4.2 / -mavx / -mavx2 flags.

The build time detection is just about the compiler, right? I built the X86_64 package on an Intel Xenon E5-1650 which does not have AVX2. But that's fine and doesn't hurt anyone. Right? Right? It's got to be right.

checking whether C++ compiler accepts -mavx... yes
checking whether C++ compiler accepts -mavx2... yes
checking whether C++ compiler accepts -msse4.1... yes

Yes, that's perfect. You can then run tesseract -v to see which SIMD instructions were detected for your CPU.

Here is the actual script it uses:
https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html

You can then run tesseract -v to see which SIMD instructions were detected for your CPU.

This is done by the runtime detection.

I think that when you use a flag like -msse4.2 the compiler can automatically use sse4.2 instructions anywhere in the code. The sse4.2 code will cause SIGILL in machines that lack sse4.2 instructions.

... but Tesseract does not use these flags globally. It uses them only in arch/Makefile.am.

I hope that this approach is enough to save you from the above issue.

@jbreiden

https://packages.debian.org/sid/tesseract-ocr

Tesseract command line OCR tool

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google and is probably one of the most accurate open source OCR engines available. It can read a wide variety of image formats and convert them to text in over 40 languages. This package includes the command line tool.

I suggest to change it to something like this:

The Tesseract OCR engine was originally developed by HP between 1985 and 1998. Since 2006 it has been developed as an open source project by Google.
It can read a wide variety of image formats and convert them to text. It supports over 120 languages.

This package includes the command line tool.

How about I copy the description in the Wiki or README file? (And should they be synchronized?)

Tesseract is an open source Optical Character Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages.

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

No problem. I mainly dislike the '23 years ago it was one of the best'. WHO CARES? (Sorry Ray).

That's why I removed it from the README.

Why is the README talking about handwriting? Tesseract is terrible at handwriting.

You mean the wiki Home, not the README.

I fixed it.

WHO CARES?

Does anybody care that the Wiki (on GitHub, but also the English Wikipedia) still says _Optical Character Recognition_ although the old Tesseract detects glyphs and the new Tesseract detects lines of text? Would the following text be better?

_Tesseract is an open source text recognition ("Optical Character Recognition" = OCR) engine [...]_

The term for what Tesseract is doing is OCR. Even if not accurate, that's the term people recognize. It's not our job to invent new terms.

Sure. OCR is also part of the GitHub repository name. I do not want to change that. That's why it is still part of the new text which I suggested.

https://www.google.co.il/search?q=%22text+recognition%22

It seems that the term 'text recognition' is commonly used as replacement for 'OCR' :)

Yes. It is quite common that abbreviations live much longer than their original meaning, so that original meaning remains only relevant for encyclopedias. Example: search for 'machines' on the IBM website. You won't find that word, although the 'M' is still part of the name.

I still sometimes use the term 'machine' to refer to a computer. Maybe I'm too old (or just a geek?).
:-)

Here's what will ship with Ubuntu 18.04. Tag (or don't tag) as you see fit.

 Tesseract is an open source Optical Character Recognition (OCR)
 Engine. It can be used directly, or (for programmers) using an API to
 extract printed text from images. It supports a wide variety of
 languages. This package includes the command line tool.
$ dpkg -l tesseract-ocr tesseract-ocr-eng
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                     Version                   Architecture              Description
+++-========================================-=========================-=========================-===============================================================
ii  tesseract-ocr                            4.00~git2219-40f43111-1.2 amd64                     Tesseract command line OCR tool
ii  tesseract-ocr-eng                        4.00~git24-0e00fe6-1.2    all                       tesseract-ocr language files for English
$ tesseract --version
tesseract 4.00.00alpha
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.30 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.6.0 : libopenjp2 2.1.2

 Found AVX
 Found SSE
$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

Tag (or don't tag) as you see fit.

@zdenop I think you should tag a release so that other distros can also be updated.

I also suggest to update the version string to match 4.00~git2219-40f43111-1.2 or similar format. There is lot of confusion with tesseract 4.00.00alpha which applies to hundreds of commits.

Very Sorry! I misread the dashboards. Looks like the slightly older code 4.00~git2207-766b7bd6-3.1 will ship, which is missing some of the last minute improvements. I believe it is no longer possible to change the version string (or anything else about Tesseract) for Ubuntu 18.04.

  1. Tagging repo will cause release in github and AFAIR it cause problem for some people.
  2. Other distribution will took:
  3. the latest github master (to include all additional fixes)
  4. the latest stable release
    Nobody would care what did other distribution...
    I would prefer Ray give clear statement about next step for 4.0 release.

Tagging repo will cause release in github [...]

That's desired. GitHub also allows marking such releases as _pre-release_ – just edit the release information of the new release. That should minimize problems for other people.

The release of today would be _4.0.0-alpha.20180302_.

ok. but do we expect more code/fixes to come for 4.0 release?

Dňa pi 2. 3. 2018, 7:23 Stefan Weil notifications@github.com napísal(a):

Tagging repo will cause release in github [...]

That's desired. GitHub also allows marking such releases as pre-release
– just edit the release information of the new release.

The release of today would be 4.0.0-alpha.20180302.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-369833449,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAjCzIIWW3apaQrYsMG4GXodJd9gQZftks5taOVegaJpZM4N9Nel
.

Yes, why not? I don't plan to stop sending code / fixes. :-), other people will continue sending fixes, too. So either we'll have a _4.0.0-alpha.20180401_, or a _4.0.0_ without alpha, or a _4.0.1_, or Ray sends a bunch of code which justifies a _4.1.0_, ...

I would prefer Ray give clear statement about next step for 4.0 release.

@jbreiden Please check with Ray. Thanks!

I would prefer Ray to speak for himself, too! However, I don't think there will be large Tesseract changes from him in either short or medium term.

Zdenko, I also think we should finally release 4.0.0. It's time to get rid of the alpha status.

If you decide to release it soon, don't forget to first update ccutil/version.h

Ha! Looks like they took 40f43111 after all, one day after deadline.

https://launchpad.net/ubuntu/+source/tesseract

Jeff,

is there any info from Ray about 4.00 release? Or at least how to tag
"Ubuntu" release (4.00RC1, 4.00beta?...)?

Zdenko

2018-03-03 5:21 GMT+01:00 jbreiden notifications@github.com:

Ha! Looks like they took 40f43111 after all, one day after deadline.

https://launchpad.net/ubuntu/+source/tesseract


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-370116512,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAjCzCbeVivFu4oYMvMPgTLdJgz3NaUUks5tahpZgaJpZM4N9Nel
.

We should make the decision ourselves.

What about this proposal:

  • Tag commit 40f43111 as 4.00-alpha.2+git.2219.40f43111.
  • 3-4 weeks from now, tag the latest commit in master as 4.0.0-beta.1.
  • Release 4.0.0 30-60 days after beta1 (maybe with one more beta and one rc).

https://semver.org/
https://packages.ubuntu.com/bionic/tesseract-ocr

Mark any non final 4.0.0 as 'pre-release'.

https://help.github.com/articles/creating-releases/

  1. If the release is unstable, select This is a pre-release to notify users that it's not ready for production.

For each (pre-)release, update ccutil/version.h.
https://github.com/tesseract-ocr/tesseract/blob/master/ccutil/version.h

is there any info from Ray about 4.00 release?

No info.

Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Millions of people will use commit 40f4311 because of Ubuntu, and I think the main benefit of a tag is to help understand bug reports coming from these users. There have been many good tag proposals in this thread from @amitdo and @stweil and @zdenop and @WilliamTambellini. I don't have a strong opinion about which one is best. If I was forced to choose, I'd probably tag commit 40f4311 with 4.0.0-beta.1 If that feels like too much commitment, then use a very specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense to apply the same tag to the fast training data at commit 0e00fe6.

Jeff, the traineddata files have a version string of 4.00.00alpha with a date (062917 if I remember correctly). tesseract also reports version of 4.00.00alpha. Will it be possible to change these in the Ubuntu 18.04 packages now?

No more changes possible. Everything will look exactly as described here: https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-369704920

done.

Zdenko

2018-03-09 19:25 GMT+01:00 jbreiden notifications@github.com:

is there any info from Ray about 4.00 release?

No info. Ray is very busy with other work, so I don't expect major changes
from him in short or medium term.

Or at least how to tag "Ubuntu" release (4.00RC1, 4.00beta?...)?

Millions of people will use commit 40f4311
https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881
because of Ubuntu, and I think the main benefit of a tag is to help
understand bug reports coming from these users. There have been many good
tag proposals in this thread from @amitdo https://github.com/amitdo and
@stweil https://github.com/stweil and @zdenop
https://github.com/zdenop and @WilliamTambellini
https://github.com/williamtambellini. I don't have a strong opinion
about which one is best. If I was forced to choose, I'd probably tag commit
40f4311
https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881
with 4.0.0-beta.1 If that feels like too much commitment, then use a very
specific tag like ubuntu18.04. Whatever is chosen, I think it makes sense
to apply the same tag to the fast training data at commit 0e00fe6.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-371902529,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAjCzH06ukxAiEbv4gYXlxJNbEYyrIzMks5tcskSgaJpZM4N9Nel
.

Great!!!!

On Sat 10 Mar, 2018, 1:12 PM zdenop, notifications@github.com wrote:

done.

Zdenko

I suggest that we release 4.0.0 (final) until end of April. ~2 week before this release, we should release 4.0.0-rc.1.

April 2018 :-)

For the final release should the files in tessdata repo be updated?

These have models for legacy tesseract that we need to keep.

However the LSTM models in those were improved in tessdata_best and then further improved / made faster in tessdata_fast.

I suggest that we update all the lstm related files in tessdata with files from tessdata_fast.

eg. for Hindi.

# combine_tessdata -d ./tessdata/hin.traineddata
Version string:Pre-4.0.0
0:config:size=739, offset=192
1:unicharset:size=180616, offset=931
2:unicharambigs:size=90293, offset=181547
3:inttemp:size=12791027, offset=271840
4:pffmtable:size=24823, offset=13062867
5:normproto:size=225187, offset=13087690
6:punc-dawg:size=426, offset=13312877
7:word-dawg:size=837458, offset=13313303
8:number-dawg:size=410, offset=14150761
9:freq-dawg:size=1242, offset=14151171
17:lstm:size=8874565, offset=14152413
18:lstm-punc-dawg:size=4322, offset=23026978
19:lstm-word-dawg:size=2726578, offset=23031300
20:lstm-number-dawg:size=122, offset=25757878
23:version:size=9, offset=25758000

# combine_tessdata -d ./tessdata_best/hin.traineddata
Version string:4.00.00alpha:hin:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=633, offset=192
17:lstm:size=11738347, offset=825
18:lstm-punc-dawg:size=3154, offset=11739172
19:lstm-word-dawg:size=143834, offset=11742326
20:lstm-number-dawg:size=234, offset=11886160
21:lstm-unicharset:size=7975, offset=11886394
22:lstm-recoder:size=1111, offset=11894369
23:version:size=80, offset=11895480

# combine_tessdata -d ./tessdata_fast/hin.traineddata
Version string:4.00.00alpha:hin:synth20170629
0:config:size=633, offset=192
17:lstm:size=965584, offset=825
18:lstm-punc-dawg:size=3154, offset=966409
19:lstm-word-dawg:size=143834, offset=969563
20:lstm-number-dawg:size=234, offset=1113397
21:lstm-unicharset:size=7975, offset=1113631
22:lstm-recoder:size=1111, offset=1121606
23:version:size=30, offset=1122717

For hindi, following files in the traineddata in tessdata repo

0:config:size=739, offset=192
17:lstm:size=8874565, offset=14152413
18:lstm-punc-dawg:size=4322, offset=23026978
19:lstm-word-dawg:size=2726578, offset=23031300
20:lstm-number-dawg:size=122, offset=25757878
23:version:size=9, offset=25758000

should be replaced by the following from tessdata_fast

0:config:size=633, offset=192
17:lstm:size=965584, offset=825
18:lstm-punc-dawg:size=3154, offset=966409
19:lstm-word-dawg:size=143834, offset=969563
20:lstm-number-dawg:size=234, offset=1113397
21:lstm-unicharset:size=7975, offset=1113631
22:lstm-recoder:size=1111, offset=1121606

Also, the version string should be updated appropriately to reflect the combo.

This will also make the size of traineddata files in tessdata repo smaller.

Good idea, but It should not delay the final 4.0.0 release.

Thinking about this some more, I think a better alternative will be to remove the lstm files from the traineddata in tessdata.

This will ensure there is no conflict in different config files needed for legacy and LSTM models.

The traineddata file will become smaller.

There will be no need to update the lstm models in tessdata in future.

It will be easier for users:

tessdata for --oem 0
tessdata_fast for --oem 1
tessdata_best for LSTM training

@stweil could implement a check that --oem 0 is only being used with traineddata files that that have a version string of Version string:Pre-4.0.0.

However, this misses the case where default --oem mode was set to 2 or 1 in the config files in tessdata. I will look to see how many such cases are there.

$grep engine_mode *.config
ara.config:tessedit_ocr_engine_mode 1
hin.config:tessedit_ocr_engine_mode 2

Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much better than 2 in tessdata.

I propose, to replace these two traineddata files in tessdata by their counterparts from tessdata_fast.

Since their version string will not be Version string:Pre-4.0.0, the program should not crash, if the check is implemented.

We can document this in readme in tessdata repo.

@stweil Since you probably use --oem 0 in your projects, what do you think of this idea?

remove the lstm files from the traineddata in tessdata.

Those ocr_engine_mode s may be due to the historical presence of cube, and
may not be optimal for the current implementation.

On Mon, Mar 19, 2018 at 1:24 AM Shreeshrii notifications@github.com wrote:

$grep engine_mode *.config
ara.config:tessedit_ocr_engine_mode 1
hin.config:tessedit_ocr_engine_mode 2

Only two languages come up. For Hindi, OEM 1 with tessdata_fast is much
better than 2 in tessdata.

I propose, to replace these two traineddata files by their counterparts in
tessdata_fast. Since their version string will not be Version
string:Pre-4.0.0, the program should not crash, if the check is implemented.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-374133659,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056XP9S5BJnL4tCd5NPgy-HwddpsPyks5tf2tUgaJpZM4N9Nel
.

--
Ray.

Ray,

Since you mentioned that best can be integerized to make it faster, and there are already three repos with traineddata files, I thought of updating the lstm files in the traineddata in tessdata with the integerized best with a Version string such as:

Version string:Pre-4.0.0+4.00.00alpha:nld:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]:best2int20180321

i.e. Version string from tessdata+ Version string from tessdata_best appended by best2int20180321

Was this page helpful?
0 / 5 - 0 ratings