Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).

SSE and AVX are also done on CPU :)

amitdo on 24 Mar 2018

Remove deprecated code. This does not include OpenCL or the old Tesseract engine.

Adding a compile option NO_LEGACY_OCR_ENGINE would be nice.

amitdo on 24 Mar 2018

Fix the autotools build so that the debug mode uses -O0 as intended (instead of -O2).
Probably, It can be adapted from #974

I'll do it.

amitdo on 25 Mar 2018

❤1

Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.

My suggestion would be to leave --list-langs as is,

and add this as --list-langs-details

or as --list-lang-details for one language file based on lang-code.

Shreeshrii on 25 Mar 2018

--list-langs should also display the directory it is using. This is useful when tessdata files ate installed in multiple directories, eg. By ppa or Linux distribution vs when built directory.

Shreeshrii on 25 Mar 2018

Re: tessdata,
Config and tessconfigs and pdf.ttf are needed in the directory which is being used via tessdata_prefix or tessdata-dir.

Eg. When doing lstm training, lstm.train config file is not found if one uses tessdata_best as the continue_from dir.

My workaround has been to copy these to both tessdata_fast and tessdata_best repos.

Shreeshrii on 25 Mar 2018

Add/implement install-langs.

Shreeshrii on 25 Mar 2018

A week with no API changes.

jbreiden on 25 Mar 2018

👍2

Add a simple bash script for building tesseract.

I use the following, it should probably also add commands to offer to download osd and eng traineddata files for first time users.

#!/bin/bash
./autogen.sh
./configure --disable-openmp  --disable-graphics --disable-opencl
make
sudo make install
sudo ldconfig
make training
sudo make training-install

rm -rf ./googletest
git submodule update --init
autoreconf -fiv
#export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
export TESSDATA_PREFIX=../tessdata_fast
make check

Shreeshrii on 25 Mar 2018

👍2

I would add this:

missing TESSERACT_VERSION and some explanation.

zdenop on 25 Mar 2018

A week with no API changes.

Mission impossible.

Edit: That was a joke.

amitdo on 25 Mar 2018

👎2 ❤1 😄1

There was (online) tool that is monitoring API changes (for tesseract). But I can not find a link for it. Does somebody has it? Can somebody show changes 4.0.beta1 vs. current code?

zdenop on 26 Mar 2018

Please see https://github.com/tesseract-ocr/tesseract/issues/793

The tracker is at https://abi-laboratory.pro/tracker/timeline/tesseract/
Currently it is tracking stable release 3.05.01

@zdenop Please tag another release for 3.05 branch since 3.05.01 had a couple of problems which have been fixed in later commits.

Shreeshrii on 26 Mar 2018

~The good news is that the latest Debian / Ubuntu tesseract-ocr does not include the development files, so there will not be any API between that version and the future 4.0.0 which we have to take care of.~

Sorry, I was wrong: there is libtesseract-dev.

stweil on 26 Mar 2018

@zdenop I suggest adding labels to issues with the following proposed list of keywords, so that it is easy to see related issues and see if there are any critical pending issues.

4.0.0 for the final relaese
4.0x for 4.00.00alpha and 4.0.0-beta.1
3.0x for 3.05/3.04

LSTM training
training for 3.0x legacy tesseract training

Accuracy for reports of incorrect recognition
Performance for questions related to speed
Crashes for asserts and program crashes

Build related to compile and build from source

This is a suggested list.

Shreeshrii on 27 Mar 2018

IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.

No ABI & API changes.
No changes to user interface (command line).

A new branch should be created for 4.0.0.
Only commits that follow the above rules should be backported from master.
4.0.0 should have at least rc.1 before final release.

We can decide that 4.1.0 will be released 2-3 months after 4.0.0 (still with legacy?).

amitdo on 27 Mar 2018

How do you define "significantly"? There are some changes with the latest Git master:

Trained data for scripts was moved.
Some deprecated functions, parameters and command line options were removed.
The Tesseract specific integer data types (inT32, ...) and macros (MIN_INT32, ...) were removed.

Would you suggest reverting these changes? They are major changes which require a step of the major version, so I think 4.0.0 is a good candidate to include those changes. Otherwise we would have to wait for 5.0.0.

I would even go further and fix potential name space problems with the 58 include files which are part of the Tesseract programming API in 4.0.0-beta.1, although that is a significant change, too.

stweil on 27 Mar 2018

How do you define "significantly"?

basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.

amitdo on 27 Mar 2018

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

I think our aim should be to get all significant changes included in final
4.0.0 and get it ready in time for Ubuntu 18.10. What are the deadlines
for that?

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Mar 27, 2018 at 5:01 PM, Amit D. notifications@github.com wrote:

How do you define "significantly"?

basically, any bug fix is ok, must follow the 2 conditions I specified, no
new features.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376491580,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7atyVy_7E3uk81VhUn_tqFXFJ3-ks5tiiMogaJpZM4S57Iv
.

Shreeshrii on 27 Mar 2018

18.04 is much more significant because it's LTS - supported for 5 years.
18.10 will be supported for only 9 months. We should not care about it.

amitdo on 27 Mar 2018

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

We tagged it as 4.0.0-beta.1.

amitdo on 27 Mar 2018

Another option is to skip final 4.0.0 and go straight to 5.0.0.

amitdo on 27 Mar 2018

As per Jeff, we can't make any changes to what is shipped for 18.04.

But we still have time to do another beta, rc-1 and final 4.0.0 release in
time for 18.10.

I do not really know much about Linux releases, but my hope would be that
users would be able to install/upgrade to the 4.0.0 final version shipped
with 18.10 on 18.04.

@AlexanderP please explain whether the above is possible.

On Tue 27 Mar, 2018, 5:48 PM Amit D., notifications@github.com wrote:

18.04 is a much more significant because it's LTS - supported for 5 years.
18.10 will be supported for only 9 month. We should not care about it.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376503682,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o1f3WICsaeI5d2ge9MMOvA8axn5xks5tii4PgaJpZM4S57Iv
.

Shreeshrii on 27 Mar 2018

@zdenop, your thoughts about these two options?

amitdo on 27 Mar 2018

On Tue 27 Mar, 2018, 5:58 PM Amit D., notifications@github.com wrote:

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C

We tagged it as 4.0.0-beta.1.

Yes, that tag is within github.

Please see the post by Jeff, where he has shown what tesseract -v will
report for 18.04.

>

Shreeshrii on 27 Mar 2018

What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
>

We tagged it as 4.0.0-beta.1.

Yes, that tag is within github.

Please see the post by Jeff, where he has shown what tesseract -v will
report for 18.04.

Here is the link:

https://github.com/tesseract-ocr/tesseract/issues/995#comment-369704920

>

Shreeshrii on 27 Mar 2018

Jeff just said that the version in Ubuntu won't change in final 18.04.

We are talking about what we want to do in Tessseract's official Github repo.
We are the upstream, not Ubuntu!

amitdo on 27 Mar 2018

😄1

IMO, our final 4.0.0 should not significantly diverge from the version
that will be shipped in Ubuntu 18.04.

I am trying to understand how 4.0.0 final release on github relates to
Ubuntu 18.04, in light of the above.

I am missing your reasoning for why it should not significantly diverge.

On Tue 27 Mar, 2018, 6:16 PM Amit D., notifications@github.com wrote:

Jeff just said that the the version in Ubuntu won't change in final
18.04.

We are talking about what we want to do in Tessseract's official Github
repo.
We are the upstream, not Ubuntu!

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376511680,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o62Ddg3LsJ9b5FQXiigM96Fy1wGoks5tijS_gaJpZM4S57Iv
.

Shreeshrii on 27 Mar 2018

I want to hear @zdenop's and @jbreiden's opinions.

I think that as maintainers, they will understand (but not necessary agree with) my proposal.

amitdo on 27 Mar 2018

First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian... If yes that we should release 4.0 ASAP (e.g. fix of issues will be accepted, no code changes).

Next I would like see report like this to better understand last changes.

Then we can decide how 4.0 will be release:

as branch started from 4.0.0-beta.1 tag (no changes in master branch - only fixes will be ported to 4.0 release branch)
or from master (we accept all applied commits for now.)

I do not expect to revert any commit in master.

zdenop on 27 Mar 2018

as branch started from 4.0.0-beta.1 tag (no changes in master branch - only fixes will be ported to 4.0 release branch)

I do not expect to revert any commit in master.

Yes, what you wrote here is what I meant.

amitdo on 27 Mar 2018

As per Jeff, we can't make any changes to what is shipped for 18.04.

But we still have time to do another beta, rc-1 and final 4.0.0 release in
time for 18.10.

I do not really know much about Linux releases, but my hope would be that
users would be able to install/upgrade to the 4.0.0 final version shipped
with 18.10 on 18.04.

@AlexanderP please explain whether the above is possible.

@Shreeshrii Updating shall will come to the end without problems

AlexanderP on 27 Mar 2018

Please don't worry too much about Ubuntu, everything is going to be fine. I've had a crazy day today, but will have time tomorrow to discuss.

jbreiden on 28 Mar 2018

First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian...

The version of Tesseract that ships with Ubuntu 18.04 will not change, unless there is a major security issue. See this chart for shipping Tesseract versions for different Ubuntu releases. https://launchpad.net/ubuntu/+source/tesseract

my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.

Ubuntu users have many choices if they want a newer Tesseract. They can build from source. They can install from Alexander's PPA. There's something called a "snap" which I don't know too much about. Maybe other ways too.

Shipping alpha/beta software in final LTS was/is a really bad idea. I bet it's against Ubuntu's policies.

This decision belongs to the Debian/Ubuntu package maintainers, which is Alexander and myself. I am a member of the Debian Project, and sponsored Alexander's excellent packaging work as official. I thought users would significantly benefit from the improved accuracy of LSTM Tesseract. I think (and hope) most developers will understand that the Tesseract API is still changing, and not have too much trouble.

We are the upstream, not Ubuntu!

That's right! Don't feel constrained. It is perfectly okay for Tesseract to change API before final release. If the API changes, Ubuntu and other Linux distributions will deal with it, and it won't be too hard. For example, in Ubuntu, the only direct dependencies on libtesseract4 are gimagereader libavfilter-extra6 libopenalpr2 libopencv-contrib3.2 and libsikulixapi-jni. These programs use just a tiny fraction of Tesseract's API. It will be up to Alexander and myself to make sure everything continues to work well together in Debian/Ubuntu both now and in the future.

jbreiden on 28 Mar 2018

Alexander and Jeff, I'll support you where needed, too, of course.

stweil on 28 Mar 2018

Jeff, Alexander,
I’m sorry that I caused offense.

amitdo on 28 Mar 2018

👍1

@amitdo No offense taken. We are all on the same team.

jbreiden on 28 Mar 2018

👍1

@stweil : Are you interested in warnings from VS2017? I was able to build tesseract with cmake, cppan an VS2017.

zdenop on 3 Apr 2018

Are those warnings the same as the warnings from the Appveyor CI build? And did you compile using Visual Studio Community? One of my colleagues might be interested, as he does more programming with Tesseract on Windows. I'm more focused on Linux and only look on macOS and Windows from time to time.

stweil on 3 Apr 2018

I just check them and it seems to be the same.

zdenop on 3 Apr 2018

4.00-alpha was 'released' in November 2016.

I think we should release a final 4.0.0 soon.

@stweil, is it fine with you if we decide on releasing 4.0.0-rc.1 in May 15?
After rc-1, no new features should go to 4.0.x branch, only bug fixes.

4.0.0 (final) will be released 2-6 weeks after rc.1.

amitdo on 10 Apr 2018

@jbreiden A number of training related issues are because of lack of updated langdata. Ray had mentioned a few days back that the files are available in google repo and could be transferred after deleting extra files.

Any update regarding that.

I think the final release should include updated langdata also.

Shreeshrii on 10 Apr 2018

@Shreeshrii Can you point me at Ray's comment please?

jbreiden on 11 Apr 2018

https://github.com/tesseract-ocr/langdata/issues/83#comment-374460335

Shreeshrii on 11 Apr 2018

theraysmith
commented 23 days ago
Hmm. Sorry. I thought I had done this in September.
The Google repo is up-to-date apart from the redundant files that need to be deleted.
I'll work with Jeff to get this done.

Shreeshrii on 11 Apr 2018

This issue is fine for discussions, but the overview gets a little bit lost. Therefore I just started a new page for the release planning in the Tesseract wiki. Comments and contributions are welcome!

stweil on 11 Apr 2018

👍1

@stweil Thanks for adding the planning page. It is much easier to see the
open tasks and plans on it

On Thu 12 Apr, 2018, 12:35 AM Stefan Weil, notifications@github.com wrote:

This issue is fine for discussions, but the overview gets a little bit
lost. Therefore I just started a new page for the release planning
https://github.com/tesseract-ocr/tesseract/wiki/Planning in the
Tesseract wiki. Comments and contributions are welcome!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-380562244,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o0aQrt2rsNd-Fa1SURx2qY-uOG-Rks5tnlQWgaJpZM4S57Iv
.

Shreeshrii on 12 Apr 2018

Adding some more issues below which could be fixed for 4.0.0

combine_lang_model does not print correct usage help #1375
Insufficient error message when output file cannot be created #1424
Segfault on using -psm 0 when using fast eng.traineddata #1167

Shreeshrii on 30 Apr 2018

Not to forget the endianness issue (see #518, #1525). For Linux distributions, the current status (big endian Tesseract 4.0 crashes) is not acceptable.

Update: The endianness issue is fixed now.

stweil on 30 Apr 2018

👍1

@stweil, what should be our next step?

4.0.0-beta.2
4.0.0-rc.1
final 4.0.0

What about a timeline?

amitdo on 1 May 2018

I think the FAQ in the wiki needs to be streamlined.

I suggest moving the current page as FAQ-old and creating a new FAQ page with link to the old one.

The new FAQ page should only have items relevant to the 4.0.0 release and common info such as link to ImproveQuality etc.

Items for FAQ-old which are relevant to 4.0.0 should be moved/copied to FAQ.

Shreeshrii on 1 May 2018

I have made changes for 4.0.0 to https://github.com/tesseract-ocr/tesseract/wiki/FAQ

Older version is at https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old

Please review / change / add to the FAQ for 4.0.0.

Shreeshrii on 1 May 2018

👍1

@stweil,

Trying again to get your answer... :-)
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-385665057

amitdo on 3 May 2018

I think that should be a community decision: What do we consider as important for 4.0.0, what can be done later. For example I considered it important that Tesseract must at least work basically on all platforms which are supported by the major Linux distributions, so the breakage for big endian hosts kept me busy for the last days.
The current list of open tasks for 4.0.0 is still rather lengthy. We could postpone some tasks to later versions, but maybe it would be good to have some of them done for 4.0.0. Therefore I suggest to make a 4.0.0-rc.1 next at end of this week, followed by two more release candidates in the following weeks. 4.0.0 could then be tagged by end of May.

stweil on 3 May 2018

@stweil You are the one making most of the changes and bug fixes, so you should prioritize the open tasks list.

@jbreiden There are couple of issues that should be resolved by you and Ray.

One is the updation of langdata repo, we get a lot of training related questions and it will be good to have the correct data to finetune/test with.

Second is the issue related to user-words, which don't seem to work with current code.

Ray has indicated in the past that it could be fixed via a small change in code. I can look up those comments for you later.

It will be good if Ray can implement it so that if a user-words list is given then the result will be ONLY from that. If users want to include user-words along with the rest of dictionary words, then they can update the word-dawg file with their words.

Shreeshrii on 3 May 2018

@stweil You are the one making most of the changes and bug fixes, so you should prioritize the open tasks list.

:+1:

Shree, while you wrote this I was drafting my response, which included this sentence:

Stefan, since you are the leading community developer, I think it's a good idea to follow your wishes and timeline :-)

amitdo on 3 May 2018

https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376004375

A week with no API changes.

Mission impossible.

@Wikinaut, I just now noticed that you :-1: me.

That was a joke...

amitdo on 3 May 2018

@jbreiden @theraysmith

https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290533931

Allow for whitelisting/blacklisting to ensure only numeric results.

A simple code change not related to training.

https://github.com/tesseract-ocr/tesseract/issues/403#issuecomment-265579471

FORCE the output to match the provided pattern(s) and/or word(s). With this option, you can't get anything else out, whatever is in the image.

Shreeshrii on 3 May 2018

Right now there are 229 open issues. It will be helpful if we can identify which ones refer to 4.0.0.

@zdenop @egorpugin It will be great if you can search and label the issues as 4.0.0 or 3.0x. It will help in testing and closing the ones related to 4.0.0 before the release. Thanks.

Shreeshrii on 3 May 2018

Yesterday while training eng for a cursive font for a test, I tried also to use the latest code to create a legacy tesseract model with it using tesstrain.sh.

The model got created ( though shapersble was not built, it seems blocked in tesstrain.sh.)

I used combine_tessdata to create a traineddata file with both this newly created legacy and LSTM model. When trying to use it to recognise texts, it crashes with an assert.

I then used just the legacy model traineddata. Using it to OCR, there is no crash, but the text is totally unlike the original.

I will retest again with a regular serif or san-serif font and file an issue with more details. Meanwhile just wanted to mention it here.

Shreeshrii on 7 May 2018

saya sangat berterimakasih atas informasinya..
anda yangterbaik

Pada tanggal Sen, 7 Mei 2018 08.47, Shreeshrii notifications@github.com
menulis:

Yesterday while training eng for a cursive font for a test, I tried also
to use the latest code to create a legacy tesseract model with it using
tesstrain.sh.

The model got created ( though shapersble was not built, it seems blocked
in tesstrain.sh.)

I used combine_tessdata to create a traineddata file with both this newly
created legacy and LSTM model. When trying to use it to recognise texts, it
crashes with an assert.

I then used just the legacy model traineddata. Using it to OCR, there is
no crash, but the text is totally unlike the original.

I will retest again with a regular serif or san-serif font and file an
issue with more details. Meanwhile just wanted to mention it here.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-386930055,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AkZu89Oun6cG4ql2mFW-mxqg_TRctLCOks5tv5mrgaJpZM4S57Iv
.

ry4nguera on 7 May 2018

Therefore I suggest to make a 4.0.0-rc.1 next at end of this week, followed by two more release candidates in the following weeks. 4.0.0 could then be tagged by end of May.

@stweil Is this still the plan?

Shreeshrii on 28 May 2018

FYI

@jbreiden had mentioned in another thread about the possibility of access to a big-endian machine at http://osuosl.org/ for testing. I applied and have access to a VM Ubuntu (Xenial) on Power8 (little-endian). It has made it easier/faster for me to build/test tesseract, try to finetune models etc.

Thanks Jeff for the info regarding this option. Thanks to @AlexanderP for adding the platform for his PPA for leptonica and other libs also.

Shreeshrii on 29 May 2018

I'm afraid that I underestimated the amount of open issues which I think should be solved for 4.0.0.

stweil on 31 May 2018

Maybe we should release beta.2 ?

amitdo on 31 May 2018

@zdenop , @AlexanderP, I suggest to tag new pre-releases to match the latest Debian / Ubuntu packages:

4.0.0-beta.1 (already available): https://github.com/tesseract-ocr/tesseract/commit/40f43111e05b3dd2f2f8aeae3aba33016523c881
4.0.0-beta.2 (new): https://github.com/tesseract-ocr/tesseract/commit/10f4998aee3ccc68e9c4931ce744dd292ad6ff19
4.0.0-beta.3 (new): https://github.com/tesseract-ocr/tesseract/commit/c3ed6f036064e54e34f75275f66c70dd924527bf
4.0.0-beta.4 (new): https://github.com/tesseract-ocr/tesseract/commit/555f6ffc0191fdd481e792be5afacc5644012bb9

The latest changes refactored the code, but no fixes, so I see currently no need for a newer release.

PS. The three new beta releases can share the same description: Beta release for Ubuntu 18.04.

stweil on 31 May 2018

What about PR 1614 ?

zdenop on 31 May 2018

please see forum post
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/qjpVWmdP9GE/f9lsXWKhAAAJ

regarding possible solution about can't find matching blob, caused by
integer overflow.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 1, 2018 at 1:34 AM, zdenop notifications@github.com wrote:

What about PR 1614 https://github.com/tesseract-ocr/tesseract/pull/1614
?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-393662238,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozFfUqLfbm74f7HBpuqfD2WxXL2Uks5t4EyygaJpZM4S57Iv
.

Shreeshrii on 31 May 2018

Also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/qjpVWmdP9GE/y7NlZS3uAAAJ regarding possible solution for buffer read overrun inside the call to ReadMemBoxes().

Shreeshrii on 4 Jun 2018

Pull request #1630 should fix the buffer overrun for Tesseract 3.05. Git master already has that fix.

stweil on 4 Jun 2018

@zdenop , @AlexanderP, I suggest to tag new pre-releases to match the latest Debian / Ubuntu packages:
4.0.0-beta.1 (already available): 40f4311
4.0.0-beta.2 (new): 10f4998
4.0.0-beta.3 (new): c3ed6f0
4.0.0-beta.4 (new): 555f6ff

I think this is a good idea.

I will also suggest that the important bug fix commits be marked for tagging at time of PR by the committer.

Shreeshrii on 4 Jun 2018

I wouldn't worry about c3ed6f0 and 555f6ff, because they are only in Linux distributions that change continuously. @AlexanderP has been keeping them pretty close to HEAD and I suspect he will continue to do so.

Despite all my predictions to the contrary, Ubuntu somehow managed to ship 10f4998 in their 18.04 long term release. So that one will probably get a bunch of use over the next 5 years. It's great to see good OCR becoming more and more accessible to people.

jbreiden on 4 Jun 2018

4.0.0-beta.2 (new): 10f4998 Ubuntu 18.04 long term release

Shreeshrii on 4 Jun 2018

Next item -

It would be nice if we fix public include files.
I've just encountered errors with conflicting file names on include path.

File ccmain/pageiterator.h has two includes:

#include "publictypes.h"
#include "platform.h"

My project also has platform.h and at the same time other 3rd party library has publictypes.h, so I had to rename two tesseract includes to

#include "../ccstruct/publictypes.h"
#include "../ccutil/platform.h"

Possible solutions:

Move public headers into tesseract/ dir, so they'll be always included as tesseract/file.h (my preference)
Explicitly state dir - #include "ccutil/platform.h" - more verbose - probability if other 3rd party dep has ccutil/platform.h is much smaller.

egorpugin on 13 Jun 2018

I also prefer the variant #include "tesseract/file.h" (and dropping the prefix for tesscallback.h and tess_version.h). In addition we chould review whether there are more public API headers which should not be there. But that's an API change of course, and all third parties which use the C or C++ API will have to do a (trivial) update. So we have to do it now before releasing 4.0.0 or much later. @amitdo, would that be an acceptable API change?

stweil on 13 Jun 2018

No objection.

amitdo on 13 Jun 2018

Good news. Debian + Ubuntu have always shipped Tesseract headers
in their own subdirectory. So should be no compatibility headaches there.

https://packages.debian.org/sid/amd64/libtesseract-dev/filelist

jbreiden on 15 Jun 2018

👍1

Pull request #1678 changes the external API. Projects using Tesseract must now write #include "tesseract/...".

stweil on 18 Jun 2018

Can we add the new tags, please, when merging this new PR.

4.0.0-beta.2 (new): 10f4998 Ubuntu 18.04 long term release

4.0.0-beta.3 (new): for Pull request #1678 which changes the external API

Shreeshrii on 18 Jun 2018

I'd appreciate new tags, too. @zdenop?

stweil on 18 Jun 2018

Done.
Also I released bugfix version of 3.05.02
The top tags (in github) are still "4.00.00dev" and "4.00.00alpha". What about to renaming it to 4.0.0-dev (maybe this one could be just removed) an 4.0.0-alpha?

zdenop on 19 Jun 2018

@zdenop Thanks for adding the tags.

Currently 4.0.0-beta.2 is showing up as the tag during builds - I think it is going by the date/time when tagging was done.

tesseract -v
tesseract 4.0.0-beta.2-313-g29f28
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Please change so that 4.0.0-beta.3 is dated after that.

What about to renaming it to 4.0.0-dev (maybe this one could be just removed)

I agree that it can be removed.

Shreeshrii on 20 Jun 2018

Also, please tag them as pre-releases (similar to beta.1). Thanks!

Shreeshrii on 20 Jun 2018

I made them at the same time, but github list them at different days... I tried to delete and recreate 4.0.0-beta.3 tag, but it does not help. github still report it as 2 days old...

zdenop on 20 Jun 2018

Finally I put tag 4.0.0-beta.3 it on different commit, so it is listed as expected ;-). Please do not forget to run locally:
git pull --prune --tags
otherwise you can experience diference to remote master tags.

zdenop on 20 Jun 2018

Thank you, specially for giving the command to get them right locally :-)

git pull --prune --tags

On Wed, Jun 20, 2018 at 8:36 PM zdenop notifications@github.com wrote:

Finally I put tag 4.0.0-beta.3 it on different commit, so it is listed as
expected ;-). Please do not forget to run locally:
git pull --prune --tags
otherwise you can experience diference to remote master tags.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-398785203,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7CaTrmYGMgpMracz4FiCfKXjgDBks5t-mT4gaJpZM4S57Iv
.

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 20 Jun 2018

@zdenop I used git pull --prune --tags however, still version shows beta-2 only.

tesseract -v
tesseract 4.0.0-beta.2-359-ga936
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Shreeshrii on 26 Jun 2018

git describe expects an annotated tag. beta-2 is annotated, but beta-3 is not, so git handles this as a local or temporary tag, not as an official release tag. I am afraid we will have to wait until this is fixed with beta-4.

stweil on 26 Jun 2018

@zdenop, git tag -a -f 4.0.0-beta.3 4.0.0-beta.3 replaces the lightweight tag by an annotated tag. You could run that on a fresh clean clone and use git push --tags to push the updated tag.

Please read https://git-scm.com/docs/git-tag#_on_re_tagging before doing that.

stweil on 26 Jun 2018

try now.

zdenop on 26 Jun 2018

tesseract -v
tesseract 4.0.0-beta.3-54-g6f23
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

Soon, it will be time for beta.4 :-)

Shreeshrii on 26 Jun 2018

FYI - Interesting visualization and comparison of OCR results for Arabic with different traineddata files.
http://kanz.pw/ocr/
See related discussion at
https://github.com/tesseract-ocr/tessdata_best/issues/11#issuecomment-400550328

Shreeshrii on 27 Jun 2018

Thanks everyone for your work!!!

Shreeshrii on 5 Jul 2018

@stweil,
What about a new schedule for 4.0.0?

amitdo on 8 Jul 2018

We get a lot of questions regarding the tesseract training tutorial, so I decided to go through the same, modify for current file structure etc.

Ray has mentioned in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch

The character error rate falls below 50% just after 3700 iterations, and by 5000 to about 13%, where it will terminate. (In about 20 minutes on a current high-end machine with AVX.)

Running on Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-122-generic ppc64le) I am finding that it is taking much longer.

Edit: Now that it has reached 3700 iterations after a few hours, I see that error rate has gone back up to 100% instead of the expected 50%. The only difference is that the fonts installed on my system are different from the ones Ray used.

File /tmp/tmp.8c63LAbUJ6/eng/eng.TeXGyreHerosCondensed_Bold_Italic.exp0.lstmf page 1 :
Mean rms=5.856%, delta=49.178%, train=100.173%(100%), skip ratio=0.6%
Iteration 3699: ALIGNED TRUTH : questions 8 this?) Other because 1 has & character; NCBI was back - SEARCH
Iteration 3699: BEST OCR TEXT :
File /tmp/tmp.8c63LAbUJ6/eng/eng.TeXGyreHerosCondensed.exp0.lstmf page 24 :
Mean rms=5.856%, delta=49.184%, train=100.173%(100%), skip ratio=0.6%
At iteration 3694/3700/3722, Mean rms=5.856%, delta=49.184%, char train=100.173%, word train=100%, skip ratio=0.6%,  wrote checkpoint.
Iteration 3700: ALIGNED TRUTH : much - 4. -» used through € between NEW % J. should when High when We it
Iteration 3700: BEST OCR TEXT :
File /tmp/tmp.8c63LAbUJ6/eng/eng.TeXGyreHerosCondensed_Italic.exp0.lstmf page 21 :
Mean rms=5.856%, delta=49.182%, train=100.173%(100%), skip ratio=0.6%
Iteration 3701: ALIGNED TRUTH : much - 4. -» used through € between NEW % J. should when High when We it
Iteration 3701: BEST OCR TEXT :

Has anyone else in the group run the tutorials? What times do you get? Should we setup a travis CI to test the tutorial process once in a while?

Shreeshrii on 17 Jul 2018

Also is there anything additional required to be added for the architecture?

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Intel%20SSE%20to%20PowerPC%20AltiVec%20migration

Shreeshrii on 17 Jul 2018

I didn't run the tutorial.

Tesseract lacks SIMD code for ppc64, so the speed will be slower than x86/x86-64.

I don't know what's the situation with OpenMP.

amitdo on 17 Jul 2018

@amitdo Even if speed is slower, the results should be same or similar with same training data. I am getting a wide variance...

#At iteration 4988/5000/5000, Mean rms=3.855%, delta=17.627%, char train=75.487%, word train=98.179%, skip ratio=0%,  wrote checkpoint.
#Finished! Error rate = 54.122

vs

#At iteration 4856/5000/5000, Mean rms=1.191%, delta=2.276%, char train=7.534%, word train=16.772%, skip ratio=0%,  New best char error = 7.534 wrote best model:../tesstutorial/engoutput/base7.534_4856.checkpoint wrote checkpoint.
#Finished! Error rate = 7.534

I am trying to figure out if it is a recent change in code or a difference in configure options while compiling which is causing this.

Shreeshrii on 18 Jul 2018

I am trying to figure out if it is a recent change in code or a difference in configure options while compiling which is causing this.

It seems to be related to configure options.

For commit - tesseract 4.0.0-beta.3-180-gab1f

./configure --enable-openmp --disable-debug --disable-opencl
I get
At iteration 4856/5000/5000, Mean rms=1.191%, delta=2.276%, char train=7.534%, word train=16.772%, skip ratio=0%,  New best char error = 7.534 wrote best model:../tesstutorial/engoutput/base7.534_4856.checkpoint wrote checkpoint.
Finished! Error rate = 7.534

./configure --enable-debug --disable-shared --disable-static CXXFLAGS="-Wall -Wextra -g -O0"
I get
At iteration 4992/5000/5000, Mean rms=5.853%, delta=49.475%, char train=100%, word train=100%, skip ratio=0%,  wrote checkpoint.
Finished! Error rate = 25.515

Shreeshrii on 18 Jul 2018

Adding -O0 will surely make the speed super slow.

I don't know what's the reason for the difference in accuracy.

amitdo on 18 Jul 2018

Is there a list of configure options available for tesseract?
What are the options set by default if just ./configure is used?
Is some error/warning displayed when incompatible options are chosen?

Shreeshrii on 19 Jul 2018

For (1) and (2) use ./configure --help

amitdo on 19 Jul 2018

For (3) see https://github.com/tesseract-ocr/tesseract/issues/1739#issuecomment-402705969

amitdo on 19 Jul 2018

@Shreeshrii, in pull request #1790 I try to improve the output from ./configure --help, so hopefully (1) and (2) should then be better answered.

stweil on 19 Jul 2018

Maybe we should disable the Java based graphic debugger by default?

amitdo on 19 Jul 2018

@stweil Thanks!

Shreeshrii on 19 Jul 2018

There is regression in plusminus training compared to 2017 code.

Ray - training 4.0 wiki - 3600 iterations - 0.041% 0.185%
alpha - Dec 2017/Jan 2018 - 3600 - char train=0.031%, word train=0.069%
beta.3 - latest code - 3600 - char train=0.107%, word train=0.297%,

@stweil @AlexanderP Is there a way we can setup for automatic testing the training process?

Shreeshrii on 23 Jul 2018

@Shreeshrii Make it possible. only the size of the source archive will be greatly increased.
As it will be necessary to include all language packs in it.

AlexanderP on 23 Jul 2018

@AlexanderP

I was not clear in my question. I meant the tesseract LSTM training tutorial process.

We only have accuracy numbers from Ray's tutorial for English. To replicate that tutorial will only require data related to English.

I have a bash script to run the required commands sequentially. However, it takes quite long for me to build tesseract, create training data and then run the tutorial commands.

So, I was hoping for two things.

A process to run the lstm training tutorial on an ongoing basis, eg. before a new tag, or every 50 commits or every month, to catch any regression.
A similar process to find out where the current regression happ

Shreeshrii on 23 Jul 2018

Please see https://github.com/tesseract-ocr/tesseract/issues/1798 regarding regression in plusminus training.

Shreeshrii on 23 Jul 2018

@Shreeshrii The easiest way to do this is with travis. We fit into 50 minutes in place with the assembly?

AlexanderP on 23 Jul 2018

Thanks, Alex. I will give it a try.

Shreeshrii on 25 Jul 2018

@zdenop Please tag beta.4, we have many changes and fixes. Thanks!

Shreeshrii on 29 Jul 2018

Yes, I think tagging a new beta would be good. An unsigned annotated tag can be created like this:

git tag -a -m "4.0.0-beta.4 release" 4.0.0-beta.4 18787ea12b2ea9368c8e1c0128d1f8aef2beebc8
git push --tags

Replace -a by -s to create a signed annotated tag.

Then draft a new pre-release based on 4.0.0-beta.4 in GitHub.

stweil on 30 Jul 2018

Done. Soon we will have more 4.0 betas than releases ;-)

zdenop on 30 Jul 2018

@zdenop Thanks! Please also make 4.0.0-beta.4 as a pre-release inGitHub.

@stweil There still are a number of issues to be fixed/looked at. Please review to prioritize what needs to be in 4.0.0 and what can be pushed to next.

Shreeshrii on 1 Aug 2018

Maybe we should disable the Java based graphic debugger by default?

@amitdo I agree, as I do not use it. However, Ray has included it as part of tutorial process. The training wiki page will need a change, if this is disabled.

Shreeshrii on 1 Aug 2018

It is beta tag not RC. IMO only RC should be marked as pre-release.

zdenop on 1 Aug 2018

https://help.github.com/articles/creating-releases/

If the release is unstable, select This is a pre-release to notify users that it's not ready for production.

amitdo on 1 Aug 2018

@amitdo : what does it imply? Should be all tags be marked as pre-release/release candidates?

zdenop on 1 Aug 2018

Technically this is not necessary. It is only additional information for people who look at the list of releases on GitHub.

stweil on 1 Aug 2018

@zdenop, it's your choice :-)

amitdo on 1 Aug 2018

The advantage that I see from a tag also being marked as pre-release is that it displays the number of commits made after it in master.

Eg. 4.0.0-beta.3 Release
@zdenop zdenop released this on 26 Jun · 277 commits to master since this release

This kind of info is not displayed for beta.4.

This is in no way necessary, but is info that is useful to have .

There maybe other ways of getting the info, that I do not know about.

Shreeshrii on 2 Aug 2018

Ok. So just to "highlight" 4.0 release I will keep "one" (the latest) 4.0 pre-release.

zdenop on 2 Aug 2018

Langdata for 4.0 added to new repo.

jbreiden on 10 Aug 2018

🎉2

Thank you, @jbreiden, for this upload.

stweil on 10 Aug 2018

Thanks, @jbreiden. This will be very helpful.

Please also upload the font list used for each language, updated training scripts and any other info required by those who want to replicate LSTM training.

Shreeshrii on 11 Aug 2018

4.00-alpha was 'released' in Nov 8, 2016.

When are we going to finally release 4.0.0?

@zdenop? @stweil?

amitdo on 6 Sep 2018

It depend what else should be fixed.
@jbreiden : do you have any info about Ray next steps?

zdenop on 7 Sep 2018

@zdenop Ray is busy working on something else, and barely has time to say hello to me. He definitely doesn't have time for significant Tesseract work.

jbreiden on 7 Sep 2018

Hello Jeff :-)

Thanks for the info.

amitdo on 7 Sep 2018

@amitdo, I'd prefer if we could at least fix the most common known bugs for 4.0.0. You know that we already get a lot of duplicates for those bugs, and I'm afraid that would increase as soon as an official 4.0.0 is released. But in the end, I think it's the community which has to decide when Tesseract is "good enough" to leave beta state (as long as Ray is busy with other work).

stweil on 7 Sep 2018

👍1

Are you aware about lgtm project? It was recently implemented for leptonica.

zdenop on 7 Sep 2018

https://github.com/tesseract-ocr/tesseract/search?q=lgtm&type=Commits

@stweil has been fixing based on lgtm alerts also.

Shreeshrii on 7 Sep 2018

Yes, I saw it at Leptonica and noticed that they already had alerts for Tesseract, too: https://lgtm.com/projects/g/tesseract-ocr/tesseract/alerts/.

stweil on 7 Sep 2018

That's the real question, at what stage it will be 'good enough'.

Personally, I prefer the 'release early, release often' paradigm.

I believe many people still use 3.05 and won't use 4.0.0 because beta implies 'unstable, buggy', but generally 4.0.0 is much better and I want more users to move to 4.0.0.

https://github.com/tesseract-ocr/tesseract/wiki/Planning
We need a more clear roadmap, otherwise we won't get to the target in a reasonable time.

We can publish a page titled

Known issues

for some issues that will be still unresolved when 4.0.0 is out.

amitdo on 7 Sep 2018

@stweil : Do you plan to fix some issues within short time?
I can mark current code as RC1 to give some time to package managers for testing and we can go for we release on 2018-09-30 if nothing special is found. Any objections or ideas?

zdenop on 12 Sep 2018

To be quite honest, I don't consider the current state ready to be released, there are just too many obvious bugs that cannot be explained with just accuracy.
Despite tesseract 3 having lower accuracy, there are no such obvious bugs.
Depending on the use case, I'd say tesseract 3 is often still the better choice.

troplin on 14 Sep 2018

@troplin, thank you for this feedback. I thought that Tesseract 4 still works good with the old recognizer and can be used as a full (better) replacement for Tesseract 3. Which bugs do you get with Tesseract 4 when it is used with the old OCR engine? Or is this list of regressions complete? Which other bugs do you consider as release stoppers?

stweil on 14 Sep 2018

IMO, there's no critical/blocking bugs in 4.0.0.

Despite tesseract 3 having lower accuracy, there are no such obvious bugs.
Depending on the use case, I'd say tesseract 3 is often still the better choice.

You still have an option to make 4.0.0 operate like 3.05. Just use it with --oem 0.

amitdo on 14 Sep 2018

You still have an option to make 4.0.0 operate like 3.05. Just use it with --oem 0.

@amitdo Do the user patterns, blacklist, whitelist etc work then?

Shreeshrii on 14 Sep 2018

@amitdo Do the user patterns, blacklist, whitelist etc work then?

Are these critical feature for an OCR software?
IMO, the answer is 'No'. Others may disagree.

How much time we should wait until someone will fix these issues? Another two years?

amitdo on 14 Sep 2018

@stweil LSTM is the default in tesseract 4 and I thought the old engine is to be removed. If LSTM is not working properly what't the point of releasing 4.0?

What's most annoying is that there's seems to exist a general internal issue with string length / word boundaries or a off-by-one error or similar that causes various bug, eg. #1712.

The effects are sometimes characters missing that cannot be explained plausibly:
zero_ero

Sometimes a single character word that has a huge bounding box, like the "a" here:
ausschnitt.out.tesseract4.pdf

Word bounding boxes that are curiously wrong (always the same pattern as in #1712 ):
numberof

And those are not rare occurrences, it happens in almost all documents I tried.

If you like I can create more issues, but I thinks it's all somehow related and if one problem is fixed, the others are too.

troplin on 14 Sep 2018

👍1

@amitdo What's an OCR software? There are many different use cases how an OCR component can be integrated in software, and for us such issues are not acceptable.

troplin on 14 Sep 2018

The main job of an OCR software is to output the text in the image.

Tesseract 4.0.0 does this job better in most cases (at least for books/magazines/newspapers, which were and still are its most important use cases).

All other features, like hocr & pdf output for example, are 'nice to have', but should not block a release after two years in alpha.

Tesseract is a command line OCR. Like it or not, being a good library with a nice API for developers is a secondary goal.

- All that is my personal opinion.

The truth is that 3.05(.xx) is not really supported anymore. We work only on 4.0.0.

amitdo on 14 Sep 2018

Yes, there are things which are more important than others. Personally I think that most users will only run OCR, but not do training, so severe bugs for the former are not acceptable for a stable release while bugs in the training part can be fixed after release of 4.0.0.

PDF is nice to have, but there are alternate solutions which can create PDF from hOCR.

hOCR output is essential for my (and other scientific) work, because it contains not only the text, but also coordinates and other essential information. It's also the only format which can be converted to ALTO format.

stweil on 14 Sep 2018

👍2

@amitdo I think you are overestimating the usage and usefulness of plain text output. Most interesting use cases need more information than just the plain text. Also I think you might be underestimating the usage of tesseract as a component in bigger systems/applications. I don't have any reliable data though.

As a matter of fact, I don't even use the PDF output and for me that feature is not important. But the problems that I encounter are are not specific to PDF output, they also appear in the hOCR output and (e.g. the missing z in zero) even in the plain text. It's just easier to see with the colored PDFs that I generate.

If these were just minor accuracy problems, e.g. caused by a lack of training or similar, then I'd probably agree that this can be solved later. But these problems hint at inconsistencies in internal data structures, errors in the program logic. It's not clear what other problems these could cause. They could even be exploitable.

troplin on 14 Sep 2018

This project tries to be a 'Swiss Army knife' that supports:

Regular users (command line)
Developers (C++ and C API)
2 OCR engines
>120 languages (traineddata)
- 3 current variants (in different repos) - lstm-best/lstm-fast/int-lstm-and-legacy
Windows
- MSVC (2015, 2017)
- MinGW-(w64)
- Cygwin
Linux (all of them...)
- GCC & Clang
macOS
- mainly with Clang
*BSD
IOS? Android?
VM / Docker (there was at least one bug report related to SIMD in a VM)
64, 32 bit
Intel/AMD, ARM, POWER
Little and big endian
For each platform, there are many ways to install Tesseract
- From source
- From a package manager
- We also supports 'snap' for Ubuntu,
text, tsv, hocr, pdf, unlv outputs
Training, which is too complicated for most users (we get ton of questions about training)
OpenMP
SIMD
OpenCL
Visual 'Debugger' with Java GUI

This is Great! :-)

BUT...

The problem is that we have too little resources (no. of developers and time they have to contribute to the project, and don't forget the support side - questions and issues).

amitdo on 14 Sep 2018

@amitdo

This project tries to be a 'Swiss Army knife' that supports: [...]

I understand that, but I draw different conclusions. Many of those features don't necessarily have to be in tesseract itself, that's true. But tesseract must provide the necessary data to enable those features.
The "heart" of tesseract is the API, almost everything else builds on that. That's the part that's most important. If there are errors in the core code, it affects everything.
If the API works correctly, other programmers can use it to build those missing features.

The problem is that it have too little resources (no. of developers and time they have to contribute to the project, and don't forget the support side - questions and issues).

If you release buggy software, you'll get more support, not less.
Even now there is already a handful of issues probably related to those problems: #1192, #1906, #1883, #1146 (maybe even #1015 and #1810) and there will probably be more if you release it like this.

EDIT:
It's actually the same with "2 OCR engines": The only reason you need 2 engines is because the LSTM engine is not yet able to completely replace the old one. Once the LSTM engine ist at that point, you can drop the old engine and no one will complain.

troplin on 14 Sep 2018

👍1

@troplin,

Obviously, we have different opinion regarding this subject.

Releasing 'Something' is better than releasing 'Nothing' (for years).

This 'Something' works quite well now.

Buggy?
- Yes, it has bugs...

How much time should we wait until some hero will come and save us by fixing the bugs you mentioned? Surly, other people have other 'favorite' bugs. Should we wait forever to fix all bugs to satisfy everyone?

Search for the term 'Release early, release often'. I believe that this is the right approach for open source projects.

amitdo on 14 Sep 2018

It's actually the same with "2 OCR engines": The only reason you need 2 engines is because the LSTM engine is not yet able to completely replace the old one. Once the LSTM engine ist at that point, you can drop the old engine and no one will complain.

Ray said several times that keeping the 'legacy' engine blocks fixing and improving the lstm engine. The legacy engine has ~37K LOC, and is much more complex than the lstm engine.

The legacy is 'Good' for languages written in the Latin script. It's not that good for other scripts.

amitdo on 14 Sep 2018

I would kindly ask everyone to consider being slightly more constructive in his/hers comments in order to avoid any deadlocks with releasing.

It should be solely on the shoulders of the maintainers to decide when and what to release. If the maintainers ask for comments, the responses should be either "wait for this feature because I am going to provide a PR soon" or "it would be nice to have this feature in the next release but if no one is going to take it in the near future I will live happily without it".

I would also love to have all the bugs fixed but if I were a maintainer, I would release asap. Tesseract 4 is in the core a completely different project than Tesseract 3 and it clearly is superior in many aspects. From the maintenance point of view, I think it is better to move with versions up than waiting in alpha/beta state for years together with waiting for new contributors. Do not let a perfect backward compatible release stand in the way of a good new one :).

Moreover, based on our experience (we use version 3 and version 4 in parallel), Tesseract 4 is not and very likely never will be a full replacement of Tesseract 3 (for example if your priority is CPU speed but also precision with very specific corner cases etc.) so blindly trying to go for this goal is, in my opinion, a waste of time.

vidiecan on 17 Sep 2018

@vidiecan, which use cases of Tesseract 3 don't work similarly with Tesseract 4 (using the old OCR engine)? My expectation is that Tesseract 4.0.0 is a full replacement. If that's not the case, I'd like to have a description here.

stweil on 17 Sep 2018

https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/YPXxGmDudHk

Zdenko Podobný | Sep 22

Hello,

I would like to thank all who share their thought about releasing new version of tesseract [1]. I took my time and I decided we should make release at the middle October 2018 (14-21...).

This should means that no new features will be applied to current code. There is not time for testings. Anyway please feel free to send your patch/PR - it will included after 4.0 release.

There are several ways, how people can contribute to this process:

Developers: go through open issues, try to fix it. Please make a comment when you start do deal with issue, so we can use our capacity efficiently.

Packagers: please test if building and packaging process is working fine. If something is broken, try to fix&submit it fast. Please give a note to forum or me directly, where users can find your "product", so we can put information about supported systems to release notes.

"Wrappers": if you are producing wrapper for tesseract, please give a note to forum or me directly if you support tesseract 4: I would like to promote your work.

"No code" developers:

check open issues, test it with the latest code if it still valid report, prepare test case if missing, report duplicates, suggest label etc.

Improve documentation, release notes, man pages etc...

English native speaker: check documentation, release notes etc.

Thanks to all who help us to get to this point. I really appreciate all ways of support.

[1] https://github.com/tesseract-ocr/tesseract/issues/1423

Zdenko

amitdo on 25 Sep 2018

@zdenop,

Do you plan to release 4.0.0-rc.1 before the final 4.0.0?

amitdo on 29 Sep 2018

yes. maybe tomorrow.

zdenop on 29 Sep 2018

It looks like tag 4.0.0-rc1 was not created as an annotated tag. Therefore it won't be used for the Tesseract version. This is nothing to worry about too much, but of course we must make sure that the final 4.0.0 gets a correct tag.

stweil on 1 Oct 2018

:-( Can you try git fetch --tags --all --prune ?

zdenop on 2 Oct 2018

Many projects have a document that list all the required steps for preparing a release (final/rc/beta).

It will be nice to have such a document for Tesseract.

amitdo on 2 Oct 2018

Thanks, it's fine now.

stweil on 2 Oct 2018

@zdenop, what about releasing a 4.0.0-rc2 this weekend?

stweil on 7 Oct 2018

Yes, today evening (European time zone).

zdenop on 7 Oct 2018

RC3 was release.

Lets see how many RC we will have :-)

Is it possible to remove old 4.0.0 beta/rc releases, but keep the tags?

amitdo on 15 Oct 2018

I always did it that way: https://github.com/tesseract-ocr/tesseract/releases
Or I miss something?

zdenop on 15 Oct 2018

The old ones are still listed there...

amitdo on 15 Oct 2018

Actually I deleted them, otherwise they will be so emphasized as Release candidate 3 at the moment. I am not sure if I can do more without deleting tag...

zdenop on 15 Oct 2018

You can't. Tags are automatically shown in the list of releases. And deleting the tags would be a really bad idea.

stweil on 15 Oct 2018

@amitdo, @Shreeshrii, @zdenop (and who else is waiting for 4.0.0), what are the most urgent things still missing for the final 4.0.0? I know that there remains much work to be done for 4.1.0 in any case.

stweil on 15 Oct 2018

1192 is the most important one. I still think we should release 4.0.0 even if a right tested fix is not found.

There will always more work to be done after releases :-)

amitdo on 15 Oct 2018

@stweil Thank you for all your work in getting 4.0.0 ready for release.

One of the things that will be useful, IMO,
If the version info from traineddata files can also be displayed when using tesseract for ocr. It might require updating the version strings to include the repo name also.

It would be useful when people report issues.

However, this is only a nice to have feature, and could wait for 4.1.0.

Shreeshrii on 16 Oct 2018

https://github.com/tesseract-ocr/tesseract/milestones/4.0.0 show only one open topic. ;-)

It would be great if following issues are solved:

related to build process - excluding #911
#1990 - maybe we could leave if for 4.0.1 (test Otsu or other binarization from leptonica and check speed and OCR quality impact.)
#1093 (xstarts[1] = xstarts[segments])
better error message/explanation for ASSERT #1075, #1781
#1036 - fix compiler warnings.
#516 (memory leak in commontraining.cpp?), #99....

zdenop on 16 Oct 2018

@zdenop, are you planning a rc4 before the final 4.0.0? Maybe rc4 today, 4.0.0 next weekend?

I'm afraid that we won't be able to solve the issues in your list for 4.0.0.

stweil on 23 Oct 2018

Don't hurry. Do as many betas and rcs as needed.

egorpugin on 23 Oct 2018

@stweil: rc4 could be tagged, if issue #736 is solved/tested...

zdenop on 23 Oct 2018

rc4 released.
BTW: for final release I want to omit git sha info (autotools build) version will be just plain "4.0.0". After release git-rev will be restored. Any objections?

zdenop on 24 Oct 2018

That works automatically, also for the release candidates:

$ git describe 
4.0.0-rc4

It's not necessary to omit and restore something. Just update VERSION and ChangeLog.

stweil on 24 Oct 2018

What about replacing ChangeLog by a very short file which just links to the release notes in the Tesseract Wiki?

stweil on 24 Oct 2018

+1

You can add:

To get the git changelog, run this command:
git log 3.04.01..4.0.0

>

amitdo on 24 Oct 2018

https://github.com/tesseract-ocr/tesseract/commits/4.0.0-rc4 shows the commit list for rc4, so users who don't have a git command line can look at https://github.com/tesseract-ocr/tesseract/commits/4.0.0 for the commits of 4.0.0. Such information can be added to the Wiki, so it would be sufficient to refer to the Wiki in the ChangeLog file.

stweil on 24 Oct 2018

Congratulation on the release of 4.0.0 :tada:

Thanks to every one who contributed: developers, testers, documentation writers, bug reporters.

amitdo on 30 Oct 2018

🎉1

Closing because of 4.0.0. was released..

zdenop on 31 Oct 2018

@zdenop Any plans for a bug fix release.

@stweil Should another issue be opened to discuss plans for next release?

Thanks!

Shreeshrii on 9 Feb 2019

Well, be broke API/ABI compatibility so bug/fix release is not easy (we should remove some fixes/improvement to keep it).

Maybe we should think about next release (4.1.0) or do not care about compatibility (release 4.0.1) which is IMO not right, but in line with tesseract history ;-)

zdenop on 10 Feb 2019

We decided to use semantic versioning (which I think is good), so a new release which is based on Git master would have to be 4.1.0. @AlexanderP, is that a problem for the Debian tesseract-ocr packages? Maybe /usr/share/tesseract-ocr/4.00/tessdata would have to be renamed (I suggest to use /usr/share/tesseract-ocr/4/tessdata).

stweil on 10 Feb 2019

February 21st

[image: Warning /!] FeatureFreeze
https://wiki.ubuntu.com/FeatureFreeze, [image:
Warning /!] Debian Import Freeze

for Ubuntu 19.04 DiscoDingo

On Sun, Feb 10, 2019 at 2:12 PM Stefan Weil notifications@github.com
wrote:

We decided to use semantic versioning (which I think is good), so a new
release which is based on Git master would have to be 4.1.0. @AlexanderP
https://github.com/AlexanderP, is that a problem for the Debian
tesseract-ocr packages? Maybe /usr/share/tesseract-ocr/4.00/tessdata
would have to be renamed (I suggest to use
/usr/share/tesseract-ocr/4/tessdata).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-462114369,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o94Ne9JzfaZ_xG5Rc7emQL-oX6Asks5vL9tegaJpZM4S57Iv
.

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 11 Feb 2019

Debian will start with a new stable release in a few days, and as far as I see that new release will include Tesseract 4.0 for the next few years. Should we backport important fixes to the 4.0 branch? What does that mean for Tesseract 4.1? Are there still interested parties who need it? Or should we focus on Tesseract 5 which may drop or replace old code? @AlexanderP, what upgrade path do you see for Debian?

stweil on 3 Jul 2019

This project has limited resources, so I suggest to release 4.1 soon (1-6 weeks), and then concentrate on 5.0 and abandon 4.x.

amitdo on 3 Jul 2019

I planned to release 4.1 on first of July. Unfortunately I found out there are problem with backwards API compatibility...

zdenop on 3 Jul 2019

I think it is necessary to load version 4.1 and to upgrade to version 5.0 is closer to release.

AlexanderP on 4 Jul 2019

@AlexanderP : Does it mean that if we make 4.1 backwards compatible, you can get it to Debian?

zdenop on 5 Jul 2019

@zdenop I think he can get into the Debian Backports.

AlexanderP on 6 Jul 2019

So Debian Buster will keep using Tesseract 4.0 for the next years? Then a 4.0.1 with carefully selected bug fixes will be required.

stweil on 7 Jul 2019

So Debian Buster will keep using Tesseract 4.0 for the next years?

Yes, but it is necessary to ask @jbreiden
I think 4.1.0, can enter Debian buster-backports.

AlexanderP on 8 Jul 2019

In general, Debian only accepts security fixes for their stable releases.
And that's fine.
People who want fresher software will often do something else (such as run
Debian Testing).
I'm not sure how many people use buster-backports, but if Alexander wants
to make them,
I'm happy to keep signing. (Someday I imagine he will get his own keys, in
my opinion he has
more than earned them!)

Reminder of versions in Debian:
https://packages.qa.debian.org/t/tesseract.html

jbreiden on 9 Jul 2019

Tesseract: RFC: Tesseract 4.0.0 – open tasks

Most helpful comment

All 194 comments

Known issues

1192 is the most important one. I still think we should release 4.0.0 even if a right tested fix is not found.

Related issues