I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.
These tasks are on my own list and to be discussed whether we consider them important for the new release or not:
--version parameter for all command line commands.--list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).
SSE and AVX are also done on CPU :)
Remove deprecated code. This does not include OpenCL or the old Tesseract engine.
Adding a compile option NO_LEGACY_OCR_ENGINE would be nice.
I'll do it.
Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.
My suggestion would be to leave --list-langs as is,
and add this as --list-langs-details
or as --list-lang-details for one language file based on lang-code.
--list-langs should also display the directory it is using. This is useful when tessdata files ate installed in multiple directories, eg. By ppa or Linux distribution vs when built directory.
Re: tessdata,
Config and tessconfigs and pdf.ttf are needed in the directory which is being used via tessdata_prefix or tessdata-dir.
Eg. When doing lstm training, lstm.train config file is not found if one uses tessdata_best as the continue_from dir.
My workaround has been to copy these to both tessdata_fast and tessdata_best repos.
Add/implement install-langs.
A week with no API changes.
Add a simple bash script for building tesseract.
I use the following, it should probably also add commands to offer to download osd and eng traineddata files for first time users.
#!/bin/bash
./autogen.sh
./configure --disable-openmp --disable-graphics --disable-opencl
make
sudo make install
sudo ldconfig
make training
sudo make training-install
rm -rf ./googletest
git submodule update --init
autoreconf -fiv
#export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
export TESSDATA_PREFIX=../tessdata_fast
make check
I would add this:
A week with no API changes.
Mission impossible.
Edit: That was a joke.
There was (online) tool that is monitoring API changes (for tesseract). But I can not find a link for it. Does somebody has it? Can somebody show changes 4.0.beta1 vs. current code?
Please see https://github.com/tesseract-ocr/tesseract/issues/793
The tracker is at https://abi-laboratory.pro/tracker/timeline/tesseract/
Currently it is tracking stable release 3.05.01
@zdenop Please tag another release for 3.05 branch since 3.05.01 had a couple of problems which have been fixed in later commits.
~The good news is that the latest Debian / Ubuntu tesseract-ocr does not include the development files, so there will not be any API between that version and the future 4.0.0 which we have to take care of.~
Sorry, I was wrong: there is libtesseract-dev.
@zdenop I suggest adding labels to issues with the following proposed list of keywords, so that it is easy to see related issues and see if there are any critical pending issues.
4.0.0 for the final relaese
4.0x for 4.00.00alpha and 4.0.0-beta.1
3.0x for 3.05/3.04
LSTM training
training for 3.0x legacy tesseract training
Accuracy for reports of incorrect recognition
Performance for questions related to speed
Crashes for asserts and program crashes
Build related to compile and build from source
This is a suggested list.
IMO, our final 4.0.0 should not significantly diverge from the version that will be shipped in Ubuntu 18.04.
A new branch should be created for 4.0.0.
Only commits that follow the above rules should be backported from master.
4.0.0 should have at least rc.1 before final release.
We can decide that 4.1.0 will be released 2-3 months after 4.0.0 (still with legacy?).
How do you define "significantly"? There are some changes with the latest Git master:
inT32, ...) and macros (MIN_INT32, ...) were removed.Would you suggest reverting these changes? They are major changes which require a step of the major version, so I think 4.0.0 is a good candidate to include those changes. Otherwise we would have to wait for 5.0.0.
I would even go further and fix potential name space problems with the 58 include files which are part of the Tesseract programming API in 4.0.0-beta.1, although that is a significant change, too.
How do you define "significantly"?
basically, any bug fix is ok, must follow the 2 conditions I specified, no new features.
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
I think our aim should be to get all significant changes included in final
4.0.0 and get it ready in time for Ubuntu 18.10. What are the deadlines
for that?
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Mar 27, 2018 at 5:01 PM, Amit D. notifications@github.com wrote:
How do you define "significantly"?
basically, any bug fix is ok, must follow the 2 conditions I specified, no
new features.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376491580,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7atyVy_7E3uk81VhUn_tqFXFJ3-ks5tiiMogaJpZM4S57Iv
.
18.04 is much more significant because it's LTS - supported for 5 years.
18.10 will be supported for only 9 months. We should not care about it.
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
We tagged it as 4.0.0-beta.1.
Another option is to skip final 4.0.0 and go straight to 5.0.0.
As per Jeff, we can't make any changes to what is shipped for 18.04.
But we still have time to do another beta, rc-1 and final 4.0.0 release in
time for 18.10.
I do not really know much about Linux releases, but my hope would be that
users would be able to install/upgrade to the 4.0.0 final version shipped
with 18.10 on 18.04.
@AlexanderP please explain whether the above is possible.
On Tue 27 Mar, 2018, 5:48 PM Amit D., notifications@github.com wrote:
18.04 is a much more significant because it's LTS - supported for 5 years.
18.10 will be supported for only 9 month. We should not care about it.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376503682,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o1f3WICsaeI5d2ge9MMOvA8axn5xks5tii4PgaJpZM4S57Iv
.
@zdenop, your thoughts about these two options?
On Tue 27 Mar, 2018, 5:58 PM Amit D., notifications@github.com wrote:
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
We tagged it as 4.0.0-beta.1.
Yes, that tag is within github.
Please see the post by Jeff, where he has shown what tesseract -v will
report for 18.04.
>
What was shipped for Ubuntu 18.04 reports as tesseract 4.00.00alpha. C
>We tagged it as 4.0.0-beta.1.
Yes, that tag is within github.
Please see the post by Jeff, where he has shown what tesseract -v will
report for 18.04.
Here is the link:
https://github.com/tesseract-ocr/tesseract/issues/995#comment-369704920
>
Jeff just said that the version in Ubuntu won't change in final 18.04.
We are talking about what we want to do in Tessseract's official Github repo.
We are the upstream, not Ubuntu!
IMO, our final 4.0.0 should not significantly diverge from the version
that will be shipped in Ubuntu 18.04.
I am trying to understand how 4.0.0 final release on github relates to
Ubuntu 18.04, in light of the above.
I am missing your reasoning for why it should not significantly diverge.
On Tue 27 Mar, 2018, 6:16 PM Amit D., notifications@github.com wrote:
Jeff just said that the the version in Ubuntu won't change in final
18.04.We are talking about what we want to do in Tessseract's official Github
repo.
We are the upstream, not Ubuntu!—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376511680,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o62Ddg3LsJ9b5FQXiigM96Fy1wGoks5tijS_gaJpZM4S57Iv
.
I want to hear @zdenop's and @jbreiden's opinions.
I think that as maintainers, they will understand (but not necessary agree with) my proposal.
First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian... If yes that we should release 4.0 ASAP (e.g. fix of issues will be accepted, no code changes).
Next I would like see report like this to better understand last changes.
Then we can decide how 4.0 will be release:
I do not expect to revert any commit in master.
as branch started from 4.0.0-beta.1 tag (no changes in master branch - only fixes will be ported to 4.0 release branch)
I do not expect to revert any commit in master.
Yes, what you wrote here is what I meant.
As per Jeff, we can't make any changes to what is shipped for 18.04.
But we still have time to do another beta, rc-1 and final 4.0.0 release in
time for 18.10.I do not really know much about Linux releases, but my hope would be that
users would be able to install/upgrade to the 4.0.0 final version shipped
with 18.10 on 18.04.@AlexanderP please explain whether the above is possible.
@Shreeshrii Updating shall will come to the end without problems
Please don't worry too much about Ubuntu, everything is going to be fine. I've had a crazy day today, but will have time tomorrow to discuss.
First of all I would like to know if final 4.0 release will be included in updates of Ubuntu (18.04)/Debian...
The version of Tesseract that ships with Ubuntu 18.04 will not change, unless there is a major security issue. See this chart for shipping Tesseract versions for different Ubuntu releases. https://launchpad.net/ubuntu/+source/tesseract
my hope would be that users would be able to install/upgrade to the 4.0.0 final version shipped with 18.10 on 18.04.
Ubuntu users have many choices if they want a newer Tesseract. They can build from source. They can install from Alexander's PPA. There's something called a "snap" which I don't know too much about. Maybe other ways too.
Shipping alpha/beta software in final LTS was/is a really bad idea. I bet it's against Ubuntu's policies.
This decision belongs to the Debian/Ubuntu package maintainers, which is Alexander and myself. I am a member of the Debian Project, and sponsored Alexander's excellent packaging work as official. I thought users would significantly benefit from the improved accuracy of LSTM Tesseract. I think (and hope) most developers will understand that the Tesseract API is still changing, and not have too much trouble.
We are the upstream, not Ubuntu!
That's right! Don't feel constrained. It is perfectly okay for Tesseract to change API before final release. If the API changes, Ubuntu and other Linux distributions will deal with it, and it won't be too hard. For example, in Ubuntu, the only direct dependencies on libtesseract4 are gimagereader libavfilter-extra6 libopenalpr2 libopencv-contrib3.2 and libsikulixapi-jni. These programs use just a tiny fraction of Tesseract's API. It will be up to Alexander and myself to make sure everything continues to work well together in Debian/Ubuntu both now and in the future.
Alexander and Jeff, I'll support you where needed, too, of course.
Jeff, Alexander,
I’m sorry that I caused offense.
@amitdo No offense taken. We are all on the same team.
@stweil : Are you interested in warnings from VS2017? I was able to build tesseract with cmake, cppan an VS2017.
Are those warnings the same as the warnings from the Appveyor CI build? And did you compile using Visual Studio Community? One of my colleagues might be interested, as he does more programming with Tesseract on Windows. I'm more focused on Linux and only look on macOS and Windows from time to time.
I just check them and it seems to be the same.
4.00-alpha was 'released' in November 2016.
I think we should release a final 4.0.0 soon.
@stweil, is it fine with you if we decide on releasing 4.0.0-rc.1 in May 15?
After rc-1, no new features should go to 4.0.x branch, only bug fixes.
4.0.0 (final) will be released 2-6 weeks after rc.1.
@jbreiden A number of training related issues are because of lack of updated langdata. Ray had mentioned a few days back that the files are available in google repo and could be transferred after deleting extra files.
Any update regarding that.
I think the final release should include updated langdata also.
@Shreeshrii Can you point me at Ray's comment please?
theraysmith
commented 23 days ago
Hmm. Sorry. I thought I had done this in September.
The Google repo is up-to-date apart from the redundant files that need to be deleted.
I'll work with Jeff to get this done.
This issue is fine for discussions, but the overview gets a little bit lost. Therefore I just started a new page for the release planning in the Tesseract wiki. Comments and contributions are welcome!
@stweil Thanks for adding the planning page. It is much easier to see the
open tasks and plans on it
On Thu 12 Apr, 2018, 12:35 AM Stefan Weil, notifications@github.com wrote:
This issue is fine for discussions, but the overview gets a little bit
lost. Therefore I just started a new page for the release planning
https://github.com/tesseract-ocr/tesseract/wiki/Planning in the
Tesseract wiki. Comments and contributions are welcome!—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-380562244,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o0aQrt2rsNd-Fa1SURx2qY-uOG-Rks5tnlQWgaJpZM4S57Iv
.
Adding some more issues below which could be fixed for 4.0.0
Not to forget the endianness issue (see #518, #1525). For Linux distributions, the current status (big endian Tesseract 4.0 crashes) is not acceptable.
Update: The endianness issue is fixed now.
@stweil, what should be our next step?
What about a timeline?
I think the FAQ in the wiki needs to be streamlined.
I suggest moving the current page as FAQ-old and creating a new FAQ page with link to the old one.
The new FAQ page should only have items relevant to the 4.0.0 release and common info such as link to ImproveQuality etc.
Items for FAQ-old which are relevant to 4.0.0 should be moved/copied to FAQ.
I have made changes for 4.0.0 to https://github.com/tesseract-ocr/tesseract/wiki/FAQ
Older version is at https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old
Please review / change / add to the FAQ for 4.0.0.
@stweil,
Trying again to get your answer... :-)
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-385665057
I think that should be a community decision: What do we consider as important for 4.0.0, what can be done later. For example I considered it important that Tesseract must at least work basically on all platforms which are supported by the major Linux distributions, so the breakage for big endian hosts kept me busy for the last days.
The current list of open tasks for 4.0.0 is still rather lengthy. We could postpone some tasks to later versions, but maybe it would be good to have some of them done for 4.0.0. Therefore I suggest to make a 4.0.0-rc.1 next at end of this week, followed by two more release candidates in the following weeks. 4.0.0 could then be tagged by end of May.
@stweil You are the one making most of the changes and bug fixes, so you should prioritize the open tasks list.
@jbreiden There are couple of issues that should be resolved by you and Ray.
One is the updation of langdata repo, we get a lot of training related questions and it will be good to have the correct data to finetune/test with.
Second is the issue related to user-words, which don't seem to work with current code.
Ray has indicated in the past that it could be fixed via a small change in code. I can look up those comments for you later.
It will be good if Ray can implement it so that if a user-words list is given then the result will be ONLY from that. If users want to include user-words along with the rest of dictionary words, then they can update the word-dawg file with their words.
@stweil You are the one making most of the changes and bug fixes, so you should prioritize the open tasks list.
:+1:
Shree, while you wrote this I was drafting my response, which included this sentence:
Stefan, since you are the leading community developer, I think it's a good idea to follow your wishes and timeline :-)
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-376004375
A week with no API changes.
Mission impossible.
@Wikinaut, I just now noticed that you :-1: me.
That was a joke...
@jbreiden @theraysmith
- Allow for whitelisting/blacklisting to ensure only numeric results.
A simple code change not related to training.
FORCE the output to match the provided pattern(s) and/or word(s). With this option, you can't get anything else out, whatever is in the image.
Right now there are 229 open issues. It will be helpful if we can identify which ones refer to 4.0.0.
@zdenop @egorpugin It will be great if you can search and label the issues as 4.0.0 or 3.0x. It will help in testing and closing the ones related to 4.0.0 before the release. Thanks.
Yesterday while training eng for a cursive font for a test, I tried also to use the latest code to create a legacy tesseract model with it using tesstrain.sh.
The model got created ( though shapersble was not built, it seems blocked in tesstrain.sh.)
I used combine_tessdata to create a traineddata file with both this newly created legacy and LSTM model. When trying to use it to recognise texts, it crashes with an assert.
I then used just the legacy model traineddata. Using it to OCR, there is no crash, but the text is totally unlike the original.
I will retest again with a regular serif or san-serif font and file an issue with more details. Meanwhile just wanted to mention it here.
saya sangat berterimakasih atas informasinya..
anda yangterbaik
Pada tanggal Sen, 7 Mei 2018 08.47, Shreeshrii notifications@github.com
menulis:
Yesterday while training eng for a cursive font for a test, I tried also
to use the latest code to create a legacy tesseract model with it using
tesstrain.sh.The model got created ( though shapersble was not built, it seems blocked
in tesstrain.sh.)I used combine_tessdata to create a traineddata file with both this newly
created legacy and LSTM model. When trying to use it to recognise texts, it
crashes with an assert.I then used just the legacy model traineddata. Using it to OCR, there is
no crash, but the text is totally unlike the original.I will retest again with a regular serif or san-serif font and file an
issue with more details. Meanwhile just wanted to mention it here.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-386930055,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AkZu89Oun6cG4ql2mFW-mxqg_TRctLCOks5tv5mrgaJpZM4S57Iv
.
Therefore I suggest to make a 4.0.0-rc.1 next at end of this week, followed by two more release candidates in the following weeks. 4.0.0 could then be tagged by end of May.
@stweil Is this still the plan?
FYI
@jbreiden had mentioned in another thread about the possibility of access to a big-endian machine at http://osuosl.org/ for testing. I applied and have access to a VM Ubuntu (Xenial) on Power8 (little-endian). It has made it easier/faster for me to build/test tesseract, try to finetune models etc.
Thanks Jeff for the info regarding this option. Thanks to @AlexanderP for adding the platform for his PPA for leptonica and other libs also.
I'm afraid that I underestimated the amount of open issues which I think should be solved for 4.0.0.
Maybe we should release beta.2 ?
@zdenop , @AlexanderP, I suggest to tag new pre-releases to match the latest Debian / Ubuntu packages:
The latest changes refactored the code, but no fixes, so I see currently no need for a newer release.
PS. The three new beta releases can share the same description: Beta release for Ubuntu 18.04.
What about PR 1614 ?
please see forum post
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/qjpVWmdP9GE/f9lsXWKhAAAJ
regarding possible solution about can't find matching blob, caused by
integer overflow.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, Jun 1, 2018 at 1:34 AM, zdenop notifications@github.com wrote:
What about PR 1614 https://github.com/tesseract-ocr/tesseract/pull/1614
?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-393662238,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozFfUqLfbm74f7HBpuqfD2WxXL2Uks5t4EyygaJpZM4S57Iv
.
Also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/qjpVWmdP9GE/y7NlZS3uAAAJ regarding possible solution for buffer read overrun inside the call to ReadMemBoxes().
Pull request #1630 should fix the buffer overrun for Tesseract 3.05. Git master already has that fix.
@zdenop , @AlexanderP, I suggest to tag new pre-releases to match the latest Debian / Ubuntu packages:
4.0.0-beta.1 (already available): 40f4311
4.0.0-beta.2 (new): 10f4998
4.0.0-beta.3 (new): c3ed6f0
4.0.0-beta.4 (new): 555f6ff
I think this is a good idea.
I will also suggest that the important bug fix commits be marked for tagging at time of PR by the committer.
I wouldn't worry about c3ed6f0 and 555f6ff, because they are only in Linux distributions that change continuously. @AlexanderP has been keeping them pretty close to HEAD and I suspect he will continue to do so.
Despite all my predictions to the contrary, Ubuntu somehow managed to ship 10f4998 in their 18.04 long term release. So that one will probably get a bunch of use over the next 5 years. It's great to see good OCR becoming more and more accessible to people.
4.0.0-beta.2 (new): 10f4998 Ubuntu 18.04 long term release
Next item -
It would be nice if we fix public include files.
I've just encountered errors with conflicting file names on include path.
File ccmain/pageiterator.h has two includes:
#include "publictypes.h"
#include "platform.h"
My project also has platform.h and at the same time other 3rd party library has publictypes.h, so I had to rename two tesseract includes to
#include "../ccstruct/publictypes.h"
#include "../ccutil/platform.h"
Possible solutions:
tesseract/ dir, so they'll be always included as tesseract/file.h (my preference)ccutil/platform.h is much smaller.I also prefer the variant #include "tesseract/file.h" (and dropping the prefix for tesscallback.h and tess_version.h). In addition we chould review whether there are more public API headers which should not be there. But that's an API change of course, and all third parties which use the C or C++ API will have to do a (trivial) update. So we have to do it now before releasing 4.0.0 or much later. @amitdo, would that be an acceptable API change?
No objection.
Good news. Debian + Ubuntu have always shipped Tesseract headers
in their own subdirectory. So should be no compatibility headaches there.
https://packages.debian.org/sid/amd64/libtesseract-dev/filelist
Pull request #1678 changes the external API. Projects using Tesseract must now write #include "tesseract/...".
Can we add the new tags, please, when merging this new PR.
4.0.0-beta.2 (new): 10f4998 Ubuntu 18.04 long term release
4.0.0-beta.3 (new): for Pull request #1678 which changes the external API
I'd appreciate new tags, too. @zdenop?
Done.
Also I released bugfix version of 3.05.02
The top tags (in github) are still "4.00.00dev" and "4.00.00alpha". What about to renaming it to 4.0.0-dev (maybe this one could be just removed) an 4.0.0-alpha?
@zdenop Thanks for adding the tags.
Currently 4.0.0-beta.2 is showing up as the tag during builds - I think it is going by the date/time when tagging was done.
tesseract -v
tesseract 4.0.0-beta.2-313-g29f28
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Please change so that 4.0.0-beta.3 is dated after that.
What about to renaming it to 4.0.0-dev (maybe this one could be just removed)
I agree that it can be removed.
Also, please tag them as pre-releases (similar to beta.1). Thanks!
I made them at the same time, but github list them at different days... I tried to delete and recreate 4.0.0-beta.3 tag, but it does not help. github still report it as 2 days old...
Finally I put tag 4.0.0-beta.3 it on different commit, so it is listed as expected ;-). Please do not forget to run locally:
git pull --prune --tags
otherwise you can experience diference to remote master tags.
Thank you, specially for giving the command to get them right locally :-)
git pull --prune --tags
On Wed, Jun 20, 2018 at 8:36 PM zdenop notifications@github.com wrote:
Finally I put tag 4.0.0-beta.3 it on different commit, so it is listed as
expected ;-). Please do not forget to run locally:
git pull --prune --tags
otherwise you can experience diference to remote master tags.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-398785203,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7CaTrmYGMgpMracz4FiCfKXjgDBks5t-mT4gaJpZM4S57Iv
.
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
@zdenop I used git pull --prune --tags however, still version shows beta-2 only.
tesseract -v
tesseract 4.0.0-beta.2-359-ga936
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
git describe expects an annotated tag. beta-2 is annotated, but beta-3 is not, so git handles this as a local or temporary tag, not as an official release tag. I am afraid we will have to wait until this is fixed with beta-4.
@zdenop, git tag -a -f 4.0.0-beta.3 4.0.0-beta.3 replaces the lightweight tag by an annotated tag. You could run that on a fresh clean clone and use git push --tags to push the updated tag.
Please read https://git-scm.com/docs/git-tag#_on_re_tagging before doing that.
try now.
tesseract -v
tesseract 4.0.0-beta.3-54-g6f23
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Soon, it will be time for beta.4 :-)
FYI - Interesting visualization and comparison of OCR results for Arabic with different traineddata files.
http://kanz.pw/ocr/
See related discussion at
https://github.com/tesseract-ocr/tessdata_best/issues/11#issuecomment-400550328
Thanks everyone for your work!!!
@stweil,
What about a new schedule for 4.0.0?
We get a lot of questions regarding the tesseract training tutorial, so I decided to go through the same, modify for current file structure etc.
Ray has mentioned in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch
The character error rate falls below 50% just after 3700 iterations, and by 5000 to about 13%, where it will terminate. (In about 20 minutes on a current high-end machine with AVX.)
Running on Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-122-generic ppc64le) I am finding that it is taking much longer.
Edit: Now that it has reached 3700 iterations after a few hours, I see that error rate has gone back up to 100% instead of the expected 50%. The only difference is that the fonts installed on my system are different from the ones Ray used.
File /tmp/tmp.8c63LAbUJ6/eng/eng.TeXGyreHerosCondensed_Bold_Italic.exp0.lstmf page 1 :
Mean rms=5.856%, delta=49.178%, train=100.173%(100%), skip ratio=0.6%
Iteration 3699: ALIGNED TRUTH : questions 8 this?) Other because 1 has & character; NCBI was back - SEARCH
Iteration 3699: BEST OCR TEXT :
File /tmp/tmp.8c63LAbUJ6/eng/eng.TeXGyreHerosCondensed.exp0.lstmf page 24 :
Mean rms=5.856%, delta=49.184%, train=100.173%(100%), skip ratio=0.6%
At iteration 3694/3700/3722, Mean rms=5.856%, delta=49.184%, char train=100.173%, word train=100%, skip ratio=0.6%, wrote checkpoint.
Iteration 3700: ALIGNED TRUTH : much - 4. -» used through € between NEW % J. should when High when We it
Iteration 3700: BEST OCR TEXT :
File /tmp/tmp.8c63LAbUJ6/eng/eng.TeXGyreHerosCondensed_Italic.exp0.lstmf page 21 :
Mean rms=5.856%, delta=49.182%, train=100.173%(100%), skip ratio=0.6%
Iteration 3701: ALIGNED TRUTH : much - 4. -» used through € between NEW % J. should when High when We it
Iteration 3701: BEST OCR TEXT :
Has anyone else in the group run the tutorials? What times do you get? Should we setup a travis CI to test the tutorial process once in a while?
Also is there anything additional required to be added for the architecture?
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4dfd_4b40_9d82_446ebc23c550/page/Intel%20SSE%20to%20PowerPC%20AltiVec%20migration
I didn't run the tutorial.
Tesseract lacks SIMD code for ppc64, so the speed will be slower than x86/x86-64.
I don't know what's the situation with OpenMP.
@amitdo Even if speed is slower, the results should be same or similar with same training data. I am getting a wide variance...
#At iteration 4988/5000/5000, Mean rms=3.855%, delta=17.627%, char train=75.487%, word train=98.179%, skip ratio=0%, wrote checkpoint.
#Finished! Error rate = 54.122
vs
#At iteration 4856/5000/5000, Mean rms=1.191%, delta=2.276%, char train=7.534%, word train=16.772%, skip ratio=0%, New best char error = 7.534 wrote best model:../tesstutorial/engoutput/base7.534_4856.checkpoint wrote checkpoint.
#Finished! Error rate = 7.534
I am trying to figure out if it is a recent change in code or a difference in configure options while compiling which is causing this.
I am trying to figure out if it is a recent change in code or a difference in configure options while compiling which is causing this.
It seems to be related to configure options.
For commit - tesseract 4.0.0-beta.3-180-gab1f
./configure --enable-openmp --disable-debug --disable-opencl
I get
At iteration 4856/5000/5000, Mean rms=1.191%, delta=2.276%, char train=7.534%, word train=16.772%, skip ratio=0%, New best char error = 7.534 wrote best model:../tesstutorial/engoutput/base7.534_4856.checkpoint wrote checkpoint.
Finished! Error rate = 7.534
./configure --enable-debug --disable-shared --disable-static CXXFLAGS="-Wall -Wextra -g -O0"
I get
At iteration 4992/5000/5000, Mean rms=5.853%, delta=49.475%, char train=100%, word train=100%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 25.515
Adding -O0 will surely make the speed super slow.
I don't know what's the reason for the difference in accuracy.
Is there a list of configure options available for tesseract?
What are the options set by default if just ./configure is used?
Is some error/warning displayed when incompatible options are chosen?
For (1) and (2) use ./configure --help
For (3) see https://github.com/tesseract-ocr/tesseract/issues/1739#issuecomment-402705969
@Shreeshrii, in pull request #1790 I try to improve the output from ./configure --help, so hopefully (1) and (2) should then be better answered.
Maybe we should disable the Java based graphic debugger by default?
@stweil Thanks!
There is regression in plusminus training compared to 2017 code.
Ray - training 4.0 wiki - 3600 iterations - 0.041% 0.185%
alpha - Dec 2017/Jan 2018 - 3600 - char train=0.031%, word train=0.069%
beta.3 - latest code - 3600 - char train=0.107%, word train=0.297%,
@stweil @AlexanderP Is there a way we can setup for automatic testing the training process?
@Shreeshrii Make it possible. only the size of the source archive will be greatly increased.
As it will be necessary to include all language packs in it.
@AlexanderP
I was not clear in my question. I meant the tesseract LSTM training tutorial process.
We only have accuracy numbers from Ray's tutorial for English. To replicate that tutorial will only require data related to English.
I have a bash script to run the required commands sequentially. However, it takes quite long for me to build tesseract, create training data and then run the tutorial commands.
So, I was hoping for two things.
A process to run the lstm training tutorial on an ongoing basis, eg. before a new tag, or every 50 commits or every month, to catch any regression.
A similar process to find out where the current regression happ
Please see https://github.com/tesseract-ocr/tesseract/issues/1798 regarding regression in plusminus training.
@Shreeshrii The easiest way to do this is with travis. We fit into 50 minutes in place with the assembly?
Thanks, Alex. I will give it a try.
@zdenop Please tag beta.4, we have many changes and fixes. Thanks!
Yes, I think tagging a new beta would be good. An unsigned annotated tag can be created like this:
git tag -a -m "4.0.0-beta.4 release" 4.0.0-beta.4 18787ea12b2ea9368c8e1c0128d1f8aef2beebc8
git push --tags
Replace -a by -s to create a signed annotated tag.
Then draft a new pre-release based on 4.0.0-beta.4 in GitHub.
Done. Soon we will have more 4.0 betas than releases ;-)
@zdenop Thanks! Please also make 4.0.0-beta.4 as a pre-release inGitHub.
@stweil There still are a number of issues to be fixed/looked at. Please review to prioritize what needs to be in 4.0.0 and what can be pushed to next.
Maybe we should disable the Java based graphic debugger by default?
@amitdo I agree, as I do not use it. However, Ray has included it as part of tutorial process. The training wiki page will need a change, if this is disabled.
It is beta tag not RC. IMO only RC should be marked as pre-release.
https://help.github.com/articles/creating-releases/
If the release is unstable, select This is a pre-release to notify users that it's not ready for production.
@amitdo : what does it imply? Should be all tags be marked as pre-release/release candidates?
Technically this is not necessary. It is only additional information for people who look at the list of releases on GitHub.
@zdenop, it's your choice :-)
The advantage that I see from a tag also being marked as pre-release is that it displays the number of commits made after it in master.
Eg. 4.0.0-beta.3 Release
@zdenop zdenop released this on 26 Jun · 277 commits to master since this release
This kind of info is not displayed for beta.4.
This is in no way necessary, but is info that is useful to have .
There maybe other ways of getting the info, that I do not know about.
Ok. So just to "highlight" 4.0 release I will keep "one" (the latest) 4.0 pre-release.
Langdata for 4.0 added to new repo.
Thank you, @jbreiden, for this upload.
Thanks, @jbreiden. This will be very helpful.
Please also upload the font list used for each language, updated training scripts and any other info required by those who want to replicate LSTM training.
4.00-alpha was 'released' in Nov 8, 2016.
When are we going to finally release 4.0.0?
@zdenop? @stweil?
It depend what else should be fixed.
@jbreiden : do you have any info about Ray next steps?
@zdenop Ray is busy working on something else, and barely has time to say hello to me. He definitely doesn't have time for significant Tesseract work.
Hello Jeff :-)
Thanks for the info.
@amitdo, I'd prefer if we could at least fix the most common known bugs for 4.0.0. You know that we already get a lot of duplicates for those bugs, and I'm afraid that would increase as soon as an official 4.0.0 is released. But in the end, I think it's the community which has to decide when Tesseract is "good enough" to leave beta state (as long as Ray is busy with other work).
Are you aware about lgtm project? It was recently implemented for leptonica.
https://github.com/tesseract-ocr/tesseract/search?q=lgtm&type=Commits
@stweil has been fixing based on lgtm alerts also.
Yes, I saw it at Leptonica and noticed that they already had alerts for Tesseract, too: https://lgtm.com/projects/g/tesseract-ocr/tesseract/alerts/.
That's the real question, at what stage it will be 'good enough'.
Personally, I prefer the 'release early, release often' paradigm.
I believe many people still use 3.05 and won't use 4.0.0 because beta implies 'unstable, buggy', but generally 4.0.0 is much better and I want more users to move to 4.0.0.
https://github.com/tesseract-ocr/tesseract/wiki/Planning
We need a more clear roadmap, otherwise we won't get to the target in a reasonable time.
We can publish a page titled
for some issues that will be still unresolved when 4.0.0 is out.
@stweil : Do you plan to fix some issues within short time?
I can mark current code as RC1 to give some time to package managers for testing and we can go for we release on 2018-09-30 if nothing special is found. Any objections or ideas?
To be quite honest, I don't consider the current state ready to be released, there are just too many obvious bugs that cannot be explained with just accuracy.
Despite tesseract 3 having lower accuracy, there are no such obvious bugs.
Depending on the use case, I'd say tesseract 3 is often still the better choice.
@troplin, thank you for this feedback. I thought that Tesseract 4 still works good with the old recognizer and can be used as a full (better) replacement for Tesseract 3. Which bugs do you get with Tesseract 4 when it is used with the old OCR engine? Or is this list of regressions complete? Which other bugs do you consider as release stoppers?
IMO, there's no critical/blocking bugs in 4.0.0.
Despite tesseract 3 having lower accuracy, there are no such obvious bugs.
Depending on the use case, I'd say tesseract 3 is often still the better choice.
You still have an option to make 4.0.0 operate like 3.05. Just use it with --oem 0.
You still have an option to make 4.0.0 operate like 3.05. Just use it with --oem 0.
@amitdo Do the user patterns, blacklist, whitelist etc work then?
@amitdo Do the user patterns, blacklist, whitelist etc work then?
Are these critical feature for an OCR software?
IMO, the answer is 'No'. Others may disagree.
How much time we should wait until someone will fix these issues? Another two years?
@stweil LSTM is the default in tesseract 4 and I thought the old engine is to be removed. If LSTM is not working properly what't the point of releasing 4.0?
What's most annoying is that there's seems to exist a general internal issue with string length / word boundaries or a off-by-one error or similar that causes various bug, eg. #1712.
The effects are sometimes characters missing that cannot be explained plausibly:

Sometimes a single character word that has a huge bounding box, like the "a" here:
ausschnitt.out.tesseract4.pdf
Word bounding boxes that are curiously wrong (always the same pattern as in #1712 ):

And those are not rare occurrences, it happens in almost all documents I tried.
If you like I can create more issues, but I thinks it's all somehow related and if one problem is fixed, the others are too.
@amitdo What's an OCR software? There are many different use cases how an OCR component can be integrated in software, and for us such issues are not acceptable.
The main job of an OCR software is to output the text in the image.
Tesseract 4.0.0 does this job better in most cases (at least for books/magazines/newspapers, which were and still are its most important use cases).
All other features, like hocr & pdf output for example, are 'nice to have', but should not block a release after two years in alpha.
Tesseract is a command line OCR. Like it or not, being a good library with a nice API for developers is a secondary goal.
- All that is my personal opinion.
The truth is that 3.05(.xx) is not really supported anymore. We work only on 4.0.0.
Yes, there are things which are more important than others. Personally I think that most users will only run OCR, but not do training, so severe bugs for the former are not acceptable for a stable release while bugs in the training part can be fixed after release of 4.0.0.
PDF is nice to have, but there are alternate solutions which can create PDF from hOCR.
hOCR output is essential for my (and other scientific) work, because it contains not only the text, but also coordinates and other essential information. It's also the only format which can be converted to ALTO format.
@amitdo I think you are overestimating the usage and usefulness of plain text output. Most interesting use cases need more information than just the plain text. Also I think you might be underestimating the usage of tesseract as a component in bigger systems/applications. I don't have any reliable data though.
As a matter of fact, I don't even use the PDF output and for me that feature is not important. But the problems that I encounter are are not specific to PDF output, they also appear in the hOCR output and (e.g. the missing z in zero) even in the plain text. It's just easier to see with the colored PDFs that I generate.
If these were just minor accuracy problems, e.g. caused by a lack of training or similar, then I'd probably agree that this can be solved later. But these problems hint at inconsistencies in internal data structures, errors in the program logic. It's not clear what other problems these could cause. They could even be exploitable.
This project tries to be a 'Swiss Army knife' that supports:
This is Great! :-)
BUT...
The problem is that we have too little resources (no. of developers and time they have to contribute to the project, and don't forget the support side - questions and issues).
@amitdo
This project tries to be a 'Swiss Army knife' that supports: [...]
I understand that, but I draw different conclusions. Many of those features don't necessarily have to be in tesseract itself, that's true. But tesseract must provide the necessary data to enable those features.
The "heart" of tesseract is the API, almost everything else builds on that. That's the part that's most important. If there are errors in the core code, it affects everything.
If the API works correctly, other programmers can use it to build those missing features.
The problem is that it have too little resources (no. of developers and time they have to contribute to the project, and don't forget the support side - questions and issues).
If you release buggy software, you'll get more support, not less.
Even now there is already a handful of issues probably related to those problems: #1192, #1906, #1883, #1146 (maybe even #1015 and #1810) and there will probably be more if you release it like this.
EDIT:
It's actually the same with "2 OCR engines": The only reason you need 2 engines is because the LSTM engine is not yet able to completely replace the old one. Once the LSTM engine ist at that point, you can drop the old engine and no one will complain.
@troplin,
Obviously, we have different opinion regarding this subject.
Releasing 'Something' is better than releasing 'Nothing' (for years).
This 'Something' works quite well now.
Buggy?
- Yes, it has bugs...
How much time should we wait until some hero will come and save us by fixing the bugs you mentioned? Surly, other people have other 'favorite' bugs. Should we wait forever to fix all bugs to satisfy everyone?
Search for the term 'Release early, release often'. I believe that this is the right approach for open source projects.
It's actually the same with "2 OCR engines": The only reason you need 2 engines is because the LSTM engine is not yet able to completely replace the old one. Once the LSTM engine ist at that point, you can drop the old engine and no one will complain.
Ray said several times that keeping the 'legacy' engine blocks fixing and improving the lstm engine. The legacy engine has ~37K LOC, and is much more complex than the lstm engine.
The legacy is 'Good' for languages written in the Latin script. It's not that good for other scripts.
I would kindly ask everyone to consider being slightly more constructive in his/hers comments in order to avoid any deadlocks with releasing.
It should be solely on the shoulders of the maintainers to decide when and what to release. If the maintainers ask for comments, the responses should be either "wait for this feature because I am going to provide a PR soon" or "it would be nice to have this feature in the next release but if no one is going to take it in the near future I will live happily without it".
I would also love to have all the bugs fixed but if I were a maintainer, I would release asap. Tesseract 4 is in the core a completely different project than Tesseract 3 and it clearly is superior in many aspects. From the maintenance point of view, I think it is better to move with versions up than waiting in alpha/beta state for years together with waiting for new contributors. Do not let a perfect backward compatible release stand in the way of a good new one :).
Moreover, based on our experience (we use version 3 and version 4 in parallel), Tesseract 4 is not and very likely never will be a full replacement of Tesseract 3 (for example if your priority is CPU speed but also precision with very specific corner cases etc.) so blindly trying to go for this goal is, in my opinion, a waste of time.
@vidiecan, which use cases of Tesseract 3 don't work similarly with Tesseract 4 (using the old OCR engine)? My expectation is that Tesseract 4.0.0 is a full replacement. If that's not the case, I'd like to have a description here.
https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/YPXxGmDudHk
Zdenko Podobný | Sep 22
Hello,
I would like to thank all who share their thought about releasing new version of tesseract [1]. I took my time and I decided we should make release at the middle October 2018 (14-21...).
This should means that no new features will be applied to current code. There is not time for testings. Anyway please feel free to send your patch/PR - it will included after 4.0 release.
There are several ways, how people can contribute to this process:
- Developers: go through open issues, try to fix it. Please make a comment when you start do deal with issue, so we can use our capacity efficiently.
- Packagers: please test if building and packaging process is working fine. If something is broken, try to fix&submit it fast. Please give a note to forum or me directly, where users can find your "product", so we can put information about supported systems to release notes.
- "Wrappers": if you are producing wrapper for tesseract, please give a note to forum or me directly if you support tesseract 4: I would like to promote your work.
- "No code" developers:
- check open issues, test it with the latest code if it still valid report, prepare test case if missing, report duplicates, suggest label etc.
- Improve documentation, release notes, man pages etc...
- English native speaker: check documentation, release notes etc.
Thanks to all who help us to get to this point. I really appreciate all ways of support.
[1] https://github.com/tesseract-ocr/tesseract/issues/1423
Zdenko
@zdenop,
Do you plan to release 4.0.0-rc.1 before the final 4.0.0?
yes. maybe tomorrow.
It looks like tag 4.0.0-rc1 was not created as an annotated tag. Therefore it won't be used for the Tesseract version. This is nothing to worry about too much, but of course we must make sure that the final 4.0.0 gets a correct tag.
:-( Can you try git fetch --tags --all --prune ?
Many projects have a document that list all the required steps for preparing a release (final/rc/beta).
It will be nice to have such a document for Tesseract.
Thanks, it's fine now.
@zdenop, what about releasing a 4.0.0-rc2 this weekend?
Yes, today evening (European time zone).
RC3 was release.
Lets see how many RC we will have :-)
Is it possible to remove old 4.0.0 beta/rc releases, but keep the tags?
I always did it that way: https://github.com/tesseract-ocr/tesseract/releases
Or I miss something?
The old ones are still listed there...
Actually I deleted them, otherwise they will be so emphasized as Release candidate 3 at the moment. I am not sure if I can do more without deleting tag...
You can't. Tags are automatically shown in the list of releases. And deleting the tags would be a really bad idea.
@amitdo, @Shreeshrii, @zdenop (and who else is waiting for 4.0.0), what are the most urgent things still missing for the final 4.0.0? I know that there remains much work to be done for 4.1.0 in any case.
There will always more work to be done after releases :-)
@stweil Thank you for all your work in getting 4.0.0 ready for release.
One of the things that will be useful, IMO,
If the version info from traineddata files can also be displayed when using tesseract for ocr. It might require updating the version strings to include the repo name also.
It would be useful when people report issues.
However, this is only a nice to have feature, and could wait for 4.1.0.
https://github.com/tesseract-ocr/tesseract/milestones/4.0.0 show only one open topic. ;-)
It would be great if following issues are solved:
@zdenop, are you planning a rc4 before the final 4.0.0? Maybe rc4 today, 4.0.0 next weekend?
I'm afraid that we won't be able to solve the issues in your list for 4.0.0.
Don't hurry. Do as many betas and rcs as needed.
@stweil: rc4 could be tagged, if issue #736 is solved/tested...
rc4 released.
BTW: for final release I want to omit git sha info (autotools build) version will be just plain "4.0.0". After release git-rev will be restored. Any objections?
That works automatically, also for the release candidates:
$ git describe
4.0.0-rc4
It's not necessary to omit and restore something. Just update VERSION and ChangeLog.
What about replacing ChangeLog by a very short file which just links to the release notes in the Tesseract Wiki?
+1
You can add:
To get the git changelog, run this command:
git log 3.04.01..4.0.0
>
https://github.com/tesseract-ocr/tesseract/commits/4.0.0-rc4 shows the commit list for rc4, so users who don't have a git command line can look at https://github.com/tesseract-ocr/tesseract/commits/4.0.0 for the commits of 4.0.0. Such information can be added to the Wiki, so it would be sufficient to refer to the Wiki in the ChangeLog file.
Congratulation on the release of 4.0.0 :tada:
Thanks to every one who contributed: developers, testers, documentation writers, bug reporters.
Closing because of 4.0.0. was released..
@zdenop Any plans for a bug fix release.
@stweil Should another issue be opened to discuss plans for next release?
Thanks!
Well, be broke API/ABI compatibility so bug/fix release is not easy (we should remove some fixes/improvement to keep it).
Maybe we should think about next release (4.1.0) or do not care about compatibility (release 4.0.1) which is IMO not right, but in line with tesseract history ;-)
We decided to use semantic versioning (which I think is good), so a new release which is based on Git master would have to be 4.1.0. @AlexanderP, is that a problem for the Debian tesseract-ocr packages? Maybe /usr/share/tesseract-ocr/4.00/tessdata would have to be renamed (I suggest to use /usr/share/tesseract-ocr/4/tessdata).
February 21st
[image: Warning /!] FeatureFreeze
https://wiki.ubuntu.com/FeatureFreeze, [image:
Warning /!] Debian Import Freeze
for Ubuntu 19.04 DiscoDingo
On Sun, Feb 10, 2019 at 2:12 PM Stefan Weil notifications@github.com
wrote:
We decided to use semantic versioning (which I think is good), so a new
release which is based on Git master would have to be 4.1.0. @AlexanderP
https://github.com/AlexanderP, is that a problem for the Debian
tesseract-ocr packages? Maybe /usr/share/tesseract-ocr/4.00/tessdata
would have to be renamed (I suggest to use
/usr/share/tesseract-ocr/4/tessdata).—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1423#issuecomment-462114369,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o94Ne9JzfaZ_xG5Rc7emQL-oX6Asks5vL9tegaJpZM4S57Iv
.
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Debian will start with a new stable release in a few days, and as far as I see that new release will include Tesseract 4.0 for the next few years. Should we backport important fixes to the 4.0 branch? What does that mean for Tesseract 4.1? Are there still interested parties who need it? Or should we focus on Tesseract 5 which may drop or replace old code? @AlexanderP, what upgrade path do you see for Debian?
This project has limited resources, so I suggest to release 4.1 soon (1-6 weeks), and then concentrate on 5.0 and abandon 4.x.
I planned to release 4.1 on first of July. Unfortunately I found out there are problem with backwards API compatibility...
I think it is necessary to load version 4.1 and to upgrade to version 5.0 is closer to release.
@AlexanderP : Does it mean that if we make 4.1 backwards compatible, you can get it to Debian?
@zdenop I think he can get into the Debian Backports.
So Debian Buster will keep using Tesseract 4.0 for the next years? Then a 4.0.1 with carefully selected bug fixes will be required.
So Debian Buster will keep using Tesseract 4.0 for the next years?
Yes, but it is necessary to ask @jbreiden
I think 4.1.0, can enter Debian buster-backports.
In general, Debian only accepts security fixes for their stable releases.
And that's fine.
People who want fresher software will often do something else (such as run
Debian Testing).
I'm not sure how many people use buster-backports, but if Alexander wants
to make them,
I'm happy to keep signing. (Someday I imagine he will get his own keys, in
my opinion he has
more than earned them!)
Reminder of versions in Debian:
https://packages.qa.debian.org/t/tesseract.html
Most helpful comment
Yes, there are things which are more important than others. Personally I think that most users will only run OCR, but not do training, so severe bugs for the former are not acceptable for a stable release while bugs in the training part can be fixed after release of 4.0.0.
PDF is nice to have, but there are alternate solutions which can create PDF from hOCR.
hOCR output is essential for my (and other scientific) work, because it contains not only the text, but also coordinates and other essential information. It's also the only format which can be converted to ALTO format.