Tesseract: RFC: Tesseract 4.1.0 Planning

Created on 16 Feb 2019  Â·  37Comments  Â·  Source: tesseract-ocr/tesseract

Planning page on wiki and RFC: Tesseract 4.0.0 – open tasks were used for discussion regarding 4.0.0 release.

Please prioritize and discuss here the tasks to be tackled for next release - 4.1.0.

Most helpful comment

An alternative approach could be to include ocrd-train project as a submodule in tesseract for LSTM training.

@wrznr and me like the idea, we're also open to directly including this in tesseract and/or changing the name of the project to better reflect that it's about tesseract training.

All 37 comments

Here is the link to issues marked for 4.1.0 milestone.

This next link is to a longer list of issues marked as bugs.

Suggestions:

  1. Add a new repo for tesstutorial

People have a hard time trying to follow Ray's tutorial page on wiki for training LSTM engine with all its different options. A new repo which has a minimal set of files for running the tutorial and bash scripts with commands for the same will be helpful. It could also be extended to show a sample of finetuning for a different language.

  1. Add bash scripts (similar to tesstrain.sh) for creating 'strarter traineddataforLSTM trainingusingbox/tiff pairs`.

An alternative approach could be to include ocrd-train project as a submodule in tesseract for LSTM training. @kba @wrznr @stweil @zdenop - What do you think of the idea?

I tagged current code and 4.1.0-rc1 so we can start preparing release. It would be great if somebody could update Changelog, release notes etc...

An alternative approach could be to include ocrd-train project as a submodule in tesseract for LSTM training.

@wrznr and me like the idea, we're also open to directly including this in tesseract and/or changing the name of the project to better reflect that it's about tesseract training.

@kba and @wrznr Thank you for your response.

Currently there are three different options for LSTM training:

  1. ocrd-train - using line images and matching ground truth.

  2. tesstrain.sh - using box/tiff pairs generated from training text and fonts.

  3. The new options of lstmbox and wordstrbox generate box files suitable LSTM training from scanned page images using tesseract. These are suitable for training from scanned images where no ground truth is initially available. For RTL languages, however, they are not directly editable. See comment regarding needed helper scripts by @amito - https://github.com/tesseract-ocr/tesseract/pull/2231#issuecomment-462177671. The downside with this approach is that the box files need to be reviewed and corrected before being used for further training. Currently there is no defined path for using these for LSTM training - modified versions of ocrd-train or tesstrain.sh need to be used.

There were earlier suggestions of changing from bash based tesstrain.sh to make based solution.

It would be great if there can be unified and simpler approach to tesseract LSTM training. Would you be open to enhancing ocrd-train to support all?

You will have more independence and control with a separate project. But I am sure it could be directly included in tesseract, if that is your preference. The decision though will be up to the project owners and maintainers.

I already created tesstrain.py which is a conversion of the bash script to Python with many improvements and better progress and error reporting. It's in master now. It only supports LSTM, however.

A make frontend could call portions of it if someone really wanted to go with make.

@jbarlow83 I haven't tried the python script yet. Which version of python do I need for it?

3.5 or newer I believe. Later edit: this is not accurate; use 3.6

@jbarlow83 I have Python 3.5 but am not able to run the script.

python3 --version
Python 3.5.2

What is the command to use?

 python3 ./src/training/tesstrain.py
python3 ./src/training/tesstrain.py
  File "./src/training/tesstrain.py", line 66
    log.info(f"=== Starting training for language {ctx.lang_code}")

Python 3.6 or newer is required. I thought this was a reasonable
requirement since the baseline Linux system for Tesseract 4 is Ubuntu 18.04.

On Fri, Mar 15, 2019 at 2:27 AM Shreeshrii notifications@github.com wrote:

@jbarlow83 https://github.com/jbarlow83 I have Python 3.5 but am not
able to run the script.

python3 --version
Python 3.5.2

What is the command to use?

python3 ./src/training/tesstrain.py
python3 ./src/training/tesstrain.py
File "./src/training/tesstrain.py", line 66
log.info(f"=== Starting training for language {ctx.lang_code}")

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/2249#issuecomment-473215675,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABvcMxDnGZw69YWNrKFFp89ckZODidxpks5vW2edgaJpZM4a-x1v
.

I am using
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"

Let me see if I can upgrade python.

Debian stable also has 3.5.

Please see https://github.com/Shreeshrii/tess4training which has the start of a new sub-repo for tesseract-ocr for running the Tesstutorials.

Please give it a try and let me know if there are any issues with it.

Changed the topic to say 5.0.0. Thanks @amitdo.

Also noticed just now that @stweil had setup 'Tesseract Next' as a project at https://github.com/tesseract-ocr/tesseract/projects/1

Why? IMO major number should be increase if we do backwards incompatible changes of public API. Excluding adding new renderers (that actually could be handled by provided external programs) changes are minor - usually related to code improvement.

@zdenop: as @amitdo pointed out, the public API already has serious backward incompatible changes. Serious in the sense that the API is no longer compatible and ABI linkage will fail or crash.

Can somebody elaborate on this (serious API change) "a little bit"?

My understanding is that: we modified API => we need co change MINOR version if we are backward compatible with previous version. So:

  1. Extending of API is backward compatible.
  2. Removing of symbols could be problem, but I really doubt in current case it is issue (if yes, than please provide real life example for testing)
  3. Problem with data types I am not able to evaluate... Because first change here was releated to "monitor" I just made simple test: I took example from wiki and build it against 4.0.0 tesseract library (libtesseract.so.4). Then I uninstall tesseract and installed the latest code and run program again (without recompiling). It runs without problem and any error message (no linking or missing symbol).

So can somebody produce real life code where linkage will fail or crash? Without that I really do not see reason why to use 5.0.0. instead of 4.1.0 (=>we are backward compatible)

It's definitely not backward compatible.

Regarding removal of symbols, any symbol exported by libtesseract.so is fair game for linking and if symbols are removed linking will fail. A linker can't link to a symbol that isn't there anymore. It does seem unlikely anyone would use Tesseract's API for reversing a singly linked list, but it may be that header-only/inlined functions use exercise it. The removal of the exported symbols is still a backward compatibility breach of the "API contract".

One that shows the problem most clearly is ResultIterator. For example this code:

#include <tesseract/baseapi.h>

int main()
{
    tesseract::TessBaseAPI api;
        // (add some boilerplate here to OCR an image if desired)
        api.GetIterator()->GetBestLSTMSymbolChoices();
    return 0;
}

Disassembles to

0000000100000ecf        callq   0x100000f4e ## symbol stub for: __ZN9tesseract11TessBaseAPI11GetIter
atorEv
0000000100000ed4        movq    %rax, -0xc0(%rbp)
0000000100000edb        jmp     0x100000ee0
0000000100000ee0        movq    -0xc0(%rbp), %rax
0000000100000ee7        movq    (%rax), %rcx
0000000100000eea        movq    0x48(%rcx), %rcx
0000000100000eee        movq    %rax, %rdi
0000000100000ef1        callq   *%rcx   ## relative function call

Since this is a vtable, libtesseract will contain a symbol for the vtable itself. The method names are not in the vtable, only offsets. If the offsets change, we're not backward compatible.

The above disassembly basically says:

*(ResultIterator->vtable[0x48])(this);

and between master and v4.0, the function at address vtable[0x48] changed from being the address of GetBestLSTMSymbolChoices to the address of GetRawLSTMTimesteps. I think it should be pretty clear this will crash.

For your example from the wiki, ResultIterator::GetUTF8Text is not affected by the change in vtable; it's still at the same offset. So yes, it will work.

In the case of ChoiceIterator, several variables were added such that the size of the structure allocated will change. If caller C++ makes a copy of a ChoiceIterator (including an implicit copy), it will allocate the wrong size if linked against libtesseract 4, and this will cause data corruption.

Doesn't the vtable contain only the virtual functions? Then all functions which are not virtual will still be found. Changed sizes of classes are a problem for compatibility if the API users can use new or inline functions which access member variables.

The compatibility report lists few problems, and only 5 are classified as high.

Two of those are for removed symbols. I don't remember that we removed insert or reverse functions, so that still needs to be examined.

The other three are related to @noahmetzger's changes for the choice iterator. Maybe Noah can find a solution which is API compatible, so that could be solved.

There is one problem classified as medium for class ETEXT_DESC. This can be fixed, too.

So I see no reason why it should not be possible to make a version 4.1 which is backward compatible.

Hi @jbarlow83,

Are any of the changes considered API breakage or only ABI breakage?

if I'm not mistaken, the semver is about API, and the soversion (libname.so.6) is about ABI.

Yes the class vtable only contains virtual methods, but it's still a problem because vtable is a part of the public API and ABI. Classes sizes changes are a problem for just about any class that accesses its member variables, and for classes that have implicit constructors (because they will be created and inlined on the fly) or header-defined constructors (inlined on the fly). The workaround is to use the private class data pattern and ensure the implementation is not part of the API.

I agree these issues are probably fixable. My point was to try to explain why it's not something to ignore, either the version number gets a major bump or the issues get fixed. ResultIterator can be fixed by reordering the methods such that new virtual methods are appended. ChoicesIterator would be trickier - it might work to use private class data and 8 bytes of padding.

The removed functions from oldlist.cpp are the result of my commit f76d8a14cd49b952bc30d21c0fab14553c8de3a7. I think that nobody will miss those functions and don't think that 5.0.0 is needed because of them. On the other hand it would be easy to restore them.

@amitdo As they say: "ABI=library API + compiler ABI".

Removing the list_rec methods is API breakage. The other problems are compiler ABI breakage, but it's all ABI breakage.

semver was written to be language agnostic. I'd argue that in the case of C++ libraries, backward compatibility means ABI compatibility, because without that, applications break.

As far as I can tell we ignore soversion in tesseract and just set it to the application version. We could start managing soversion separately and that would allow bumping the ABI version separately, if we want to make that distinction.

Leptonica is an example for a (C-based) library that makes this distinction.

I would like to release 4.1.0 mainly as bugfix release + new features (extension of API with ALTO, LSTMBox and WordStrBox renderes).
This can happen in dedicated branch "4.x" where we can revert all necessary changes in API. So we can continue breaking API in master :-) (master will be tagged as 5.0.0alpha)
Is this proposal fine for you or is there other proposal how to move forward?

Yes, that's quite fine.

Practically, after my first foray in the 3.0 days, I avoid any attempts to work with Tesseract at the API level because it's been too unstable. I really appreciate that we're paying more attention to it.

OK. Now the master is tagged as 5.0.0-alpha and we have 4.1 branch for making API backward compatible.
Can @noahmetzger revert changes related API?
@stweil: What is your suggestion to fix ETEXT_DESC? Also your original commit changes size of end_time
Also we need to find out minimum public headers...

Can we replace STRING and GenericVector by standard C++ types in the public API for 5.0.0? It's a new major version, so we are free to clean as much as possible and reasonable...

Yes, this the way. First we need fix API compatibility for 4.1 because than back-porting changes will be difficult.

@stweil @noahmetzger: any comments/update on 4.1.0?

Now we reach 100%
image

There still some (strange) report about Problems with Symbols (Low Severity), but I were and are protected:
image

What's the timeline for final 5.0.0?

There is no timelime for 5.0.0

IMO 5.0 (or 5.1) should be ready to be released no later than mid-February next year, so it can be packaged and shipped for Ubuntu 20.04 LTS.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clarkk picture clarkk  Â·  3Comments

YeisonVelez11 picture YeisonVelez11  Â·  5Comments

clarkk picture clarkk  Â·  6Comments

spajak picture spajak  Â·  4Comments

Shreeshrii picture Shreeshrii  Â·  4Comments