Tesseract: RFC: Reorganize source tree

Created on 19 Dec 2016  ·  63Comments  ·  Source: tesseract-ocr/tesseract

I'd like to propose changes to tesseract source tree structure.
Today the common way is to have src folder with all program stuff and include folder with public headers. Now we have a lot of dirs in the root - that's very annoying.
On the first stage I propose:

  1. move all sources into src
  2. move training tools from training to tools/training

Later we can try to move public headers to include directory.

The new look will be like:
pic

If there are no objections, I'll commit changes.

Most helpful comment

Done.

All 63 comments

Make sure to get positive confirmation from Ray before doing something like this. He's considering a different, very disruptive change involving the non-LSTM recognizer. It would collide very badly with source reorganization.

Yes, I'll act only in case of positive reply from Ray.

cc: @stweil @zdenop @amitdo
What about this topic?
Recent changes are very large in terms of code style, warning cleanups.
Maybe we could hide dirs a little bit?

@jbreiden @theraysmith : can you please? It would be nice to make this reorganization as part of 4.0 release...

Related: I suggest to remove some large binaries (mainly in testdata and testing) which are used for tests and add them to a new repository test. Like googletest, this new repository can then be added as a submodule.

cc: @Shreeshrii

Maybe we could hide dirs a little bit?

I think this is a good time to do the reorg.

I suggest to remove some large binaries (mainly in testdata and testing) which are used for tests and add them to a new repository test. Like googletest, this new repository can then be added as a submodule.

Yes. I think it is a good idea. It would also be useful to streamline the various other test related folders at the same time.

However this may require @theraysmith / @jbreiden to create the new repo for test.

done. enjoy ;-)

How about the following first stage:
Move all the source to src.
Create pointer headers in include that #include the real header for the public headers.
Try to avoid creating any APIs that make use of the old classifier.

@egorpugin, will you send a pull request for the first step, moving all source to src?

Sure.

Changes are in master. Please, check build scripts. I did small updates, cmake & cppan seems ok, but autotools is untested.

The training folder should also go under src.

Thanks.

On Wed 25 Apr, 2018, 1:35 PM Egor Pugin, notifications@github.com wrote:

Changes are in master. Please, check build scripts. I did small updates,
cmake & cppan seems ok, but autotools is untested.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/594#issuecomment-384197184,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o4jEdQwR0u2HY0HoYjVtIIVkLMV_ks5tsC4wgaJpZM4LRMwy
.

Done.

I would also suggest to rename VS2010 directory, because it is misleading.

Getting rid of vs2010 would be even better. I have a local patch for gettimeofday, so at least that part could be eliminated.

This is part of the game, but content of directory is needed and it is platform (compiler?) dependent...
There is already android directory. I have not clue if for IOS or Mac something special is needed, but it would make sense to organize it better....

I'd appreciate if we could reduce the number of API includes (those which get installed, currently 58).

Several of them have a high probability of a naming conflict with other include files, for example capi.h, errcode.h, functions.h, helpers.h, input.h, network.h and more. We could avoid such conflicts by requiring include statements like #include "tesseract/capi.h" instead of #include "capi.h". That's an API change of course. Can we do it nevertheless?

Are there any guidelines/standard for include statements of header files from the project?

BTW: I was suggesting this change some time ago for allheader.h but there were problems with it...

From my pov public headers must go into include/tesseract folder.
Then they'll be included like #include <tesseract/...>.
Instead of not really informing #include <allheaders.h>.

May I also suggest an examples folder where sample programs using api are provided. Currently these are included as part of the wiki.

Should that examples folder be added to https://github.com/tesseract-ocr/tesseract? Or maybe to https://github.com/tesseract-ocr/test?

See https://github.com/google/googletest/tree/master/googletest
for an example of project directory structure

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Apr 26, 2018 at 4:26 PM, Stefan Weil notifications@github.com
wrote:

Should that examples folder be added to https://github.com/tesseract-
ocr/tesseract? Or maybe to https://github.com/tesseract-ocr/test?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/594#issuecomment-384597841,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozfmdyIGEyZ8ygNYBHCZlPrV07ABks5tsafOgaJpZM4LRMwy
.

It's better to put examples inside tesseract-ocr/tesseract.

From my pov public headers must go into include/tesseract folder.
Then they'll be included like #include

+1

Create pointer headers in include that #include the real header for the public headers.

What's the benefit here?

Probably something to do with merging the code in Google's development branch.

@theraysmith Please clarify regarding this.

A side effect of source reorganization:

History of the source files is not carried though to the new file location. eg. see

https://github.com/tesseract-ocr/tesseract/commits/master/src/ccmain/pageiterator.h

It maybe useful to tag https://github.com/tesseract-ocr/tesseract/tree/e95ff1159e652d9b8ae6bc4aafdb196981942e6a so that it is easy to locate older version of source files.

Yes, that's due to the way git works. Same issue was with reorganization of the wiki.

Only the view of individual files' history is lost. Git's blame still works. The general history is still there.

Only the view of individual files' history is lost

That's the part I was looking at.

Please see

https://github.com/tesseract-ocr/tesseract/blob/e95ff1159e652d9b8ae6bc4aafdb196981942e6a/src/ccmain/pageiterator.h#L205-210

I thought there was duplication there and was trying to locate when the change was made. I think having a tag marking the source reorg maybe helpful.

But, I am not very familiar with workings of GitHub, maybe the experts don't need it.

bool BoundingBox(PageIteratorLevel level,
                   int* left, int* top, int* right, int* bottom) const;
bool BoundingBox(PageIteratorLevel level, const int padding,
                   int* left, int* top, int* right, int* bottom) const;

It's not a duplication. It's a feature in C++ (and other programing languages) called 'function overloading'.
https://en.wikipedia.org/wiki/Function_overloading

@amitdo Got it now. Thank you for explaining.

The current code implements what was discussed above: third party code now includes the Tesseract API with code like #include "tesseract/capi.h", so the problem with potential name conflicts is fixed.

If we want to write the include statements in the Tesseract code similarly, more code would have to be moved, and Ray's suggestion would be one possible solution. I see currently no need for that.

Correct me if I'm wrong, but you only deal with use the C API, but not the C++ API.

As I said elsewhere, some bindings use the C++ API directly, and of course there are users that use the C++ API in their C++ code.

The latest changes include header files for C and for C++.

Do you know bindings or other third party applications which use more than these files (from the Python binding):

#include <tesseract/baseapi.h>
#include <tesseract/publictypes.h>
#include <tesseract/resultiterator.h>

The R binding also uses

#include <params.h>

These are the only bindings I know that use the C++ API directly.

What do you think about moving tesseractmain.cpp outside of api?

Either to the /src, or to a new dir under it (bin / prog / exec).

And maybe we should also rename the file to tesseract.cpp.

@amitdo: I agree. it is "just" example how to use tesseract API ;-), but it is not part of API.

We could put it under /tools or even more better under /examples as it's really the example.

Should something else (e.g. renaming /moving tesseractmain some where) be done for 4.0 release or can we close the issue?

@egorpugin commented on Dec 19, 2016

Later we can try to move public headers to include directory.

What about this part of reorganizing the source tree?

The proposition is still there.

Upd.:
More to say, I found very useful following organization of includes (files):

include/project name here/...
src/any structure you like/...

for tess:
include/tesseract/...

Later then when installing headers, public include dir points inside include dir. So users write #include <project name here/header.h> or #include <tesseract/header.h> in our case.

I mean, sometimes installation process take your headers and put them under created project name dir. But already having it in project own source tree very disciplines other parts of the project (sources etc.).
Project sources begin to use #include <tesseract/header.h> also.

+1

Can we agree on moving all programs to a directory named tools?

I am fine with it, but IMO tesseractmain should be separated from training tools...

The training tools should go to tools/training.

Agree?

I agree.
Tessmain could be put into tools/tess..main.cpp.

Let this be open for some time.

One thing is that maybe singular form will be better.
tool/...

Because we have include/src/test/unittest/googletest/...

One thing is that maybe singular form will be better.
tool/...

@zdenop, @stweil
Do you agree?

If training tools will be part of this directory, then plural is right form IMO.

@egorpugin,

Let's split this to 2 stages.

Can you please send a PR with the first stage:

  • Create a new dir named tools.
  • Move tesseractmain.cpp to '/tools'
  • tesseractmain.cpp -> tesseract.cpp
  • Fix CMake, sw

Second stage (separate PR) will handle the training tools.

Alternate proposal:

  • Move all sources for libtesseract to new lib/.../... (directory structure under lib similar to the current one under src, or split for legacy, lstm and maybe common)
  • Sources for tesseract and training tools stay at their current place
  • Move Python training scripts to tesstrain project

@zdenop, @egorpugin

I don't like lib dir.

As I mentioned in on other place: I would prefer if we are able to provide tesseract library (libtesserract?) as individual repository (e.g. without training tools and tesseract executable).

Background: several interesting projects integrated tesseract (OpenCV, BluePrism, Apache tika) and I would like to see more such integration - (with hope) to get contribution from such integrator (or at lease they could provide results from their internal testing: e.g. one integrator suggested to allow to "plug-in custom OCR engine" to tesseract. The reason is that for amount fields they found tesseract not efficient and they use customs OCR engine for such fields.)

How are separate repos related to other projects & integrations?

If training tools (TTs) are needed for everyday usage, they must be near.
If they are used rarely, it's ok to move them out.
And even with that it is questionable to me to split things apart. TTs look good here. We just need to split them from sources a bit more (in separate folder etc.).

Also TTs depend on internal tess headers, it's +1 for not moving them out (atleast now).

Separate repositories might be considered as soon as we have a code base which no longer requires major API changes. I don't think we have reached that goal yet.

Changes like removal of proprietary data types or adding of a logging API would be complicated with several repositories which have to synchronized.

We already have that situation with heavily used third party code like the Python wrapper tesserocr which currently cannot be used with our Git master because of commit 90bcff3732db2b732b4e329848c4a89677e339d2.

Also it will be very useful if we drop EXPORT_ALL_SYMBOLS stuff and export needed symbols using TESS_API manually.
As I understand without EXPORT_ALL_SYMBOLS there will be link errors on win.

About separating between the library and the tools.

We have multiple suggestions here, so how do we move forward?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clarkk picture clarkk  ·  3Comments

Shreeshrii picture Shreeshrii  ·  4Comments

egorpugin picture egorpugin  ·  6Comments

dthrock picture dthrock  ·  5Comments

LaurentBerger picture LaurentBerger  ·  3Comments