Tesseract: good accuracy but too slow, how to improve Tesseract speed

Created on 10 Mar 2016  Â·  78Comments  Â·  Source: tesseract-ocr/tesseract

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster?
thanks
00060

OpenMP SIMD performance

Most helpful comment

So, guys... How speed things up? Any practical ideas?

All 78 comments

You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images. I am working on a project where OCR with Tesseract would take nearly 7 years on a single core, but luckily I can try to get many computers and use their cores, so the time can be reduced to a few days.
Using compiler settings which are optimized for your CPU helps to gain a few percent, but I am afraid that for a larger gain different algorithms in Tesseract and its libraries would be needed.

Besides the OCR, we have other things that need to run on the other cores.
I believe, the main issue that's slowing down Tesseract is the way memory is managed.
Too many memory allocations (new function) and releases (delete or delete [] functions) do slow down the reader.
In the past, I did use a different OCR engine, and it was allocating up-front large buffers to store all the needed data (large buffer of blobs, a large buffer of lines, a large buffer of words and their corresponding data), the buffers were just being indexed as we were reading the data from an image. The large buffers were allocated only once upon ocr engine initialization and release only once upon ocr engine shutdown. This memory management scheme was very efficient computational-time-wise.
Are there any settings for Tesseract that are known to be computationally intensive?
any tricks to speed up Tesseract?

What evidence is your memory management speculation based on?

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

@ychtioui, as you have spent many years in machine vision, you know quite well that there are lots of ways why programs can be slow. Memory management is just one of them. Even with a lot of experience, I'd start running performance analyzers to investigate performance issues. Of course I can guess what might be possible reasons and try to improve the software based on that guesses, but improvements based on evidence (like the result of a performance analysis) are more efficient. Don't you think so, too? Do you have a chance to run a performance analysis?

You can try to use 3.02 version if you need only English. AFAIR it was
singnificantly faster on my (old) computer.

Zdenko

On Thu, Mar 10, 2016 at 4:35 PM, younes [email protected] wrote:

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1
second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the
OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster?
thanks
[image: 00060]
https://cloud.githubusercontent.com/assets/9968625/13674495/ac261db4-e6ab-11e5-9b4a-ad91d5b4ff87.jpg

—
Reply to this email directly or view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263.

I'm running version 3.02
I'm going through different sections of the reader, and checking which section is taking the most time.

is it typical to read images (such as mine attached above) in a few seconds?

thanks for your comments.

... 3.02 version ... AFAIR it was significantly faster on my (old) computer.

3.02 3.02.02 is compiled with '-O3' by default.
https://github.com/tesseract-ocr/tesseract/blob/3.02.02/configure.ac#L161

3.03 and 3.04 are compiled with '-O2' by default.
https://github.com/tesseract-ocr/tesseract/blob/3.03-rc1/configure.ac#L201
https://github.com/tesseract-ocr/tesseract/blob/3.04.01/configure.ac#L300

2.04 and 3.01 are compiled with '-O0' '-O2' by default.
https://github.com/tesseract-ocr/tesseract/blob/2.04/configure.ac
https://github.com/tesseract-ocr/tesseract/blob/3.01/configure.ac
The 'configure.ac' script in these versions does not explicitly set the '-O' level, so autotools will use '-O0' '-O2' as default.

thanks amitdo.
I'm using 3.02 but the C/C++ version of Tesseract.
I couldn't find the setting -O3 in the source files. where is it?

What I linked to was actually 3.02.02

I think this is 3.02:
https://github.com/tesseract-ocr/tesseract/blob/d581ab7e12a2fac4a73ac0af4ce7ec522b8f3e42/configure.ac

You are right. It does not contain any '-On' flag, so the compiler will use '-O0', which is not good for speed. so if you are using autotools to build Tesseract it will instruct the compiler to use '-O2'.

I assume you are using Tesseract on Linux / FreeBSD / Mac. On Windows + MS Visual C++ the configure.ac file is irrelevant.

@ychtioui said in a post above "I use VS2010" so using Windows.

Thanks Shree.

I don't know which optimization level is used for Visual C++.

I use vs2010 on a Windows 7 pc.
Project settings or building options won't change much the read speed.
Tesseract was designed in research labs. Most of the key sections of the reader are speed-don't-care.
I used some performance tools to analyze where most of the computation time is spent.
In the page layout section, the blob analyzer does a lot of new/delete. This is very time consuming. The attached image above has more than 3600 blobs. Besides a number of processings are done on each blob (distance transform, finding the enclosing rectangle, measuring blob parameters, etc.). The allocations (new) and the release (delete) of all these blobs is very time consuming.
If we use a global array (allocate upfront) of blobs (exactly object BLOBNBOX) and whenever we need a blob, just get one index from the array. The array will be released once when we shut down the engine.
I used this concept in another single line ocr reader and it's super fast.

VS2010 use optimization flag /O2 (Maximize speed) - other flags are set to default.
In past in forum there were warnings against using compiler optimization flag as they affect also OCR results. This is reason why there are standard optimization flags (-O2 in autotools and /O2 in VS).

I tried to run perf tool on linux:
perf record tesseract eurotext.tif eurotext
and I got this report (perf report):

  39,77%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  13,98%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  13,09%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   4,22%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   2,66%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   1,48%  tesseract  libtesseract.so.3.0.4  [.] ELIST_ITERATOR::forward
   1,16%  tesseract  libc-2.19.so           [.] _int_malloc
   1,15%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   1,01%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
   0,87%  tesseract  liblept.so.5.0.0       [.] rasteropLow
   0,79%  tesseract  libm-2.19.so           [.] __mul
   0,72%  tesseract  libtesseract.so.3.0.4  [.] FPCUTPT::assign
   0,71%  tesseract  libc-2.19.so           [.] _int_free
   0,71%  tesseract  libtesseract.so.3.0.4  [.] ELIST::add_sorted_and_find
   0,61%  tesseract  libtesseract.so.3.0.4  [.] tesseract::AmbigSpec::compare_ambig_specs
   0,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   0,52%  tesseract  libc-2.19.so           [.] memset
   0,49%  tesseract  libc-2.19.so           [.] vfprintf
   0,45%  tesseract  libc-2.19.so           [.] malloc
   0,36%  tesseract  libtesseract.so.3.0.4  [.] SegmentLLSQ
   0,31%  tesseract  libm-2.19.so           [.] __ieee754_atan2_sse2
   0,31%  tesseract  libc-2.19.so           [.] malloc_consolidate
   0,30%  tesseract  libtesseract.so.3.0.4  [.] LLSQ::add
   0,29%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::operator+=
   0,29%  tesseract  libtesseract.so.3.0.4  [.] _ZN14ELIST_ITERATOR7forwardEv@plt
   0,28%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ComputeFeatures
   0,25%  tesseract  liblept.so.5.0.0       [.] pixScanForForeground
   0,24%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::reserve
   0,20%  tesseract  libtesseract.so.3.0.4  [.] C_OUTLINE::increment_step
   0,20%  tesseract  [kernel.kallsyms]      [k] clear_page

according this report 3 top function consumed 66% of "time".

Then I tried 4 pages (A4 ) tiff (G4 compressed):

  52,24%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  12,06%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  10,06%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   3,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   1,90%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
...

Then I tried non eng image: perf record tesseract hebrew.png hebrew -l heb:

  27,79%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
  27,34%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
   4,40%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   3,98%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   3,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   2,36%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   2,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
...

Just for record for possible improvement in this issue: there was interesting information posted in scantailor project: OpenCL alone only brings ~2x speed-up. Another ~6x speed-up comes from multi-threaded processing.

Hi @ychtioui I am newbie and saw your first comment that you are able to get pretty accurate results from Tesseract. For your image itself i am no table to get any results its telling: Can't recognize image. Can you plz provide the code snippet on how you are processing the image.
Thanks - Anant.

@theraysmith
What do you use in the internal Google build, -O2 or -O3?

I'm interested in the same answer, @amitdo . Can you answer the question, @theraysmith ? It really can help us :)

Don't expect much difference between -O2 and -O3. I tried different optimizations, and they only have small effects on the time needed for OCR of a page. Higher optimization levels can even result in slower code because the code gets larger (because of unfolding of loops), so CPU caches become less effective. It is much more important to write good code.

That is a surprisingly hard question to answer in the Google environment!

I use 'opt' mode which after some digging, I found maps to -O2.
In addition, explicitly added are:
-fopenmp which will deliver a major improvement (3x faster), if you do not
have it, and a corresponding -lgomp for the linker
arch/dotproductavx.cpp is compiled with -mavx
arch/dotproductsse.cpp (and actually all the rest of the code) is compiled
with -msse4.1

I thought all this stuff was in the autotools files already, or are you
looking to convert these to windows?

On Sat, Apr 8, 2017 at 10:50 AM, Stefan Weil notifications@github.com
wrote:

Don't expect much difference between -O2 and -O3. I tried different
optimizations, and they only have small effects on the time needed for OCR
of a page. Higher optimization levels can even result in slower code
because the code gets larger (because of unfolding of loops), so CPU caches
become less effective. It is much more important to write good code.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-292734412,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056Qbi9xKk5GXQtfgXVZajN10mksEUks5rt8j6gaJpZM4Ht19x
.

--
Ray.

The improvement by using -fopenmp is useful when you want "realtime" OCR – running OCR for a single page and waiting for the result. Then it is fast because it uses more than one CPU core for some time consuming parts of the OCR process.

For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

Stefan, what about using OpenMP for training?

Yes, for training a single new model OpenMP could perhaps speed up the training process. Up to now, OpenMP is only used in ccmain/ and in lstm/. I don't know how much that part is used during training, and I never have run a performance evaluation for the training process (in fact I‌ have only run LSTM training once for Fraktur, and as I already said, it was not really successful).

OpenMP speeds up training by about 3.5x, since it runs 4 threads (one for
each part of the LSTM) and spends >90% of CPU time computing the LSTM
forward/backward.

On Sat, Apr 15, 2017 at 7:11 AM, Stefan Weil notifications@github.com
wrote:

Yes, for training a single new model OpenMP could perhaps speed up the
training process. Up to now, OpenMP is only used in ccmain/ and in lstm/.
I don't know how much that part is used during training, and I never have
run a performance evaluation for the training process (in fact I‌ have only
run LSTM training once for Fraktur, and as I already said, it was not
really successful).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-294295776,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056QxUeSroEmcJmZ30om3_wi6Mlyu5ks5rwNAogaJpZM4Ht19x
.

--
Ray.

can I set more than 4 threads for Trainning LSTM?

No, it doesn't help. The parallelism is limited by the implementation of
the LSTM as 4 matrix-vector products.
When I experimented with more threads for some of the other operations (eg
the output softmax), it slowed down because the cache coherency was lost.
I also experimented with breaking the matrix-vector products up further (eg
splitting the input from the recurrent part), but openMP doesn't seem too
good at allocating the threads in a way that keeps the cache coherency.
Each thread needs to run the same part of the weights matrix for each
timestep, and that is difficult to achieve with the recurrent nature of the
LSTM.

On Tue, Apr 18, 2017 at 11:09 PM, xlight notifications@github.com wrote:

can I set more than 4 threads for Trainning LSTM?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-295112242,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056UAEnzbmZZ5vncaO2zr0ASll1IoCks5rxaUjgaJpZM4Ht19x
.

--
Ray.

What about machines that have only 2 cores?
Shouldn't the 'num_threads' lowered to 2 in that case?

It still works. It just takes longer.

On Wed, Apr 19, 2017 at 10:00 AM, Amit D. notifications@github.com wrote:

What about machine that have only 2 cores?
Shouldn't the 'num_threads' lowered to 2 in that case?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-295345495,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL056dmv_0xhpF-2Qt11PJbfyg5Z-Bepks5rxj26gaJpZM4Ht19x
.

--
Ray.

@theraysmith I want to train tesseract 4 for arabic language. theraysmith you mean that there is no way to speed up the training process?

"I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text)."

In the context of your message, I have to read a single line and it still takes 1 sec in processing. How did you minimise the processing speed?

Thanks

@zdenop Please label

Performance

ShounakCy
Our in-house ocr reader is super fast in reading single lines of multi-fonts. It's proprietary (not open-source).
Tesseract 4.x is much accurate than 3.x since it uses Neural Networks.
I believe the key to improving Tesseract Speed is to use OpenCL.

What is the cost of yourTesseract 4.x, and I would like to integrate the
same in our Python or C# code.

Thanks

On Wed, Apr 25, 2018 at 8:14 PM, younes notifications@github.com wrote:

ShounakCy
Our in-house ocr reader is super fast in reading single lines of
multi-fonts. It's proprietary (not open-source).
Tesseract 4.x is much accurate than 3.x since it uses Neural Networks.
I believe the key to improving Tesseract Speed is to use OpenCL.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-384312916,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AjPGGKFPMzQuEvkN8CxGVtWf-6jyWpqbks5tsIvjgaJpZM4Ht19x
.

Hi, sorry if this is the wrong place to ask, but how are some users achieving very fast speeds compared to what I am getting? It takes me close to 4 seconds to run the OPs image. This user seems to run a 6 page PDF through tesseract in a matter of seconds, whereas it takes me minutes to run through that many pages of similar text. I have a Ryzen 3 1200 and 8 GB RAM. I have installed versions 3.02, 3.04, 3.05, and 4.00 with all the same results.

Yes, this is wrong place to post questions. As you can see that user is using version provided by is distribution his speed it related to:

  • power off his computer
  • complexity of input document.

I'm using tesseract 3.04 with ara.traineddata of course i also use the cube files , to initialize the file it taks too much time , it takes from me 15 min just to initialize , any idea how to improve that

im using visual studio 2013

Please try with latest 4.0 beta with tessdata_fast files.

On Thu 10 May, 2018, 8:51 AM AbdelsalamHaa, notifications@github.com
wrote:

I'm using tesseract 3.04 with ara.traineddata of course i also use the
cube files , to initialize the file it taks too much time , it takes from
me 15 min just to initialize , any idea how to improve that

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-387940280,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o1m6TIxBTlSZTvTxqkeLHI6NCtyZks5tw7JQgaJpZM4Ht19x
.

i have tested 4.0 it's very good and fast
the reason why im using 3.04 is due to i have so many other libraries build in 2013 visual studio , and tesseract 4 doesn't not support in 2013 vs . means if i want to use 4.0 i have to rebuild all the libraries again.

if u have any suggestion please let me know

I am also having similar issue .... am having more than 50K data .... I ran ocr and it took 12 hours to process only 1000 pdf .... how to make tessaract fast .... can using hadoop make it fast

Have you tried OMP_NUM_THREADS=1 tesseract ... as described in https://github.com/tesseract-ocr/tesseract/issues/898 ?

How to use "OMP_NUM_THREADS=1 tesseract" in R

Have you tried OMP_NUM_THREADS=1 tesseract ... as described in #898 ?

OMP_NUM_THREADS=1 will have no impact.

https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-300549643

Something that DOES work:
https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167

@amitdo oops I copied from the wrong comment. Indeed OMP_THREAD_LIMIT=1 tesseract... is what worked for me

I am still not clear how to improve the speed .... my code is in R and I used ocr function .... where should I use "OMP_THREAD_LIMIT=1 tesseract..."

@SandeepShaw2017 I'm not sure I can help. I don't know much about R so I can only give some general advise: if you are calling tesseract's functions directly from your R code, then maybe you have to set it when running your own app, e.g. from command line OCR_THREAD_LIMIT=1 ./my-R-script or via some System.setEnv("OCR_THREAD_LIMIT", 1); If you use tesseract as an application that your R code executes (eg via System.exec() or something), then you need to set the environment variable OCR_THREAD_LIMIT=1 for that process in whatever way R does it, or maybe via the same method as the former case if the child process inherits the environment variables. You should do your own googling, this seems to be an R-specific issue rather than tesseract's.

Setting Sys.setenv(OMP_THREAD_LIMIT= 1) is still taking more than 20 sec ..... can processing in R hadoop rmr2 help to reduce process time

Use multi-threading in your application. Initialize N instances of TessBaseAPI. N should be the number of
CPU cores. Each instance should handle a different image.

Dear Amit .... i am having 4 cores ... so does that mean I will be using the ocr tool in 4 consoles of RStudio ....

I don't know R. Just try and see.

Adding more consoles which run R with Tesseract until all CPU cores are fully used is one way how you can get maximum throughput.

Hi, I have a similar but slightly different problem here.
I am using Python 3.7 with Tesseract 3.02. And I am new to Tesseract.
I used pytesseract.image_to_string function, and it took me a long duration on the "first run".

result for first run:

'Cuz my associate professor 
at college advises the club. 

Duration: 259.72785544395447

result for second run:

'Cuz my associate professor
at college advises the club.

Duration: 0.9130520820617676

Can anyone please explain to me why will it happened? This is the 2nd day I am using Tesseract.
Thank you.


complete python code:

```import pytesseract
from PIL import Image
import time
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"

start_time = time.time()

img = Image.open("Pic5.png")
print("start")
result = pytesseract.image_to_string(img, config='--tessdata-dir "C:/Program Files (x86)/Tesseract-OCR/tessdata"')
print(result)

duration = time.time() - start_time
print("\nDuration:", duration)
```

@WaltPeter, you are obviously running on Windows, so anything can happen in the background and delay your test, for example AV scans, disk defragmentation or software updates. Try running your test many times to see how times vary.

A Python program name.py is compiled at the first run into a name.pyc, but that should not take more than a second. You can remove all *.pyc files to force a new compilation.

So, guys... How speed things up? Any practical ideas?

I get the same issue with Tesseract 4.0.0 beta upon my Centos 7.3 setup.
It takes 0.91 second to detect one character.
Anything updated for this issue?

Just a detail, but I recommend using OMP_THREAD_LIMIT=1 so that tesseract runs in single thread mode.

By default, tesseract runs in multithread mode but apparently this just burns out CPU cycles without benefits. Here is an example on a 4 cores machine:

root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=1
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12

real    0m34.300s
user    0m33.682s
sys     0m0.617s
root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=4
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12

real    0m31.943s
user    1m19.374s
sys     0m1.346s

Consumes three times more CPU while not even 10% faster.

Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (configure --disable-openmp ...) to remove the remaining overhead.

Sounds like we should change the build defaults if OpenMP is providing no real benefit.

I have no idea how the multithreading takes place but I have a feeling it's too low level, resulting in more overhead than gains. If the document's pages as a whole would be processed in parallel, that would probably be a real boost!

I turn off default openmp usage for cmake.
Patch for autotools is welcomed (I have no possibility to get to my linux machine soon).

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

Hi @ychtioui ,

i am in the same case as you. I have many single line text images and i want to know if you can suggest me a fast and good OCR like the OCR you specify in example.

Thank in advance.

Use multi-threading in your application. Initialize N instances of TessBaseAPI. N should be the number of
CPU cores. Each instance should handle a different image.

Can you explain this @amitdo

Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (configure --disable-openmp ...) to remove the remaining overhead.

@stweil Apart from disabling openmp, would you suggest any other changes to increase speed.

We noticed some time ago that the Linux kernel version can have a huge effect on the OCR performance, namely more than 20 % slower in some versions because of workarounds for SPECTRE and MELTDOWN. Those workarounds can be disabled using kernel parameters. I expect similar effects for other operating systems, too.

@stweil : I'd like to ask whether it is viable in a cloud environment, since I'd be deploying on cloud, and I don't know whether it can be done or not? Also, referencing zdenko's comment on opencl opencl for more speed , while doing ./configure --help, I noticed --enable-opencl enable opencl build [default=no], do you think that would help too?

My personal experience is that Tesseract runs best on real hardware, virtual machines / cloud environments are often slower. There is initial experimental support for OpenCL in the Tesseract code, but as it is only initial and experimental, I cannot recommend it unless you want to work on improving it. You won't see better performance with the current code.

I didn't get this one in the last comment, but there is also --with-tensorflow support TensorFlow [default=check] optional package, and I am guessing that it has to do with the lstm network, but is it for cuda based gpu usage?

Indeed that would be another way to get faster OCR, but it requires special traineddata model files for Tensorflow. As far as I know nobody has ever created such a file and used it with Tesseract + Tensorflow.

I just run some simple speed tests in python with current code (Intel Core i7-6600U CPU @ 2.60GHz, 2801 MHz, Cores: 2, Threads: 4; Windows 64 bit) and here are results:

|Optimization | tessdata_best | tessdata_fast | tessdata |
| --- | ---: | ---: | ---: |
| None | 48.9555 | 8.0645 | 13.3477 |
|AVX, AVX2, FMA, SSE| 19.0863 | 3.3139 | 4.9020 |
| Improvement None/AVX | 156% | 143% | 172%|
| Additional: | | | |
| None + no_invert | 35.4278 | 5.0341 | 11.4808 |
|AVX, AVX2, FMA, SSE + no_invert | 13.8921 | 2.7461 | 3.6696 |
| Improvement AVX/AVX no_invert | 37% | 21% | 34%|

UPDATE 2019-10-06: recent tesseract code allows to use option "-c tessedit_do_invert=0" which brings extra speed.

I used image from this issue, eng lang, no openmp, without specifying any parameter (e.g. default oem, psm...), duration is calculated as arithmetic average of 5 runs testing code.

Interesting it that there is no big difference in OCR quality between tessdata_fast, tessdata and tessdata_best models (for this image).

import timeit
import time
import os
import pytesseract

start_time = time.time()
tess_exe = r"f:\Project-Personal\tesseract\build.clang_no_avx\bin\tesseract.exe"
test_image = r"f:\\Project-Personal\\tesseract.test\\i263_speed.jpg"
os.environ['TESSDATA_PREFIX'] = r"f:\Project-Personal\tessdata_best\tessdata"

code_to_test = """
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"{}"
pytesseract.pytesseract.image_to_string(r"{}", lang = 'eng')
"""

elapsed_time = timeit.timeit(code_to_test.format(tess_exe), number=5)/5
print("\nDuration:", elapsed_time)

The Linux kernel and kernel parameters also have a significant effect on the performance of Tesseract (both for recognition and training). Especially the first kernels which tried to fix Spectre and similar CPU bugs make it really slow. I recently noticed that Tesseract with Debian GNU Linux (testing / bullseye) is faster when running in the Linux subsystem for Windows. Running on a Linux kernel with the default settings is slightly slower than running on the Windows kernel.

With the kernel parameters from https://make-linux-fast-again.com/ Tesseract gets faster by about 10 to 20 % and is then faster than in the Linux subsystem for Windows.

@zdenop How to achieve AVX, AVX2, FMA or SSE optimization.

It is used automatically if your computer provides them.

For texts without inverted text, significant faster OCR is possible when tesseract is called with -c tessedit_do_invert=0, see timing results above.

Is it possible to set -c tessedit_do_invert=0 in runtime or do we need to build Tesseract with this option?

It's a runtime option:

tesseract in.png out -c tessedit_do_invert=0

Are you aware of whether or not the pytesseract has that option available?

I'm not familiar with pytesseract.

Are you aware of whether or not the pytesseract has that option available?

The answer is on the pytesseract homepage:

config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Shreeshrii picture Shreeshrii  Â·  4Comments

royudev picture royudev  Â·  5Comments

dthrock picture dthrock  Â·  5Comments

clarkk picture clarkk  Â·  7Comments

samiles picture samiles  Â·  4Comments