Tesseract: Multiprocess 4.00.00alpha way slower than 3.03

Created on 9 May 2017  路  14Comments  路  Source: tesseract-ocr/tesseract

Hi,

I need to do OCR on a lot of multipage TIF documents. After reading https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-294275453 I decided to run several Tesseract processes in parallel.

With tesseract 3.03, OCR speeds increases linearly (more or less) with the number of processes. However, with 4.00.00alpha all processes are blocked at the first page and it seems to take an infinitely long time to process this first page. If I manually pause a process, others are able to resume processing.

The problems seems to be caused by the fact that v4.00 uses up to 4 CPUs to process a multipage TIF (one is saturated and the other 3 are used at about 25%). So if you run 4 processes in parallel on a 4-CPU machine, they're stuck. That's also why launching two processes in parallel on an 8-CPU machine is OK but launching 8 is infinitely slow.

I got the same problem on Ubuntu 14.04.5 LTS and Amazon Linux AMI 2016.09.

Is it a bug on the alpha version? Or is it a feature meant to fasten the processing of multipage TIFF images?

Thanks for any help you can provide.


tesseract 3.05.00 ( 2ca5d0a ) is OK

OpenMP

Most helpful comment

@a455bcd9,

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

Actually, the answer is yes :-)

OMP_THREAD_LIMIT=1 tesseract...

All 14 comments

The behavior which you describe was expected, see this previous discussion.

You can build Tesseract 4.x without multithreading by using configure --disable-openmp. That will improve your case, but I expect that it still will be slower than 3.x because Tesseract 4.x needs more processing time.

It would be good to have a runtime option to disable multithreading (or set the number of threads).

Thanks a lot, it worked! I'll benchmark 3.x vs 4.x in my case to see if it's interesting to use 4.x.

I don't need multiprocessing all the time, so is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

No.

It would be good to have a runtime option to disable multithreading (or set the number of threads).

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

"No" is the correct answer, but the whole story is a little bit more complicated. Here is the related Tesseract code:

ccmain/par_control.cpp:#pragma omp parallel for num_threads(10)
lstm/fullyconnected.cpp:#pragma omp parallel for num_threads(kNumThreads)
lstm/fullyconnected.cpp:#pragma omp parallel for num_threads(kNumThreads)
lstm/lstm.cpp:#pragma omp parallel for num_threads(GFS) if (!Is2D())
lstm/parallel.cpp:#pragma omp parallel for num_threads(stack_size)
lstm/parallel.cpp:#pragma omp parallel for num_threads(stack_size)
lstm/weightmatrix.cpp:#pragma omp parallel for num_threads(4) if (in_parallel)

Some of those statements use a fixed number of threads (10, kNumThreads = 4, 4), while others use a calculated value. In addition, there is code which generates the threads conditionally. There is also a Tesseract parameter named tessedit_parallelize which controls use of multithreading. By default it is set to 0 which means no multithreading for those parts of the code. So the more complete answer would be: No, you cannot disable OpenMP just before running Tesseract, but you can enable additional use of OpenMP by setting the parameter tessedit_parallelize.

The parameter tessedit_parallelize is used only with the legacy engine*. The new LSTM engine does not use it.

* Ray now calls the legacy engine "dead code".

Thanks, I close this issue.

@a455bcd9, it would be nice if you could publish your final benchmark results here as soon as they are available.

@stweil OK!

By the way, I thought OMP_NUM_THREADS=1 tesseract ... would disable multi threading but it seems it doesn't change anything, is it normal?

OMP_NUM_THREADS specifies the default number of threads. The Tesseract code never uses that default because all omp parallel statements add the num_threads attribute.

In the mean time I did compare Tesseract 4 with and without OpenMP. My test result suggests that mass production should not use OpenMP:

# tesseract 0604.jp2 /tmp/0604 # default = with OpenMP
real    1m44,390s
user    4m57,656s
sys 0m1,352s

# tesseract 0604.jp2 /tmp/0604 # without OpenMP
real    2m54,469s
user    2m54,160s
sys 0m0,304s

While the total time is shorter with multithreaded code, the user time is much worse.
Therefore I'd expect that it is better to run large OCR jobs with one non threaded
Tesseract process per CPU.

@a455bcd9,

is it possible to build 4.x with multi threading and to specify a flag to disable OpenMP just before running Tesseract?

Actually, the answer is yes :-)

OMP_THREAD_LIMIT=1 tesseract...

In the mean time I did compare Tesseract 4 with and without OpenMP. My test result suggests that mass production should not use OpenMP:

# tesseract 0604.jp2 /tmp/0604 # default = with OpenMP
real  1m44,390s
user  4m57,656s
sys   0m1,352s

# tesseract 0604.jp2 /tmp/0604 # without OpenMP
real  2m54,469s
user  2m54,160s
sys   0m0,304s

While the total time is shorter with multithreaded code, the user time is much worse.
Therefore I'd expect that it is better to run large OCR jobs with one non threaded
Tesseract process per CPU.

@stweil Can you elaborate upon this

Simply don't use Tesseract 4 with OpenMP unless you are sure that it helps in your case.

@stweil : Using OMP_THREAD_LIMIT = 1 seems to be the solution as given by @amitdo

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Shreeshrii picture Shreeshrii  路  4Comments

garry-ut99 picture garry-ut99  路  5Comments

clarkk picture clarkk  路  6Comments

johnthagen picture johnthagen  路  6Comments

egorpugin picture egorpugin  路  6Comments