I compiled the tesseract 4.0 on Itel I5-8400 CPU with Debian 9.6.0 amd64.
tesseract --version output this:
tesseract 4.0.0
leptonica-1.74.2
libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
When I call tesseract several times, crash happens and PC is reboot.
I have a Intel G4650 CPU and this CPU not suport AVX2 / AVX and everything works fine! Never crash happens!
How to make tesseract work fine on Intel I5-8400 with AVX/AVX2/SSE.
When I call tesseract several times, crash happens and PC is reboot.
From the C/C++ API? Paste the code you are using.
Do you run this code inside a VM?
Does running tesseract command line tool work fine?
Did you test rc/beta versions of tesseract 4.0.0?
"From the C/C++ API? Paste the code you are using."
I'm not use the API, I use command line of linux:
tesseract img.jpeg img.jpeg.txt -l fra+eng hocr;
tesseract img2.jpeg img2.jpeg.txt -l fra+eng hocr;
....
"Do you run this code inside a VM?"
Yes, and works fine! because virtual box not expose AVX/AVX2/SSE instructions.
But, when I run on real SO with I5 8400 CPU tesseract crashed and system reboot.
On real SO with G4650 CPU works fine.
"Does running tesseract command line tool work fine?"
Not, when I run several times this line:
tesseract img.jpeg img.jpeg.txt -l fra+eng hocr;
The system reboot.
"Did you test rc/beta versions of tesseract 4.0.0?"
not yet, just tested the version 4.0.0 release.
What version you recommend to test?
As you compile Tesseract yourself, you could locally modify the code to pretend that your CPU does not support AVX2 by setting avx2_available_ = false. Does that help? If yes, we must examine that closer. If not, you also have to set avx_available_ = false.
But there is another more fundamental problem: user code like Tesseract should never cause a Linux system crash. Can you reproduce the crash with the official Debian tesseract-ocr package 4.0.0 (from stretch-backports or newer)? Then I'd report an issue to Debian.
I assume that Tesseract does not crash when you just run tesseract -v several times, but crashes while doing real OCR. Does it crash with any image? How often do you have to run it until it crashes with a small image? Does it crash faster when doing OCR of a large image? Do you get any crash information, either a short message on the screen or messages in /var/log/syslog?
Recently I noticed that the Tesseract code uses AVX(2) with a wrong alignment of data (4-byte aligned instead of 8-byte aligned). That might explain performance problems (especially observed on Windows, but also on Linux). I could not examine it closer up to now, but perhaps such misaligned data causes a kernel trap causing a system reboot in rare cases (that's pure speculation up to now).
"As you compile Tesseract yourself, you could locally modify the code to pretend that your CPU does not support AVX2 by setting avx2_available_ = false. Does that help? If yes, we must examine that closer. If not, you also have to set avx_available_ = false."
But will this make tesseract slow? I need to process some gygabytes of images.
But will this make tesseract slow? I need to process some gygabytes of images.
In your opinion, how much in performance do I lose out on it?
"Can you reproduce the crash with the official Debian tesseract-ocr package 4.0.0 (from stretch-backports or newer)?"
Yes, in the backport version the same thing happens.
"I assume that Tesseract does not crash when you just run tesseract -v several times, but crashes while doing real OCR. Does it crash with any image? How often do you have to run it until it crashes with a small image? Does it crash faster when doing OCR of a large image? Do you get any crash information, either a short message on the screen or messages in /var/log/syslog?"
I have a script with:
tesseract img.jpeg img.jpeg.txt -l fra+eng hocr;
There are 5.000 imagens to process on my script.
when the execution is in the fifth line, the system reboot, no error show me, no log in the /var/log/syslog.
The system simply restarts.
"Recently I noticed that the Tesseract code uses AVX(2) with a wrong alignment of data (4-byte aligned instead of 8-byte aligned). That might explain performance problems (especially observed on Windows, but also on Linux). I could not examine it closer up to now, but perhaps such misaligned data causes a kernel trap causing a system reboot in rare cases (that's pure speculation up to now)."
Okay, I understand, thanks for the support.
I will try to compile in ubuntu and with other versions and return to comment here.
Using AVX instead of AVX2 won't cost much performance, and even SSE still has acceptable performance. An unexpected system reboot will cost much more, I expect.
On Windows @noahmetzger recently reported that not using SSE or AVX is faster than using AVX.
"Using AVX instead of AVX2 won't cost much performance, and even SSE still has acceptable performance. An unexpected system reboot will cost much more, I expect."
Can I disable AVX2 on tesseract-ocr package? or I'll have to compile again.
Can I disable AVX2 on tesseract-ocr package? or I'll have to compile again.
No, it cannot be disabled by the user (that's a desired feature), so you have to compile again. I don't think that tests of older revisions of Tesseract are needed, because the problem seems to be related to the AVX or AVX2 code.
when the execution is in the fifth line, the system reboot,
What means "fifth line" here? Does the system reboot while Tesseract is processing the 5th image? Would it also reboot immediately if you start with that image? If yes, then that image is special, and it would be good if you could share it somehow, so it is possible to examine it closer.
yes, it's the fifth image.
I also thought it was the image, but with tesseract 3 it works.
And without AVX2 too.
I'll do more testing and return here.
Can you attach the image to the issue report? Or send it to me per e-mail if the content is sensitive?
Yes I can, I am currently not having access to the image, but I am sending it to you as soon as possible.
Thanks a lot!
More news,
The problem in the image is discarded, because I tested with other images and touched the same problem.
I also realized that if I use tesseract in only 1 image (with the image that failed) everything works.
I disabled AVX2 (avx2_available_= false) and it worked, but I'm not sure if that's the problem.
I tested the tesseract with AVX2:
Thank you for reporting those new test results. That's very interesting and adds more hints that there might be a problem with Tesseract's implementation of the dot product calculation on AVX2. There is only a AVX2 implementation for integer models, so I assume the reported problem can only be seen with models from tessdata_fast (the models provided by the major Linux distributions), but not with models from tessdata_best or tessdata.
You could try to install a newer Linux kernel on Debian (for example from backports) and test whether that improves the performance and fixes the crashes. Which exact kernel versions do you currently use on Debian / on Ubuntu?
How does Tesseract performance with AVX (AVX2 disabled) compare to Tesseract with AVX2 on Ubuntu? Do you get performance differences with AVX on Debian compared to Ubuntu?
I also realized that if I use tesseract in only 1 image (with the image that failed) everything works.
I don't understand what you mean here. Can you clarify?
Hello guys,
Sorry for the delay in answering.
"You could try to install a newer Linux kernel on Debian (for example from backports) and test whether that improves the performance and fixes the crashes. Which exact kernel versions do you currently use on Debian / on Ubuntu?"
Sorry, but I can't Install a newer Linux kernel on Debian because I replaced Debian by Ubuntu server...(formatting partitions)
I need to finish processing some images with the tesseract after this I can install Debian again and test the kernel backports.
Debian 9.6, kernel 4.9.0-8-amd64
Ubuntu 18.04.1 LTS , kernel 4.15.0-39-generic x86_64
"How does Tesseract performance with AVX (AVX2 disabled) compare to Tesseract with AVX2 on Ubuntu? Do you get performance differences with AVX on Debian compared to Ubuntu?"
It's hard to say because ubuntu server has no GUI and there were no apps running in the background.
Already in Debian had graphical interface and other apps that had installed.
When I install Debian again I can say exactly.
"I also realized that if I use tesseract in only 1 image (with the image that failed) everything works."
I had a script with 5,000 lines, 5mil calls to the tesseract and in the fifth line of the script (fifth image) the system crashes.
So I took this fifth line and just run it (just a picture, just a call to the tesseract) and all works fine.
So I came to the conclusion that there was no problem with the image.
I now have run a benchmark using tessdata_best/script/Latin.traineddata on one of our Debian GNU Linux servers with different Linux kernels:
linux-image-4.9.0-8-amd64 4.9.130-2:
real 3m52.251s
user 12m16.976s
sys 0m7.152s
linux-image-4.18.0-0.bpo.1-amd64 4.18.6-1~bpo9+1:
real 2m49.634s
user 8m33.314s
sys 0m4.478s
So the newer kernel greatly increases the speed of Tesseract!
bad news,
Ubuntu 18.04.1 crashed too.
For now, please disregard the tests I've done.
Because what is making a difference in this bug is the amount of images processed.
In Debian after the fifth image the system crashes and in Ubuntu after 1200 images the system crashes.
I will redo all the tests I did, this time with 10000 images or more.
How to run tesseract 4 many times in parallel? In tesseract 3 I used parallel (gnu.org/software/parallel) but it does not work with tesseract 4.
How to do this? Does tesseract 4 work in an HPC cluster?
"So the newer kernel greatly increases the speed of Tesseract!"
How was your test? Did you use AVX / AVX2? how many images have been processed?
Did you use tesseract in parallel?
The difference in speed might be related to spectra and meltdown patches.
The difference in speed might be related to spectra and meltdown patches.
Yes, I think so, too. Obviously newer kernels have a better implementation to handle those CPU bugs. Tests without any such patches (they can be disabled via kernel parameters) are still to be done.
How was your test?
I have run OCR for a single image using time tesseract 0604.png output -l script/Latin, so Tesseract used its internal multithreading.
For your use case with thousands of images, it is much better to run Tesseract single-threaded (export OMP_THREAD_LIMIT=1 in the environment) and use _parallel_ or other scripts to run as many Tesseract instances as there are CPU cores.
@edilinux, did you run Tesseract as root or as a normal user when it crashed your system? I'd like to reproduce it.
@edilinux, did you run Tesseract as root or as a normal user when it crashed your system? I'd like to reproduce it.
as a normal user
@edilinux, did you run Tesseract as root or as a normal user when it crashed your system? I'd like to reproduce it.
You should process more than 2mil images for the system to crash.
I'm redoing the tests, but I believe if your CPU does not have AVX / AVX2 then the system will not crash.
My performance test was done on a server with AVX2 support. I'll repeat it with smaller images running in a loop. Did you run several instances of Tesseract in parallel when you got a system reboot?
My performance test was done on a server with AVX2 support. I'll repeat it with smaller images running in a loop. Did you run several instances of Tesseract in parallel when you got a system reboot?
I performed both ways, in parallel and with only one instance.
I have now run OCR for phototest.tif 60000 times on the server with AVX2 without a crash:
# using eng.traineddata from tessdata_fast
tesseract test/testing/phototest.tif /tmp/out -l eng
@stweil @amitdo Thanks a lot for your help!
I am convinced that this is some problem with my equipment / installation.
I'm still testing but I have not had problems so far with an I5 7400 CPU + Debian + 4.9.0-8-amd64 + AVX + AVX2.
I don't believe, I found the problem.
One of my partitions was with btrfs and I can not explain why the system crashes when using tesseract.
It's strange because the images were in a partition with Ext4.
@stweil @amitdo sorry for bothering and wasting your time, all the time I thought it was something with AVX / AVX2 because on the CPU that was working that was the only difference I had noticed.
And one more time thanks a lot for your help!
So the result of the OCR was written to a BTRFS partition? That still should not be a reason for a kernel reboot.
But your issue report helped me very much, as I was not aware of the huge impact of the Linux kernel version on the performance of Tesseract OCR.
Yes, was written to a btrfs partition, and the /root directory was on btrfs partition too.
I know it's strange, but after changing btrfs for ext4 everything is working fine.
the CPU I5 8400 is on for more than 20 hours and has not crashed.
But your issue report helped me very much, as I was not aware of the huge impact of the Linux kernel version on the performance of Tesseract OCR.
Yes, the difference in performance is impressive, tesseract 4 + newer kernel is extremely powerful and fast!
My setup for better performance:
Best regards
Seems like those who search for any speed improvement should test different
linux kernels:
https://www.phoronix.com/scan.php?page=article&item=linux-410-420&num=1
@edilinux : if the issue is solved, please close it.
@stweil Thanks for your effort, I had same problem on my environment, and stuck to this problem several days. I don't have same problem any more by disabling AVX features like you advised. The environment of mine for now is like followings.
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
Ubuntu 16.04 LTS
tesseract 4.1.0-rc1
leptonica-1.77.0
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libopenjp2 2.1.2
Found SSE
Thanks!
@stweil Thanks for your effort, I had same problem on my environment, and stuck to this problem several days. I don't have same problem any more by disabling AVX features like you advised. The environment of mine for now is like followings.
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
Ubuntu 16.04 LTS
tesseract 4.1.0-rc1
leptonica-1.77.0
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libopenjp2 2.1.2
Found SSEThanks!
The solution for me.
https://github.com/tesseract-ocr/tesseract/issues/2064#issuecomment-441012762
@edilinux Thanks for sharing your information. You put all OCR stuffs in btrfs and the problem has gone? I will give it a try if my understanding is right. Please confirm it. Thanks!
I have the similar crush problem.
tesseract --version
tesseract 4.0.0
leptonica-1.74.1 libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 :
libwebp 0.5.2 : libopenjp2 2.1.2
Found AVX2
Found AVX
Found SSE
And I tested both on centos 7.5 and debian 9.8
@jeodai, please give more details. Are you running on real hardware or in a virtual machine? Does Tesseract crash with any image? Does your PC reboot?
I ran into this problem using Fedora 29, Linux 5.0.17, tesseract 4.1.0rc2, exact same CPU Intel I5-8400. Disabling AVX2 works around the crashes:
~
diff -ur tesseract-4.1.0-rc2.orig/src/arch/simddetect.cpp tesseract-4.1.0-rc2/src/arch/simddetect.cpp
--- tesseract-4.1.0-rc2.orig/src/arch/simddetect.cpp 2019-05-28 16:05:22.915125087 +0200
+++ tesseract-4.1.0-rc2/src/arch/simddetect.cpp 2019-05-28 16:09:51.314124063 +0200
@@ -99,7 +99,7 @@
// there is in my cpuid.h. It is a macro for an asm statement and cannot
// be used inside an if.
__cpuid_count(7, 0, eax, ebx, ecx, edx);
- avx2_available_ = (ebx & 0x00000020) != 0;
+ avx2_available_ = false;
avx512F_available_ = (ebx & 0x00010000) != 0;
avx512BW_available_ = (ebx & 0x40000000) != 0;
}
~
@stweil I only tested this with one picture which I can send to you privately (it's from a data set and I cannot publish the image.) If it helps I can also test with other pictures but for now I'm happy that my computer doesn't crash anymore.
I expect that it will crash with any picture. Can you confirm that? Are you running on real hardware or in a virtual machine? Does your PC reboot?
It also crashes with http://www.mattmahoney.net/ocr/stock_gs200.jpg on the second attempt. I use real hardware and the computer hard resets (goes off and on again) with no error message visible.
I also tested this in a chroot running the GRML 2018.12 kernel (Debian based rescue system) with the same hard reset effect.
Instead of modifying the code, you can also tell Tesseract to avoid the AXV2 code: tesseract IMAGE OUTPUTBASE [...] -c dotproduct=sse.
What happens if you run your test with gdb supervision? gdb --args tesseract [...]
With -c dotproduct=sse I do not get the hard resets.
If I run it using gdb and with the AVX2 code, the system also hard resets after a few attempts.
FWIW, I run everything on ext4.
I don't think that it is related to the filesystem.
Summary of the currently known facts:
We now have at least two cases with Intel Core i5-8400, a CPU which claims to support AVX2 but not only crashes when running Tesseract with AVX2 code but even reboots Linux.
That could be a bug in the Intel microcode. @edilinux and @mikegerber, which microcode package do you have installed? Is it current?
Using SSE 4.2 instead of AVX2 by calling Tesseract with -c dotproduct=sse is a working workaround.
~
~ dmesg | grep -i microcode
[ 0.000000] microcode: microcode updated early to revision 0xb4, date = 2019-04-01
[ 1.258506] microcode: sig=0x906ea, pf=0x2, revision=0xb4
[ 1.258691] microcode: Microcode Update Driver: v2.2.
~
If I interpret this message correctly, the microcode seems to be up-to-date.
My last idea is running Tesseract in a pure Linux console (Alt-Ctrl-F1, no X environment) and hope that the Kernel shows a last piece of information on that console before it reboots. Maybe you need a video camera to film the very short moment when this information is on the display.
On Windows @noahmetzger recently reported that not using SSE or AVX is faster than using AVX.
Is there a separate issue for this?
I am not sure whether this is still true and will ask @noahmetzger on Monday or test it myself.
Apparently I do not have this problem anymore.
馃
~
% bin/tesseract --version
tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 2.0.2) : libpng 1.6.34 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3
Found AVX2
Found AVX
Found SSE
Found libarchive 3.3.3 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 liblz4/1.8.3 libzstd/1.3.8
~