Tesseract: LSTM: ARM SIMD support

Created on 1 Dec 2016 · 9Comments · Source: tesseract-ocr/tesseract

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#for-open-source-contributors

There is a C++ implementation if the hardware does not have SSE and/or AVX, but the code could benefit from SIMD implementations for other hardware, such as ARM. See the new arch directory for where to insert the code.

SIMD

Source

amitdo

Most helpful comment

Yes, that's correct. It is already possible to do it by providing additional compiler flags when running configure (CXXFLAGS=...). But of course that should happen automatically, and we must take care that the resulting binary can also be used on different hardware. This still has to be implemented.

PS. I have recently ordered a small ARM based cluster for Tesseract OCR, so I'm highly motivated to work on that issue. :-)

stweil on 3 Jun 2018

👍2 🎉1

All 9 comments

Should we write this code manually nowadays? Modern compilers can optimize SIMD instructions very good wihtout any manual work with intrinsics. User should just compile with -O2/3 and -march=

I think writing a lot of manual assembler/intrinsic isn't a good idea.

ZaMaZaN4iK on 3 Jun 2018

PS. I have recently ordered a small ARM based cluster for Tesseract OCR, so I'm highly motivated to work on that issue. :-)

stweil on 3 Jun 2018

👍2 🎉1

Enabling NEON optimisations does result in vectorised NEON instructions for WeightMatrix::DotProduct: https://godbolt.org/z/YCUgcb

I'm not sure about IntSimdMatrix::MatrixDotVector -- the code (and assembly) is much harder to follow.

On my ARM device (NVidia Tegra K1) compiling tesseract with NEON optimisations (-mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a15) gave a 10-15% speedup, but the LSTM engine is still 3-10 times slower than the legacy engine: 3-30 seconds (depending on the image size) compared to 1-4 seconds for the legacy engine.

These compiler flags had no measurable effect on the legacy engine.

Adding -O3 (versus the default -O2) resulted in a further 0-20% speedup (depending on image size). In other words, a total speedup of 10-30% over -O2 without NEON. (Still many times slower than the legacy engine.)

For the legacy engine, -O3 gave me a 1-8% speedup.

I used Ubuntu's tesseract package version 4.00~git2288-10f4998a-2 + the english data files from https://github.com/tesseract-ocr/tessdata/tree/590567f2

How I built it, in case it helps anyone:

sudo apt install build-essential devscripts
sudo apt build-dep tesseract-ocr
mkdir /tmp/tesseract
cd /tmp/tesseract
apt source tesseract-ocr
cd tesseract-4.00~git2288-10f4998a
debchange -R "Rebuild with NEON optimisations";
export DEB_CFLAGS_APPEND="-mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a15"
debuild -i -us -uc -b  # creates ../*.deb

drothlis on 12 Feb 2019

I suggest using data files from tessdata_fast instead of those from tessdata. In addition, you could try -c dotproduct=native which should use Neon if you compiled on a Neon machine.

stweil on 12 Feb 2019

you can find below code which addresses arm neon integer support. This is native implementation of intsimdmatrixneon.cpp along with changes in other files to support this. Once i get my hand on a 64b arm platform, i will work on the arm neon float support (for dotproductneon.cpp). There is about 20% improvement in performance. Please review the code and let me know your comments.

https://github.com/s6ch13/tesseract/tree/arm_neon_support

cheers Sriram

s6ch13 on 9 Jan 2020

Dot product acceleration using Neon was implemented in f79e52a7ccc06e.

amitdo on 27 May 2020

I'll try to compare the performance of both implementations later. This is an interesting example because the one here simply relies on the compiler while the other one uses handwritten NEON code.

stweil on 27 May 2020

@stweil Do you have a result for the comparison?
What are the recommended settings to use for Neon?