This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work.
I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.
Special thanks for Shree that helped me at the google groups
Project and more details: https://github.com/tesseract-ocr/tesseract
where to get help?
google group: https://groups.google.com/forum/#!forum/tesseract-ocr
git: https://github.com/tesseract-ocr/tesseract/issues
Platform: MAC OS X 10.13.3
Tesseract: 4.0.0-beta.1-69-g10f4
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos
Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it)
1 - Install these libs
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc
2 - Run the code
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1
3 - Clone tesseract repo
git clone https://github.com/tesseract-ocr/tesseract/
4 - Enter in the folder
cd tesseract
5 - Run the script
./autogen.sh
6 - Run the code, and copy the CPPFLAGS and LDFLAGS
brew info icu4c
7 - Update the CPPFLAGS and LDFLAGS and execute the code
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/lib
8 - Run the code
make -j
9 - Run the code
sudo make install
10 - Run the code
sudo update_dyld_shared_cache
Obs.: this is the sudo ldconfig version for MAC OS X
11 - Run the code
make training
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line
https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging
Important: Use the JDK 8 to build, or else it is going to return an error
1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar
2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java
3 - Enter the tesseract/java folder
cd java
4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code
SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain
1 - Clone the langdata dir from git
git clone https://github.com/tesseract-ocr/langdata
2 - Enter the tesseract folder
cd ..
3 - Execute this code and select one font from the list (I recommend "Verdana")
text2image --list_available_fonts --fonts_dir=/Library/Fonts
Font dir for MAC can be : ~/Library/Fonts
/Library/Fonts/
/Network/Library/Fonts/
/System/Library/Fonts/
/System Folder/Fonts/
More details here: https://support.apple.com/en-us/HT201722
4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
Obs.: this is a fix for the error:
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied
5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)
git clone https://github.com/tesseract-ocr/tessdata_best
or
git clone https://github.com/tesseract-ocr/tessdata_fast
6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/
7 - Create the training data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/engtrain
Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
8 - Create other training data using other font to compare
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/engeval
Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
9 - Create the needed folder
mkdir -p ~/tesstutorial/engoutput
10 - Start the training
SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1
11 - Monitor the log on another console
tail -f ~/tesstutorial/engoutput/basetrain.log
12 - Test Accuracy with other font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
13 - Test Accuracy with best traindata
~/projects/tesseract/training/lstmeval \
--model ~/projects/tessdata_best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
14 - Test Accuracy with actual traindata (in this case the same as step 13)
~/projects/tesseract/training/lstmeval \
--model ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
1 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_small
2 - Start to fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_small/verdana \
--continue_from ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 1200
3 - Validate the progress
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
4 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_full
5 - Combine the trained data
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/verdana_from_full/eng.lstm
6 - Train merged data
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_full/verdana \
--continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 400
7 - Validate the results on the main training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
8 - Validate the results on our training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
1 - Modify langdata/eng/eng.training_text and include these lines:
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
2 - Generate the training file
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
"Times New Roman, Bold" \
"Times New Roman, Bold Italic" \
"Times New Roman, Italic" \
"Courier New" \
"Courier New Bold" \
"Courier New Bold Italic" \
"Courier New Italic" \
--output_dir ~/tesstutorial/trainplusminus
3 - Generate the eval data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/evalplusminus
4 - Combine trained data files
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/trainplusminus/eng.lstm
5 - Fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/trainplusminus/plusminus \
--continue_from ~/tesstutorial/trainplusminus/eng.lstm \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 3600
6 - Test the result on other fonts
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
6 - Test the result test on main font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
Thank you for step by step info. This should probably be added to wiki.
One correction:
When doing fine-tune training, ONLY traineddata files from tessdata_best
can be used as a base traineddata to continue from
Models from tessdata_fast as well as tessdata will NOT work.
On Sun 8 Apr, 2018, 3:16 PM FernandoGOT, notifications@github.com wrote:
This is step by step that I used to install tesseract 4.0 on my MAC OS X
and the fixes/workaround I needed to do so I could make it work.
I'm sharing this "guide" with the intention of helping other people who
may have the same problems I had.Special thanks for Shree that helped me at the google groups
Project and more details: https://github.com/tesseract-ocr/tesseract
where to get help?
google group: https://groups.google.com/forum/#!forum/tesseract-ocr
git: https://github.com/tesseract-ocr/tesseract/issuesPlatform: MAC OS X 10.13.3
Tesseract: 4.0.0-beta.1-69-g10f4
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11Found AVX2
Found AVX
Found SSE
Compiling Tesseract - tesseract 4.0Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos
Warning: Don't install tesseract using brew, since you can't generate the
ScrollView.jar from it! (At least I wasn't able to generate it)
Steps1 - Install these libs
brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc2 - Run the code
ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
Obs.: text2image is set to use icu4c/60.2 but the actual version is
icu4c/61.13 - Clone tesseract repo
git clone https://github.com/tesseract-ocr/tesseract/
4 - Enter in the folder
cd tesseract
5 - Run the script
./autogen.sh
6 - Run the code, and copy the CPPFLAGS and LDFLAGS
brew info icu4c
7 - Update the CPPFLAGS and LDFLAGS and execute the code
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/lib8 - Run the code
make -j
9 - Run the code
sudo make install
10 - Run the code
sudo update_dyld_shared_cache
Obs.: this is the sudo ldconfig version for MAC OS X
11 - Run the code
make training
Creating ScrollView.jar - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line
https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebuggingImportant: Use the JDK 8 to build, or else it is going to return an error
Steps1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar
http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar
2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to
tesseract/java3 - Enter the tesseract/java folder
cd java
4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the
codeSCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
Training Font - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain
Steps1 - Clone the langdata dir from git
git clone https://github.com/tesseract-ocr/langdata
2 - Enter the tesseract folder
cd ..
3 - Execute this code and select one font from the list (I recommend
"Verdana")text2image --list_available_fonts --fonts_dir=/Library/Fonts
Font dir for MAC can be : ~/Library/Fonts
/Library/Fonts/
/Network/Library/Fonts/
/System/Library/Fonts/
/System Folder/Fonts/More details here: https://support.apple.com/en-us/HT201722
4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh
from
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
Obs.: this is a fix for the error:
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied5 - Clone the tessdata repo from git (i recommend the "tessdata_best"
since it is the more precise, "tessdata_fast" is just more fast)git clone https://github.com/tesseract-ocr/tessdata_best
or
git clone https://github.com/tesseract-ocr/tessdata_fast
6 - Copy the tessdata_best/eng.traineddata (for english training) from
the tessdata you just cloned and past at tesseract/tessdata/7 - Create the training data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/engtrainAdd the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
8 - Create other training data using other font to compare
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/engevalAdd the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX
9 - Create the needed folder
mkdir -p ~/tesstutorial/engoutput
10 - Start the training
SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.logCase you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval
-111 - Monitor the log on another console
tail -f ~/tesstutorial/engoutput/basetrain.log
12 - Test Accuracy with other font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt13 - Test Accuracy with best traindata
~/projects/tesseract/training/lstmeval \
--model ~/projects/tessdata_best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt14 - Test Accuracy with actual traindata (in this case the same as step 13)
~/projects/tesseract/training/lstmeval \
--model ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txtFine tuning - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
Steps1 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_small
2 - Start to fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_small/verdana \
--continue_from ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 12003 - Validate the progress
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt4 - Create the necessary folder
mkdir -p ~/tesstutorial/verdana_from_full
5 - Combine the trained data
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/verdana_from_full/eng.lstm6 - Train merged data
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/verdana_from_full/verdana \
--continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 4007 - Validate the results on the main training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt8 - Validate the results on our training file
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
--traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txtFine tuning add ± character - tesseract 4.0
Reference:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
Steps1 - Modify langdata/eng/eng.training_text and include these lines:
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED2 - Generate the training file
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Times New Roman," \
"Times New Roman, Bold" \
"Times New Roman, Bold Italic" \
"Times New Roman, Italic" \
"Courier New" \
"Courier New Bold" \
"Courier New Bold Italic" \
"Courier New Italic" \
--output_dir ~/tesstutorial/trainplusminus3 - Generate the eval data
PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \
--langdata_dir ~/projects/langdata \
--tessdata_dir ~/projects/tesseract/tessdata \
--fontlist "Verdana" \
--output_dir ~/tesstutorial/evalplusminus4 - Combine trained data files
~/projects/tesseract/training/combine_tessdata \
-e ~/projects/tesseract/tessdata/eng.traineddata \
~/tesstutorial/trainplusminus/eng.lstm5 - Fine tuning
~/projects/tesseract/training/lstmtraining \
--model_output ~/tesstutorial/trainplusminus/plusminus \
--continue_from ~/tesstutorial/trainplusminus/eng.lstm \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
--train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 36006 - Test the result on other fonts
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt6 - Test the result test on main font
~/projects/tesseract/training/lstmeval \
--model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1453, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AE2_oy-BFI7DnIs0HYfIUQvk9uZT7aU3ks5tmdxdgaJpZM4TLeJ9
.
@FernandoGOT Thank you. /// As you know, @Shreeshrii he mentioned about problem - Fine tune -training. So I hope so. This page will be reflected soon . Thank you
This is a great resource! It would be even more amazing if it were in the form of a pull request of changes to the existing documentation so that it could be improved to avoid these problems for other OS X users.
I followed @FernandoGOT steps but I am getting: read_params_file: parameter not found: enable_new_segsearch when running tesseract --list-langs. It's the first time I try to build tesseract so I have no idea what it's going on. Any ideas on where to look?
@kas84 please post results of
tesseract -v
Version info.
Are you using latest source from Github ?
@Shreeshrii I cloned the repo like so git clone https://github.com/tesseract-ocr/tesseract/, so if latest version is in master, yes I am.
tesseract -v
Yeah, I forgot, sorry!
```tesseract 4.0.0-beta.1-232-g45a6
leptonica-1.76.0
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
Usually tesseract -v should also show the tesseract version.
Is the error only with --list-langs
Are you able to recognize any test images?
My bad:
tesseract 4.0.0-beta.1-232-g45a6
leptonica-1.76.0
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
It also happens when trying to recognize an image, yes.
What commands are you using?
What tessdata-dir are you using? Eg. Where is eng.traineddata installed?
What output do you get with the following? Use ./tessdata if you have copied eng.traineddata there.
cd tesseract
tesseract ./testing/phototest.tif - --tessdata-dir ../tessdata -c page_separator=''
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

page _seperator
The space here confuses the command line options parser.
Has any one built a dockerfile out of this ?

It works now! I am guessing it had something to do with my TESSDATA env
I am guessing it had something to do with my TESSDATA env
No.
It was due to wrong command line usage.
I am a newbie with tesseract and this has nothing to do with my bug, but... is it supposed to recognize images like this?

Or do I need to treat the image first to remove everything but white so that tesseract can handle it?
Please use the forum for asking questions.
Okay, sorry!
@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you.
@ysnnzlcn I'm out of times these days (working too much), but when I get some free time I'm going to make a better step-by-step of how to use tesseract and send a merge to the docs
@FernandoGOT That would be great, looking forward to it. Thanks
Under Training Font -- Tesseract 4.0, Step 7, I get a failure:
```Hadil-Sabbaghs-MacBook-Pro:tesseract hadilsabbagh$ PANGCAIRO=fc ~/tesseract/src/training/tesstrain.sh --fonts_dir /Library/Fonts --lang eng --linedata_only --noextract_font_properties --exposures "0" --langdata_dir ~/tesseract/java/langdata --tessdata_dir ~/tesseract/tessdata --fontlist "Verdana" --output_dir ~/tesstutorial/engtrain
=== Starting training for language 'eng'
[Sat Sep 22 16:56:06 MST 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=Verdana --outputbase=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG/sample_text.txt --text=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG/sample_text.txt --fontconfig_tmpdir=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG
=== Phase I: Generating training images ===
Rendering using Verdana
[Sat Sep 22 16:56:09 MST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/eng-2018-09-22.XXX.rxeEXrp0/eng.Verdana.exp0 --max_pages=0 --font=Verdana --text=/Users/hadilsabbagh/tesseract/java/langdata/eng/eng.training_text
ERROR: /var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/eng-2018-09-22.XXX.rxeEXrp0/eng.Verdana.exp0.box does not exist or is not readable
ERROR: /var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/eng-2018-09-22.XXX.rxeEXrp0/eng.Verdana.exp0.box does not exist or is not readable
I have:
Hadil-Sabbaghs-MacBook-Pro:tesseract hadilsabbagh$ tesseract -v
tesseract 4.0.0-beta.4-158-g02f9d
leptonica-1.76.0
libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE
```
My user is allowed to create files in that directory, and the directory itself is present.
Please advise.
Hadil G. Sabbagh, Ph. D.
Hi, when I try installing this it breaks here:
[Wed Sep 26-19:00:26][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>sudo update_dyld_shared_cache
Password:
update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat
update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat
update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat
update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat
update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat
update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat
update_dyld_shared_cache: warning: x86_64h rejected from cached dylibs: /System/Library/PrivateFrameworks/CreateML.framework/Versions/A/CreateML (("Could not find dependency '/System/Library/PrivateFrameworks/TuriCore.framework/Versions/A/TuriCore'"))
[Wed Sep 26-19:00:48][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>
I really would like to get this working - I've spent a lot of time getting something running...any help or pointers to instructions would be greatly appreciated..
@FernandoGOT @Shreeshrii : can you put the instruction to wiki? I would like to close this issue (related to build process). it is to long and other people mixed other topics (training) here.
@FernandoGOT: can you test the recent code?
I do not have a Mac. Would prefer if someone can test with current code and then post required instructions to wiki.
'make training' returns the following error:
combine_tessdata.cpp: 100:9: error: use of undeclared identifier 'errno'
errno = 0;
^
combine_tessdata.cpp:103:20: error: use of undeclared identifier 'errno'
} else if (errno == 0) {
^
combine_tessdata.cpp:109:36: error: use of undeclared identifier 'errno'
argv[i], strerror(errno));
^
combine_tessdata.cpp:120:9: error: use of undeclared identifier 'errno'
errno = 0;
^
combine_tessdata.cpp:123:20: error: use of undeclared identifier 'errno'
} else if (errno != 0) {
^
combine_tessdata.cpp:125:46: error: use of undeclared identifier 'errno'
filename.string(), strerror(errno));
^
6 errors generated.
make[1]: * [combine_tessdata.o] Error 1
make: * [training] Error 2
Any fix to this issue??
Thanks
@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you.
Please check your output after running this code:
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/lib
I came across the same error and the log showed me an issue with _icu4c_ and also asked to install _pango_.
Once done, run the above code again and hopefully your error will be solved.
@escapist21 : is your compile problem with combine_tessdata still valid?
@zdenop The errno problem exists in the current version. I'll have a look at it.
I created a bug report (#1986) and patch (#1987) for the problem reported by @escapist21.
With that bug fix and following the instructions on the wiki for MacPorts (https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos-with-macports), I was able to build both Tess and the training tools. This was not a clean install from scratch, so it's possible that I had a necessary dependency already installed, but I think this issue can be closed and folks can open new issues if they find additional problems.
One thing I noticed is that there's a small issue with linking the OpenMP version that I haven't looked into, but the standard non-OpenMP build works fine.
@tfmorris : Can you please check clean install from scratch, so we can be sure 4.0.0 is ready for Mac?
I don't usually have completely unused machines with none of the dependencies installed, but I've got a new work computer that I was able to use.
I made a minor edit to the homebrew instructions on the wiki page, but with that I was able to successfully build both the main program and the training tools using both MacPorts and Homebrew using current head of master.
@tfmorris,
Please share your minor edits.
With OpenMP you can get a major speedup, so I suggest to investigate how to make it work on macOS with Clang + LLVM's OpenMP runtime.
Hi, when I try installing this it breaks here:
[Wed Sep 26-19:00:26][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>sudo update_dyld_shared_cache Password: update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat update_dyld_shared_cache: warning: x86_64h rejected from cached dylibs: /System/Library/PrivateFrameworks/CreateML.framework/Versions/A/CreateML (("Could not find dependency '/System/Library/PrivateFrameworks/TuriCore.framework/Versions/A/TuriCore'")) [Wed Sep 26-19:00:48][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>I really would like to get this working - I've spent a lot of time getting something running...any help or pointers to instructions would be greatly appreciated..
I am having this issue too, has this been resolved here or somewhere else??
@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you.
Please check your output after running this code:
./configure \
CPPFLAGS=-I/usr/local/opt/icu4c/include \
LDFLAGS=-L/usr/local/opt/icu4c/libI came across the same error and the log showed me an issue with _icu4c_ and also asked to install _pango_.
Once done, run the above code again and hopefully your error will be solved.
@jamesoneill54 https://stackoverflow.com/questions/33259191/installing-libicu-dev-on-mac/33352241 this is work for me
I suggest to close this issue. Part of the information given here is no longer up to date.
I made a minor edit to the homebrew instructions on the wiki page,
Please share your minor edits.
@amitdo You can find my edits in the history for the wiki page.
With OpenMP you can get a major speedup, so I suggest to investigate how to make it work on macOS with Clang + LLVM's OpenMP runtime.
That's not something I have time to tackle.
I suggest to close this issue. Part of the information given here is no longer up to date.
@stweil I suggested exactly that back in Oct 2018, so obviously agree. :) If people run into new problems, they can open new issues (or just update the wiki with the necessary corrections).
Did anyone manage to overcome the following error:
make training
Need to reconfigure project, so there are no errors
And if so how?
make training is disabled because some requirements are missing.
@stweil How do I diagnose which requirements are missing and why make training is disabled?
nvm,
configure: WARNING: pango 1.22.0 or higher is required, but was not found.
configure: WARNING: Training tools WILL NOT be built.
configure: WARNING: Try to install libpango1.0-dev package.
checking for cairo... no
configure: WARNING: Training tools WILL NOT be built because of missing cairo library.
configure: WARNING: Try to install libcairo-dev?? package.
checking that generated files are newer than configure... done
@stweil How do I diagnose which requirements are missing and why make training is disabled?
Obviously you found the answer yourself: configure says that pango 1.22.0 or higher is required, but was not found.
I am getting an error when 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'.
Error : 'text2image: not found'.
Can you please suggest me a direction on how i can tackle this issue?
MacOS : 10.14.6
@khalajink, I suggest to ask for help at the user forum.
@khalajink Did you install the training tools (including text2image)?
If so, where are they? Make sure you've included them on your $PATH.
@jtlz2 I have followed the @FernandoGOT's comment, i do not see installation for text2image there, i suppose it comes along with icu4c. How do i include it in $PATH?
When i try to run 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'.
Error is '-bash: /usr/local/bin/text2image: No such file or directory'.
Also I see that you had and issue related to pango version 3 days ago, even i am facing this although i have pango 1.44.6 already installed. How did you happen to solve it?
Solved the the pango issue by following https://stackoverflow.com/questions/55361379/osx-compiling-training-tools-for-tesseract-4-0-pango-libraries-not-found
Also I see that you had and issue related to pango version 3 days ago, even i am facing this although i have pango 1.44.6 already installed. How did you happen to solve it?
@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819
@jtlz2 Yes i followed your answer got the pango issue fixed but text2image issue still exists. Any idea about it?
When i try to run 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'.
Error is '-bash: /usr/local/bin/text2image: No such file or directory'.
@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819
Thanks for the answer. The commands you shared didn't work for me but the instruction on how to diagnose the issue helped a lot. It turns out that I do not have zlib installed so I installed it and now I can finally build the training tools.
I have a different but slightly similar problem in 2020 still.
I've successfully installed the latest Tesseract (master branch) on the latest OSX (11.1 Big Sur).
tesseract 5.0.0-alpha-855-g6d86
leptonica-1.80.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.6 liblz4/1.9.2 libzstd/1.4.5
Found libcurl/7.64.1 SecureTransport (LibreSSL/2.8.3) zlib/1.2.11 nghttp2/1.41.0
However, my training tools (even though they have been installed) could not find the actual files.
For example, if I call a text2image I see the following error message
```/usr/local/bin/text2image: error: '/usr/local/bin/.libs/text2image' does not exist
This script is just a wrapper for text2image.
See the libtool documentation for more information.
ERROR: Program text2image failed. Abort.
If I enable Debug for the bash script I see the following problem
❯ text2image --list_available_fonts --fonts_dir=~/Library/Fonts
(set -o) 2>/dev/null inbasically, all training tools can't find thier actual executable files which are located under `tesseract/.libs/
Did I miss something during the configuration?
@nnnikolay, I am sorry, that was my fault. It is now fixed with commit 421ebf0418f415c2ca270521243d4edc36dd44bf.
wow, @stweil thank you for your swift reaction. it seems that this step works now!
Most helpful comment
Please check your output after running this code:
./configure \CPPFLAGS=-I/usr/local/opt/icu4c/include \LDFLAGS=-L/usr/local/opt/icu4c/libI came across the same error and the log showed me an issue with _icu4c_ and also asked to install _pango_.
Once done, run the above code again and hopefully your error will be solved.