I am running the tutorial on training lstm by fine tuning it following the link https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
The training works OK when I follow the tutorial instruction and fine tune from .lstm extracted from tessdata/best/eng.traineddata. However the training failed when I try to extract .lstm from tessdata/eng.traineddata
Tesseract Version: tesseract 4.0.0-beta.1-232-g45a6
Platform:
The code I am trying to execute:
training/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact --continue_from ~/tesstutorial/impact_from_full/eng.lstm --traineddata tessdata/eng.traineddata --train_listfile ~/tesstutorial/engeval/eng.training_files.txt --max_iterations 400
The eng.lstm is extracted by "training/combine_tessdata -e tessdata/eng.traineddata ~/tesstutorial/impact_from_full/eng.lstm"
The code will work if I use the tessdata/best/eng.traineddata
The error that I got:
Loaded file /home/dlai/tesstutorial/impact_from_full/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/dlai/tesstutorial/impact_from_full/eng.lstm
Loaded 72/72 pages (1-72) of document /home/dlai/tesstutorial/engeval/eng.FreeSans.exp0.lstmf
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Segmentation fault (core dumped)
Thanks very much
Dihui
However the training failed when I try to extract .lstm from tessdata/eng.traineddata
Both tessdata and tessdata_fast have integer models which cannot be used for lstmtraining..
Only the float models in tessdata_best can be used for it.
Of course, it should give an appropriate error message and not crash.
@stweil Is it possible to add an error msg for 4.0.0?
Thanks for your response, Shreeshrii,
I did read some comments on the integerize in your documentations and should have guessed this.
Still, is there a way to integerize the fine tuned model from the tessdata_best ? The speed of the model on tessdata_best is too slow for our application.
Dihui
The best files can be converted to integer by the following command
Usage for compacting LSTM component to int:
combine_tessdata -c traineddata_file
The tessdata repo has the integer version of best models plus the old legacy model also.
@DihuiLai Please change issue title to
Segmentation fault when using integer models for LSTM training
Segmentation fault when using integer models for LSTM trining
s/trining/training/
@stweil Is it possible to add an error msg for 4.0.0?
Yes, I think so. I added the issue to the planning list.
@zdenop, please add the "bug" label to this issue.
@stweil Thanks for fixing the typo :-) Good to know that it can be fixed for 4.0.0.
Changed @Shreeshrii
The problem is solved and I am closing the issue
AFAIK this issue was not solved.
It was only clarified that it was caused by training based on an integer model which is not allowed.
So that's an error which can be easily avoided. Of course the error handling needs to be improved here. @zdenop or @DihuiLai, please reopen this issue.
Although this is a bug, I think it can be fixed after 4.0.0, as training won't be done by most users of Tesseract.
@stweil : can you send PR, so we can fix this for 4.0 release?
Most helpful comment
Both tessdata and tessdata_fast have integer models which cannot be used for lstmtraining..
Only the float models in tessdata_best can be used for it.
Of course, it should give an appropriate error message and not crash.
@stweil Is it possible to add an error msg for 4.0.0?