Sample attached. File is created by the tesstrain.sh process. If on 'unix' you can try the following to create the file -

ls -1 *.lstmf > lang.training_files.txt

You may need to give the path before *.lstmf or the next step will not find the files.

Shreeshrii on 26 Apr 2017

The line break must be \n. This is what is inserted automatically when you hit the Enter key in the keyboard in Linux/macOS. In Windows it's by default \r\n, which will confuse Tesseract.

http://stackoverflow.com/questions/8195839/choose-newline-character-in-notepad

amitdo on 26 Apr 2017

Thanks @Shreeshrii, thanks @amitdo!

This raises a new question: how do I generate .lstfm files? I'm trying Tesseract to train on New York city directories, I have box files and TIFs. (Another question: can I already use WordStr box files, some parts of the documentation say I can, others say I can't?)

ZIP file with one TIF and box file I'm trying to use: Wilson1852_0.zip. Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

bertspaan on 26 Apr 2017

WordStr box files are not yet supported (AFAIK).

If you have box files in 3.0 format, you can use jtessboxeditor to add the
end of line tab character and use them.

When I want to test using box/tiff pairs, I copy the files to the training
directory - by modifying tesstrain.sh.

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

cp /home/shree/tesstutorial/larmbig/*.tif "${TRAINING_DIR}/"

cp /home/shree/tesstutorial/larmbig/*.box "${TRAINING_DIR}/"

Then use a command similar to following (based on location of your files)
and use just one font similar to the one used in your box/tiff pairs.

You may need to modify tesstrain_utils.sh to make sure that all your
box/tiff pairs are selected (based on the naming).

training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
--training_text ../langdata/eng/eng.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/engtest

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 26, 2017 at 6:32 PM, Bert Spaan notifications@github.com
wrote:

Thanks! This raises a new question: how do I generate .lstfm files? I'm
trying Tesseract to train on New York city directories
https://digitalcollections.nypl.org/items/b42866fb-b877-e4fc-e040-e00a1806275e,
I have box files and TIFs. (Another question: can I already use WordStr
box files, some parts of the documentation say I can, others say I can't?)

ZIP file with TIF and box file I'm trying to use: Wilson1852_0.zip
https://github.com/tesseract-ocr/tesseract/files/958420/Wilson1852_0.zip

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297400009,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxQuRa5kIoAXZ_0HyeRUZ_JPTZFzks5rz0CCgaJpZM4NIBJS
.

Shreeshrii on 26 Apr 2017

What's the output for the 388½ in this example and in other places in this book?

amitdo on 26 Apr 2017

@amitdo 388½ becomes 3884.

bertspaan on 26 Apr 2017

@Shreeshrii : but I have no fonts_dir, fontlist, etc, since I am only training from images.

bertspaan on 26 Apr 2017

I'm afraid this way of training is not well documented right now.

I have not yet tried training with 4.00.

amitdo on 26 Apr 2017

@amitdo Is it not well documented, or not yet possible at all?

bertspaan on 26 Apr 2017

@Shreeshrii Do you have examples of this process?

bertspaan on 26 Apr 2017

Your box file is in wordstr format. That cannot be used with existing
process.

If you had box file in older 3.04 format, then the hacked version of
script would work.

excuse the brevity, sent from mobile

On 26-Apr-2017 9:02 PM, "ShreeDevi Kumar" shreeshrii@gmail.com wrote:

I will post my modified versions of the scripts tomorrow, don't have
access to my pc right now.

excuse the brevity, sent from mobile

On 26-Apr-2017 8:41 PM, "Bert Spaan" notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii Do you have examples of this
process?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297440803,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o0ChPIZezAouqzZA3thLi0m2ovFqks5rz16zgaJpZM4NIBJS
.

Shreeshrii on 26 Apr 2017

I can provide a box file in 3.04 format tomorrow, I'll post the file here.

bertspaan on 26 Apr 2017

As said, the WordStr format is not really supported right now.

You can still train with the regular box format + tab lines to signal line breaks.

Training from 'real' images as opposed to synthetic ones (with text2image), that what's not well documented.

amitdo on 26 Apr 2017

Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

@bertspaan the desired_characters file is not directly used for training. It is used at Google for building the large training text required for LSTM training.

I couldn't find any font which has 1/2 the way it is printed here, so it maybe difficult to create synthetic image for it.

Shreeshrii on 27 Apr 2017

Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.

Here are the modified scripts:
boxtrain.zip
You will need to copy your box/tiff pairs to the
../langdata/eng/ directory
for them to be used.

You cannot use finetune process because 1/2 i not included in the unicharset for current LSTM traineddata for English. @theraysmith , will this change with your next update?

The following commands outline the process you may need to follow to do the LSTM training - top layer.

training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng  \
  --exposures "-2 -1 0" \
  --fontlist "Century Schoolbook" "Dejavu Serif" "Garamond" "Liberation Serif" "Times New Roman," "FreeSerif" "Georgia" \
  --output_dir ~/tesstutorial/nydlegacy

cp ~/tesstutorial/nydlegacy/eng.traineddata ./tessdata/nydlegacy.traineddata

training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "-2 -1" \
  --fontlist "Bookman Old Style Semi-Light"  \
  --output_dir ~/tesstutorial/nyd

rm -rf ~/tesstutorial/eng_from_nyd
mkdir -p ~/tesstutorial/eng_from_nyd

combine_tessdata -e ../tessdata/eng.traineddata \
   ~/tesstutorial/eng_from_nyd/eng.lstm

lstmtraining  \
   -U ~/tesstutorial/nyd/eng.unicharset \
  --train_listfile ~/tesstutorial/nyd/eng.training_files.txt \
  --script_dir ../langdata   \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --continue_from ~/tesstutorial/eng_from_nyd/eng.lstm \
  --model_output ~/tesstutorial/eng_from_nyd/nyd \
  --debug_interval -1 \
  --target_error_rate 0.01

lstmtraining \
  --continue_from ~/tesstutorial/eng_from_nyd/nyd_checkpoint \
  --model_output ~/tesstutorial/eng_from_nyd/nyd.lstm \
  --stop_training

cp ../tessdata/eng.traineddata ~/tesstutorial/eng_from_nyd/nyd.traineddata

combine_tessdata -o ~/tesstutorial/eng_from_nyd/nyd.traineddata \
  ~/tesstutorial/eng_from_nyd/nyd.lstm \
  ~/tesstutorial/nyd/eng.lstm-number-dawg \
  ~/tesstutorial/nyd/eng.lstm-punc-dawg \
  ~/tesstutorial/nyd/eng.lstm-word-dawg 

cp ~/tesstutorial/eng_from_nyd/nyd.traineddata ./tessdata/nyd.traineddata

Shreeshrii on 27 Apr 2017

Thanks so much, I will try all this next week!

bertspaan on 29 Apr 2017

Also see https://github.com/nypl-spacetime/ocr-scripts

Shreeshrii on 30 Apr 2017

@Shreeshrii: ha, that's my repository!

bertspaan on 30 Apr 2017

😄5

@bertspaan

I see that you have trained models for ocropy.

Is there anything you want to share about ocropy vs. Tesseract 4.00, accuracy wise, with your dataset?

amitdo on 30 Apr 2017

@bertspaan :-)

Since you already have an OCR process working, I suggest you wait for Ray to update code for training from scanned images and improve traineddata to support 1/2.

My hacked training is only proof of concept (i trained till about 2% accuracy) so while it recognizes 1/2 as %, other letters may not be as accurate as the traineddata from the repo.

Shreeshrii on 1 May 2017

@amitdo: yes, we've trained ocropy on a very small amount of sentences, and already the results are pretty good. See 1854-55.lines.ndjson.zip, this file contains all bounding boxes with ocropy output. However, ocropy sometimes crashes and its documentation is not too good, that's why last week we've started experimenting with Tesseract 4. I haven't compared out-of-the-box output of Tesseract 4 with our trained orcopy model in detail.

@Shreeshrii: ok, I'll try some of the commands you've posted here, but I'm not going to spend much time on trying to train Tesseract, I'll wait until training from scanned images is improved.

We are also building dictionaries of possible names, streets and professions, so we should be able to fix many OCR errors afterwards.

Thank you both so much for your help!

bertspaan on 1 May 2017

@Shreeshrii
I am also trying to fine tune tesseract4.0 with images. I am confused by several parameters below.
First, what is the training_text(nyd.training_text) file? Do I need to create it? If yes, how to create it?
Second, do I just need to specify the --training_text and --output_dir while leaving other parameters unchanged?

minly on 15 Aug 2017

Please see the wiki page on training, there have been changes made to LSTM training process.

Shreeshrii on 15 Aug 2017

combine_lang_model which takes as input an input_unicharset and script_dir (script_dir points to the langdata directory) and optional word list files...
I have got input_unicharset, but I don't know how can I get script_dir .

CoCa520 on 16 Aug 2017

https://github.com/tesseract-ocr/langdata

is the script_dir.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 16, 2017 at 12:39 PM, CoCa520 notifications@github.com wrote:

combine_lang_model which takes as input an input_unicharset and script_dir
(script_dir points to the langdata directory) and optional word list
files...
I have got input_unicharset, but I don't know how can I get script_dir .

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-322685646,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_owv9Cyw11QES-BfmZH8KPSm_b_Ijks5sYpWlgaJpZM4NIBJS
.

Shreeshrii on 16 Aug 2017

@CoCa520 Also see https://github.com/tesseract-ocr/tesseract/issues/590#issuecomment-322685025

Shreeshrii on 16 Aug 2017

@Shreeshrii Thank you！
BUT
I really can't understand how can I create lstm files.
Can you show me the code.

I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x29173e0 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916420 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatanumber-dawg)

CoCa520 on 30 Aug 2017

tesseract eng.font.exp0.tif eng.font.exp0.box lstm.train

you need a space after box to give the name of config file.

Best method is to follow the training tutorial. If you want more pages,
change tesstrain_utils.sh for max_page

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 2:02 PM, CoCa520 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii Thank you！
BUT
I really can't understand how can I create lstm files.
Can you show me the code.

I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x29173e0 still has count 1 (id /usr/local/tesseract/share/
tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x2916420 still has count 1 (id /usr/local/tesseract/share/
tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.
traineddatanumber-dawg)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-325921815,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozyFOlba1PIkz4zDYhN6YsomGE8Eks5sdR44gaJpZM4NIBJS
.

Shreeshrii on 30 Aug 2017

👍1

Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
but I just have tif/box pairs, so i come here for more information。

CoCa520 on 30 Aug 2017

4.0 training with tif/box pairs is not yet supported.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 2:52 PM, CoCa520 notifications@github.com wrote:

Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/
TrainingTesseract-4.00 http://url
but I just have tif/box pairs, so i come here for more information。

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-325934619,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o9zx0GhGvGJcTyC7Cp40PtGBxTKhks5sdSnrgaJpZM4NIBJS
.

Shreeshrii on 30 Aug 2017

@Shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model information are combines whit characters and numbers and located in somewhere of nameplate, so I have collected lots of pictures which contains various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:

And then I put all model pictures into tesseract for recognize, but the accuracy is not so good, so I am trying to train teaaeract4.00 with model pictures.
The tesseract4.0 training tutorial said that there are two ways to create training data, and I use the first option: each line in the box file matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy?

CoCa520 on 31 Aug 2017

Others who have done licensed plate recognition may be able to give you
better tips.

For your user case, I think using an older version of tesseract, specially
one which supports the 'digits' config file for limiting output to numbers
may be a better choice than using 4.0alpha.

On 31-Aug-2017 9:27 AM, "CoCa520" notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model
information are combines whit characters and numbers and located in
somewhere of nameplate, so I have collected lots of pictures which contains
various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
[image: image]
https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png
[image: image]
https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png
And then I put all model pictures into tesseract for recognize, but the
accuracy is not so good, so I am trying to train teaaeract4.00 with model
pictures.
The tesseract4.0 training tutorial said that there are two ways to create
training data, and I use the first option: each line in the box file
matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do
to raise the accuracy?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-326182878,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS
.

Shreeshrii on 31 Aug 2017

also see https://github.com/openalpr/openalpr which uses tesseract-ocr

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 31, 2017 at 9:27 AM, CoCa520 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model
information are combines whit characters and numbers and located in
somewhere of nameplate, so I have collected lots of pictures which contains
various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
[image: image]
https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png
[image: image]
https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png
And then I put all model pictures into tesseract for recognize, but the
accuracy is not so good, so I am trying to train teaaeract4.00 with model
pictures.
The tesseract4.0 training tutorial said that there are two ways to create
training data, and I use the first option: each line in the box file
matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do
to raise the accuracy?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-326182878,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS
.

Shreeshrii on 31 Aug 2017

@minly, hello, I have the same problems with you ? have you resolved them?

694376965 on 18 Oct 2017

@CoCa520, hello, did you generate lstm files finally? I want to know how to generate lstm files according to *.tif and *.box files?

694376965 on 18 Oct 2017

Bumping this..

I tried running the steps mentioned here: https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-298174100

I'm getting this error:

ERROR: Non-existent flag -D
ERROR: /var/folders/vz/yqbfrgj91hqdj76mmpl2vjmw0000gn/T/tmp.W8q07ZtQ/eng/unicharset does not exist or is not readable

It does not create a .traineddata, which is what I expected from doing --linedata_only parameter..

I'll try to compare what edits @Shreeshrii put in place, but any guidance would be appreciated.

Edit:

boxtrain.zip

I've diffed the three files to a version in April, grabbed the "intent" of @Shreeshrii 's edit, and applied it to the newest versions of the three files.

boxtrain/boxtrain.sh --fonts_dir ~/Library/Fonts/ --training_text ../langdata/eng/eng.training_text --langdata_dir ../langdata --tessdata_dir ./tessdata/ --lang eng --fontlist "Calibri" --output_dir ./lstm1

While editing, I saw that @Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder, and to be honest I have no clarity on what the .traineddata contains .

dzjin on 13 Nov 2017

@Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder

GT was my groundtruth folder, in which I also copied the box/tiff pairs for future reference.

They are NOT used in further training, as LSTM training now uses the generated lstmf files and starter traineddata.

Shreeshrii on 27 Mar 2018

@zdenop Please close this issue.

Shreeshrii on 25 Feb 2019

Tesseract: Format of train_listfile

Most helpful comment

All 38 comments

cp /home/shree/tesstutorial/larmbig/*.tif "${TRAINING_DIR}/"

cp /home/shree/tesstutorial/larmbig/*.box "${TRAINING_DIR}/"

Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.

Related issues