Tesseract: Format of train_listfile

Created on 25 Apr 2017  ·  38Comments  ·  Source: tesseract-ocr/tesseract

The documentation does not seem to specify the format of the text files required by the train_listfile option. Are there examples available of eng.training_files.txt?

Most helpful comment

@Shreeshrii: ha, that's my repository!

All 38 comments

eng.training_files.txt

Sample attached. File is created by the tesstrain.sh process. If on 'unix' you can try the following to create the file -

ls -1 *.lstmf > lang.training_files.txt

You may need to give the path before *.lstmf or the next step will not find the files.

The line break must be \n. This is what is inserted automatically when you hit the Enter key in the keyboard in Linux/macOS. In Windows it's by default \r\n, which will confuse Tesseract.

http://stackoverflow.com/questions/8195839/choose-newline-character-in-notepad

Thanks @Shreeshrii, thanks @amitdo!

This raises a new question: how do I generate .lstfm files? I'm trying Tesseract to train on New York city directories, I have box files and TIFs. (Another question: can I already use WordStr box files, some parts of the documentation say I can, others say I can't?)

ZIP file with one TIF and box file I'm trying to use: Wilson1852_0.zip. Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

image

WordStr box files are not yet supported (AFAIK).

If you have box files in 3.0 format, you can use jtessboxeditor to add the
end of line tab character and use them.

When I want to test using box/tiff pairs, I copy the files to the training
directory - by modifying tesstrain.sh.

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

cp /home/shree/tesstutorial/larmbig/*.tif "${TRAINING_DIR}/"

cp /home/shree/tesstutorial/larmbig/*.box "${TRAINING_DIR}/"

Then use a command similar to following (based on location of your files)
and use just one font similar to the one used in your box/tiff pairs.

You may need to modify tesstrain_utils.sh to make sure that all your
box/tiff pairs are selected (based on the naming).

training/tesstrain.sh \
--fonts_dir /mnt/c/Windows/Fonts \
--training_text ../langdata/eng/eng.training_text \
--langdata_dir ../langdata \
--tessdata_dir ./tessdata \
--lang eng \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--fontlist "Arial" \
--output_dir ~/tesstutorial/engtest

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 26, 2017 at 6:32 PM, Bert Spaan notifications@github.com
wrote:

Thanks! This raises a new question: how do I generate .lstfm files? I'm
trying Tesseract to train on New York city directories
https://digitalcollections.nypl.org/items/b42866fb-b877-e4fc-e040-e00a1806275e,
I have box files and TIFs. (Another question: can I already use WordStr
box files, some parts of the documentation say I can, others say I can't?)

ZIP file with TIF and box file I'm trying to use: Wilson1852_0.zip
https://github.com/tesseract-ocr/tesseract/files/958420/Wilson1852_0.zip


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297400009,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oxQuRa5kIoAXZ_0HyeRUZ_JPTZFzks5rz0CCgaJpZM4NIBJS
.

What's the output for the 388½ in this example and in other places in this book?

@amitdo 388½ becomes 3884.

@Shreeshrii : but I have no fonts_dir, fontlist, etc, since I am only training from images.

I'm afraid this way of training is not well documented right now.

I have not yet tried training with 4.00.

@amitdo Is it not well documented, or not yet possible at all?

@Shreeshrii Do you have examples of this process?

Your box file is in wordstr format. That cannot be used with existing
process.

If you had box file in older 3.04 format, then the hacked version of
script would work.

  • excuse the brevity, sent from mobile

On 26-Apr-2017 9:02 PM, "ShreeDevi Kumar" shreeshrii@gmail.com wrote:

I will post my modified versions of the scripts tomorrow, don't have
access to my pc right now.

  • excuse the brevity, sent from mobile

On 26-Apr-2017 8:41 PM, "Bert Spaan" notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii Do you have examples of this
process?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-297440803,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o0ChPIZezAouqzZA3thLi0m2ovFqks5rz16zgaJpZM4NIBJS
.

I can provide a box file in 3.04 format tomorrow, I'll post the file here.

As said, the WordStr format is not really supported right now.

You can still train with the regular box format + tab lines to signal line breaks.

Training from 'real' images as opposed to synthetic ones (with text2image), that what's not well documented.

Out of the box, Tesseract already performs pretty well, but 150 years ago, house numbers in New York sometimes included ½, so I have to include this character in the desited_characters file:

@bertspaan the desired_characters file is not directly used for training. It is used at Google for building the large training text required for LSTM training.

I couldn't find any font which has 1/2 the way it is printed here, so it maybe difficult to create synthetic image for it.

Update: The LSTM training process has been modified since this post was written. These will not work as is. You can use them as reference.

Here are the modified scripts:
boxtrain.zip
You will need to copy your box/tiff pairs to the
../langdata/eng/ directory
for them to be used.

You cannot use finetune process because 1/2 i not included in the unicharset for current LSTM traineddata for English. @theraysmith , will this change with your next update?

The following commands outline the process you may need to follow to do the LSTM training - top layer.

training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng  \
  --exposures "-2 -1 0" \
  --fontlist "Century Schoolbook" "Dejavu Serif" "Garamond" "Liberation Serif" "Times New Roman," "FreeSerif" "Georgia" \
  --output_dir ~/tesstutorial/nydlegacy

cp ~/tesstutorial/nydlegacy/eng.traineddata ./tessdata/nydlegacy.traineddata

training/boxtrain.sh \
  --fonts_dir  /mnt/c/Windows/Fonts \
  --training_text ../langdata/eng/nyd.training_text \
  --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "-2 -1" \
  --fontlist "Bookman Old Style Semi-Light"  \
  --output_dir ~/tesstutorial/nyd

rm -rf ~/tesstutorial/eng_from_nyd
mkdir -p ~/tesstutorial/eng_from_nyd

combine_tessdata -e ../tessdata/eng.traineddata \
   ~/tesstutorial/eng_from_nyd/eng.lstm

lstmtraining  \
   -U ~/tesstutorial/nyd/eng.unicharset \
  --train_listfile ~/tesstutorial/nyd/eng.training_files.txt \
  --script_dir ../langdata   \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --continue_from ~/tesstutorial/eng_from_nyd/eng.lstm \
  --model_output ~/tesstutorial/eng_from_nyd/nyd \
  --debug_interval -1 \
  --target_error_rate 0.01

lstmtraining \
  --continue_from ~/tesstutorial/eng_from_nyd/nyd_checkpoint \
  --model_output ~/tesstutorial/eng_from_nyd/nyd.lstm \
  --stop_training

cp ../tessdata/eng.traineddata ~/tesstutorial/eng_from_nyd/nyd.traineddata

combine_tessdata -o ~/tesstutorial/eng_from_nyd/nyd.traineddata \
  ~/tesstutorial/eng_from_nyd/nyd.lstm \
  ~/tesstutorial/nyd/eng.lstm-number-dawg \
  ~/tesstutorial/nyd/eng.lstm-punc-dawg \
  ~/tesstutorial/nyd/eng.lstm-word-dawg 

cp ~/tesstutorial/eng_from_nyd/nyd.traineddata ./tessdata/nyd.traineddata

Thanks so much, I will try all this next week!

@Shreeshrii: ha, that's my repository!

@bertspaan

I see that you have trained models for ocropy.

Is there anything you want to share about ocropy vs. Tesseract 4.00, accuracy wise, with your dataset?

@bertspaan :-)

Since you already have an OCR process working, I suggest you wait for Ray to update code for training from scanned images and improve traineddata to support 1/2.

My hacked training is only proof of concept (i trained till about 2% accuracy) so while it recognizes 1/2 as %, other letters may not be as accurate as the traineddata from the repo.

@amitdo: yes, we've trained ocropy on a very small amount of sentences, and already the results are pretty good. See 1854-55.lines.ndjson.zip, this file contains all bounding boxes with ocropy output. However, ocropy sometimes crashes and its documentation is not too good, that's why last week we've started experimenting with Tesseract 4. I haven't compared out-of-the-box output of Tesseract 4 with our trained orcopy model in detail.

@Shreeshrii: ok, I'll try some of the commands you've posted here, but I'm not going to spend much time on trying to train Tesseract, I'll wait until training from scanned images is improved.

We are also building dictionaries of possible names, streets and professions, so we should be able to fix many OCR errors afterwards.

Thank you both so much for your help!

@Shreeshrii
I am also trying to fine tune tesseract4.0 with images. I am confused by several parameters below.
First, what is the training_text(nyd.training_text) file? Do I need to create it? If yes, how to create it?
Second, do I just need to specify the --training_text and --output_dir while leaving other parameters unchanged?

image

Please see the wiki page on training, there have been changes made to LSTM training process.

combine_lang_model which takes as input an input_unicharset and script_dir (script_dir points to the langdata directory) and optional word list files...
I have got input_unicharset, but I don't know how can I get script_dir .

https://github.com/tesseract-ocr/langdata

is the script_dir.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 16, 2017 at 12:39 PM, CoCa520 notifications@github.com wrote:

combine_lang_model which takes as input an input_unicharset and script_dir
(script_dir points to the langdata directory) and optional word list
files...
I have got input_unicharset, but I don't know how can I get script_dir .


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-322685646,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_owv9Cyw11QES-BfmZH8KPSm_b_Ijks5sYpWlgaJpZM4NIBJS
.

@Shreeshrii Thank you!
BUT
I really can't understand how can I create lstm files.
Can you show me the code.

I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x29173e0 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916420 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object 0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.traineddatanumber-dawg)

tesseract eng.font.exp0.tif eng.font.exp0.box lstm.train

you need a space after box to give the name of config file.

Best method is to follow the training tutorial. If you want more pages,
change tesstrain_utils.sh for max_page

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 2:02 PM, CoCa520 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii Thank you!
BUT
I really can't understand how can I create lstm files.
Can you show me the code.

I have tried:
tesseract eng.font.exp0.tif eng.font.exp0.box.lstm.train
But it gives:
Error during processing.
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x29173e0 still has count 1 (id /usr/local/tesseract/share/
tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x2916420 still has count 1 (id /usr/local/tesseract/share/
tessdata/eng.traineddataword-dawg)
ObjectCache(0x7f098f0849a0)::~ObjectCache(): WARNING! LEAK! object
0x2916240 still has count 1 (id /usr/local/tesseract/share/tessdata/eng.
traineddatanumber-dawg)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-325921815,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozyFOlba1PIkz4zDYhN6YsomGE8Eks5sdR44gaJpZM4NIBJS
.

Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
but I just have tif/box pairs, so i come here for more information。

4.0 training with tif/box pairs is not yet supported.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Aug 30, 2017 at 2:52 PM, CoCa520 notifications@github.com wrote:

Training tutorial ?
Do you mean https://github.com/tesseract-ocr/tesseract/wiki/
TrainingTesseract-4.00 http://url
but I just have tif/box pairs, so i come here for more information。


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-325934619,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o9zx0GhGvGJcTyC7Cp40PtGBxTKhks5sdSnrgaJpZM4NIBJS
.

@Shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model information are combines whit characters and numbers and located in somewhere of nameplate, so I have collected lots of pictures which contains various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
image
image
And then I put all model pictures into tesseract for recognize, but the accuracy is not so good, so I am trying to train teaaeract4.00 with model pictures.
The tesseract4.0 training tutorial said that there are two ways to create training data, and I use the first option: each line in the box file matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do to raise the accuracy?

Others who have done licensed plate recognition may be able to give you
better tips.

For your user case, I think using an older version of tesseract, specially
one which supports the 'digits' config file for limiting output to numbers
may be a better choice than using 4.0alpha.

On 31-Aug-2017 9:27 AM, "CoCa520" notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model
information are combines whit characters and numbers and located in
somewhere of nameplate, so I have collected lots of pictures which contains
various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
[image: image]
https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png
[image: image]
https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png
And then I put all model pictures into tesseract for recognize, but the
accuracy is not so good, so I am trying to train teaaeract4.00 with model
pictures.
The tesseract4.0 training tutorial said that there are two ways to create
training data, and I use the first option: each line in the box file
matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do
to raise the accuracy?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-326182878,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS
.

also see https://github.com/openalpr/openalpr which uses tesseract-ocr

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 31, 2017 at 9:27 AM, CoCa520 notifications@github.com wrote:

@Shreeshrii https://github.com/shreeshrii
I want to use tesseract4.00 to recognize models of machines. All model
information are combines whit characters and numbers and located in
somewhere of nameplate, so I have collected lots of pictures which contains
various nameplate of each machine.
After a series of processing, I have got lots pictures of model as follows:
[image: image]
https://user-images.githubusercontent.com/22894599/29905526-d8207088-8e41-11e7-8a94-60661df186c8.png
[image: image]
https://user-images.githubusercontent.com/22894599/29905561-ff00a722-8e41-11e7-934d-8e87c61433df.png
And then I put all model pictures into tesseract for recognize, but the
accuracy is not so good, so I am trying to train teaaeract4.00 with model
pictures.
The tesseract4.0 training tutorial said that there are two ways to create
training data, and I use the first option: each line in the box file
matches a 'character' (glyph) in the tiff image.

If 4.0 training with tif/box pairs is not yet supported then how can I do
to raise the accuracy?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-326182878,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o6HKRmwkXximPpL9HyJPhhWwtyPRks5sdi85gaJpZM4NIBJS
.

@minly, hello, I have the same problems with you ? have you resolved them?

@CoCa520, hello, did you generate lstm files finally? I want to know how to generate lstm files according to *.tif and *.box files?

Bumping this..

I tried running the steps mentioned here: https://github.com/tesseract-ocr/tesseract/issues/841#issuecomment-298174100

I'm getting this error:

ERROR: Non-existent flag -D
ERROR: /var/folders/vz/yqbfrgj91hqdj76mmpl2vjmw0000gn/T/tmp.W8q07ZtQ/eng/unicharset does not exist or is not readable

It does not create a .traineddata, which is what I expected from doing --linedata_only parameter..

I'll try to compare what edits @Shreeshrii put in place, but any guidance would be appreciated.

Edit:

boxtrain.zip

I've diffed the three files to a version in April, grabbed the "intent" of @Shreeshrii 's edit, and applied it to the newest versions of the three files.

boxtrain/boxtrain.sh --fonts_dir ~/Library/Fonts/ --training_text ../langdata/eng/eng.training_text --langdata_dir ../langdata --tessdata_dir ./tessdata/ --lang eng --fontlist "Calibri" --output_dir ./lstm1

While editing, I saw that @Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder, and to be honest I have no clarity on what the .traineddata contains .

@Shreeshrii was creating a folder named "${LANG_DATA_DIR}/GT", and copying TIF/BOX in/out from it. I tried placing my TIF/BOX pairs in the "GT" folder

GT was my groundtruth folder, in which I also copied the box/tiff pairs for future reference.

They are NOT used in further training, as LSTM training now uses the generated lstmf files and starter traineddata.

@zdenop Please close this issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

duzenko picture duzenko  ·  3Comments

eliyaz-kl picture eliyaz-kl  ·  4Comments

reubano picture reubano  ·  6Comments

royudev picture royudev  ·  5Comments

LaurentBerger picture LaurentBerger  ·  3Comments