Tesseract: LSTM: Training: Can lstmtraining make full use of CPU?

Created on 17 Mar 2018 · 13Comments · Source: tesseract-ocr/tesseract

Hi.

I have installed Tesseract: 4.00.00alpha with Leptonica: 1.74.2. (including training tools) on Ubuntu Server 16.04 LTS

I am training a chi_sim model from scratch. However, lstmtraining cannot make full use of CPU. It only takes (500%~600%) / 2800% . Command line as follow:

tesseract-master/training/lstmtraining \
  --traineddata $dir/traineddata
  --net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]' \
  --model_output $checkpoint \
  --learning_rate 20e-4 \
  --train_listfile $train_listfile \
  --eval_listfile $eval_listfile \
  --max_image_MB 30000 \
  --max_iterations 600000

question

Thank you!

performance training

Source

530154436

Most helpful comment

Please see
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification
for the network spec used for tessdata_fast files.

You can specify complete network spec if you train from scratch.

You can also specify a layer to replace if you use
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

On Tue, Jun 26, 2018 at 2:03 PM Chen Yuxing notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii @stweil
https://github.com/stweil
I am sorry, I cannot access
https://groups.google.com/d/forum/tesseract-ocr
in China, so I come to here to ask question :
Are all the data files for tesseract version 4 created by the same net_spec
'[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' in
TrainingTesseract 4.00 ?
If not , can you write a wiki document about net_spec trained for all
language data files?
I am training a minority language Tai Le with tesseract 4, but I don't
know how to change net_spec for best traineddata.
Thank you very much!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397#issuecomment-400226458,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5YWaGKItL946uWKg11JQj1iNOv0ks5uAfHRgaJpZM4Su58o
.

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 26 Jun 2018

❤1 😄1 👍1

All 13 comments

Please use the latest code beta.1 and check whether you get same amount of
resource usage.

On Sat 17 Mar, 2018, 11:40 PM 530154436, notifications@github.com wrote:

Hi.

I have installed Tesseract: 4.00.00alpha with Leptonica: 1.74.2.
(including training tools) on Ubuntu Server 16.04 LTS

I am training a chi_sim model from scratch. However, lstmtraining cannot
make full use of CPU. It only takes (500%~600%) / 2800% . Command line
as follow:

tesseract-master/training/lstmtraining \
--traineddata $dir/traineddata
--net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]' \
--model_output $checkpoint \
--learning_rate 20e-4 \
--train_listfile $train_listfile \
--eval_listfile $eval_listfile \
--max_image_MB 30000 \
--max_iterations 600000

[image: question]
https://user-images.githubusercontent.com/24240399/37558626-742711fc-2a51-11e8-96d0-c9140c1836b0.png

Thank you!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AE2_o3SdeHqlXQOEW_KD5kMlzGyQsjj9ks5tfVGcgaJpZM4Su58o
.

Shreeshrii on 17 Mar 2018

@Shreeshrii I have updated source code to tesseract 4.0.0-beta.1，but still get same amount of
resource usage.

530154436 on 18 Mar 2018

@stweil Any suggestions.

I remember there being another similar issue.

Shreeshrii on 18 Mar 2018

Tesseract 4 uses a fixed number of threads for parts of the training process. Therefore no, it won't make full use of a CPU with 28 cores, at least not with the current code. It would need a more detailed analysis to check whether using more cores could speed up the training. Maybe memory bandwidth is the limiting factor, then only reducing the data size (float instead of double) would help.

stweil on 18 Mar 2018

@Shreeshrii @stweil Okay, thank you very much!

530154436 on 18 Mar 2018

@zdenop Label with

4.0x
LSTM training
Performance

Shreeshrii on 27 Mar 2018

@530154436 @Shreeshrii @stweil Hi I have a question . I type --net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]' like that . It doesn't work. and I receive an error

Invalid network spec:01c111]
Missing ] at end of [Series]!
Failed to create network from spec: [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]

but I delete [ ] it does work . Why this accident take place ? it doesn't matter?

godofcheerup on 28 Mar 2018

@godofcheerup the first character of 'O1c111' should be the capital letter 'O' ， not number zero '0'.
I found your error
'Invalid network spec:01c111]'
is number zero '0'.

yuxing007 on 26 Jun 2018

@Shreeshrii @stweil
I am sorry, I cannot open https://groups.google.com/d/forum/tesseract-ocr
in China, so I come to here to ask question :
Are all the data files for tesseract version 4 created by the same net_spec
'[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' in TrainingTesseract 4.00 ?
If not , can you write a wiki document about net_spec trained for all language data files?
I am training a minority language Tai Le similar with Thai and Lao, but I don't know how to change net_spec for best traineddata.
Thank you very much!

yuxing007 on 26 Jun 2018

Please see
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification
for the network spec used for tessdata_fast files.

You can specify complete network spec if you train from scratch.

You can also specify a layer to replace if you use
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

On Tue, Jun 26, 2018 at 2:03 PM Chen Yuxing notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii @stweil
https://github.com/stweil
I am sorry, I cannot access
https://groups.google.com/d/forum/tesseract-ocr
in China, so I come to here to ask question :
Are all the data files for tesseract version 4 created by the same net_spec
'[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' in
TrainingTesseract 4.00 ?
If not , can you write a wiki document about net_spec trained for all
language data files?
I am training a minority language Tai Le with tesseract 4, but I don't
know how to change net_spec for best traineddata.
Thank you very much!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397#issuecomment-400226458,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5YWaGKItL946uWKg11JQj1iNOv0ks5uAfHRgaJpZM4Su58o
.

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii on 26 Jun 2018

❤1 😄1 👍1

Sorry, I haven't find them on the wiki document, I just search the word 'net_spec' in github and find here. Thanks for your answer so quickly ! @Shreeshrii

yuxing007 on 26 Jun 2018

@Shreeshrii Hello sir, Sorry for disturbing you again， I have tried your option, and found the cmd below can list the version string includes the net_spec that was used to train :
training/combine_tessdata -d tessdata/fast/heb.traineddata
I mean combind_tessdata can list net_spec from the tessdata_fast file, but cannot list net_spec from the tessdata_best file.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
And in the wiki above, it says:

tessdata_best (Sep 2017) best results on the eval data, slower, Float models, can be used as base for finetune training

But the net_spec listed in it is only for Information specific to tessdata_fast.
So I think if I use the net_spec given from the tessdata_best file of Thai or Lao to train a new language similar with Thai and Lao, whether I can train a better traineddata as well as possible ?

yuxing007 on 29 Jun 2018

The network spec is copied within version string in MOST but NOT ALL
traineddata files.

You can use the files from tessdata_best to use as 'base' for continue_from
training, whether it shows the network spec in version or not. When you run
training the network spec might be displayed.

On Fri, Jun 29, 2018 at 2:30 PM Chen Yuxing notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii Hello sir, Sorry for
disturbing you again， I have tried your option, and found the cmd below can
list the version string includes the net_spec that was used to train :
training/combine_tessdata -d tessdata/fast/heb.traineddata
I mean combind_tessdata can list net_spec from the tessdata_fast file, but
cannot list net_spec from the tessdata_best file.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
And in the wiki above, it says:

tessdata_best (Sep 2017) best results on the eval data, slower, Float
models, can be used as base for finetune training

So I think if I use the net_spec given from the tessdata_best file of a
similar language Thai or Lao, whether I can train a better traineddata fast
and well ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397#issuecomment-401294430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozbbZ-vdnpbFt9yR-8z105wwMdH0ks5uBezHgaJpZM4Su58o
.