Tesseract: LSTM: Training: Can lstmtraining make full use of CPU?

Created on 17 Mar 2018  ·  13Comments  ·  Source: tesseract-ocr/tesseract

Hi.

  • I have installed Tesseract: 4.00.00alpha with Leptonica: 1.74.2. (including training tools) on Ubuntu Server 16.04 LTS

I am training a chi_sim model from scratch. However, lstmtraining cannot make full use of CPU. It only takes (500%~600%) / 2800% . Command line as follow:

tesseract-master/training/lstmtraining \
  --traineddata $dir/traineddata
  --net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]' \
  --model_output $checkpoint \
  --learning_rate 20e-4 \
  --train_listfile $train_listfile \
  --eval_listfile $eval_listfile \
  --max_image_MB 30000 \
  --max_iterations 600000

question

Thank you!

performance training

Most helpful comment

Please see
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification
for the network spec used for tessdata_fast files.

You can specify complete network spec if you train from scratch.

You can also specify a layer to replace if you use
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

On Tue, Jun 26, 2018 at 2:03 PM Chen Yuxing notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii @stweil
https://github.com/stweil
I am sorry, I cannot access
https://groups.google.com/d/forum/tesseract-ocr
in China, so I come to here to ask question :
Are all the data files for tesseract version 4 created by the same net_spec
'[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' in
TrainingTesseract 4.00 ?
If not , can you write a wiki document about net_spec trained for all
language data files?
I am training a minority language Tai Le with tesseract 4, but I don't
know how to change net_spec for best traineddata.
Thank you very much!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397#issuecomment-400226458,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5YWaGKItL946uWKg11JQj1iNOv0ks5uAfHRgaJpZM4Su58o
.

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

All 13 comments

Please use the latest code beta.1 and check whether you get same amount of
resource usage.

On Sat 17 Mar, 2018, 11:40 PM 530154436, notifications@github.com wrote:

Hi.

  • I have installed Tesseract: 4.00.00alpha with Leptonica: 1.74.2.
    (including training tools) on Ubuntu Server 16.04 LTS

I am training a chi_sim model from scratch. However, lstmtraining cannot
make full use of CPU. It only takes (500%~600%) / 2800% . Command line
as follow:

tesseract-master/training/lstmtraining \
--traineddata $dir/traineddata
--net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]' \
--model_output $checkpoint \
--learning_rate 20e-4 \
--train_listfile $train_listfile \
--eval_listfile $eval_listfile \
--max_image_MB 30000 \
--max_iterations 600000

[image: question]
https://user-images.githubusercontent.com/24240399/37558626-742711fc-2a51-11e8-96d0-c9140c1836b0.png

Thank you!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AE2_o3SdeHqlXQOEW_KD5kMlzGyQsjj9ks5tfVGcgaJpZM4Su58o
.

@Shreeshrii I have updated source code to tesseract 4.0.0-beta.1,but still get same amount of
resource usage.

@stweil Any suggestions.

I remember there being another similar issue.

Tesseract 4 uses a fixed number of threads for parts of the training process. Therefore no, it won't make full use of a CPU with 28 cores, at least not with the current code. It would need a more detailed analysis to check whether using more cores could speed up the training. Maybe memory bandwidth is the limiting factor, then only reducing the data size (float instead of double) would help.

@Shreeshrii @stweil Okay, thank you very much!

@zdenop Label with

4.0x
LSTM training
Performance

@530154436 @Shreeshrii @stweil Hi I have a question . I type --net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]' like that . It doesn't work. and I receive an error

Invalid network spec:01c111]
Missing ] at end of [Series]!
Failed to create network from spec: [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c111]

but I delete [ ] it does work . Why this accident take place ? it doesn't matter?

@godofcheerup the first character of 'O1c111' should be the capital letter 'O' , not number zero '0'.
I found your error
'Invalid network spec:01c111]'
is number zero '0'.

@Shreeshrii @stweil
I am sorry, I cannot open https://groups.google.com/d/forum/tesseract-ocr
in China, so I come to here to ask question :
Are all the data files for tesseract version 4 created by the same net_spec
'[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' in TrainingTesseract 4.00 ?
If not , can you write a wiki document about net_spec trained for all language data files?
I am training a minority language Tai Le similar with Thai and Lao, but I don't know how to change net_spec for best traineddata.
Thank you very much!

Please see
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification
for the network spec used for tessdata_fast files.

You can specify complete network spec if you train from scratch.

You can also specify a layer to replace if you use
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

On Tue, Jun 26, 2018 at 2:03 PM Chen Yuxing notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii @stweil
https://github.com/stweil
I am sorry, I cannot access
https://groups.google.com/d/forum/tesseract-ocr
in China, so I come to here to ask question :
Are all the data files for tesseract version 4 created by the same net_spec
'[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' in
TrainingTesseract 4.00 ?
If not , can you write a wiki document about net_spec trained for all
language data files?
I am training a minority language Tai Le with tesseract 4, but I don't
know how to change net_spec for best traineddata.
Thank you very much!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397#issuecomment-400226458,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o5YWaGKItL946uWKg11JQj1iNOv0ks5uAfHRgaJpZM4Su58o
.

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Sorry, I haven't find them on the wiki document, I just search the word 'net_spec' in github and find here. Thanks for your answer so quickly ! @Shreeshrii

@Shreeshrii Hello sir, Sorry for disturbing you again, I have tried your option, and found the cmd below can list the version string includes the net_spec that was used to train :
training/combine_tessdata -d tessdata/fast/heb.traineddata
I mean combind_tessdata can list net_spec from the tessdata_fast file, but cannot list net_spec from the tessdata_best file.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
And in the wiki above, it says:

tessdata_best (Sep 2017) best results on the eval data, slower, Float models, can be used as base for finetune training

But the net_spec listed in it is only for Information specific to tessdata_fast.
So I think if I use the net_spec given from the tessdata_best file of Thai or Lao to train a new language similar with Thai and Lao, whether I can train a better traineddata as well as possible ?

The network spec is copied within version string in MOST but NOT ALL
traineddata files.

You can use the files from tessdata_best to use as 'base' for continue_from
training, whether it shows the network spec in version or not. When you run
training the network spec might be displayed.

On Fri, Jun 29, 2018 at 2:30 PM Chen Yuxing notifications@github.com
wrote:

@Shreeshrii https://github.com/Shreeshrii Hello sir, Sorry for
disturbing you again, I have tried your option, and found the cmd below can
list the version string includes the net_spec that was used to train :
training/combine_tessdata -d tessdata/fast/heb.traineddata
I mean combind_tessdata can list net_spec from the tessdata_fast file, but
cannot list net_spec from the tessdata_best file.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
And in the wiki above, it says:

tessdata_best (Sep 2017) best results on the eval data, slower, Float
models, can be used as base for finetune training

So I think if I use the net_spec given from the tessdata_best file of a
similar language Thai or Lao, whether I can train a better traineddata fast
and well ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/1397#issuecomment-401294430,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_ozbbZ-vdnpbFt9yR-8z105wwMdH0ks5uBezHgaJpZM4Su58o
.

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LaurentBerger picture LaurentBerger  ·  3Comments

mm-manu picture mm-manu  ·  4Comments

clarkk picture clarkk  ·  7Comments

eliyaz-kl picture eliyaz-kl  ·  4Comments

YeisonVelez11 picture YeisonVelez11  ·  5Comments