Tesseract: Tesseract wordlist2dawg dawg is empty

Created on 20 Nov 2016 · 14Comments · Source: tesseract-ocr/tesseract

I am trying to train tesseract with my own data and want to generate frequent file list with wordlist2dawg. https://github.com/tesseract-ocr/tesseract/blob/master/doc/wordlist2dawg.1.asc

Running command

wordlist2dawg data/freq_file_list.txt eng1.freq-dawg eng1.unicharset
This is output

Loading unicharset from 'eng1.unicharset'
Reading word list from 'data/freq_file_list.txt'
Reducing Trie to SquishedDawg
Dawg is empty, skip producing the output file
Wordlist looks like this

Akstiletto
Ankyros
Ash
Bo
Boar
Boltor
Braton
Bronco
Burston
Carrier
Dakra
Dual
Ember
Fang
Fragor
....

It is not generating the dawg file, any suggestions what is wrong?

Source

sjaanus

All 14 comments

Please provide all input files

zdenop on 22 Nov 2016

Added both files
I had to add txt extension to unicharset just to upload it here.

freq_file_list.txt

eng1.unicharset.txt

sjaanus on 22 Nov 2016

Please provide also information about tesseract version and OS.

zdenop on 22 Nov 2016

I am using Tesseract 3.05-dev on Windows 10.

sjaanus on 22 Nov 2016

I got the same result,I think there maybe two probability.
1、The encoding of the input files freq_file_list.txt and eng1.unicharset.txt may not meet the requirements .
2、The unicharset you provide is not the correct.The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated. use :
mftraining -F font_properties -U unicharset -O regenerated.unicharset *.tr

blacklong28 on 6 Apr 2017

hi,
what if the language was Arabic, which doesn't have capital case or small case, but the same problem was generated, what could be the issue?

ibr123 on 18 Apr 2017

@blacklong617 @ibr123

Please note tesseract version, o/s, commit number if known.

Also share the input files.

Shreeshrii on 18 Apr 2017

👍1

ara_frequent.txt
ara.unicharset.txt
these are the input files, the tesseract version is: tesseract 4.00.00alpha and OS is windows 10
and thanks for your response

ibr123 on 18 Apr 2017

your ara_frequent.txt is encoded in ANSI with windows style end of line
markers. the words show up as the following, instead of in Arabic.
Just a few words from top of file pasted below

íÊæÞÚ
ÇáÚáãÇÁ
Ãä
ÊÕÈÍ
ÝÇßåÉ
ÇáßÑÒ
æÇÍÏÉ
ãä
æÓÇÆá
ÚáÇÌ
ÇáÏÇÁ
ÇáÓßÑí
ÝÇáãÇÏÉ
ÇáÓßÑíÉ

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Apr 18, 2017 at 6:27 PM, ibr123 notifications@github.com wrote:

ara_frequent.txt
https://github.com/tesseract-ocr/tesseract/files/929641/ara_frequent.txt
ara.unicharset.txt
https://github.com/tesseract-ocr/tesseract/files/929645/ara.unicharset.txt
these are the input files, the tesseract version is: tesseract
4.00.00alpha and OS is windows 10
and thanks for your response

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/482#issuecomment-294831301,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oyBRPmPFHGZx5ElILqptklrbYERzks5rxLMvgaJpZM4K3oHp
.

Shreeshrii on 18 Apr 2017

👍1

Also see https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh#L339

 # -r arguments to wordlist2dawg denote RTL reverse policy
    # (see Trie::RTLReversePolicy enum in third_party/tesseract/dict/trie.h).
    # We specify 0/RRP_DO_NO_REVERSE when generating number DAWG,
    # 1/RRP_REVERSE_IF_HAS_RTL for freq and word DAWGS,
    # 2/RRP_FORCE_REVERSE for the punctuation DAWG.

Shreeshrii on 18 Apr 2017

👍1

@sjaanus Is your issue resolved? Please mention the solution and close the issue.

Shreeshrii on 27 Jun 2017

Is this issue still there? If not, please close the issue.

Shreeshrii on 18 May 2018

Closing because of missing input from reporter.

zdenop on 27 Sep 2018

i am also facing the same issue. wordlist2dawg not creating the word-dawg file. using windows 10 and tesseract version 3.0.2. Kindly assist