I am trying to train tesseract with my own data and want to generate frequent file list with wordlist2dawg. https://github.com/tesseract-ocr/tesseract/blob/master/doc/wordlist2dawg.1.asc
Running command
wordlist2dawg data/freq_file_list.txt eng1.freq-dawg eng1.unicharset
This is output
Loading unicharset from 'eng1.unicharset'
Reading word list from 'data/freq_file_list.txt'
Reducing Trie to SquishedDawg
Dawg is empty, skip producing the output file
Wordlist looks like this
Akstiletto
Ankyros
Ash
Bo
Boar
Boltor
Braton
Bronco
Burston
Carrier
Dakra
Dual
Ember
Fang
Fragor
....
It is not generating the dawg file, any suggestions what is wrong?
Please provide all input files
Added both files
I had to add txt extension to unicharset just to upload it here.
Please provide also information about tesseract version and OS.
I am using Tesseract 3.05-dev on Windows 10.
I got the same result,I think there maybe two probability.
1、The encoding of the input files freq_file_list.txt and eng1.unicharset.txt may not meet the requirements .
2、The unicharset you provide is not the correct.The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated. use :
mftraining -F font_properties -U unicharset -O regenerated.unicharset *.tr
hi,
what if the language was Arabic, which doesn't have capital case or small case, but the same problem was generated, what could be the issue?
@blacklong617 @ibr123
Please note tesseract version, o/s, commit number if known.
Also share the input files.
ara_frequent.txt
ara.unicharset.txt
these are the input files, the tesseract version is: tesseract 4.00.00alpha and OS is windows 10
and thanks for your response
your ara_frequent.txt is encoded in ANSI with windows style end of line
markers. the words show up as the following, instead of in Arabic.
Just a few words from top of file pasted below
íÊæÞÚ
ÇáÚáãÇÁ
Ãä
ÊÕÈÍ
ÝÇßåÉ
ÇáßÑÒ
æÇÍÏÉ
ãä
æÓÇÆá
ÚáÇÌ
ÇáÏÇÁ
ÇáÓßÑí
ÝÇáãÇÏÉ
ÇáÓßÑíÉ
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Apr 18, 2017 at 6:27 PM, ibr123 notifications@github.com wrote:
ara_frequent.txt
https://github.com/tesseract-ocr/tesseract/files/929641/ara_frequent.txt
ara.unicharset.txt
https://github.com/tesseract-ocr/tesseract/files/929645/ara.unicharset.txt
these are the input files, the tesseract version is: tesseract
4.00.00alpha and OS is windows 10
and thanks for your response—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/482#issuecomment-294831301,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oyBRPmPFHGZx5ElILqptklrbYERzks5rxLMvgaJpZM4K3oHp
.
Also see https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh#L339
# -r arguments to wordlist2dawg denote RTL reverse policy
# (see Trie::RTLReversePolicy enum in third_party/tesseract/dict/trie.h).
# We specify 0/RRP_DO_NO_REVERSE when generating number DAWG,
# 1/RRP_REVERSE_IF_HAS_RTL for freq and word DAWGS,
# 2/RRP_FORCE_REVERSE for the punctuation DAWG.
@sjaanus Is your issue resolved? Please mention the solution and close the issue.
Is this issue still there? If not, please close the issue.
Closing because of missing input from reporter.
i am also facing the same issue. wordlist2dawg not creating the word-dawg file. using windows 10 and tesseract version 3.0.2. Kindly assist