I'm trying to use the text2image utility to train tesseract. Unfortunately it keeps crashing every time I try to use it :(
text2image --text=training_text.txt --outputbase=test.MenloMedium.exp0 --font='Menlo Medium' --fonts_dir=/Library/Fonts/
(lldb) run
Process 49926 launched: '/usr/local/bin/text2image' (x86_64)
Process 49926 stopped
* thread #1: tid = 0x1d2b8cb, 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph:
-> 0x100b74358 <+25>: movq (%rcx), %rdi
0x100b7435b <+28>: testq %rdi, %rdi
0x100b7435e <+31>: je 0x100b74369 ; <+42>
0x100b74360 <+33>: movq %rax, %rsi
(lldb) bt
* thread #1: tid = 0x1d2b8cb, 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
frame #1: 0x000000010000edc1 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 321
frame #2: 0x000000010000ec57 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int) const + 33
frame #3: 0x0000000100015227 text2image`tesseract::StringRenderer::StripUnrenderableWords(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const + 193
frame #4: 0x00000001000154aa text2image`tesseract::StringRenderer::RenderToImage(char const*, int, Pix**) + 418
frame #5: 0x0000000100005748 text2image`main + 2891
frame #6: 0x00007fff8a2645ad libdyld.dylib`start + 1
frame #7: 0x00007fff8a2645ad libdyld.dylib`start + 1
My guess, and it's very uneducated, is that the pointer from run->item->analysis.font only points to a PangoFont and not a PangoFcFont. Since it's reinterpret_casted into the later pango_fc_font_get_glyph somewhere hits a null pointer.
I have checked that run->item->analysis.font isn't a null pointer, it isn't.
My very crude workaround for now:
diff --git a/training/pango_font_info.cpp b/training/pango_font_info.cpp
index b542591..86c108e 100644
--- a/training/pango_font_info.cpp
+++ b/training/pango_font_info.cpp
@@ -416,10 +416,13 @@ bool PangoFontInfo::CanRenderString(const char* utf8_word, int len,
tlog(2, "Found end of line NULL run marker\n");
continue;
}
- PangoGlyph dotted_circle_glyph;
+ // PangoGlyph dotted_circle_glyph;
PangoFont* font = run->item->analysis.font;
- dotted_circle_glyph = pango_fc_font_get_glyph(
- reinterpret_cast<PangoFcFont*>(font), kDottedCircleGlyph);
+
+ // printf("The pointer: %p\n", (void *) font);
+
+ // dotted_circle_glyph = pango_fc_font_get_glyph(
+ // reinterpret_cast<PangoFcFont*>(font), kDottedCircleGlyph);
if (TLOG_IS_ON(2)) {
PangoFontDescription* desc = pango_font_describe(font);
char* desc_str = pango_font_description_to_string(desc);
@@ -456,9 +459,9 @@ bool PangoFontInfo::CanRenderString(const char* utf8_word, int len,
const bool unknown_glyph =
(cluster_iter.glyph_item->glyphs->glyphs[i].glyph &
PANGO_GLYPH_UNKNOWN_FLAG);
- const bool illegal_glyph =
- (cluster_iter.glyph_item->glyphs->glyphs[i].glyph ==
- dotted_circle_glyph);
+ const bool illegal_glyph = false;
+ // (cluster_iter.glyph_item->glyphs->glyphs[i].glyph ==
+ // dotted_circle_glyph);
bad_glyph = unknown_glyph || illegal_glyph;
if (TLOG_IS_ON(2)) {
printf("(%d=%d)", cluster_iter.glyph_item->glyphs->glyphs[i].glyph,
First thing - remove your patch.
Now, run this:
text2image --list_available_fonts --fonts_dir=/Library/Fonts
Do you get a list of fonts? If the answer is 'yes', then proceed.
Use this file:
https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text
text2image --text=eng.training_text --outputbase=eng.MenloRegular.exp0 --font='Menlo Regular' --fonts_dir=/Library/Fonts
If this doesn't work try with other fonts, but only regular ones, not bold/italic/medium.
If this doesn't work either, download and install the DejaVu fonts.
Also, please provide any error message you get.
http://ryanfb.github.io/etc/2014/11/19/installing_tesseract_training_tools_on_mac_os_x.html
Note that fontconfig font locations and caching are a whole other nightmare, and I seem unable to get text2image to respect/use the --fonts_dir argument on OS X. Your best bet seems to be to install things as system/user fonts (e.g. copy into ~/Library/Fonts) and optionally run fc-cache -frv to force a cache update.
A note for @ryanfb:
text2image --list_available_fonts
Running this command on Ubuntu 14.04 does not produce any output, but running this one does:
text2image --list_available_fonts --fonts_dir=
cc: @behdad
Behdad, maybe you can help here?
@ryanfb, did you tried running tesstrain.sh on mac?
Maybe it has some tricks that make text2image actually work on mac.
Indeed, looks like the font is not a PangoFcFont. Try using PANGO_FC_FONT(...) instead of the reinterpret_cast<...>, and you should get a warning. You can use PANGO_IS_FC_FONT() to test at runtime.
@LinusU ?
Sorry, I haven't had time to investigate this further. Hopefully I'll get some work done on this in the near future...
@jbarlow83, anyone with a mac...
Could you help test and debug this issue?
@amitdo
I have same problem. I cannot use tesstrain.sh on Mac ( So, I use Ubuntu on VirtualBox for training).
I tried bellow.
$ brew install tesseract --with-training-tools --HEAD
$ text2image --list_available_fonts --fonts_dir=/Library/Fonts
<skip>
There are total 1221 fonts installed. But 'Regular' style is not included in text2image's output.
Even if some font has "Regular" style glyphs.
So, I can not try text2image with 'Regular' style font. What should I do for this issue?
$ text2image --text=eng.training_text --outputbase=eng.MenloRegular.exp0 --font='Menlo Regular' --fonts_dir=/Library/Fonts
Could not find font named Menlo Regular. Pango suggested font Menlo Medium
Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line 437
Abort trap: 6
ex. dejavu font (maybe not regular style, detail)
$ text2image --text=eng.training_text --outputbase=eng.MenloRegular.exp0 --font='DejaVu Sans Thin' --fonts_dir=~/Library/Fonts
Segmentation fault: 11
HI @atuyosi!
I don't have 'DejaVu Sans Thin' in my ubuntu system. I have 'DejaVu Sans'.
Please copy the output of
text2image --list_available_fonts --fonts_dir=/Library/Fonts
to a new file 'fontlist.txt' and attach this file here. I want to find a font we both have.
Please try:
text2image --text=eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts
@amitdo
text2image's result is bellow:
$ text2image --text=eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts
Segmentation fault: 11
detail back trace is bellow:
$ lldb /usr/local/bin/text2image
(lldb) target create "/usr/local/bin/text2image"
Current executable set to '/usr/local/bin/text2image' (x86_64).
(lldb) run --text=eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts
Process 53735 launched: '/usr/local/bin/text2image' (x86_64)
Process 53735 stopped
* thread #1: tid = 0xf3978, 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph:
-> 0x100b98358 <+25>: movq (%rcx), %rdi
0x100b9835b <+28>: testq %rdi, %rdi
0x100b9835e <+31>: je 0x100b98369 ; <+42>
0x100b98360 <+33>: movq %rax, %rsi
(lldb) bt
* thread #1: tid = 0xf3978, 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
frame #1: 0x000000010000ea31 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 321
frame #2: 0x000000010000e8c7 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int) const + 35
frame #3: 0x0000000100015047 text2image`tesseract::StringRenderer::StripUnrenderableWords(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const + 195
frame #4: 0x00000001000152d0 text2image`tesseract::StringRenderer::RenderToImage(char const*, int, Pix**) + 418
frame #5: 0x0000000100005541 text2image`main + 2895
frame #6: 0x00007fff9c3ec5ad libdyld.dylib`start + 1
frame #7: 0x00007fff9c3ec5ad libdyld.dylib`start + 1
(lldb)
Another test...
text2image --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
Hi @amitdo,
I have the exact same issue with text2image (HEAD revision) running on OSX.
Here's the debug trace when running your last command :
$ text2image --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
query weight = 700 selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 's'
Render string of size 84
Starting page 0
max_width = 3400, max_height = 4600
len = 84 buf_len = 84
Segmentation fault: 11
And the corresponding debug trace:
$ lldb /usr/local/bin/text2image
(lldb) target create "/usr/local/bin/text2image"
Current executable set to '/usr/local/bin/text2image' (x86_64).
(lldb) run --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
Process 43961 launched: '/usr/local/bin/text2image' (x86_64)
query weight = 700 selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 's'
Render string of size 84
Starting page 0
max_width = 3400, max_height = 4600
len = 84 buf_len = 84
Process 43961 stopped
* thread #1: tid = 0x1e1688, 0x0000000100c7e36e libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000100c7e36e libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25:
-> 0x100c7e36e: movq (%rcx), %rdi
0x100c7e371: testq %rdi, %rdi
0x100c7e374: je 0x100c7e37f ; pango_fc_font_get_glyph + 42
0x100c7e376: movq %rax, %rsi
Hope it can help
Could someone with a Mac retest this with the latest commit in the repo?
@amitdo
I tried the HEAD version (5610738) , and the exit status is 0.
It looks good in Times fonts.
Could you check the logs ?
$ uname -a
Darwin sakura.local 15.6.0 Darwin Kernel Version 15.6.0: Thu Jun 23 18:25:34 PDT 2016; root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64
$ brew install tesseract --HEAD --with-training-tools
<skip>
$ tesseract --version
tesseract 3.05.00dev
leptonica-1.73
libjpeg 8d : libpng 1.6.23 : libtiff 4.0.6 : zlib 1.2.5
$ ls -l /usr/local/bin/text2image
lrwxr-xr-x 1 atuyosi admin 49 8 11 01:55 /usr/local/bin/text2image -> ../Cellar/tesseract/HEAD-5610738_2/bin/text2image
$ text2image --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
Log file is below:
The exit status and output files:
$ echo $?
0
$ ls
eng.TimesNewRomanBold.exp0.box eng.training.txt
eng.TimesNewRomanBold.exp0.tif
eng.TimesNewRomanBold.exp0.box.txt

.box file to .txt, and convert .tiff to .pngThanks.
@atuyosi, thanks for testing. The output files look okay.
The issue seems to be solved.
Hi guys,
i have a trouble with the text2image im trying to make the fontlist that was described in the main article about tesseract ocr, i can create the box file and tif file normally for one font but the list i get a problem. im using tesseract OCR 3.03 and the OS is windows 10 and the language is english the command is:
text2image --text=training_text.txt --outputbase=eng.fontlist.txt --fonts_dir=C:\Windows\Fonts --find_fonts=true --min_coverage=1.0 --render_per_font=false --fontconfig_tmpdir=C:\Tesseract\Tesseract-OCR
and i get a warning
WARNING: Could not find a font to render image title with!
and it gives a fail for every font such as:
Font Aldhabi failed with 62 hits = 21.60%
also give '%' (U+25) not covered by font but i don't know what does that mean, anyways
Any idea how to solve this error?
Thanks in advance
You are looking for 100% coverage of training text in the fonts with --min_coverage=1.0
I use the following on bash on windows (mobaxterm).
text2image --find_fonts \
--fonts_dir /mnt/c/Windows/Fonts \
--text ./langdata/eng/eng.training_text \
--min_coverage .95 \
--outputbase ./langdata/eng/eng \
|& grep raw | sed -e 's/ :.*/" \\/g' | sed -e 's/^/ "/' > ../langdata/eng/fontslist-windows.txt
Thanks a lot for your response, but i didn't get the last part of the command that you are using \
|& grep raw | sed -e 's/ :.*/" \/g' | sed -e 's/^/ "/' > ../langdata/eng/fontslist-windows.txt i don't remember that there is such arguments at text2image binary, also i tried in my command to change the coverage from 100 to 95 yet i still have the same problem
keep in mind I'm not familiar with terminal environment :)
The text2image --find_fonts command displays the output on the terminal.
the \ at end of each line is a continuation mark for the command.
| is for piping the output of earlier command to next command.
Grep selects all the lines which have 'raw' in them - to select all lines which have the font name.
first sed command deletes everything following the : sign with a quote mark.
second sed command adds a quote sign to the beginning of each line.
The resulting output is saved in the output file name given after >
So, basically it deletes all extraneous output and creates a text file with each font name with quotes around it, which can be used as part of fontslist or plugged into language-specific.sh. Example of output below:
"WenQuanYi Zen Hei Medium" \
"WenQuanYi Zen Hei Mono Medium" \
"WenQuanYi Zen Hei Sharp Medium" \
I notice just now that you say
using tesseract OCR 3.03
That could be the problem. text2image segfaults have been fixed in recent code.
Please use the latest windows binaries eg. from
@ibr123 Please note that if you are using windows command prompt and not bash under windows, the commands such as grep, sed etc may not be available.
"text2image --find_fonts command displays the output on the terminal" does that mean no file will be generated? only printing on the terminal?
text2image --find_fonts \
--fonts_dir /usr/share/fonts/truetype/dejavu/ \
--text ../langdata/eng/eng.training_text \
--min_coverage .99 \
--outputbase ../langdata/eng/eng
Total chars = 6694
DejaVu Sans : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 0 to file ../langdata/eng/eng.DejaVu_Sans.tif
DejaVu Sans Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 1 to file ../langdata/eng/eng.DejaVu_Sans_Bold.tif
DejaVu Sans Mono : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 2 to file ../langdata/eng/eng.DejaVu_Sans_Mono.tif
DejaVu Sans Mono Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 3 to file ../langdata/eng/eng.DejaVu_Sans_Mono_Bold.tif
DejaVu Serif : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 4 to file ../langdata/eng/eng.DejaVu_Serif.tif
DejaVu Serif Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 5 to file ../langdata/eng/eng.DejaVu_Serif_Bold.tif
you can redirect output to file by
text2image --find_fonts \
--fonts_dir /usr/share/fonts/truetype/dejavu/ \
--text ../langdata/eng/eng.training_text \
--min_coverage .99 \
--outputbase ../langdata/eng/eng &>./test.txt
test.txt has
Total chars = 6694
DejaVu Sans : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 0 to file ../langdata/eng/eng.DejaVu_Sans.tif
DejaVu Sans Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 1 to file ../langdata/eng/eng.DejaVu_Sans_Bold.tif
DejaVu Sans Mono : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 2 to file ../langdata/eng/eng.DejaVu_Sans_Mono.tif
DejaVu Sans Mono Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 3 to file ../langdata/eng/eng.DejaVu_Sans_Mono_Bold.tif
DejaVu Serif : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 4 to file ../langdata/eng/eng.DejaVu_Serif.tif
DejaVu Serif Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 5 to file ../langdata/eng/eng.DejaVu_Serif_Bold.tif
If I use fonts which do not provide adequate coverage for the training text, then the output shows error.
eg. following command which tries to render hindi txt in devanagari script using regular latin script fonts. Since numbers and punctuation are same, it shows coverage of about 18%
text2image --find_fonts \
--fonts_dir /usr/share/fonts/ \
--text ../langdata/hin/hin.training_text \
--min_coverage .99 \
--outputbase ../langdata/hin/hin
Total chars = 34998
Font DejaVu Serif failed with 6378 hits = 18.22%
Font DejaVu Serif Bold failed with 6378 hits = 18.22%
Font Dingbats failed with 5723 hits = 16.35%
Stripped 2 unrenderable words
FreeMono : 34995 hits = 99.99%, raw = 135 = 99.26%
Font FreeMono Bold failed with 6378 hits = 18.22%
Font FreeMono Bold Italic failed with 6378 hits = 18.22%
Font FreeMono Italic failed with 6378 hits = 18.22%
Stripped 2 unrenderable words
FreeSans : 34995 hits = 99.99%, raw = 135 = 99.26%
Font FreeSans Italic failed with 6527 hits = 18.65%
Stripped 2 unrenderable words
FreeSans Semi-Bold : 34993 hits = 99.99%, raw = 134 = 98.53%
Font FreeSans Semi-Bold Italic failed with 6378 hits = 18.22%
Stripped 2 unrenderable words
FreeSerif : 34995 hits = 99.99%, raw = 135 = 99.26%
Stripped 2 unrenderable words
FreeSerif Bold : 34995 hits = 99.99%, raw = 135 = 99.26%
Font FreeSerif Bold Italic failed with 6380 hits = 18.23%
Font FreeSerif Italic failed with 6527 hits = 18.65%
Note above the lines that have raw in them. Those are the only fonts that meet the coverage criteria.
i appreciate the answers, Thanks
@ibr123,
Please use the forum for asking questions.
I'm also having this issue: using latest homebrew version
tesseract: stable 3.05.00 (bottled), HEAD
The training text file I'm using is the first one posted by amitdo; but it happens with any text
Using $ text2image --list_available_fonts --fonts_dir=/Library/Fonts does give the font I want Lucida Grande
Also ran fc-cache -frv
Result:
$ text2image --text=eng.training_text --outputbase=eng.LucidaGrande.exp0 --font='Lucida Grande' --fonts_dir=/Library/Fonts
[1] 72778 segmentation fault text2image --text=eng.training_text --outputbase=eng.LucidaGrande.exp0
I don't know how to get more error info to you? Please help.
@amitdo Could it be that some required commit fixing text2image has not been backported for 3.05?
@Tjorriemorrie Did you build tesseract from source? Please also try with the 4.0 alpha version (latest source from github), if the same error is there?
Ray did some changes in 4.00 that made this problem reappear. These changes were also backported to 3.05.
Here is the source for the regression:
https://github.com/tesseract-ocr/tesseract/commit/709935851061#diff-b37dca9f063c3727f62c496e514177a9L440
Here is a (temporary) solution:
https://github.com/tesseract-ocr/tesseract/issues/736#issuecomment-282685898