Tesseract: text2image segfault

Created on 21 Jan 2016  路  36Comments  路  Source: tesseract-ocr/tesseract

I'm trying to use the text2image utility to train tesseract. Unfortunately it keeps crashing every time I try to use it :(

text2image --text=training_text.txt --outputbase=test.MenloMedium.exp0 --font='Menlo Medium' --fonts_dir=/Library/Fonts/
(lldb) run
Process 49926 launched: '/usr/local/bin/text2image' (x86_64)
Process 49926 stopped
* thread #1: tid = 0x1d2b8cb, 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph:
->  0x100b74358 <+25>: movq   (%rcx), %rdi
    0x100b7435b <+28>: testq  %rdi, %rdi
    0x100b7435e <+31>: je     0x100b74369               ; <+42>
    0x100b74360 <+33>: movq   %rax, %rsi
(lldb) bt
* thread #1: tid = 0x1d2b8cb, 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000100b74358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
    frame #1: 0x000000010000edc1 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 321
    frame #2: 0x000000010000ec57 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int) const + 33
    frame #3: 0x0000000100015227 text2image`tesseract::StringRenderer::StripUnrenderableWords(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const + 193
    frame #4: 0x00000001000154aa text2image`tesseract::StringRenderer::RenderToImage(char const*, int, Pix**) + 418
    frame #5: 0x0000000100005748 text2image`main + 2891
    frame #6: 0x00007fff8a2645ad libdyld.dylib`start + 1
    frame #7: 0x00007fff8a2645ad libdyld.dylib`start + 1
bug

All 36 comments

My guess, and it's very uneducated, is that the pointer from run->item->analysis.font only points to a PangoFont and not a PangoFcFont. Since it's reinterpret_casted into the later pango_fc_font_get_glyph somewhere hits a null pointer.

I have checked that run->item->analysis.font isn't a null pointer, it isn't.

My very crude workaround for now:

diff --git a/training/pango_font_info.cpp b/training/pango_font_info.cpp
index b542591..86c108e 100644
--- a/training/pango_font_info.cpp
+++ b/training/pango_font_info.cpp
@@ -416,10 +416,13 @@ bool PangoFontInfo::CanRenderString(const char* utf8_word, int len,
       tlog(2, "Found end of line NULL run marker\n");
       continue;
     }
-    PangoGlyph dotted_circle_glyph;
+    // PangoGlyph dotted_circle_glyph;
     PangoFont* font = run->item->analysis.font;
-    dotted_circle_glyph = pango_fc_font_get_glyph(
-        reinterpret_cast<PangoFcFont*>(font), kDottedCircleGlyph);
+
+    // printf("The pointer: %p\n", (void *) font);
+
+    // dotted_circle_glyph = pango_fc_font_get_glyph(
+    //     reinterpret_cast<PangoFcFont*>(font), kDottedCircleGlyph);
     if (TLOG_IS_ON(2)) {
       PangoFontDescription* desc = pango_font_describe(font);
       char* desc_str = pango_font_description_to_string(desc);
@@ -456,9 +459,9 @@ bool PangoFontInfo::CanRenderString(const char* utf8_word, int len,
         const bool unknown_glyph =
             (cluster_iter.glyph_item->glyphs->glyphs[i].glyph &
              PANGO_GLYPH_UNKNOWN_FLAG);
-        const bool illegal_glyph =
-            (cluster_iter.glyph_item->glyphs->glyphs[i].glyph ==
-             dotted_circle_glyph);
+        const bool illegal_glyph = false;
+            // (cluster_iter.glyph_item->glyphs->glyphs[i].glyph ==
+            //  dotted_circle_glyph);
         bad_glyph = unknown_glyph || illegal_glyph;
         if (TLOG_IS_ON(2)) {
           printf("(%d=%d)", cluster_iter.glyph_item->glyphs->glyphs[i].glyph,

First thing - remove your patch.

Now, run this:

text2image --list_available_fonts --fonts_dir=/Library/Fonts

Do you get a list of fonts? If the answer is 'yes', then proceed.

Use this file:
https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text

text2image --text=eng.training_text --outputbase=eng.MenloRegular.exp0 --font='Menlo Regular' --fonts_dir=/Library/Fonts

If this doesn't work try with other fonts, but only regular ones, not bold/italic/medium.

If this doesn't work either, download and install the DejaVu fonts.

Also, please provide any error message you get.

http://ryanfb.github.io/etc/2014/11/19/installing_tesseract_training_tools_on_mac_os_x.html

Note that fontconfig font locations and caching are a whole other nightmare, and I seem unable to get text2image to respect/use the --fonts_dir argument on OS X. Your best bet seems to be to install things as system/user fonts (e.g. copy into ~/Library/Fonts) and optionally run fc-cache -frv to force a cache update.

A note for @ryanfb:
text2image --list_available_fonts
Running this command on Ubuntu 14.04 does not produce any output, but running this one does:
text2image --list_available_fonts --fonts_dir=

cc: @behdad
Behdad, maybe you can help here?

@ryanfb, did you tried running tesstrain.sh on mac?
Maybe it has some tricks that make text2image actually work on mac.

Indeed, looks like the font is not a PangoFcFont. Try using PANGO_FC_FONT(...) instead of the reinterpret_cast<...>, and you should get a warning. You can use PANGO_IS_FC_FONT() to test at runtime.

@LinusU ?

Sorry, I haven't had time to investigate this further. Hopefully I'll get some work done on this in the near future...

@jbarlow83, anyone with a mac...
Could you help test and debug this issue?

@amitdo

I have same problem. I cannot use tesstrain.sh on Mac ( So, I use Ubuntu on VirtualBox for training).

I tried bellow.

$ brew install  tesseract --with-training-tools --HEAD 
$ text2image --list_available_fonts --fonts_dir=/Library/Fonts
<skip>

There are total 1221 fonts installed. But 'Regular' style is not included in text2image's output.
Even if some font has "Regular" style glyphs.

So, I can not try text2image with 'Regular' style font. What should I do for this issue?

$ text2image --text=eng.training_text --outputbase=eng.MenloRegular.exp0 --font='Menlo Regular' --fonts_dir=/Library/Fonts
Could not find font named Menlo Regular. Pango suggested font Menlo Medium
Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line 437
Abort trap: 6

ex. dejavu font (maybe not regular style, detail)

$ text2image --text=eng.training_text --outputbase=eng.MenloRegular.exp0 --font='DejaVu Sans Thin' --fonts_dir=~/Library/Fonts
Segmentation fault: 11

HI @atuyosi!

I don't have 'DejaVu Sans Thin' in my ubuntu system. I have 'DejaVu Sans'.

Please copy the output of
text2image --list_available_fonts --fonts_dir=/Library/Fonts
to a new file 'fontlist.txt' and attach this file here. I want to find a font we both have.

HI @amitdo , my font lis is here.

fontlist.txt

Thanks.

Please try:
text2image --text=eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts

@amitdo

text2image's result is bellow:

$ text2image --text=eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts
Segmentation fault: 11

detail back trace is bellow:

$ lldb /usr/local/bin/text2image
(lldb) target create "/usr/local/bin/text2image"
Current executable set to '/usr/local/bin/text2image' (x86_64).
(lldb) run --text=eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts
Process 53735 launched: '/usr/local/bin/text2image' (x86_64)
Process 53735 stopped
* thread #1: tid = 0xf3978, 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph:
->  0x100b98358 <+25>: movq   (%rcx), %rdi
    0x100b9835b <+28>: testq  %rdi, %rdi
    0x100b9835e <+31>: je     0x100b98369               ; <+42>
    0x100b98360 <+33>: movq   %rax, %rsi
(lldb) bt
* thread #1: tid = 0xf3978, 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000100b98358 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
    frame #1: 0x000000010000ea31 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >*) const + 321
    frame #2: 0x000000010000e8c7 text2image`tesseract::PangoFontInfo::CanRenderString(char const*, int) const + 35
    frame #3: 0x0000000100015047 text2image`tesseract::StringRenderer::StripUnrenderableWords(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*) const + 195
    frame #4: 0x00000001000152d0 text2image`tesseract::StringRenderer::RenderToImage(char const*, int, Pix**) + 418
    frame #5: 0x0000000100005541 text2image`main + 2895
    frame #6: 0x00007fff9c3ec5ad libdyld.dylib`start + 1
    frame #7: 0x00007fff9c3ec5ad libdyld.dylib`start + 1
(lldb)

Another test...

eng.training.txt

text2image --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3

Hi @amitdo,
I have the exact same issue with text2image (HEAD revision) running on OSX.

Here's the debug trace when running your last command :

$ text2image --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
query weight = 700   selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 's'
Render string of size 84
Starting page 0
max_width = 3400, max_height = 4600
len = 84  buf_len = 84
Segmentation fault: 11

And the corresponding debug trace:

$ lldb  /usr/local/bin/text2image
(lldb) target create "/usr/local/bin/text2image"
Current executable set to '/usr/local/bin/text2image' (x86_64).
(lldb) run --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
Process 43961 launched: '/usr/local/bin/text2image' (x86_64)
query weight = 700   selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 's'
Render string of size 84
Starting page 0
max_width = 3400, max_height = 4600
len = 84  buf_len = 84
Process 43961 stopped
* thread #1: tid = 0x1e1688, 0x0000000100c7e36e libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100c7e36e libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25:
-> 0x100c7e36e:  movq   (%rcx), %rdi
   0x100c7e371:  testq  %rdi, %rdi
   0x100c7e374:  je     0x100c7e37f               ; pango_fc_font_get_glyph + 42
   0x100c7e376:  movq   %rax, %rsi

Hope it can help

Could someone with a Mac retest this with the latest commit in the repo?

@amitdo

I tried the HEAD version (5610738) , and the exit status is 0.
It looks good in Times fonts.

Could you check the logs ?

$ uname -a
Darwin sakura.local 15.6.0 Darwin Kernel Version 15.6.0: Thu Jun 23 18:25:34 PDT 2016; root:xnu-3248.60.10~1/RELEASE_X86_64 x86_64

$ brew install tesseract --HEAD --with-training-tools
<skip>
$ tesseract --version
tesseract 3.05.00dev
 leptonica-1.73
  libjpeg 8d : libpng 1.6.23 : libtiff 4.0.6 : zlib 1.2.5

$ ls -l /usr/local/bin/text2image
lrwxr-xr-x  1 atuyosi  admin  49  8 11 01:55 /usr/local/bin/text2image -> ../Cellar/tesseract/HEAD-5610738_2/bin/text2image

$ text2image --text=eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3

Log file is below:

issue_195_HEAD-5610738.txt

The exit status and output files:

$ echo $?
0
$ ls
eng.TimesNewRomanBold.exp0.box  eng.training.txt
eng.TimesNewRomanBold.exp0.tif

eng.TimesNewRomanBold.exp0.box.txt

eng timesnewromanbold exp0

  • rename .box file to .txt, and convert .tiff to .png

Thanks.

@atuyosi, thanks for testing. The output files look okay.
The issue seems to be solved.

Hi guys,
i have a trouble with the text2image im trying to make the fontlist that was described in the main article about tesseract ocr, i can create the box file and tif file normally for one font but the list i get a problem. im using tesseract OCR 3.03 and the OS is windows 10 and the language is english the command is:
text2image --text=training_text.txt --outputbase=eng.fontlist.txt --fonts_dir=C:\Windows\Fonts --find_fonts=true --min_coverage=1.0 --render_per_font=false --fontconfig_tmpdir=C:\Tesseract\Tesseract-OCR
and i get a warning
WARNING: Could not find a font to render image title with!
and it gives a fail for every font such as:
Font Aldhabi failed with 62 hits = 21.60%
also give '%' (U+25) not covered by font but i don't know what does that mean, anyways
Any idea how to solve this error?

Thanks in advance

You are looking for 100% coverage of training text in the fonts with --min_coverage=1.0

I use the following on bash on windows (mobaxterm).

text2image --find_fonts \
--fonts_dir /mnt/c/Windows/Fonts \
--text ./langdata/eng/eng.training_text \
--min_coverage .95  \
--outputbase ./langdata/eng/eng \
|& grep raw | sed -e 's/ :.*/" \\/g'  | sed -e 's/^/  "/' > ../langdata/eng/fontslist-windows.txt

Thanks a lot for your response, but i didn't get the last part of the command that you are using \
|& grep raw | sed -e 's/ :.*/" \/g' | sed -e 's/^/ "/' > ../langdata/eng/fontslist-windows.txt
i don't remember that there is such arguments at text2image binary, also i tried in my command to change the coverage from 100 to 95 yet i still have the same problem

keep in mind I'm not familiar with terminal environment :)

The text2image --find_fonts command displays the output on the terminal.

the \ at end of each line is a continuation mark for the command.

| is for piping the output of earlier command to next command.

Grep selects all the lines which have 'raw' in them - to select all lines which have the font name.

first sed command deletes everything following the : sign with a quote mark.

second sed command adds a quote sign to the beginning of each line.

The resulting output is saved in the output file name given after >

So, basically it deletes all extraneous output and creates a text file with each font name with quotes around it, which can be used as part of fontslist or plugged into language-specific.sh. Example of output below:

  "WenQuanYi Zen Hei Medium" \
  "WenQuanYi Zen Hei Mono Medium" \
  "WenQuanYi Zen Hei Sharp Medium" \

I notice just now that you say

using tesseract OCR 3.03

That could be the problem. text2image segfaults have been fixed in recent code.

Please use the latest windows binaries eg. from

https://github.com/UB-Mannheim/tesseract/wiki

@ibr123 Please note that if you are using windows command prompt and not bash under windows, the commands such as grep, sed etc may not be available.

"text2image --find_fonts command displays the output on the terminal" does that mean no file will be generated? only printing on the terminal?

text2image --find_fonts \
 --fonts_dir  /usr/share/fonts/truetype/dejavu/ \
 --text ../langdata/eng/eng.training_text \
 --min_coverage .99  \
 --outputbase ../langdata/eng/eng

Total chars = 6694
DejaVu Sans : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 0 to file ../langdata/eng/eng.DejaVu_Sans.tif
DejaVu Sans Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 1 to file ../langdata/eng/eng.DejaVu_Sans_Bold.tif
DejaVu Sans Mono : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 2 to file ../langdata/eng/eng.DejaVu_Sans_Mono.tif
DejaVu Sans Mono Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 3 to file ../langdata/eng/eng.DejaVu_Sans_Mono_Bold.tif
DejaVu Serif : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 4 to file ../langdata/eng/eng.DejaVu_Serif.tif
DejaVu Serif Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 5 to file ../langdata/eng/eng.DejaVu_Serif_Bold.tif

you can redirect output to file by

text2image --find_fonts \
--fonts_dir  /usr/share/fonts/truetype/dejavu/ \
--text ../langdata/eng/eng.training_text \
--min_coverage .99  \
--outputbase ../langdata/eng/eng &>./test.txt

test.txt has

Total chars = 6694
DejaVu Sans : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 0 to file ../langdata/eng/eng.DejaVu_Sans.tif
DejaVu Sans Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 1 to file ../langdata/eng/eng.DejaVu_Sans_Bold.tif
DejaVu Sans Mono : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 2 to file ../langdata/eng/eng.DejaVu_Sans_Mono.tif
DejaVu Sans Mono Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 3 to file ../langdata/eng/eng.DejaVu_Sans_Mono_Bold.tif
DejaVu Serif : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 4 to file ../langdata/eng/eng.DejaVu_Serif.tif
DejaVu Serif Bold : 6694 hits = 100.00%, raw = 112 = 100.00%
Rendered page 5 to file ../langdata/eng/eng.DejaVu_Serif_Bold.tif

If I use fonts which do not provide adequate coverage for the training text, then the output shows error.

eg. following command which tries to render hindi txt in devanagari script using regular latin script fonts. Since numbers and punctuation are same, it shows coverage of about 18%

text2image --find_fonts \
 --fonts_dir  /usr/share/fonts/ \
 --text ../langdata/hin/hin.training_text \
 --min_coverage .99  \
 --outputbase ../langdata/hin/hin

Total chars = 34998
Font DejaVu Serif failed with 6378 hits = 18.22%
Font DejaVu Serif Bold failed with 6378 hits = 18.22%
Font Dingbats failed with 5723 hits = 16.35%
Stripped 2 unrenderable words
FreeMono : 34995 hits = 99.99%, raw = 135 = 99.26%
Font FreeMono Bold failed with 6378 hits = 18.22%
Font FreeMono Bold Italic failed with 6378 hits = 18.22%
Font FreeMono Italic failed with 6378 hits = 18.22%
Stripped 2 unrenderable words
FreeSans : 34995 hits = 99.99%, raw = 135 = 99.26%
Font FreeSans Italic failed with 6527 hits = 18.65%
Stripped 2 unrenderable words
FreeSans Semi-Bold : 34993 hits = 99.99%, raw = 134 = 98.53%
Font FreeSans Semi-Bold Italic failed with 6378 hits = 18.22%
Stripped 2 unrenderable words
FreeSerif : 34995 hits = 99.99%, raw = 135 = 99.26%
Stripped 2 unrenderable words
FreeSerif Bold : 34995 hits = 99.99%, raw = 135 = 99.26%
Font FreeSerif Bold Italic failed with 6380 hits = 18.23%
Font FreeSerif Italic failed with 6527 hits = 18.65%

Note above the lines that have raw in them. Those are the only fonts that meet the coverage criteria.

i appreciate the answers, Thanks

@ibr123,

Please use the forum for asking questions.

I'm also having this issue: using latest homebrew version
tesseract: stable 3.05.00 (bottled), HEAD

The training text file I'm using is the first one posted by amitdo; but it happens with any text

Using $ text2image --list_available_fonts --fonts_dir=/Library/Fonts does give the font I want Lucida Grande

Also ran fc-cache -frv

Result:
$ text2image --text=eng.training_text --outputbase=eng.LucidaGrande.exp0 --font='Lucida Grande' --fonts_dir=/Library/Fonts

[1] 72778 segmentation fault text2image --text=eng.training_text --outputbase=eng.LucidaGrande.exp0

I don't know how to get more error info to you? Please help.

@amitdo Could it be that some required commit fixing text2image has not been backported for 3.05?

@Tjorriemorrie Did you build tesseract from source? Please also try with the 4.0 alpha version (latest source from github), if the same error is there?

Ray did some changes in 4.00 that made this problem reappear. These changes were also backported to 3.05.

Here is the source for the regression:
https://github.com/tesseract-ocr/tesseract/commit/709935851061#diff-b37dca9f063c3727f62c496e514177a9L440

Here is a (temporary) solution:
https://github.com/tesseract-ocr/tesseract/issues/736#issuecomment-282685898

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LaurentBerger picture LaurentBerger  路  3Comments

anavc94 picture anavc94  路  6Comments

ivder picture ivder  路  7Comments

samiles picture samiles  路  4Comments

clarkk picture clarkk  路  7Comments