Tesseract: text2image segmentation fault on macOS ( regression #195?)

Created on 25 Feb 2017  路  41Comments  路  Source: tesseract-ocr/tesseract

There seems to be a regression bug with text2image of 3.05 release.
The text2image utility caused segfault, so I couldn' use tesstrain.sh on macOS (10.12.3).

$ text2image --text eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
query weight = 700   selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 'Times New Roman, Bold'
Render string of size 6801
Starting page 0
max_width = 3400, max_height = 4600
len = 6801  buf_len = 6801
Found offset = 6421
Segmentation fault: 11

ref. https://github.com/tesseract-ocr/tesseract/issues/195

for detail:

$ lldb /usr/local/bin/text2image
(lldb) target create "/usr/local/bin/text2image"
Current executable set to '/usr/local/bin/text2image' (x86_64).
(lldb) run --text eng.training_text --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts
Process 40801 launched: '/usr/local/bin/text2image' (x86_64)
Process 40801 stopped
* thread #1: tid = 0xffe117, 0x0000000100bdd5a7 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100bdd5a7 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph:
->  0x100bdd5a7 <+25>: movq   (%rcx), %rdi
    0x100bdd5aa <+28>: testq  %rdi, %rdi
    0x100bdd5ad <+31>: je     0x100bdd5b8               ; <+42>
    0x100bdd5af <+33>: movq   %rax, %rsi
(lldb)
$ tesseract -v
tesseract 3.05.00
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
bug

Most helpful comment

No error occurred. It looks work. It seems to be working.

$ PANGOCAIRO_BACKEND=fc text2image --text eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3  2&> detail_log_736_1.txt

detail_log_736_1.txt

But macOS's Preview.app can't open this tif file (although Gimp can open this file ).
I'll try to run tesstrain.sh later.

All 41 comments

Here is the source for the regression:
https://github.com/tesseract-ocr/tesseract/commit/709935851061#diff-b37dca9f063c3727f62c496e514177a9L440

It also affects 4.00.

Hi @amitdo

Thanks for your quick response.

I tried your patch, but same result.

$ wget https://github.com/amitdo/tesseract/commit/da7a7fa2f23f.patch
$ patch -p1 < da7a7fa2f23f.patch
patching file training/pango_font_info.cpp

try patched binary:

$ text2image --text eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
query weight = 700   selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 'Times New Roman, Bold'
Render string of size 84
Starting page 0
max_width = 3400, max_height = 4600
len = 84  buf_len = 84
Segmentation fault: 11
$ lldb /usr/local/bin/text2image
(lldb) target create "/usr/local/bin/text2image"
Current executable set to '/usr/local/bin/text2image' (x86_64).
(lldb) run --text eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3
Process 88534 launched: '/usr/local/bin/text2image' (x86_64)
query weight = 700   selected weight =700
query_desc: 'Times New Roman, Bold' Selected: 'Times New Roman, Bold'
Render string of size 84
Starting page 0
max_width = 3400, max_height = 4600
len = 84  buf_len = 84
Process 88534 stopped
* thread #1: tid = 0x122eeb9, 0x0000000100cfb5a7 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100cfb5a7 libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph + 25
libpangoft2-1.0.0.dylib`pango_fc_font_get_glyph:
->  0x100cfb5a7 <+25>: movq   (%rcx), %rdi
    0x100cfb5aa <+28>: testq  %rdi, %rdi
    0x100cfb5ad <+31>: je     0x100cfb5b8               ; <+42>
    0x100cfb5af <+33>: movq   %rax, %rsi
(lldb)

Try to manually set the environment variable PANGOCAIRO_BACKEND to fc and then call text2image.

No error occurred. It looks work. It seems to be working.

$ PANGOCAIRO_BACKEND=fc text2image --text eng.training.txt --outputbase=eng.TimesNewRomanBold.exp0 --font='Times New Roman, Bold' --fonts_dir=/Library/Fonts --tlog_level=3  2&> detail_log_736_1.txt

detail_log_736_1.txt

But macOS's Preview.app can't open this tif file (although Gimp can open this file ).
I'll try to run tesstrain.sh later.

I tried to run tesstrain.sh with above patch and workaround . It works fine. Successfully generated jpn.traineddata.

$ TESSDATA_PREFIX=/usr/local/share/  PANGOCAIRO_BACKEND=fc ./training/tesstrain.sh --lang jpn --langdata_dir ../github_tesseract/langdata/  --training_text ../github_tesseract/langdata/jpn/jpn.training_text --fonts_dir /Library/Fonts/

Note: I replaced mktemp with gmktemp (coreutils ) on macOS

It should also work without my patch.

The patch was supposed to programmatically set the said environment variable. Since you reported it did not work, I asked you to do it manually. The call to the setenv function in my patch needs to be moved to another place in the code.

thank you amitdo. it fixed my problem too~~

Got the same problem on MAC OSX 10.13.3 using tessercat 4.0.0-beta.1, fixed it using the PANGOCAIRO_BACKEND=fc before executing the tesstrain.sh file

@FernandoGOT,

Can you please check my new patch?
https://github.com/amitdo/tesseract/commit/ee2d6739cf0782

Don't set the environment variable PANGOCAIRO_BACKEND when you test this patch.

Instead of setting PANGOCAIRO_BACKEND=fc, it should also be possible to replace pango_cairo_font_map_get_default() by pango_cairo_font_map_new_for_font_type(CAIRO_FONT_TYPE_FT).

@stweil, your suggested method looks simpler. Please make a PR.

Is this issue still valid? If yes can we get this modification to 4.0 finale release?

I don't have a Mac, but we didn't done anything to fix this issue.

@stweil, https://github.com/tesseract-ocr/tesseract/issues/736#issuecomment-381436973

@zdenop It has not fixed (4.0.0-beta.4-164-g5dfce).

It is still necessary to set environment variables PANGOCAIRO_BACKEND=fc .

Can you please check if 345e5ee1f3e78d16927a667212623d9507cc4a63 solve the problem?

@zdenop ,

I tested 345e5ee. No error occurred.

It's OK.

$ text2image --fonts_dir /System/Library/Fonts/  --list_available_fonts
<skip>
$ echo $?
0
$ text2image --text jpn_test.txt  --outputbase=jpn.HiraginoMinchoProLicht.exp0 --font='Hiragino Mincho Pro Light' --fonts_dir=/System/Library/Fonts --tlog_level=3 --ptsize 32
<skip>
$ echo $?
0

Software Version:

$ text2image --version
4.0.0-beta.4-167-g345e
$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.14
BuildVersion:   18A391

I have to revert this patch because of side effects. Other solution should be found or implemented...

@atuyosi: can you please try get the latest code and post output of text2image -v

I have ready something like this:

diff --git a/src/training/text2image.cpp b/src/training/text2image.cpp
index 1142bb68..3e2b8a23 100644
--- a/src/training/text2image.cpp
+++ b/src/training/text2image.cpp
@@ -50,6 +50,9 @@
 #include "tlog.h"
 #include "unicharset.h"
 #include "util.h"
+#ifdef __APPLE__
+#  include "TargetConditionals.h"
+#endif

 // A number with which to initialize the random number generator.
 const int kRandomSeed = 0x18273645;
@@ -689,5 +692,19 @@ int main(int argc, char** argv) {
     }
   }
   tesseract::ParseCommandLineFlags(argv[0], &argc, &argv, true);
+#ifdef TARGET_OS_MAC
+  // Respect enviroment variable. could be:
+  // fc (fontconfig), win32, and coretext
+  // If not set force fontconfig for Mac OS.
+  // See https://github.com/tesseract-ocr/tesseract/issues/736
+  char* backend;
+  backend = getenv("PANGOCAIRO_BACKEND");
+  if (backend == NULL) {
+    putenv("PANGOCAIRO_BACKEND=fc");
+  } else {
+    printf("Using '%s' as pango cairo backend based on enviroment "
+           "variable.\n", backend);
+  }
+#endif
   return Main();
 }

Maybe we can implement it as option --pangocairo_backend for text2image, so user will be aware about this possibility to change backend.

    } else {
      printf("Using '%s' as pango cairo backend based on enviroment "
             "variable.\n", backend);
   }

text2image's code needs the 'fc' backend on linux and mac.
Without the 'fc' backend there will be a crash.

@amitdo : code you quoted means that program will accept what user will set (intentionally). This will give user option to choose backend (in case of new backend or fix for actual)

I have possibility to test it (just sort text - not whole training) on linux and windows:

  • On linux (opensuse 15) pango choose CAIRO_FONT_TYPE_FT automatically and I did not recognized any crash. I see no reason to modify code for linux.
  • On windows (10) pango choose CAIRO_FONT_TYPE_WIN32 automatically and I did not recognized any crash. Setting it to CAIRO_FONT_TYPE_FT (with PANGOCAIRO_BACKEND=fc). Anyway I prefer to use native solution if it works.

@tfmorris : Can you please test this ('text2image -v' and patch above?)?

Linux always uses 'fc'.

Originally Windows had the same issue:
https://github.com/tesseract-ocr/tesseract/issues/380

This patch fixed the crash on windows & mac:
https://github.com/tesseract-ocr/tesseract/commit/8d2d94e4ed32c

but Ray partly reverted it, so it is only compiled for win32 and not on linux or mac.
https://github.com/tesseract-ocr/tesseract/commit/709935851061#diff-b37dca9f063c3727f62c496e514177a9L440

Although the workaround is still there for win32, it has this remark:

// Fixme! Leaks memory and breaks unittests.

@amitdo: thanks for analysis. Thank I will not limit this change to Mac.

The new commit will still crash on mac...

Why?

Because the default bacend on mac is 'coretext', but text2image is not designed to work with it

You should force 'fc' as backend, at least for mac.

... and the patch DOES NOT care about default backend... It cares about environment variable definition. So it should work.

@amitdo: do you have Mac or did you test it on Mac? Because you are claiming that Mac has set PANGOCAIRO_BACKEND variable by default to some value...

@atuyosi: can you please test current code?

...and patch, that reflect above logic, will do:

  1. if env var is set = do nothing (because this PANGOCAIRO_BACKEND must be set by user and he should know what he is doing)
  2. if env var is not set = set it fc (everything should works).

So crash should happened only in case 1. when user set PANGOCAIRO_BACKEND to wrong backend.

@amitdo : can you please read my comments? You miss the point. Or try the code ;-)

I will reread the code, maybe I missed something.

It looks okay :-)

Please pardon me.

You changed the code a few times in the last week, and I was reading the diffs, so I didn't see the whole picture. That was the reason for my confusion.

I have to revert this patch because of side effects. Other solution should be found or implemented...

@zdenop, do you remember which side effects your initial patch had? It fixes pango_font_info_test for MacOS (see discussion for PR #3189).

See #1999 and #2009

@stweil : I do not remember, but pango was always strange for me. ;-) After pango changed build system, I lost interest to to play with as I am mainly on windows at this time...

After pango changed build system, I lost interest to play with as I am mainly on windows at this time...

Meson is great and it does work on windows.

Thank you for your answers. The main problem was obviously unreleased memory. I now have a different solution for #3189.

Was this page helpful?
0 / 5 - 0 ratings