Tesseract: Running example hocr command from wiki does not work as expected

Created on 4 Oct 2019  路  10Comments  路  Source: tesseract-ocr/tesseract

I'm having trouble simply running the example command from wiki page on hocr output (https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#hocr-output)
command:
tesseract --tessdata-dir ./ ./testing/eurotext.png ./testing/eurotext-eng -l eng hocr

Environment

  • Tesseract Version: 4.0.0-beta.1 with Leptonica
  • Platform: Ubuntu
  • Directory: The cwd ./ contains eng.traineddata from tessdata_best

Current Behavior:

eurotext-eng.txt is generated with text and the terminal says Can't open hocr:

read_params_file: Can't open hocr

Expected Behavior:

Should generate eurotext-eng.hocr
(Note: this works as expected if I exclude --tessdata-dir ./)

Suggested Fix:

Perhaps this is an argument parse bug, or maybe there's a new syntax and the wiki needs updating. Or maybe I'm missing something.

Thank you!

question

All 10 comments

Thank you.

The example was wrong for your case. You already found the right solution. I fixed the Wiki.

Thanks @stweil. So does this mean the hocr option cannot be combined with the --tessdata-dir option? Is there a workaround to use these two options together?

The Tesseract command line syntax is a bit confusing. hocr is not an option, but a configfile. --tessdata-dir is an option. Both can be used together, but options must come before configfiles. In your case the given directory did not contain the expected tessdata files.

The Tesseract command line syntax is a bit confusing

It's terrible...

Ah I see. The wiki does call it a config. So I just need to copy tessdata/configs/hocr into my tessdata directory and it should work. I'll give this a try.

Hmm didn't work. Got this error: read_params_file: parameter not found: enable_new_segsearch

It worked. That's only a warning. You copied an old hocr file. Remove enable_new_segsearch from that file.

Actually the hocr config file does not contain enable_new_segsearch. And there is no generated files txt or hocr/html.

Maybe someone else can reproduce what I'm experiencing:

  • Using 4.0.0-beta.1 leptonica-1.75.3
  • Create directory with just these 3 files:

    • eng.traineddata (from tessdata_best)

    • hocr

    • test-image.png

  • run tesseract --tessdata-dir ./ ./test-image.png ./extract-image-output hocr

The hocr file contains:

tessedit_create_hocr 1
hocr_font_info 0

The output from this is:
read_params_file: parameter not found: enable_new_segsearch
and no new files created in the cwd.

If you really get error read_params_file: parameter not found: enable_new_segsearch there are 2 possibilities:

  1. You are run different command than you post
  2. Your tesseract installation does not use eng.traineddata (from tessdata_best) as you stated.

Config parameter enable_new_segsearch must be somewhere specified (e.g. in traineddata or config file).
BTW please use recent tesseract version (and data) when you report issue.

Thanks @zdenop for the help! I was able to get things working normally by doing two things:

  • I uninstalled 4.0.0 (this was the latest version for Ubuntu using apt)
  • I downloaded and built 4.1.0 from source and installed it
  • I re-downloaded eng.traineddata from tessdata_best

The result: hocr file was generated and no warning message.

Thanks again for the help.

Was this page helpful?
0 / 5 - 0 ratings