Tesseract: Upgrading from 3.04 to 4.00

Created on 20 Mar 2017  Â·  7Comments  Â·  Source: tesseract-ocr/tesseract

After upgrading to 4.00 my program using tesseract is broken

Installing tesseract 4.00

cd /var/bin && wget https://github.com/tesseract-ocr/tesseract/archive/4.00.00alpha.tar.gz -O tesseract-4.00.00alpha.tar.gz && tar -xvf tesseract-4.00.00alpha.tar.gz
cd tesseract-4.00.00alpha && ./autogen.sh && ./configure && make && make install

Compiling program

g++ -std=c++11 txtocr.cpp -o txtocr -llept -ltesseract

Running program

# ./txtocr
./txtocr: error while loading shared libraries: libtesseract.so.4: cannot open shared object file: No such file or directory

Most helpful comment

Try to uninstall 3.0x first before installing 4.00.
Don't forget running ldconfig.

All 7 comments

txtocr.cpp

/*
 *  Compile
 *  # g++ -std=c++11 txtocr.cpp -o txtocr -llept -ltesseract
 *
 *  Get tesseract version
 *  # pkg-config --modversion tesseract
*/

#include "txtocr.hpp"

void usage(const std::string VERSION){
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

    std::cerr << "txtocr version: " << VERSION << "\nTesseract version: " << api->Version() << "\n\nUsage: txtocr input [options]\n"
        "Options:\n"
        "\t-l <string>          -- Set iterator level\n"
        "\t                        (Values: block | para | line | word | symbol)\n"
        "\t-d                   -- Debug (Verbose)\n" << std::endl;
}

int main(int argc, char* argv[]){
    Txtocr a;

    try{
        a.set_level("");

        //  Parse arguments
        for(int i = 1; i < argc; i++){
            std::string arg = std::string(argv[i]);

            if(i == 1){
                a.set_input(arg);
            }
            else if(arg == "-l"){
                a.set_level(argv[++i]);
            }
            else if(arg == "-d"){
                a.set_debug(true);
            }
            else{
                throw std::runtime_error("Argument '"+arg+"' is invalid");
            }
        }

        std::cout << a.run() << std::endl;
    }
    catch(std::exception& e){
        std::cerr << "Error: " << e.what() << "\n" << std::endl;
        usage(a.VERSION);
        return 1;
    }

    return 0;
}

txtocr.h

class Txtocr{
    private:
        std::string input                   = "";
        tesseract::PageIteratorLevel level;
        bool is_debug                       = false;
        void error                          (const std::string& s);
        std::string utf8_to_latin           (const char * in);

    public:
        Txtocr();
        const std::string VERSION           = "0.1";
        void set_input                      (const std::string& s);
        void set_level                      (const std::string& s);
        void set_debug                      (bool d);
        std::string run                     ();
};

txtocr.hpp

#include <iostream>
#include <stdexcept>
#include <fstream>
#include <chrono>
#include <string>
#include <vector>
#include <math.h>
#include <boost/algorithm/string.hpp>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/json_parser.hpp>
#include "txtocr.h"

Txtocr::Txtocr(){}

void Txtocr::set_input(const std::string& s){
    input = s;
}

void Txtocr::set_level(const std::string& s){
    if(s == ""){
        level = tesseract::RIL_TEXTLINE;
    }
    else if(s == "block"){
        level = tesseract::RIL_BLOCK;
    }
    else if(s == "para"){
        level = tesseract::RIL_PARA;
    }
    else if(s == "line"){
        level = tesseract::RIL_TEXTLINE;
    }
    else if(s == "word"){
        level = tesseract::RIL_WORD;
    }
    else if(s == "symbol"){
        level = tesseract::RIL_SYMBOL;
    }
    else{
        error("Invalid iterator level");
    }
}

void Txtocr::set_debug(bool d){
    is_debug = d;
}

void Txtocr::error(const std::string& s){
    throw std::runtime_error(s);
}

std::string Txtocr::utf8_to_latin(const char* in){
    std::string out;

    if(in == NULL){
        return out;
    }

    unsigned int codepoint;
    while (*in != 0){
        unsigned char ch = static_cast<unsigned char>(*in);
        if(ch <= 0x7f){
            codepoint = ch;
        }
        else if(ch <= 0xbf){
            codepoint = (codepoint << 6) | (ch & 0x3f);
        }
        else if(ch <= 0xdf){
            codepoint = ch & 0x1f;
        }
        else if(ch <= 0xef){
            codepoint = ch & 0x0f;
        }
        else{
            codepoint = ch & 0x07;
        }

        ++in;

        if(((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)){
            if(codepoint <= 255){
                out.append(1, static_cast<char>(codepoint));
            }
            else{
                // do whatever you want for out-of-bounds characters
            }
        }
    }

    return out;
}

std::string Txtocr::run(){
    //  Return error if input file is not defined
    if(input == ""){
        error("Input file not defined");
    }
    //  Return error if input file is not TIFF
    else{
        std::string input_lc = input;
        boost::to_lower(input_lc);
        if(input_lc.substr(input_lc.find_last_of(".") + 1) != "tif"){
            error("Input file must be TIFF");
        }
    }

    auto start = std::chrono::high_resolution_clock::now();

    // Open input image with leptonica library
    Pix *image = pixRead((input).c_str());

    boost::property_tree::ptree root;
    boost::property_tree::ptree children;

    root.put("height", pixGetHeight(image));
    root.put("width", pixGetWidth(image));

    // Initialize tesseract-ocr, without specifying tessdata path
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if(api->Init(NULL, "dan+eng")){
        error("Could not initialize tesseract");
    }
    api->SetImage(image);
    api->Recognize(0);

    tesseract::ResultIterator* ri = api->GetIterator();

    if(ri != 0){
        do{
            boost::property_tree::ptree child;

            const char* seg = ri->GetUTF8Text(level);
            int x1, y1, x2, y2, height, width;
            ri->BoundingBox(level, &x1, &y1, &x2, &y2);
            height = y2 - y1;
            width = x2 - x1;

            if(is_debug){
                printf("seg: '%s'; BoundingBox: %d,%d,%d,%d;\n", seg, x1, y1, x2, y2);
            }

            child.put("top", y1);
            child.put("left", x1);
            child.put("height", height);
            child.put("width", width);
            child.put("bottom", y1 + height);
            child.put("right", x1 + width);
            child.put("html", utf8_to_latin(seg));
            child.put("conf", roundf(ri->Confidence(level) * 100) / 100);

            children.push_back(std::make_pair("", child));

            delete[] seg;
        }
        while(ri->Next(level));

        root.add_child("elms", children);
    }

    // Destroy used object and release memory
    api->End();
    pixDestroy(&image);

    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    root.put("exec_time", ((float)std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count() / 1000) / 1000);

    std::ostringstream oss;
    write_json(oss, root, false);
    return oss.str();
}

Try to uninstall 3.0x first before installing 4.00.
Don't forget running ldconfig.

ldconfig solved the error..

But what about those shared objects/files?

Arent all required files compiled into the file txtocr ??

When the following code prints Tesseract version: 4.00.00alpha can I then be 100% sure that everything is running 4.00 ?

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

std::cerr << "Tesseract version: " << api->Version() << std::endl;

I am compiling my program like this g++ -std=c++11 txtocr.cpp -o txtocr -llept -ltesseract

4.0.0alpha tagged zip file is from Nov 2016 (I think, please check).

If you want latest 4.0.0alphacode, please clone from master in GitHub.
There have been quite a few commits since the original tag for 4.0.0.

  • excuse the brevity, sent from mobile

On 20-Mar-2017 9:05 PM, "clarkk" notifications@github.com wrote:

ldconfig solved the error..

But what about those shared objects/files?

Arent all required files compiled into the file txtocr ??

When the following code prints Tesseract version: 4.00.00alpha can I then
be 100% shure that everything is running 4.00 ?

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

std::cerr << "Tesseract version: " << api->Version() << std::endl;

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/774#issuecomment-287797686,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oz8guS825Mrknsc_-ul--atGngkiks5rnpzQgaJpZM4MihhH
.

Run Configure with enable debug to see the git revision of the code u r
running.

  • excuse the brevity, sent from mobile

On 20-Mar-2017 11:24 PM, "ShreeDevi Kumar" shreeshrii@gmail.com wrote:

4.0.0alpha tagged zip file is from Nov 2016 (I think, please check).

If you want latest 4.0.0alphacode, please clone from master in GitHub.
There have been quite a few commits since the original tag for 4.0.0.

  • excuse the brevity, sent from mobile

On 20-Mar-2017 9:05 PM, "clarkk" notifications@github.com wrote:

ldconfig solved the error..

But what about those shared objects/files?

Arent all required files compiled into the file txtocr ??

When the following code prints Tesseract version: 4.00.00alpha can I
then be 100% shure that everything is running 4.00 ?

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

std::cerr << "Tesseract version: " << api->Version() << std::endl;

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/tesseract-ocr/tesseract/issues/774#issuecomment-287797686,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_oz8guS825Mrknsc_-ul--atGngkiks5rnpzQgaJpZM4MihhH
.

Use tesseract user forum for asking support.
When installing for source you have to remove (uninstall) previous version (this is valid not only for tesseract).

Where is the user forum? Where can I ask questions?

I need to build static library of tesseract so I can compile everything into one binary

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MerlijnWajer picture MerlijnWajer  Â·  7Comments

ivder picture ivder  Â·  7Comments

johnthagen picture johnthagen  Â·  6Comments

anavc94 picture anavc94  Â·  6Comments

mm-manu picture mm-manu  Â·  4Comments