Tesseract: C# Tesseract 3.02 How I access each character of word from image

Created on 12 Jan 2014 · 3Comments · Source: charlesw/tesseract

Hi, I'm newbie here.
First, I need to draw rectangle on each character of word from image.
in old version of tesseract I found that we can access each character by

foreach (tessnet2.Character c in word.CharList)
e.Graphics.DrawRectangle..........

demo

But, now I'm working on C# winform with Tesseract 3.02

TesseractEngine a = new TesseractEngine(@"./tessdata", "eng", EngineMode.TesseractAndCube);
Tesseract.Page page1 = a.Process(image);
foreach ( ....... in page1)
{
// draw rectangle from (bounding box of each character)
}

Question 1: how i access each character of page1.

I try many method like PageIteratorLevel and get some part of page like first line, first word or first block , but i can't get first character of them.
Well, I notice that on result text of HOCRtext from page1 each element like word, line , block has Bounding box's value.

Question 2: how i get value of bounding box of each element. ( I found only 1 method "TryGetBoundingBox" that return only boolean.

thank you.

question

Source

ominouse

Most helpful comment

Answer for Q1:

Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:

using (var iter = page.GetIterator()) {
    do {
        do {
            do {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                    // do whatever you need to do when a block (top most level result) is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                    // do whatever you need to do when a paragraph is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                    // do whatever you need to do when a line of text is encountered is encountered.
                }                                               
                if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                    // do whatever you need to do when a word is encountered is encountered.
                }

                // get bounding box for symbol
                Rect symbolBounds;
                if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                    // do whatever you want with bounding box for the symbol
                }
            } while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}

Note that the general result hierarchy is as follows:

Block -> Para -> TextLine -> Word -> Symbol

I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.

Answer for Question 2:

As per above the TryGetBoundingBox method returns the bounds in an out parameter. Much like Dictionary.TryGetValue does.

charlesw on 13 Jan 2014

👍2

All 3 comments

Answer for Q1:

Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:

using (var iter = page.GetIterator()) {
    do {
        do {
            do {
                if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
                    // do whatever you need to do when a block (top most level result) is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
                    // do whatever you need to do when a paragraph is encountered.
                }
                if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
                    // do whatever you need to do when a line of text is encountered is encountered.
                }                                               
                if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
                    // do whatever you need to do when a word is encountered is encountered.
                }

                // get bounding box for symbol
                Rect symbolBounds;
                if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
                    // do whatever you want with bounding box for the symbol
                }
            } while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
        } while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
    } while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}

Note that the general result hierarchy is as follows:

Block -> Para -> TextLine -> Word -> Symbol

I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.

Answer for Question 2:

As per above the TryGetBoundingBox method returns the bounds in an out parameter. Much like Dictionary.TryGetValue does.

charlesw on 13 Jan 2014

👍2

Hi Charles,

Hope you're doing great.

I am new to this stuff, I can get the required text from a small picture or test picture but not from the actual picture:

how to extract a BIB# from a photograph.
How to recognize a BIB# area from the whole photograph.

Thanks.

kndnath on 1 Sep 2019

Use opencv to find and crop the region. There is a guy with demos written in Python that aren't too hard to translate to .net.

tdhintz on 1 Sep 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Unhandled exception in InteropRuntimeImplementer.TessApiSignaturesInstance on Windows Server 2019 with Tesseract 4/4.1

Soruk · 37Comments

How to compile and install 321-Tesseract-4 branch as a nuget package?

masterisk · 8Comments

Linux support missing for .NET Core

arthrp · 20Comments

"Warning. Invalid resolution 0 dpi. Using 70 instead." with 321-Tesseract-4 branch

masterisk · 13Comments

Setting variable tessedit_write_images to true has no effect

Jonathan-JFR · 5Comments