Hi, I'm newbie here.
First, I need to draw rectangle on each character of word from image.
in old version of tesseract I found that we can access each character by
foreach (tessnet2.Character c in word.CharList)
e.Graphics.DrawRectangle..........
But, now I'm working on C# winform with Tesseract 3.02
TesseractEngine a = new TesseractEngine(@"./tessdata", "eng", EngineMode.TesseractAndCube);
Tesseract.Page page1 = a.Process(image);
foreach ( ....... in page1)
{
// draw rectangle from (bounding box of each character)
}
Question 1: how i access each character of page1.
I try many method like PageIteratorLevel and get some part of page like first line, first word or first block , but i can't get first character of them.
Well, I notice that on result text of HOCRtext from page1 each element like word, line , block has Bounding box's value.
Question 2: how i get value of bounding box of each element. ( I found only 1 method "TryGetBoundingBox" that return only boolean.
thank you.
Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:
using (var iter = page.GetIterator()) {
do {
do {
do {
if (iter.IsAtBeginningOf(PageIteratorLevel.Block)) {
// do whatever you need to do when a block (top most level result) is encountered.
}
if (iter.IsAtBeginningOf(PageIteratorLevel.Para)) {
// do whatever you need to do when a paragraph is encountered.
}
if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine)) {
// do whatever you need to do when a line of text is encountered is encountered.
}
if (iter.IsAtBeginningOf(PageIteratorLevel.Word)) {
// do whatever you need to do when a word is encountered is encountered.
}
// get bounding box for symbol
Rect symbolBounds;
if(iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out symbolBounds)) {
// do whatever you want with bounding box for the symbol
}
} while(iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Block));
} while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
} while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}
Note that the general result hierarchy is as follows:
Block -> Para -> TextLine -> Word -> Symbol
I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.
As per above the TryGetBoundingBox
method returns the bounds in an out parameter. Much like Dictionary.TryGetValue
does.
Hi Charles,
Hope you're doing great.
I am new to this stuff, I can get the required text from a small picture or test picture but not from the actual picture:
how to extract a BIB# from a photograph.
How to recognize a BIB# area from the whole photograph.
Thanks.
Use opencv to find and crop the region. There is a guy with demos written in Python that aren't too hard to translate to .net.
Most helpful comment
Answer for Q1:
Check out the console sample provided as it gives an example of how to iterate through the results, however something like the following should work:
Note that the general result hierarchy is as follows:
Block -> Para -> TextLine -> Word -> Symbol
I.e. the result set can contain many Blocks, which can in turn contain many Paragraphs and so on.
Answer for Question 2:
As per above the
TryGetBoundingBox
method returns the bounds in an out parameter. Much likeDictionary.TryGetValue
does.