[轉]Tesseract-OCR學習系列（四）API

本文轉載自查看原文 2017-02-24 10:57 4368

原文地址：http://www.jianshu.com/p/3df039e42986

2016.09.20

Other API Examples

參考文檔：https://github.com/tesseract-ocr/tesseract/wiki/APIExample

在上一篇中，我們學習了參考文檔中的第一個示例。用CMake構建了工程，並且看了一下例子中調用到的API。在這一篇中，我們繼續看一看其它的例子。但如何用CMake構建工程的方法就不贅述了。這里給出我寫的例程，若有疑問之處，請閱讀Tesseract-OCR學習系列（三）簡例以及CMake簡要教程這兩篇文章。

GetComponentImages example

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> int main() { Pix *image = pixRead("D:\\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif"); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image); Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL); printf("Found %d textline image components.\n", boxes->n); for (int i = 0; i < boxes->n; i++){ BOX* box = boxaGetBox(boxes, i, L_CLONE); api->SetRectangle(box->x, box->y, box->w, box->h); char* ocrResult = api->GetUTF8Text(); int conf = api->MeanTextConf(); fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s", i, box->x, box->y, box->w, box->h, conf, ocrResult); } }

我們知道，如果要進行字符識別，首先要搜索到文字圖塊。或者說，找到包含字符的文字圖塊。這個例子幫助我們將每一個文字圖塊找到，並對文字圖塊進行識別。下面來看代碼（之前說過的就不說了）：

    Pix *image = pixRead("D:\\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif"); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image);

已經熟悉了，Pass……

    Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);

這是最關鍵的一行代碼。GetComponentImages用於查找圖像內的圖像塊，並將分割到的圖像塊返回給Boxa這個結構中。

  Boxa* GetComponentImages(const PageIteratorLevel level, const bool text_only, Pixa** pixa, int** blockids)

那么，分割到什么程度呢？這是函數的第一個參數來控制的。

enum PageIteratorLevel { RIL_BLOCK, // Block of text/image/separator line. RIL_PARA, // Paragraph within a block. RIL_TEXTLINE, // Line within a paragraph. RIL_WORD, // Word within a textline. RIL_SYMBOL // Symbol/character within a word. };

也就是說，我們可以分割到一塊、一段、一行、一個單詞或者一個單字。這特別適合用於做文檔的OCR。一份文檔，有可能包含圖像和大大小小的各種文字。用這個函數，就可以將圖像、文字等單獨拎出來，然后再分別進行處理。第二個參數，text_only如果是true的話，就表示只返回文字區域坐標，不返回圖像區域坐標。pixa用於返回分割出來的圖像。這里設為NULL，即表示不需要返回圖像。blockids返回的是序列號。這里也不需要，所以設置成NULL。最后，返回值是分割到的矩形數組。

struct Box { l_int32 x; l_int32 y; l_int32 w; l_int32 h; l_uint32 refcount; /* reference count (1 if no clones) */ }; typedef struct Box BOX; struct Boxa { l_int32 n; /* number of box in ptr array */ l_int32 nalloc; /* number of box ptrs allocated */ l_uint32 refcount; /* reference count (1 if no clones) */ struct Box **box; /* box ptr array */ }; typedef struct Boxa BOXA;

接着，進入循環。

for (int i = 0; i < boxes->n; i++){ BOX* box = boxaGetBox(boxes, i, L_CLONE); api->SetRectangle(box->x, box->y, box->w, box->h); char* ocrResult = api->GetUTF8Text(); int conf = api->MeanTextConf(); fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s", i, box->x, box->y, box->w, box->h, conf, ocrResult); }

只有兩個函數沒有見過，一個是boxaGetBox，一個是MeanTextConf。

boxaGetBox：用於提取矩形數組中的某個矩形。其它參數一眼就看出來了。第三個參數可以選擇L_CLONE或者L_COPY。L_CLONE是軟拷貝，只增加引用數目。L_COPY是硬拷貝，把數據都復制一遍。
MeanTextConf：用於返回OCR的平均信心。信心的值最低為0，最高為100。

最后，看一下運行結果：

Result iterator example

這個例子是說，對於OCR的結果，我們可以一個詞一個詞地遍歷了來看。可以看到每一個詞的OCR結果、置信度以及在原圖中的位置。

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> int main() { Pix *image = pixRead("D:\\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif"); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image); api->Recognize(0); tesseract::ResultIterator* ri = api->GetIterator(); tesseract::PageIteratorLevel level = tesseract::RIL_WORD; if (ri != 0) { do { const char* word = ri->GetUTF8Text(level); float conf = ri->Confidence(level); int x1, y1, x2, y2; ri->BoundingBox(level, &x1, &y1, &x2, &y2); printf("word: '%s'; \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n", word, conf, x1, y1, x2, y2); delete[] word; } while (ri->Next(level)); } }

最為關鍵的就是下面這兩行

    tesseract::ResultIterator* ri = api->GetIterator();
    tesseract::PageIteratorLevel level = tesseract::RIL_WORD;

第一句按照閱讀順序來獲取一個OCR結果的迭代器。第二句設置迭代的單位。可用的迭代單位有：

enum PageIteratorLevel { RIL_BLOCK, // Block of text/image/separator line. RIL_PARA, // Paragraph within a block. RIL_TEXTLINE, // Line within a paragraph. RIL_WORD, // Word within a textline. RIL_SYMBOL // Symbol/character within a word. };

運行下來的部分結果如下：

Orientation and script detection (OSD) example

這個例子講了如何進行頁面的方向檢測和文字的方向檢測。不知道大家是否與我有同樣的疑問，就是頁面的方向如果檢測出來了，那文字的方向還用檢測嗎？文字不就是正着的了嗎？可是人家說的文字方向檢測根本不是說的這個，而是說閱讀的方向性。比如，我們知道英文的一行肯定是橫着排的，閱讀方向是從左到右的。讀完上面一行再讀下面一行。然而對於古體中文來說，文字是豎着寫的，閱讀方向是從上到下的，行與行之間呢，是從右往左讀的。這里文字的方向檢測檢測的是這個。

先看代碼：

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> int main() { const char* inputfile = "D:\\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif"; tesseract::Orientation orientation; tesseract::WritingDirection direction; tesseract::TextlineOrder order; float deskew_angle; PIX *image = pixRead(inputfile); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetPageSegMode(tesseract::PSM_AUTO_OSD); api->SetImage(image); api->Recognize(0); tesseract::PageIterator* it = api->AnalyseLayout(); it->Orientation(&orientation, &direction, &order, &deskew_angle); printf("Orientation: %d;\nWritingDirection: %d\nTextlineOrder: %d\n" \ "Deskew angle: %.4f\n", orientation, direction, order, deskew_angle); }

只看之前沒有看到過的。

    api->SetPageSegMode(tesseract::PSM_AUTO_OSD);

這句是重點。它設置了頁面分割模式。頁面分割有如下的模式可供選擇：

enum PageSegMode { PSM_OSD_ONLY, ///< Orientation and script detection only. PSM_AUTO_OSD, ///< Automatic page segmentation with orientation and ///< script detection. (OSD) PSM_AUTO_ONLY, ///< Automatic page segmentation, but no OSD, or OCR. PSM_AUTO, ///< Fully automatic page segmentation, but no OSD. PSM_SINGLE_COLUMN, ///< Assume a single column of text of variable sizes. PSM_SINGLE_BLOCK_VERT_TEXT, ///< Assume a single uniform block of vertically ///< aligned text. PSM_SINGLE_BLOCK, ///< Assume a single uniform block of text. (Default.) PSM_SINGLE_LINE, ///< Treat the image as a single text line. PSM_SINGLE_WORD, ///< Treat the image as a single word. PSM_CIRCLE_WORD, ///< Treat the image as a single word in a circle. PSM_SINGLE_CHAR, ///< Treat the image as a single character. PSM_SPARSE_TEXT, ///< Find as much text as possible in no particular order. PSM_SPARSE_TEXT_OSD, ///< Sparse text with orientation and script det. PSM_RAW_LINE, ///< Treat the image as a single text line, bypassing ///< hacks that are Tesseract-specific. PSM_COUNT ///< Number of enum entries. };

多提一句，如需使用OSD功能，則需要下載osd.traineddata

    api->Recognize(0);

這一句當然是用來根據之前的設定來進行識別的。

    tesseract::PageIterator* it = api->AnalyseLayout();

這一句根據之前SetPageSegMode的設定來運行頁面的布局分析。這句話其實也可以在Recognize前面進行。

    it->Orientation(&orientation, &direction, &order, &deskew_angle);

這個函數用來獲取頁面和文字的方向。其簽名如下：

void Orientation(tesseract::Orientation *orientation, tesseract::WritingDirection *writing_direction, tesseract::TextlineOrder *textline_order, float *deskew_angle) const;

其中，

tesseract::Orientation指的是頁面的方向。
tesseract::WritingDirection指的是書寫方向。（比如剛剛說的英文是從左到右，中文是從上到下）
tesseract::TextlineOrder指的是一行一行的方向。（比如剛剛說的英文是從上往下閱讀，中文是從右往左閱讀）
deskew_angle是指傾斜角度。因為排出來的圖片也不可能完全是正着的。這里可以計算出偏轉的角度。

運行的結果如下：

Example of iterator over the classifier choices for a single symbol

這個例子可以幫助我們學習如何找到一個識別對象的其它候選結果。

#include <tesseract/baseapi.h> #include <leptonica/allheaders.h> int main() { Pix *image = pixRead("D:\\open_source\\tesseract-3.04.01\\tesseract\\testing\\phototest.tif"); tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init(NULL, "eng"); api->SetImage(image); api->SetVariable("save_blob_choices", "T"); api->SetRectangle(37, 228, 548, 31); api->Recognize(NULL); tesseract::ResultIterator* ri = api->GetIterator(); tesseract::PageIteratorLevel level = tesseract::RIL_SYMBOL; if (ri != 0) { do { const char* symbol = ri->GetUTF8Text(level); float conf = ri->Confidence(level); if (symbol != 0) { printf("symbol %s, conf: %f", symbol, conf); bool indent = false; tesseract::ChoiceIterator ci(*ri); do { if (indent) printf("\t\t "); printf("\t- "); const char* choice = ci.GetUTF8Text(); printf("%s conf: %f\n", choice, ci.Confidence()); indent = true; } while (ci.Next()); } printf("------------------------------------------\n"); delete[] symbol; } while (ri->Next(level)); } }

在編譯的時候出現了寫問題。問題在於：

                tesseract::ChoiceIterator ci(*ri);

這個類在dll中沒有。沒有的原因是，這個類根本就沒有被導出來！如果需要導出這個類，那么就需要在tesseract的源代碼中修改一下，然后再重新編譯。

修改方法為，在ltrresultiterator.h頭文件中，將：

class ChoiceIterator {

修改為：

class TESS_API ChoiceIterator {

然后要記得重新編譯哦！並且將生成的動態庫覆蓋原來的動態庫。目前我們還不太熟悉的API如下：

    api->SetVariable("save_blob_choices", "T");

這個函數的作用是設置內部的參數。（不過話說我怎么知道內部有哪些參數，這些參數又有什么意義啊！）設置"save_blob_choices"的目的是將候選項全部保存下來。

                tesseract::ChoiceIterator ci(*ri);

這是一個迭代器，通過這個迭代器，可以將每一個候選的結果都打印出來。

部分結果如下：

好了，就寫到這兒吧。可以看出，Tesseract的應用是非常靈活的。下面一段時間，我希望自己可以慢慢了解Tesseract-OCR的算法原理。這不是一件容易的事。這個系列可能要暫停一段時間了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [轉]Tesseract-OCR學習系列 tesseract-ocr Tesseract-OCR 字符識別---樣本訓練 [轉] tesseract-OCR + pytesseract安裝 Tesseract-OCR引擎安裝 Tesseract-OCR 的安裝與使用 Tesseract-ocr 安裝與使用 Tesseract-OCR的簡單使用與訓練基於tesseract-OCR進行中文識別 Tesseract-OCR使用有感