一,簡介:
Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.
二,使用:
按照主頁wiki的介紹,下載編譯tesseract。
Sample Code : http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/tesseractmain.cpp
VS2005的工程(包括第三方庫) :http://pan.baidu.com/s/13ROuA
三,原理探索:
1,Tesseract是一個開源跨平台的OCR庫;
2,Tesseract主要分為兩部分:訓練,預測;
3,訓練:
a,Tesseract能通過訓練來支持第三方語言,或者提高OCR准確率。詳情:http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
b,etc.
5,預測:
a,基本輸入是PIX數據結構,可通過外圍操作將video data或者其他格式的數據轉換為leptonica的PIX格式;
b,輸入得到PIX –> ProcessPage() –> Recognize() –>
b.1: 搜索文字塊;
b.2:BaseLine匹配;
b.3:字符截斷,分割成單個字符;
b.4:截斷連在一起的字符,補全斷掉的筆畫;
b.5: 特征提取:早期tesseract使用字符的拓撲特征,這種匹配方式對字體變化不敏感,但是對現實中出現的字識別率魯棒性不好;
etc.
未完待續…