Tesseract–OCR 庫原理探索


一,簡介:

Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

項目主頁:http://code.google.com/p/tesseract-ocr/

二,使用:

按照主頁wiki的介紹,下載編譯tesseract。

Sample Code : http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/tesseractmain.cpp

VS2005的工程(包括第三方庫) :http://pan.baidu.com/s/13ROuA

三,原理探索:

1,Tesseract是一個開源跨平台的OCR庫;

2,Tesseract主要分為兩部分:訓練,預測;

3,訓練:

a,Tesseract能通過訓練來支持第三方語言,或者提高OCR准確率。詳情:http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

b,etc.

5,預測:

a,基本輸入是PIX數據結構,可通過外圍操作將video data或者其他格式的數據轉換為leptonica的PIX格式;

b,輸入得到PIX –> ProcessPage() –> Recognize() –>

b.1: 搜索文字塊;

b.2:BaseLine匹配;

b.3:字符截斷,分割成單個字符;

b.4:截斷連在一起的字符,補全斷掉的筆畫;

b.5: 特征提取:早期tesseract使用字符的拓撲特征,這種匹配方式對字體變化不敏感,但是對現實中出現的字識別率魯棒性不好;

 

etc.

未完待續…


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM