使用的是python的pytesser模塊,原先想做的是圖片中文識別,搞了一段時間了,在中文的識別上還是有很多問題,這里做記錄分享。
pytesser,OCR in Python using the Tesseract engine from Google。是谷歌OCR開源項目的一個模塊,可將圖片中的文字轉換成文本(主要是英文)。
1.pytesser安裝
使用設備:win8 64位
PyTesser使用Tesseract OCR引擎,將圖像轉換到可接受的格式,然后執行tesseract提取出文本信息。使用PyTesser ,你無須安裝Tesseract OCR引擎,但必須要先安裝PIL模塊(Python Image Library,python的圖形庫)
pytesser下載:http://code.google.com/p/pytesser/ 若打不開,可通過百度網盤下載:http://pan.baidu.com/s/1o69LL8Y
PIL官方下載:http://www.pythonware.com/products/pil/
其中PIL可直接點擊exe安裝,pytesser無需安裝,解壓后可以放在python安裝文件夾的\Lib\site-packages\ 下直接使用(需要添加pytesser.pth)
Ubuntu安裝
sudo pip install pytesseract sudo apt-get install tesseract-ocr
2.pytesser源碼
通過查看pytesser.py的源碼,可以看到幾個主要函數:
(1)call_tesseract(input_filename, output_filename)
該函數調用tesseract外部執行程序,提取圖片中的文本信息
(2)image_to_string(im, cleanup = cleanup_scratch_flag)
該函數處理的是image對象,所以需用使用im = open(filename)打開文件,返回一個image對象。其中調用util.image_to_scratch(im, scratch_image_name)將內存中的圖像文件保存為bmp,以便tesserac程序能正常處理。
(3)image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True)
該函數直接使用Tesseract讀取圖像文件,如果圖像是不相容的,會先轉換成兼容的格式,然后再提取圖片中的文本信息。
"""OCR in Python using the Tesseract engine from Google http://code.google.com/p/pytesser/ by Michael J.T. O'Kelly V 0.0.1, 3/10/07""" import Image import subprocess import util import errors tesseract_exe_name = 'tesseract' # Name of executable to be called at command line scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format scratch_text_name_root = "temp" # Leave out the .txt extension cleanup_scratch_flag = False # Temporary files cleaned up after OCR operation def call_tesseract(input_filename, output_filename): """Calls external tesseract.exe on input file (restrictions on types), outputting output_filename+'txt'""" args = [tesseract_exe_name, input_filename, output_filename] proc = subprocess.Popen(args) retcode = proc.wait() if retcode!=0: errors.check_for_errors() def image_to_string(im, cleanup = cleanup_scratch_flag): """Converts im to file, applies tesseract, and fetches resulting text. If cleanup=True, delete scratch files after operation.""" try: util.image_to_scratch(im, scratch_image_name) call_tesseract(scratch_image_name, scratch_text_name_root) text = util.retrieve_text(scratch_text_name_root) finally: if cleanup: util.perform_cleanup(scratch_image_name, scratch_text_name_root) return text def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True): """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True, converts to compatible format and then applies tesseract. Fetches resulting text. If cleanup=True, delete scratch files after operation.""" try: try: call_tesseract(filename, scratch_text_name_root) text = util.retrieve_text(scratch_text_name_root) except errors.Tesser_General_Exception: if graceful_errors: im = Image.open(filename) text = image_to_string(im, cleanup) else: raise finally: if cleanup: util.perform_cleanup(scratch_image_name, scratch_text_name_root) return text if __name__=='__main__': im = Image.open('phototest.tif') text = image_to_string(im) print text try: text = image_file_to_string('fnord.tif', graceful_errors=False) except errors.Tesser_General_Exception, value: print "fnord.tif is incompatible filetype. Try graceful_errors=True" print value text = image_file_to_string('fnord.tif', graceful_errors=True) print "fnord.tif contents:", text text = image_file_to_string('fonts_test.png', graceful_errors=True) print text
3.pytesser使用
在代碼中加載pytesser模塊,簡單的測試代碼如下:
from pytesser import * im = Image.open('fonts_test.png') text = image_to_string(im) print "Using image_to_string(): " print text text = image_file_to_string('fonts_test.png', graceful_errors=True) print "Using image_file_to_string():" print text

識別結果如下:基本能將英文字符提取出來,但對一些復雜點的圖片,比如說我嘗試對一些英文論文圖片進行識別,但結果實在不理想。

由於在中文識別方面還有很多問題,以后再進一步研究分享。
參考:HK_JH的專欄 http://blog.csdn.net/hk_jh/article/details/8961449
