Python 利用pytesser模塊識別圖像文字

本文轉載自查看原文 2014-12-05 23:44 43117 Python

　　使用的是python的pytesser模塊，原先想做的是圖片中文識別，搞了一段時間了，在中文的識別上還是有很多問題，這里做記錄分享。

　　pytesser，OCR in Python using the Tesseract engine from Google。是谷歌OCR開源項目的一個模塊，可將圖片中的文字轉換成文本（主要是英文）。

　　1.pytesser安裝

　　使用設備：win8 64位

　　PyTesser使用Tesseract OCR引擎，將圖像轉換到可接受的格式，然后執行tesseract提取出文本信息。使用PyTesser ，你無須安裝Tesseract OCR引擎,但必須要先安裝PIL模塊（Python Image Library，python的圖形庫）

　　pytesser下載：http://code.google.com/p/pytesser/ 若打不開，可通過百度網盤下載：http://pan.baidu.com/s/1o69LL8Y

　　PIL官方下載：http://www.pythonware.com/products/pil/

　　其中PIL可直接點擊exe安裝，pytesser無需安裝，解壓后可以放在python安裝文件夾的\Lib\site-packages\ 下直接使用（需要添加pytesser.pth）

　　Ubuntu安裝

sudo pip install pytesseract
sudo apt-get install tesseract-ocr

　　2.pytesser源碼

　　通過查看pytesser.py的源碼，可以看到幾個主要函數：

　（1）call_tesseract(input_filename, output_filename)

　　該函數調用tesseract外部執行程序，提取圖片中的文本信息　

  （2）image_to_string(im, cleanup = cleanup_scratch_flag)

　　該函數處理的是image對象，所以需用使用im = open(filename)打開文件，返回一個image對象。其中調用util.image_to_scratch(im, scratch_image_name)將內存中的圖像文件保存為bmp，以便tesserac程序能正常處理。

  （3）image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True)

　 該函數直接使用Tesseract讀取圖像文件，如果圖像是不相容的，會先轉換成兼容的格式，然后再提取圖片中的文本信息。

"""OCR in Python using the Tesseract engine from Google
http://code.google.com/p/pytesser/
by Michael J.T. O'Kelly
V 0.0.1, 3/10/07"""

import Image
import subprocess

import util
import errors

tesseract_exe_name = 'tesseract' # Name of executable to be called at command line
scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
scratch_text_name_root = "temp" # Leave out the .txt extension
cleanup_scratch_flag = False  # Temporary files cleaned up after OCR operation

def call_tesseract(input_filename, output_filename):
    """Calls external tesseract.exe on input file (restrictions on types),
    outputting output_filename+'txt'"""
    args = [tesseract_exe_name, input_filename, output_filename]
    proc = subprocess.Popen(args)
    retcode = proc.wait()
    if retcode!=0:
        errors.check_for_errors()

def image_to_string(im, cleanup = cleanup_scratch_flag):
    """Converts im to file, applies tesseract, and fetches resulting text.
    If cleanup=True, delete scratch files after operation."""
    try:
        util.image_to_scratch(im, scratch_image_name)
        call_tesseract(scratch_image_name, scratch_text_name_root)
        text = util.retrieve_text(scratch_text_name_root)
    finally:
        if cleanup:
            util.perform_cleanup(scratch_image_name, scratch_text_name_root)
    return text

def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True):
    """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
    converts to compatible format and then applies tesseract.  Fetches resulting text.
    If cleanup=True, delete scratch files after operation."""
    try:
        try:
            call_tesseract(filename, scratch_text_name_root)
            text = util.retrieve_text(scratch_text_name_root)
        except errors.Tesser_General_Exception:
            if graceful_errors:
                im = Image.open(filename)
                text = image_to_string(im, cleanup)
            else:
                raise
    finally:
        if cleanup:
            util.perform_cleanup(scratch_image_name, scratch_text_name_root)
    return text
    

if __name__=='__main__':
    im = Image.open('phototest.tif')
    text = image_to_string(im)
    print text
    try:
        text = image_file_to_string('fnord.tif', graceful_errors=False)
    except errors.Tesser_General_Exception, value:
        print "fnord.tif is incompatible filetype.  Try graceful_errors=True"
        print value
    text = image_file_to_string('fnord.tif', graceful_errors=True)
    print "fnord.tif contents:", text
    text = image_file_to_string('fonts_test.png', graceful_errors=True)
    print text

　　3.pytesser使用
　　在代碼中加載pytesser模塊，簡單的測試代碼如下：

from pytesser import *
im = Image.open('fonts_test.png')
text = image_to_string(im)
print "Using image_to_string(): "
print text
text = image_file_to_string('fonts_test.png', graceful_errors=True)
print "Using image_file_to_string():"
print text

　識別結果如下：基本能將英文字符提取出來，但對一些復雜點的圖片，比如說我嘗試對一些英文論文圖片進行識別，但結果實在不理想。

　　由於在中文識別方面還有很多問題，以后再進一步研究分享。

參考：HK_JH的專欄 http://blog.csdn.net/hk_jh/article/details/8961449

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 wxPython利用pytesser模塊實現圖片文字識別 Python2.7+pytesser圖片文字識別功能 Python驗證碼識別--利用pytesser識別簡單圖形驗證碼 Python驗證碼通過pytesser識別 python做簡單的圖像文字識別利用python庫識別圖片中的文字利用百度文字識別API識別圖像中的文字 python識別驗證碼——PIL,pytesser,pytesseract的安裝 python中的驗證碼識別庫PyTesser 使用 Python 識別並提取圖像中的文字