wxPython利用pytesser模塊實現圖片文字識別

本文轉載自查看原文 2013-05-29 20:07 4536

Pytesser——OCR in Python using the Tesseract engine from Google

pytesser是谷歌OCR開源項目的一個模塊，在python中導入這個模塊即可將圖片中的文字轉換成文本。

鏈接：https://code.google.com/p/pytesser/

pytesser 調用了 tesseract。在python中調用pytesser模塊，pytesser又用tesseract識別圖片中的文字。

下面是整個過程的實現步驟：

1、首先要在code.google.com下載pytesser。https://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip

這個是免安裝的，可以放在python安裝文件夾的\Lib\site-packages\ 下直接使用

pytesser里包含了tesseract.exe和英語的數據包（默認只識別英文），還有一些示例圖片，所以解壓縮后即可使用。

可通過以下代碼測試：

>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord

from pytesser import * 
#im = Image.open('fnord.tif') 
#im = Image.open('phototest.tif') 
#im = Image.open('eurotext.tif')
im = Image.open('fonts_test.png')
text = image_to_string(im) 
print text

注：該模塊需要PIL庫的支持。

2、解決識別率低的問題

可以增強圖片的顯示效果，或者將其轉換為黑白的，這樣可以使其識別率提升不少：

enhancer = ImageEnhance.Contrast(image1)
image2 = enhancer.enhance(4)

可以再對image2調用 image_to_string識別

3、識別其他語言

tesseract是一個命令行下運行的程序，參數如下：

tesseract imagename outbase [-l lang] [-psm N] [configfile...]

imagename是輸入的image的名字

outbase是輸出的文本的名字，默認為outbase.txt

-l lang 是定義要識別的的語言，默認為英文

詳見http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html

通過以下步驟可以識別其他語言：

（1）、下載其他語言數據包：

https://code.google.com/p/tesseract-ocr/downloads/list

將語言包放入pytesser的tessdata文件夾下

接下來修改pytesser.py的參數，下面是一個例子：

"""OCR in Python using the Tesseract engine from Google
http://code.google.com/p/pytesser/
by Michael J.T. O'Kelly
V 0.0.2, 5/26/08"""

import Image
import subprocess
import os
import StringIO

import util
import errors


tesseract_exe_name = 'dlltest' # Name of executable to be called at command line
scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
scratch_text_name_root = "temp" # Leave out the .txt extension
_cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation
_language = "" # Tesseract uses English if language is not given
_pagesegmode = "" # Tesseract uses fully automatic page segmentation if psm is not given (psm is available in v3.01)

_working_dir = os.getcwd()

def call_tesseract(input_filename, output_filename, language, pagesegmode):
        """Calls external tesseract.exe on input file (restrictions on types),
        outputting output_filename+'txt'"""
        current_dir = os.getcwd()
        error_stream = StringIO.StringIO()
        try:
                os.chdir(_working_dir)
                args = [tesseract_exe_name, input_filename, output_filename]
                if len(language) > 0:
                        args.append("-l")
                        args.append(language)
                if len(str(pagesegmode)) > 0:
                        args.append("-psm")
                        args.append(str(pagesegmode))
                try:
                        proc = subprocess.Popen(args)
                except (TypeError, AttributeError):
                        proc = subprocess.Popen(args, shell=True)
                retcode = proc.wait()
                if retcode!=0:
                        error_text = error_stream.getvalue()
                        errors.check_for_errors(error_stream_text = error_text)
        finally:  # Guarantee that we return to the original directory
                error_stream.close()
                os.chdir(current_dir)

def image_to_string(im, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag):
        """Converts im to file, applies tesseract, and fetches resulting text.
        If cleanup=True, delete scratch files after operation."""
        try:
                util.image_to_scratch(im, scratch_image_name)
                call_tesseract(scratch_image_name, scratch_text_name_root, lang, psm)
                result = util.retrieve_result(scratch_text_name_root)
        finally:
                if cleanup:
                        util.perform_cleanup(scratch_image_name, scratch_text_name_root)
        return result

def image_file_to_string(filename, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag, graceful_errors=True):
        """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
        converts to compatible format and then applies tesseract.  Fetches resulting text.
        If cleanup=True, delete scratch files after operation. Parameter lang specifies used language.
        If lang is empty, English is used. Page segmentation mode parameter psm is available in Tesseract 3.01.
        psm values are:
        0 = Orientation and script detection (OSD) only.
        1 = Automatic page segmentation with OSD.
        2 = Automatic page segmentation, but no OSD, or OCR
        3 = Fully automatic page segmentation, but no OSD. (Default)
        4 = Assume a single column of text of variable sizes.
        5 = Assume a single uniform block of vertically aligned text.
        6 = Assume a single uniform block of text.
        7 = Treat the image as a single text line.
        8 = Treat the image as a single word.
        9 = Treat the image as a single word in a circle.
        10 = Treat the image as a single character."""
        try:
                try:
                        call_tesseract(filename, scratch_text_name_root, lang, psm)
                        result = util.retrieve_result(scratch_text_name_root)
                except errors.Tesser_General_Exception:
                        if graceful_errors:
                                im = Image.open(filename)
                                result = image_to_string(im, cleanup)
                        else:
                                raise
        finally:
                if cleanup:
                        util.perform_cleanup(scratch_image_name, scratch_text_name_root)
        return result
        

if __name__=='__main__':
        im = Image.open('phototest.tif')
        text = image_to_string(im, cleanup=False)
        print text
        text = image_to_string(im, psm=2, cleanup=False)
        print text
        try:
                text = image_file_to_string('fnord.tif', graceful_errors=False)
        except errors.Tesser_General_Exception, value:
                print "fnord.tif is incompatible filetype.  Try graceful_errors=True"
                #print value
        text = image_file_to_string('fnord.tif', graceful_errors=True, cleanup=False)
        print "fnord.tif contents:", text
        text = image_file_to_string('fonts_test.png', graceful_errors=True)
        print text
        text = image_file_to_string('fonts_test.png', lang="eng", psm=4, graceful_errors=True)
        print text

這個是source里面提供的，其實若只要識別其他語言只要添加一個language參數就行了，下面是我的例子：

"""OCR in Python using the Tesseract engine from Google
http://code.google.com/p/pytesser/
by Michael J.T. O'Kelly
V 0.0.1, 3/10/07"""

import Image
import subprocess
import util
import errors

tesseract_exe_name = 'tesseract' # Name of executable to be called at command line
scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
scratch_text_name_root = "temp" # Leave out the .txt extension
cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation

def call_tesseract(input_filename, output_filename, language):
	"""Calls external tesseract.exe on input file (restrictions on types),
	outputting output_filename+'txt'"""
	args = [tesseract_exe_name, input_filename, output_filename, "-l", language]
	proc = subprocess.Popen(args)
	retcode = proc.wait()
	if retcode!=0:
		errors.check_for_errors()

def image_to_string(im, cleanup = cleanup_scratch_flag, language = "eng"):
	"""Converts im to file, applies tesseract, and fetches resulting text.
	If cleanup=True, delete scratch files after operation."""
	try:
		util.image_to_scratch(im, scratch_image_name)
		call_tesseract(scratch_image_name, scratch_text_name_root,language)
		text = util.retrieve_text(scratch_text_name_root)
	finally:
		if cleanup:
			util.perform_cleanup(scratch_image_name, scratch_text_name_root)
	return text

def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True, language = "eng"):
	"""Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
	converts to compatible format and then applies tesseract.  Fetches resulting text.
	If cleanup=True, delete scratch files after operation."""
	try:
		try:
			call_tesseract(filename, scratch_text_name_root, language)
			text = util.retrieve_text(scratch_text_name_root)
		except errors.Tesser_General_Exception:
			if graceful_errors:
				im = Image.open(filename)
				text = image_to_string(im, cleanup)
			else:
				raise
	finally:
		if cleanup:
			util.perform_cleanup(scratch_image_name, scratch_text_name_root)
	return text
	

if __name__=='__main__':
	im = Image.open('phototest.tif')
	text = image_to_string(im)
	print text
	try:
		text = image_file_to_string('fnord.tif', graceful_errors=False)
	except errors.Tesser_General_Exception, value:
		print "fnord.tif is incompatible filetype.  Try graceful_errors=True"
		print value
	text = image_file_to_string('fnord.tif', graceful_errors=True)
	print "fnord.tif contents:", text
	text = image_file_to_string('fonts_test.png', graceful_errors=True)
	print text

在調用image_to_string函數時，只要加上相應的language參數就可以了，如簡體中文最后一個參數即為 chi_sim，繁體中文chi_tra,

也就是下載的語言包的 XXX.traineddata 文件的名字XXX，如下載的中文包是 chi_sim.traineddata，參數就是chi_sim :

text = image_to_string(self.im, language = 'chi_sim')

至此，圖片識別就完成了。

額外附加一句：有可能中文識別出來了，但是亂碼，需要相應地將text轉換為你所用的中文編碼方式，如：

text.decode("utf8")就可以了

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 利用pytesser模塊識別圖像文字 Python2.7+pytesser圖片文字識別功能基於Tesseract實現圖片文字識別利用python庫識別圖片中的文字 java 基於Tesseract實現圖片文字識別利用百度AI OCR圖片識別，Java實現PDF中的圖片轉換成文字用百度文字識別實現圖片文本識別 opencv實現人臉識別（三）訓練圖片模塊 Python驗證碼識別--利用pytesser識別簡單圖形驗證碼 python識別圖片文字