python實現中文圖片文字識別--OCR about chinese text--tesseract

本文轉載自查看原文 2016-08-12 17:27 2782 OCR python tesseract

0.我的環境：

win7 32bits

python 3.5

pycharm 5.0

1.相關庫

安裝pillow：

pip install pillow

安裝tesseract：

tesseract-ocr-setup-3.02.02.exe

自帶了英文語言包，如果需要中文語言包往下找即可。

或者在安裝的時候，在選項lang處，點選chi-sim即可。

安裝完畢后，會兒自動加入系統環境變量中。

安裝pytesseract：

pip install pytesseract

2.修改pytesseract.py原文件

# tesseract_cmd = 'tesseract'

tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'

#如果不修改，會報錯：FileNotFoundError: [WinError 2] 系統找不到指定的文件。

#f = open(output_file_name)

f = open(output_file_name, encoding='utf-8')

#如果不修改，會兒報錯：UnicodeDecodeError: 'gbk' codec can't decode byte 0xyy in position xxx: illegal multibyte sequence

3.小程序，測試一下

 1 #coding:utf-8
 2 #Test one page
 3 import pytesseract
 4 from PIL import Image
 5 
 6 def processImage():
 7     image = Image.open('test.png')
 8 
 9     #背景色處理，可有可無
10     image = image.point(lambda x: 0 if x < 143 else 255)
11     newFilePath = 'raw-test.png'
12     image.save(newFilePath)
13 
14     content = pytesseract.image_to_string(Image.open(newFilePath), lang='eng')
15     #中文圖片的話，是lang='chi_sim'
16     print(content)
17 
18 processImage()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Tesseract OCR 圖片文字識別圖片文字OCR識別-tesseract-ocr Tesseract-OCR-03-圖片文字識別開源圖片文字識別引擎——Tesseract OCR pytesseract+Tesseract-OCR圖片文字識別 Tesseract Ocr文字識別基於Tesseract實現圖片文字識別 tesseract-ocr 識別中文掃描圖片 java 基於Tesseract實現圖片文字識別 python ocr中文識別庫 tesseract安裝及問題處理