Python OCR工具pytesseract詳解

本文轉載自查看原文 2021-12-21 20:28 2467 pytesseract/ Python/ OCR

pytesseract是基於Python的OCR工具，底層使用的是Google的Tesseract-OCR 引擎，支持識別圖片中的文字，支持jpeg, png, gif, bmp, tiff等圖片格式。本文介紹如何使用pytesseract 實現圖片文字識別。

引言
環境配置
- 1. 安裝Google Tesseract
- 2. 安裝pytesseract
文字識別小例子
獲取文字位置信息
多語言識別
- 使用方法
- 訓練數據
OCR選項
- 圖片分割模式（PSM）
- OCR引擎模式（OEM）
方向及語言檢測OSD
提取數字
字符白名單
字符黑名單
格式轉換

引言

OCR（Optical character recognition，光學字符識別）是一種將圖像中的手寫字或者印刷文本轉換為機器編碼文本的技術。通過數字方式存儲文本數據更容易保存和編輯，可以存儲大量數據，比如1G的硬盤可以存儲數百萬本書。

OCR技術可以將圖片，紙質文檔中的文本轉換為數字形式的文本。OCR過程一般包括以下步驟：

圖像預處理
文本定位
字符分割
字符識別
后處理

最初由惠普開發，后來Google贊助的開源OCR引擎 tesseract 提供了比較精確的文字識別API，本文將要介紹的Python庫Pytesseract就是基於Tesseract-OCR 引擎。

環境配置

環境要求：

Python 3.6+
PIL庫
安裝Google Tesseract OCR
系統：windows/mac/linux，我的系統是Windows10

1. 安裝Google Tesseract

Tesseract OCR github地址：https://github.com/tesseract-ocr/tesseract

Windows Tesseract下載地址：https://digi.bib.uni-mannheim.de/tesseract/

Mac和Linux安裝方法參考：https://tesseract-ocr.github.io/tessdoc/Installation.html

安裝時可以選擇需要的語言包：

安裝完成后，添加到環境變量PATH中，我的安裝路徑是：C:\Program Files\Tesseract-OCR 。

命令行窗口輸入：tesseract ，查看是否安裝成功。

$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

2. 安裝pytesseract

Python tesseract：https://github.com/madmaze/pytesseract

pip安裝pytesseract

pip install pytesseract

另外需要安裝一下Pillow庫，用於圖像處理。

pip install Pillow

文字識別小例子

先准備一張包含英文字符的圖片，下面的代碼實現提取圖片中的中文和英文字符，並識別為字符串：

import pytesseract
try:
    from PIL import Image
except ImportError:
    import Image

# 列出支持的語言
print(pytesseract.get_languages(config=''))

print(pytesseract.image_to_string(Image.open('test.png'), lang='chi_sim+eng'))

識別下面圖片中的文字（test.png）：

執行結果：

['chi_sim', 'eng', 'osd']
拳 列出支持的語言
print(pytesseract.get_languages (config=”))

print(pytesseract.image_to_string(Image.open('test.png'), lang='chi_sim+eng'))

獲取文字位置信息

image_to_boxes() 方法返回識別到的字符及字符邊框信息。image_to_data() 返回單詞及單詞位置信息。下面來看看這兩種方法的執行效果，識別下圖中的中文字符：

img = Image.open('testimg2.png')
print(pytesseract.image_to_boxes(img, output_type=Output.STRING, lang='chi_sim'))
print("#"*30)
print(pytesseract.image_to_data(img, output_type=Output.STRING, lang='chi_sim'))

執行結果：

生 63 211 80 227 0
存 81 209 118 227 0
是 122 211 139 226 0
文 126 200 154 231 0
明 142 210 157 226 0
的 162 209 197 227 0
第 200 217 218 219 0
一 221 209 236 226 0
需 217 200 253 231 0
要 239 209 259 226 0
。 260 211 266 216 0
猜 325 64 364 82 0
疑 364 64 481 82 0
鏈 373 54 393 86 0
和 383 54 403 86 0
技 403 54 435 86 0
術 419 54 451 86 0
爆 441 54 477 86 0
炸 469 54 485 86 0

##############################
level	page_num	block_num	par_num	line_num	word_num	left	top	width	height	conf	text
1	1	0	0	0	0	0	0	566	279	-1	
2	1	1	0	0	0	63	52	203	18	-1	
3	1	1	1	0	0	63	52	203	18	-1	
4	1	1	1	1	0	63	52	203	18	-1	
5	1	1	1	1	1	63	52	55	18	96	生存
5	1	1	1	1	2	122	53	17	15	96	是
5	1	1	1	1	3	126	48	31	31	96	文明
5	1	1	1	1	4	162	52	35	18	96	的
5	1	1	1	1	5	200	60	18	2	91	第
5	1	1	1	1	6	221	53	15	17	93	一
5	1	1	1	1	7	217	48	42	31	93	需要
5	1	1	1	1	8	260	63	6	5	91	。
2	1	2	0	0	0	325	197	156	18	-1	
3	1	2	1	0	0	325	197	156	18	-1	
4	1	2	1	1	0	325	197	156	18	-1	
5	1	2	1	1	1	325	197	156	18	94	猜疑
5	1	2	1	1	2	373	193	20	32	77	鏈
5	1	2	1	1	3	383	193	20	32	92	和
5	1	2	1	1	4	403	193	48	32	96	技術
5	1	2	1	1	5	441	193	44	32	94	爆炸

根據image_to_data() 方法返回的位置信息，下面來標出識別出的詞語位置。

import numpy as np
import pytesseract
from pytesseract import Output
import cv2

try:
    from PIL import Image
    from PIL import ImageDraw
    from PIL import ImageFont
except ImportError:
    import Image
    
img = cv2.imread('testimg2.png')

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

width_list = []
for c in cnts:
	_, _, w, _ = cv2.boundingRect(c)
	width_list.append(w)
wm = np.median(width_list)

tess_text = pytesseract.image_to_data(img, output_type=Output.DICT, lang='chi_sim')
for i in range(len(tess_text['text'])):
	word_len = len(tess_text['text'][i])
	if word_len > 1:
		world_w = int(wm * word_len)
		(x, y, w, h) = (tess_text['left'][i], tess_text['top'][i], tess_text['width'][i], tess_text['height'][i])
		cv2.rectangle(img, (x, y), (x + world_w, y + h), (255, 0, 0), 1)
		im = Image.fromarray(img)
		draw = ImageDraw.Draw(im)
		font = ImageFont.truetype(font="simsun.ttc", size=18, encoding="utf-8")
		draw.text((x, y - 20), tess_text['text'][i], (255, 0, 0), font=font)
		img = cv2.cvtColor(np.array(im), cv2.COLOR_RGB2BGR)

cv2.imshow("TextBoundingBoxes", img)
cv2.waitKey(0)

執行結果：

另外說明一下， ImageFont.truetype(font="simsun.ttc", size=18, encoding="utf-8") 用於設置字體及編碼格式，原因是draw.text() 默認使用ISO-8859-1（latin-1）編碼，中文需要使用UTF-8編碼。Windows中，字體存放路徑一般為C:\Windows\Fonts ，已經添加到了環境變量，直接寫字體名稱就可以了，simsun.ttc 表示宋體。

如果不知道字體對應名稱可以進入注冊表查看：運行窗口或者命令行窗口輸入regedit打開注冊表，進入如下路徑：HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Fonts ，可以查看對應字體文件名稱。

多語言識別

使用方法

圖片中可能包含了多種語言，比如在上面的例子中，圖片包含了中文和英文，lang='chi_sim+eng' 表示識別簡體中文和英文。

簡體中文chi_sim是在安裝tesseract時勾選的，get_languages() 方法列出了支持的語言，也可以在命令行窗口執行 tesseract --list-langs 查看支持的語言：

$ tesseract --list-langs
List of available languages (3):
chi_sim
eng
osd

除了使用 lang='chi_sim+eng' 方式指定語言外，也可以使用config='-l chi_sim+eng' 形式：

img = Image.open('test.png')
config = r'-l chi_sim+eng --psm 6'
print(pytesseract.image_to_string(img, config=config))

執行結果和前面一樣。

訓練數據

如果需要下載其它語言包，可以到這里https://tesseract-ocr.github.io/tessdoc/Data-Files下載。

Tesseract 提供了三種訓練數據：

訓練數據	訓練模型	識別速度	正確率
tessdata_fast	LSTM	最快	最低
tessdata_best	LSTM	最慢	最高
tessdata	Legacy + LSTM	中等	略低於tesdata -best

根據自己的需要下載需要的模型文件，將traineddata文件放在 C:\Program Files\Tesseract-OCR\tessdata 目錄（Tesseract安裝目錄）下就可以了。

tessdata_best可用來再訓練字庫，訓練方法參考文檔：https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html。

OCR選項

多語言識別中使用了 -l 和 --psm 選項，tesseract還支持更多的OCR選項。

OCR選項：

--tessdata-dir PATH：Specify the location of tessdata path.
--user-words PATH：Specify the location of user words file.
--user-patterns PATH：Specify the location of user patterns file.
--dpi VALUE：Specify DPI for input image.
-l LANG[+LANG]：Specify language(s) used for OCR.
-c VAR=VALUE：Set value for config variables. Multiple -c arguments are allowed.
--psm NUM：Specify page segmentation mode.
--oem NUM：Specify OCR Engine mode.

在pytesseract中的使用方法是添加config參數：config='--psm 0 -c min_characters_to_try=5'

下面介紹一下psm和oem這兩個選項。

圖片分割模式（PSM）

tesseract有13種圖片分割模式（page segmentation mode，psm）：

0 -- Orientation and script detection (OSD) only. 方向及語言檢測（Orientation and script detection，OSD)
1 -- Automatic page segmentation with OSD. 自動圖片分割
2 -- Automatic page segmentation, but no OSD, or OCR. 自動圖片分割，沒有OSD和OCR
3 -- Fully automatic page segmentation, but no OSD. (Default) 完全的自動圖片分割，沒有OSD
4 -- Assume a single column of text of variable sizes. 假設有一列不同大小的文本
5 -- Assume a single uniform block of vertically aligned text. 假設有一個垂直對齊的文本塊
6 -- Assume a single uniform block of text. 假設有一個對齊的文本塊
7 -- Treat the image as a single text line. 圖片為單行文本
8 -- Treat the image as a single word. 圖片為單詞
9 -- Treat the image as a single word in a circle. 圖片為圓形的單詞
10 -- Treat the image as a single character. 圖片為單個字符
11 -- Sparse text. Find as much text as possible in no particular order. 稀疏文本。查找盡可能多的文本，沒有特定的順序。
12 -- Sparse text with OSD. OSD稀疏文本
13 -- Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. 原始行。將圖像視為單個文本行。

OCR引擎模式（OEM）

有4種OCR引擎模式：

0 -- Legacy engine only.
1 -- Neural nets LSTM engine only.
2 -- Legacy + LSTM engines.
3 -- Default, based on what is available.

方向及語言檢測OSD

Tesseract支持方向及語言檢測（Orientation and script detection，OSD) ，比如檢測下面的圖片：

osd = pytesseract.image_to_osd('osd-example.png',config='--psm 0 -c min_characters_to_try=5')
print(osd)

其中 min_characters_to_try 表示設置最小字符數，默認為50。

執行結果：

Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 0.74
Script: Han
Script confidence: 0.83

結果是旋轉了270度，識別到的語言為中文Han。

提取數字

只提取下面圖片中的數字：

img = Image.open('number-example.png')
config = r'--oem 3 --psm 6 outputbase digits'
osd = pytesseract.image_to_string(img, config=config)
print(osd)

執行結果：

1200-.41194-.
4-.

12000000

11994933.
-119940218

119932207

1199251

119915241

119907238

-119853209
1119450495
.-11941637

字符白名單

只檢測特定的字符：只檢測數字

img = Image.open('number-example.png')
config = r'-c tessedit_char_whitelist=0123456789 --psm 6'
print(pytesseract.image_to_string(img, config=config))

執行結果：

發現識別精度比 outputbase digits 方法更加准確。

字符黑名單

不檢測數字：

img = Image.open('number-example.png')
config = r'-c tessedit_char_blacklist=0123456789 --psm 6'
print(pytesseract.image_to_string(img, config=config, lang='chi_sim'))

執行結果：

膠片很快沖出來了，他開始查看哪張值得放大洗成照片，
在第一張就發現了一件離奇的事。一個倒計時。倒計時從
 小時開始，到現在還剩余 小時。

這張拍的是一個大商場外的一小片草地，他看到底片正中
有一行白色的東西，

細看是一排數字:  :  :

第二張底片上也有數字: l]:  :  -。

第三張: l : : lg，

第四張:  :  :  ，

第五張: ] :  : l;

第六張:  : : l，

第七張: l : o : g ;

第八張: lg :  :  ;

第三十四張: : :

第三十六張，也是最后一張:  :  :

格式轉換

pytesseract 支持將圖片轉換為PDF、HOCR以及ALTO XML格式。

pdf = pytesseract.image_to_pdf_or_hocr('testimg2.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf)
    
hocr = pytesseract.image_to_pdf_or_hocr('testimg2.png', extension='hocr')
xml = pytesseract.image_to_alto_xml('testimg2.png')

--THE END--

世人多巧, 心茫茫然。 by 王陽明

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python之pytesseract模塊-實現OCR python中文ocr方案-pytesseract Python 進行 OCR識別 -- pytesseract庫 python中ocr軟件pytesseract使用 tesseract-OCR + pytesseract安裝 Tesseract-ocr視覺學習-驗證碼識別及python import pytesseract使用一個 Python 包 pytesseract ，幾行代碼實現 OCR 文本識別技術！ Python驗證碼識別安裝Pillow、tesseract-ocr與pytesseract模塊的安裝以及錯誤解決使用python的pytesseract調用谷歌tesseract-ocr識別中英文字符 pytesseract+Tesseract-OCR圖片文字識別