Python 驗證碼解析

本文轉載自查看原文 2020-04-02 11:12 674 Python

驗證碼解析環境搭建

安裝Tesseract

Tesserocr 是 Python 的一個 OCR 識別庫，但其實是對 Tesseract 做的一層 Python API 封裝，所以它的核心是 Tesseract，所以在安裝 Tesserocr 之前我們需要先安裝 Tesseract

官方網址：https://digi.bib.uni-mannheim.de/tesseract/

選擇版本：

此處選擇4.0.0版本，因為截至目前（2020-2-28）對應的python庫的支持最新只到這個版本。

具體看https://github.com/simonflueckiger/tesserocr-windows_build/releases的顯示版本，括號里是支持Tesserocr的版本。

安裝時可以勾選多語言支持（但會導致整個過程很慢）：

安裝完成后，需要設置環境變量。在Path中設置C:\Program Files\Tesseract-OCR（路徑以自己為准）

確認是否設置正確：

安裝Tesserocr（Tesseract-OCR）

使用pip直接安裝：

pip install tesserocr pillow

如果安裝失敗，嘗試使用以下方法：

下載安裝tesserocr的whl格式文件。

whl格式本質上是一個壓縮包,里面包含了py文件,以及經過編譯的pyd文件

網址：https://github.com/simonflueckiger/tesserocr-windows_build/releases

查看本機python對應的版本：

新建test2.py文件並執行：

import pip import pip._internal print(pip._internal.pep425tags.get_supported())

輸出：

[('cp37', 'cp37m', 'win_amd64'), ('cp37', 'none', 'win_amd64'), ('py3', 'none', 'win_amd64'), ('cp37', 'none', 'any'), ('cp3', 'none', 'any'), ('py37', 'none', 'any'), ('py3', 'none', 'any'), ('py36', 'none', 'any'), ('py35', 'none', 'any'), ('py34', 'none', 'any'), ('py33', 'none', 'any'), ('py32', 'none', 'any'), ('py31', 'none', 'any'), ('py30', 'none', 'any')]

意思是對應版本是'cp37', 'cp37m', 'win_amd64'。

找到對應的版本：

下載后使用pip安裝.whl文件（路徑以自己實際路徑為准）：

pip install C:\tesserocr-2.4.0-cp37-cp37m-win_amd64.whl

開始編碼

解析驗證碼

首先安裝依賴：

pip install pillow

如果安裝失敗。使用：

python -m pip install --upgrade pip

完成后執行install命令。

使用tesseract識別驗證碼

找一張驗證碼（test.jpg）：

解析驗證碼（test3.py）：

import tesserocr
from PIL import Image
image=Image.open('test.jpg')
image.show()  #可以打印出圖片，供預覽
print(tesserocr.image_to_text(image))

如果執行過程中報錯：

Failed to init API, possibly an invalid tessdata path: C:\Users\XXXXX\AppData\Local\Programs\Python\Python37\/tessdata/

則將Tesseract安裝目錄下的tessdata文件夾復制到python的根目錄，即報錯顯示的目錄。

使用pytesseract識別驗證碼

以上范例使用的是tesserocr.image_to_text()，但是識別效率很低，推薦使用pytesseract。pytesseract是在Tesseract-OCR基礎上封裝的，識別效果更好的類庫。

官方介紹：Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others.

首先安裝pytesseract：

pip install pytesseract

使用pytesseract的image_to_string()方法：

1 from PIL import Image
2 from pytesseract import *
3 
4 result = image_to_string(Image.open("test.jpg"), lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

lang表示識別的語言。
psm是一個設置驗證碼識別的重要參數，可以用它來精確提升驗證通過率（下方是官網給出的值范圍）。
oem沒有找到專門的解釋，官網給的范例使用的值是3。
tessedit_char_whitelist表示白名單，將識別的結果控制在白名單范圍（經測試，效果有限）

psm值：

Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,bypassing hacks that are Tesseract-specific.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python驗證碼處理(1) Python驗證碼識別 Python驗證碼識別解析最簡單的驗證碼 Python 通過打碼平台實現驗證碼 python爬蟲_簡單使用百度OCR解析驗證碼 Python 爬蟲入門（四）—— 驗證碼上篇（主要講述驗證碼驗證流程，不含破解驗證碼）用python生成驗證碼圖片 Python圖形驗證碼識別 Python實現圖片驗證碼識別