python爬蟲_簡單使用百度OCR解析驗證碼

本文轉載自查看原文 2018-07-24 18:51 1269 python爬蟲

首先要注冊百度雲賬號：

在首頁，找到圖像識別，創建應用，選擇相應的功能，創建

安裝接口模塊：

pip install baidu-aip

簡單識別一：

簡單圖形驗證碼：

圖片：

from aip import AipOcr

 # 你的 APPID AK SK
APP_ID = '你的APPID'
API_KEY = '你的AK'
SECRET_KEY = '你的SK'

client = AipOcr(APP_ID, API_KEY, SECRET_KEY)

# 讀取圖片
def get_file_content(filePath):
    with open(filePath, 'rb') as fp:
        return fp.read()
# 測試文件也可以寫路徑
image = get_file_content('test.jpg')

#  調用通用文字識別, 圖片參數為本地圖片
result = client.basicGeneral(image)

# 定義參數變量
options = {
    # 定義圖像方向
        'detect_direction' : 'true',
    # 識別語言類型，默認為'CHN_ENG'中英文混合
        'language_type' : 'CHN_ENG',
}

# 調用通用文字識別接口
results = client.basicGeneral(image,options)
print(results)
# 遍歷取出圖片解析的內容
# for word in result['words_result']:
#     print(word['words'])
try:
    code = results['words_result'][0]['words']
except:
    code = '驗證碼匹配失敗'

print(code)

結果為：

{'log_id': **************, 'direction': 0, 'words_result_num': 1, 'words_result': [{'words': '526'}]}
526

返回數據的參數詳解:

輸出結果中,各字段分別代表：

log_id : 唯一的log id，用於定位問題
direction : 圖像方向，傳入參數時定義為true表示檢測，0表示正向，1表示逆時針90度，2表示逆時針180度，3表示逆時針270度，-1表示未定義。
words_result_num : 識別的結果數，即word_result的元素個數
word_result : 定義和識別元素數組
words : 識別出的字符串

二.很明顯結果不太正確，（部分代碼可能和官網不太一樣，因為在python3中有些模塊被替代了）然后就繼續在百度技術文檔中尋找答案，於是又找到了一個方案（其余的功能調用方法相同）

代碼如下：

# 第一步：獲取百度access_token
import urllib, sys
from urllib import request
import ssl

# client_id 為官網獲取的AK， client_secret 為官網獲取的SK
host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=【你的API_KEY】&client_secret=【你的SECRET_KEY】'
requests = request.Request(host)
requests.add_header('Content-Type', 'application/json; charset=UTF-8')
response = request.urlopen(requests)
content = response.read()
# 獲取返回的數據
if (content):
    print(content)

# 第二步 通用文字識別（高精度版）識別
# 與百度技術文檔上的部分代碼不同，在python3中urllib2和urllib合並成了urllib
import base64
from urllib import request,parse
import json

access_token = '第一步獲取的token'
url = 'https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic?access_token=' + access_token
# 二進制方式打開圖文件
f = open(r'test.jpg', 'rb')
# 參數image：圖像base64編碼
img = base64.b64encode(f.read())
params = {"image": img}
# 將圖像轉化為可攜帶的參數
params = parse.urlencode(params)
# 構造請求對象
requests = request.Request(url, bytes(params, encoding='utf-8'))
# 添加請求頭
requests.add_header('Content-Type', 'application/x-www-form-urlencoded')
#發起請求
response = request.urlopen(requests)
# 讀取返回的內容並解碼
content = response.read().decode('utf-8')
# 將數據轉換為字典
res = json.loads(content)

try:
    # 嘗試從數據中獲取圖片解析的結果，如果沒有則證明沒有解析成功
    code = res['words_result'][0]['words']
except:
    code = '驗證碼匹配失敗'

print(code)

結果為：

對於特別復雜的圖片，需要經過處理才能識別，使用PIL和Tesseract-OCR

from PIL import Image
import subprocess
def cleanFile(filePath, newFilePath):
    image = Image.open(filePath)
    # 對圖片進行閾值過濾,然后保存
    image = image.point(lambda x: 0 if x<143 else 255)
    image.save(newFilePath)
    # 調用系統的tesseract命令對圖片進行OCR識別  可以使用絕對路徑，也可將程序添加到系統環境變量中
    subprocess.call(['C:\\Program Files (x86)\\Tesseract-OCR\\tesseract', newFilePath, "output"])
    # 打開文件讀取結果
    with open("output.txt", 'r') as f:
        print(f.read())

cleanFile("text.jpg", "text2clean.png")

但是呢，不知道為什么識別率太低了，求大神指教

使用雲打碼識別，識別率會高很多，但也是有代價的，就是需要花錢，但是挺便宜的

　　https://www.cnblogs.com/mswei/p/9392530.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 識別驗證碼之百度通用識別接口 C#使用Tesseract OCR 解析驗證碼 Python 爬蟲入門（四）—— 驗證碼下篇（破解簡單的驗證碼） Python 驗證碼解析 python識別驗證碼ocr_Python爬蟲過程中驗證碼識別的三種解決方案 python使用tesseract-ocr完成驗證碼識別（模型訓練和使用部分） python使用tesseract-ocr完成驗證碼識別（安裝部分）解析最簡單的驗證碼爬蟲驗證碼 python實現簡單的百度翻譯