python3爬蟲之驗證碼的識別——selenium自動識別驗證碼並點擊提交，附源代碼

本文轉載自查看原文 2019-08-19 20:45 969 Python

https://aq.yy.com/p/reg/account.do?appid=&url=&fromadv=udbclsd_r

yy語音的注冊頁面，賬號、密碼、重復密碼及提交按鈕的實現這里不再講解，利用selenium非常容易實現

本文只講解如何識別綠色框里圖片中文字的識別，並使用鼠標正確點擊

思路：

1. 利用爬蟲技術將綠色圖片下載到本地

2. 使用第三方工具（本文使用超級鷹）識別圖片中的文字，並返回每個文字的坐標位置

3. 根據坐標位置，使用鼠標點擊

這么一說是不是顯得非常簡單啦！那么就一步一步來

1. 搞到這個圖片，此處有一坑！

爬蟲所下載下來的圖片是這樣的，應該是將圖片切片重新排列了，然后應該是有一定的算法可以重新排列回來

但是這個算法不太好找，並且對於其他類型的網站不一定是通用的

那么我們換另外一種思路：既然原圖搞不來，那么我們就利用截圖！

如何截圖呢？首先要定位這個元素並且獲知其四個點的坐標位置

定位這個元素很簡單，直接使用selenium定位class即可

image_element = browser.find_element_by_class_name('pw_subBg')

下面的問題就是獲取這個元素四個點的（x，y）坐標或者其他可以確定位置的坐標

通過location方法可以獲取這個元素左上角坐標；通過size可以獲取這個元素的寬（width）和高（height）

location = image_element.location
size = image_element.size
print(location,size)

輸出結果：{'x': 108.5625, 'y': 295} {'height': 128, 'width': 272}

怎樣計算呢？下面畫出一幅坐標圖就清楚明了了！

這里寫出了所需位置的四個值（top, bottom, left, right），並且給出了計算公式，如果還看不明白那就幫不了你了誒……

有了位置坐標，再執行截圖語句並保存圖片到本地就ok了

screenshot = browser.get_screenshot_as_png()
screenshot = Image.open(BytesIO(screenshot))
captcha = screenshot.crop((left,top,right,bottom))
captcha.save('captcha1.png')

驗證一下保存的圖片，好像截取的位置不太對，那么需要人工去設置一個偏移量

這個偏移量是我大概試出來的，暫沒有研究一些好的自動測量方法

top, bottom, left, right = location['y']+128, location['y'] + size['height']+128, location['x']+181, location['x'] + size[
    'width']+181 # 手動測試偏移量

修正后的結果：423 551 289.5625 561.5625

問了度娘，這個偏移量的大小和電腦分辨率、瀏覽器、是否是無頭模式有關系。因此可能每個人運行程序所設置的偏移量都不一樣

截圖結果如下：（因為每次運行程序都刷新了頁面，因此本例中驗證碼可能不一樣）

2. 調用第三方平台識別漢字並且返回所識別漢字的坐標

前提是已經下好了超級鷹的demo，並調試成功。詳見我的文章《》，此次不再詳述

但是需要改一個地方就是識別類型，可以改成910x，這個是返回漢字坐標值的

bytes_array = BytesIO()
captcha.save(bytes_array,format('PNG'))
chaojiying = Chaojiying('賬戶', '密碼', '軟件id')
result = chaojiying.PostPic(bytes_array.getvalue(),9103)
print(result)

運行結果：{'err_no': 0, 'err_str': 'OK', 'pic_id': '2077320412830000020', 'pic_str': '60,19|109,16|187,92', 'md5': 'c3d41675003cd44058347e591cf405e7'}

看pic_str字段，返回了3個坐標值，用 | 隔開了，需要對這個字符串進行處理

locations = result.get('pic_str').split('|')
for i in locations:
    location = i.split(',')
    print(location)

3. 通過坐標值進行自動化點擊操作

ActionChains(browser).move_to_element_with_offset(image_element,int(location[0]),int(location[1])).click().perform()

當然，僅到這一步還是沒有完成的

注冊的需求還需要識別右上角的小圖中的文字，和大圖中的文字做匹配，匹配成功了才點擊，不成功的不點擊

這里有個坑就是：超級鷹返回的要么是識別的文字，要么是坐標信息，無法同時返回兩者。那么，每次返回的是一一對應的嗎？這個我們現在來驗證一下（為了省積分，用a.jpg）：

首先先運行下

    r1 = chaojiying.PostPic(im, 1902) 
    r2 = chaojiying.PostPic(im, 9004)  # 返回坐標

運行結果：

r1 = {'err_no': 0, 'err_str': 'OK', 'pic_id': '3077320552830000021', 'pic_str': '7261', 'md5': '265c70b7f6d88426fa2a77a06f450972'}
r2 = {'err_no': 0, 'err_str': 'OK', 'pic_id': '2077320552830000022', 'pic_str': '11,22|28,20|47,21|71,21', 'md5': '82948c8bf04e521

如果需要點擊的數字或者漢字也是圖片顯示的，那么識別過程與大圖一樣；如果是直接給出了文本那就爬蟲直接獲取。這個步驟我省略了，這里假定需要點擊數字7和2（無順序）

def vs(list0,dict1,dict2):
    list1 = list(dict1.get('pic_str'))
    list2 = dict2.get('pic_str').split('|')
    print(list1)
    print(list2)
    dd = dict(zip(list1,list2))
    print(dd)
    for i in list0:
        if i in dd:
            x = dd.get(i).split(',')[0]
            y = dd.get(i).split(',')[1]
            print(int(x),int(y))
            # 調用點擊的模塊

調用結果

28 20
71 21

完美的get到了需要點擊數字的坐標

源代碼（全）

  1 import time
  2 from io import BytesIO
  3 from PIL import Image
  4 from selenium import webdriver
  5 from selenium.webdriver import ActionChains
  6 from selenium.webdriver.common.by import By
  7 from selenium.webdriver.support.ui import WebDriverWait
  8 from selenium.webdriver.support import expected_conditions as EC
  9 from chaojiying import Chaojiying
 10 
 11 
 12 # 初始化變量
 13 
 14 EMAIL = 'diaongaodsing'
 15 PASSWORD = 'mindomg301415'
 16 REPASSWORD = PASSWORD
 17 
 18 # 超級鷹用戶登錄名、密碼、軟件ID、待識別的驗證碼類型
 19 CHAOJIYING_USERNAME = '用戶名'
 20 CHAOJIYING_PASSWORD = '密碼'
 21 CHAOJIYING_SOFT_ID = 軟件ID
 22 CHAOJIYING_KIND_XY = 9103 # 返回坐標
 23 CHAOJIYING_KIND = 2003 #返回數字、字母或者漢字
 24 
 25 
 26 class CrackTouClick():
 27 
 28     def __init__(self):
 29         """
 30 
 31         """
 32         self.url = 'https://aq.yy.com/p/reg/account.do?appid=&url=&fromadv=udbclsd_r' # 待爬取的頁面
 33         self.browser = webdriver.Chrome()
 34         self.wait = WebDriverWait(self.browser, 20)
 35         self.input = EMAIL
 36         self.password = PASSWORD
 37         self.repassword = REPASSWORD
 38         self.chaojiying = Chaojiying(CHAOJIYING_USERNAME, CHAOJIYING_PASSWORD, CHAOJIYING_SOFT_ID)
 39     # def __del__(self):
 40     #     """
 41     #     析構函數，關閉瀏覽器
 42     #     """
 43     #     self.browser.close()
 44     def login(self):
 45         """
 46         打開網頁輸入用戶名、密碼和再次驗證密碼
 47         :return: None
 48         """
 49         self.browser.get(self.url)
 50         iframe = self.browser.find_elements_by_tag_name("iframe")[0]
 51         self.browser.switch_to.frame(iframe)
 52 
 53         input = self.browser.find_element_by_xpath('//*[@id="m_mainForm"]/div[2]/div[1]/span[2]/input')
 54         password = self.browser.find_element_by_xpath('//*[@id="m_mainForm"]/div[2]/div[2]/span[2]/input')
 55         repassword = self.browser.find_element_by_xpath('//*[@id="m_mainForm"]/div[2]/div[3]/span[2]/input')
 56 
 57         input.send_keys(self.input)
 58         password.send_keys(self.password)
 59         repassword.send_keys(self.repassword)
 60         self.browser.find_element_by_class_name('field_title').click()# 隨便找一個地方單擊一下，否則無法驗證輸入是否正確
 61     def get_image_element(self):
 62         """
 63         獲取驗證圖片對象
 64         :return: 圖片對象
 65         """
 66         image_element_b = self.browser.find_element_by_class_name('pw_subBg') # 大圖
 67         image_element_s = self.browser.find_element_by_class_name('pw_expic') # 小圖
 68         return image_element_s,image_element_b
 69     def get_position(self,image_element):
 70         """
 71         獲取驗證碼位置
 72         :return: 驗證碼位置
 73         """
 74 
 75         time.sleep(2)
 76         location = image_element.location
 77         size = image_element.size
 78         top, bottom, left, right = location['y'] + 128, location['y'] +  size['height'] + 128, location['x'] + 181, location['x'] + size[
 79             'width'] + 181
 80         return (top, bottom, left, right)
 81     def get_screen_image(self, image_element,name):
 82         """
 83         獲取驗證碼截圖圖片
 84         :return: 圖片對象
 85         """
 86         top, bottom, left, right = self.get_position(image_element)
 87         #print('圖片位置', top, bottom, left, right)
 88         screenshot = self.browser.get_screenshot_as_png()
 89         screenshot = Image.open(BytesIO(screenshot))
 90         captcha = screenshot.crop((left, top, right, bottom))
 91         captcha.save(str(name)+'.png')
 92         return captcha
 93 
 94     def get_recognation_result(self,image_element,chaojiying_kind):
 95         """
 96         用第三方平台超級鷹進行圖片識別,返回識別結果
 97         :return: <dic> 識別結果(需要的字段是‘pic_str’)
 98         """
 99         image = self.get_screen_image(image_element,chaojiying_kind)
100         bytes_array = BytesIO()
101         image.save(bytes_array, format('PNG'))
102         recognation_result = self.chaojiying.PostPic(bytes_array.getvalue(),chaojiying_kind)
103         print('識別結果：',recognation_result)
104         return recognation_result
105 
106     # def click_points(self,image_element, recognation_result):
107     #     """
108     #     解析識別結果並進行點擊
109     #     :param captcha_result: <dic>第三方識別結果
110     #     :return: None
111     #     """
112     #     locations = recognation_result.get('pic_str').split('|')
113     #     for i in locations:
114     #         location = i.split(',')
115     #         #print(location)
116     #         ActionChains(self.browser).move_to_element_with_offset(image_element, int(location[0]),int(location[1])).click().perform()
117     #         #print('ok')
118 
119     def vs(self,image_element,res_s, res_b, res_b_xy):
120         """
121         對比，選出需要點擊的漢字和坐標，並點擊
122         :param image_element:
123         :param res_s: <dic> 小圖識別結果（漢字）
124         :param res_b: <dic> 大圖識別結果（漢字）
125         :param res_b_xy: <dic> 大圖識別結果（坐標）
126         :return: None
127         """
128         list_res_s = list(res_s.get('pic_str'))
129         list_res_b = list(res_b.get('pic_str'))
130         list_res_b_xy = res_b_xy.get('pic_str').split('|')
131         #print(list_res_s)
132         # print(list_res_b)
133         # print(list_res_b_xy)
134         dic_res_b = dict(zip(list_res_b, list_res_b_xy))
135         print('字典格式：',dic_res_b)
136         for i in list_res_s:
137             if i in dic_res_b:
138                 x = dic_res_b.get(i).split(',')[0]
139                 y = dic_res_b.get(i).split(',')[1]
140                 #print(int(x), int(y))
141                 ActionChains(self.browser).move_to_element_with_offset(image_element, int(x),int(y)).click().perform()
142 
143     def verify_info(self):
144         """
145         驗證用戶名、密碼、再次密碼是否符合規則，符合返回True，否則返回False
146         :return: <bool>
147         """
148         try:
149             input_v = self.browser.find_elements_by_class_name('icon_suc')[0]
150             password_v = self.browser.find_elements_by_class_name('icon_suc')[1]
151             repassword_v = self.browser.find_elements_by_class_name('icon_suc')[2]
152             print('注冊信息正確！')
153             return True
154         except:
155             print('注冊信息錯誤！')
156             return False
157     def verify_recognation(self):
158         """
159         驗證驗證碼是否正確，正確返回True，否則返回False
160         :return:
161         """
162         try:
163             self.wait.until(EC.text_to_be_present_in_element((By.CLASS_NAME, 'done_text'), '驗證成功'))
164             # 使用text_to_be_present_in_element方法不能使用find_element，因為發現一直在
165             print('驗證碼正確！')
166             return True
167         except:
168             print('驗證碼錯誤！')
169             return False
170     def get_verify_button(self):
171         """
172         獲取驗證“提交”按鈕，並點擊
173         :return: None
174         """
175         verify_button = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'pw_submit')))
176         verify_button.click()
177     def get_login_button(self):
178         """
179         獲取“同意並注冊賬號”按鈕，並點擊
180         :return: None
181         """
182         submit_button = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="m_mainForm"]/div[2]/div[7]/a/span')))
183         submit_button.click()
184         print('登錄成功')
185 
186 
187     def crack(self):
188         """
189         破解入口
190         :return: None
191         """
192         self.login() # 登陸
193 
194         image_element_s = self.get_image_element()[0]# 獲取小圖
195         res_s = self.get_recognation_result(image_element_s,2003) # 獲取第三方識別結果
196 
197         image_element_b = self.get_image_element()[1]# 獲取大圖
198         res_b = self.get_recognation_result(image_element_b,2006) # 獲取第三方識別結果
199 
200         res_b_xy = self.get_recognation_result(image_element_b,9008) # 獲取第三方識別結果
201 
202         self.vs(image_element_b,res_s,res_b,res_b_xy)
203 
204         self.get_verify_button() # 點擊“驗證”按鈕
205         if self.verify_info() is True and self.verify_recognation() is True:
206             #如果信息符合規則且驗證碼正確，點擊“注冊”按鈕
207             time.sleep(2)
208             self.get_login_button()
209 
210 
211 if __name__ == '__main__':
212     crack = CrackTouClick()
213     crack.crack()

測試了幾次發現題分沒了……趕緊去充錢，好在1元=1000題分

試了幾次只有一次是完全正確的，但是單獨識別某一個圖是沒有問題的呀！難道是我刷新的太快了嗎……因為無法同時獲取漢字和其坐標，導致兩次返回的數量很有可能不一致！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python驗證碼自動識別 Python自動識別驗證碼 Python 實現自動識別驗證碼 WebDriver中自動識別驗證碼--Python實現 WebDriver中自動識別驗證碼--Python實現 Python+Selenium+PIL+Tesseract真正自動識別驗證碼進行一鍵登錄 Python爬蟲實例動態ip+抓包+驗證碼自動識別 selenium如何識別驗證碼 Selenium識別驗證碼 Web滲透測試——驗證碼自動識別工具