基於selenium實現自動化爬取數據
如果想具體查看selenium自動化模塊的更多功能請看我的博客測試分類中有介紹
selenium
- 概念:基於瀏覽器自動化的模塊
- 自動化:可以通過代碼指定一系列的行為動作,然后將其作用到瀏覽器中。
- pip install selenium
- selenium和爬蟲之間的關聯
- 1.便捷的捕獲到任意形式動態加載的數據(可見即可得)
- 2.實現模擬登錄
- 谷歌驅動下載:http://chromedriver.storage.googleapis.com/index.html
#1.基於瀏覽器的驅動程序實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path='./chromedriver')
#對目的網站發起請求
bro.get('https://www.jd.com/')
#標簽定位
search_text = bro.find_element_by_xpath('//*[@id="key"]')
search_text.send_keys('iphoneX') #向標簽中錄入數據
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#在搜索結果頁面進行滾輪向下滑動的操作(執行js操作:js注入)
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
bro.quit()
- 葯監總局為例:http://125.35.6.84:81/xk/
- 前三頁所有企業名稱爬取
url = 'http://125.35.6.84:81/xk/'
bro = webdriver.Chrome(executable_path='./chromedriver')
bro.get(url)
page_text_list = []#每一頁的頁面源碼數據
sleep(1)
#捕獲到當前頁面對應的頁面源碼數據
page_text = bro.page_source #當前頁面全部加載完畢后對應的所有的數據
page_text_list.append(page_text)
#點擊下一頁
for i in range(2):
next_page = bro.find_element_by_xpath('//*[@id="pageIto_next"]')
next_page.click()
sleep(1)
page_text_list.append(bro.page_source)
for page_text in page_text_list:
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="gzlist"]/li')
for li in li_list:
name = li.xpath('./dl/@title')[0]
print(name)
sleep(2)
bro.quit()
動作鏈
from selenium.webdriver import ActionChains
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro = webdriver.Chrome(executable_path='./chromedriver')
bro.get(url)
sleep(1)
#如果通過find系列的函數進行標簽定位,如果標簽是存在於iframe下面,則會定位失敗
#解決方案:使用switch_to即可
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')
#對div_tag進行滑動操作
action = ActionChains(bro)
action.click_and_hold(div_tag)#點擊且長按
for i in range(6):
#perform讓動作鏈立即執行
action.move_by_offset(10,15).perform()
sleep(0.5)
bro.quit()
-
如何讓selenium規避檢測
- 有的網站會檢測請求是否為selenium發起,如果是的話則讓該次請求失敗
- 規避檢測的方法:
- selenium接管chrome瀏覽器
-
實現步驟
- 1.必須將你電腦中安裝的谷歌瀏覽器的驅動程序所在的目錄找到。且將目錄添加到環境變量中。
- 2.打開cmd,在命令行中輸入命令:
- chrome.exe --remote-debugging-port=9222 --user-data-dir="一個空文件夾的目錄"
- 指定執行結束后,會打開你本機安裝好的谷歌瀏覽器。
- 3.執行如下代碼:可以使用下屬代碼接管步驟2打開的真實的瀏覽器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
#本機安裝好谷歌的驅動程序路徑
chrome_driver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver,chrome_options=chrome_options)
print(driver.title)
12306模擬登錄
url =https://kyfw.12306.cn/otn/login/init
主要用到ActionChains 鏈式操作,超級鷹解析驗證碼。還有圖片得裁剪
import requests
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
"""
im: 圖片字節
codetype: 題目類型 參考 http://www.chaojiying.com/price.html
"""
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
return r.json()
def ReportError(self, im_id):
"""
im_id:報錯題目的圖片ID
"""
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
def tranformImgCode(imgPath,imgType):
chaojiying = Chaojiying_Client('929235569', 'lyz19960415', '904189')
im = open(imgPath, 'rb').read()
return chaojiying.PostPic(im,imgType)['pic_str']
from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep
from PIL import Image
headers = {
}
url = "https://kyfw.12306.cn/otn/login/init"
chrome = webdriver.Chrome(executable_path="./chromedriver")
#打開瀏覽器12306頁面
chrome.get(url=url)
sleep(2)
#進行截圖
chrome.save_screenshot("main.png")
#定位到image標簽
img_tag = chrome.find_element_by_class_name("touclick-image")
#裁剪出截圖中驗證碼部分
location = img_tag.location #驗證碼得左下角下標
size = img_tag.size #驗證碼尺寸
#基於驗證碼尺寸指定裁剪范圍
img_range = (int(location["x"]),int(location["y"]),int(location["x"]+size["width"]),int(location["y"]+size["height"]))
#根據img_range表示的裁剪范圍進行圖片的裁剪
i=Image.open("./main.png")
image =i.crop(img_range)
image.save("./code.png")
#用超級鷹獲取坐標
result = tranformImgCode('./code.png',9004)
all_list = []
if "|" in result: #這是存在2個圖片都符合要求
result1 = result.split("|") #['174,71', '272,60']
count = len(result1)
lst = []
for w in range(count):
x = int(result1[w].split(",")[0])
y = int(result1[w].split(",")[1])
lst.append(x)
lst.append(y)
all_list.append(lst)
else: # 這是一個圖片符合要求得
lst = []
x = int(result.split(",")[0])
y = int(result.split(",")[1])
lst.append(x)
lst.append(y)
all_list.append(lst)
for xy in all_list:
x = xy[0]
y = xy[1]
ActionChains(chrome).move_to_element_with_offset(img_tag,x,y).click().perform()
sleep(1)
chrome.find_element_by_id("username").send_keys("洲神再次")
chrome.find_element_by_id("password").send_keys("1234567")
chrome.find_element_by_id("loginSub").click()