基於selenium實現自動化爬取數據

本文轉載自查看原文 2020-04-12 21:55 1032 爬蟲

基於selenium實現自動化爬取數據

如果想具體查看selenium自動化模塊的更多功能請看我的博客測試分類中有介紹

selenium

概念：基於瀏覽器自動化的模塊
自動化：可以通過代碼指定一系列的行為動作，然后將其作用到瀏覽器中。
pip install selenium
selenium和爬蟲之間的關聯
- 1.便捷的捕獲到任意形式動態加載的數據（可見即可得）
- 2.實現模擬登錄
谷歌驅動下載：http://chromedriver.storage.googleapis.com/index.html

#1.基於瀏覽器的驅動程序實例化一個瀏覽器對象
bro = webdriver.Chrome(executable_path='./chromedriver')
#對目的網站發起請求
bro.get('https://www.jd.com/')
#標簽定位
search_text = bro.find_element_by_xpath('//*[@id="key"]')
search_text.send_keys('iphoneX') #向標簽中錄入數據

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()

sleep(2)

#在搜索結果頁面進行滾輪向下滑動的操作（執行js操作：js注入）
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
bro.quit()

葯監總局為例：http://125.35.6.84:81/xk/
前三頁所有企業名稱爬取

url = 'http://125.35.6.84:81/xk/'
bro = webdriver.Chrome(executable_path='./chromedriver')
bro.get(url)
page_text_list = []#每一頁的頁面源碼數據
sleep(1)
#捕獲到當前頁面對應的頁面源碼數據
page_text = bro.page_source #當前頁面全部加載完畢后對應的所有的數據
page_text_list.append(page_text)

#點擊下一頁
for i in range(2):
    next_page = bro.find_element_by_xpath('//*[@id="pageIto_next"]')
    next_page.click()
    sleep(1)
    page_text_list.append(bro.page_source)
for page_text in page_text_list:
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="gzlist"]/li')
    for li in li_list:
        name = li.xpath('./dl/@title')[0]
        print(name)
sleep(2)
bro.quit()

動作鏈

from selenium.webdriver import ActionChains
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro = webdriver.Chrome(executable_path='./chromedriver')
bro.get(url)
sleep(1)
#如果通過find系列的函數進行標簽定位，如果標簽是存在於iframe下面，則會定位失敗
#解決方案：使用switch_to即可
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')

#對div_tag進行滑動操作
action = ActionChains(bro)
action.click_and_hold(div_tag)#點擊且長按

for i in range(6):
    #perform讓動作鏈立即執行
    action.move_by_offset(10,15).perform()
    sleep(0.5)
bro.quit()

如何讓selenium規避檢測
- 有的網站會檢測請求是否為selenium發起，如果是的話則讓該次請求失敗
- 規避檢測的方法：
  - selenium接管chrome瀏覽器
實現步驟
- 1.必須將你電腦中安裝的谷歌瀏覽器的驅動程序所在的目錄找到。且將目錄添加到環境變量中。
- 2.打開cmd，在命令行中輸入命令：
  - chrome.exe --remote-debugging-port=9222 --user-data-dir="一個空文件夾的目錄"
  - 指定執行結束后，會打開你本機安裝好的谷歌瀏覽器。
- 3.執行如下代碼：可以使用下屬代碼接管步驟2打開的真實的瀏覽器

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
#本機安裝好谷歌的驅動程序路徑
chrome_driver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"

driver = webdriver.Chrome(executable_path=chrome_driver,chrome_options=chrome_options)
print(driver.title)

12306模擬登錄

url =https://kyfw.12306.cn/otn/login/init

主要用到ActionChains 鏈式操作，超級鷹解析驗證碼。還有圖片得裁剪

import requests

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 圖片字節
        codetype: 題目類型 參考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:報錯題目的圖片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


def tranformImgCode(imgPath,imgType):
    chaojiying = Chaojiying_Client('929235569', 'lyz19960415', '904189')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im,imgType)['pic_str']


from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep
from PIL import Image
headers = {

}
url = "https://kyfw.12306.cn/otn/login/init"
chrome = webdriver.Chrome(executable_path="./chromedriver")
#打開瀏覽器12306頁面
chrome.get(url=url)
sleep(2)
#進行截圖
chrome.save_screenshot("main.png")
#定位到image標簽
img_tag = chrome.find_element_by_class_name("touclick-image")
#裁剪出截圖中驗證碼部分
location = img_tag.location #驗證碼得左下角下標
size = img_tag.size  #驗證碼尺寸
#基於驗證碼尺寸指定裁剪范圍
img_range = (int(location["x"]),int(location["y"]),int(location["x"]+size["width"]),int(location["y"]+size["height"]))
#根據img_range表示的裁剪范圍進行圖片的裁剪
i=Image.open("./main.png")
image =i.crop(img_range)
image.save("./code.png")
#用超級鷹獲取坐標
result = tranformImgCode('./code.png',9004)

all_list = []
if "|" in result:  #這是存在2個圖片都符合要求
    result1 = result.split("|") #['174,71', '272,60']
    count = len(result1)
    lst = []
    for w in range(count):
        x = int(result1[w].split(",")[0])
        y = int(result1[w].split(",")[1])
        lst.append(x)
        lst.append(y)
        all_list.append(lst)
else:  # 這是一個圖片符合要求得
    lst = []
    x = int(result.split(",")[0])
    y = int(result.split(",")[1])
    lst.append(x)
    lst.append(y)
    all_list.append(lst)
for xy in all_list:
    x = xy[0]
    y = xy[1]
    ActionChains(chrome).move_to_element_with_offset(img_tag,x,y).click().perform()
sleep(1)
chrome.find_element_by_id("username").send_keys("洲神再次")
chrome.find_element_by_id("password").send_keys("1234567")
chrome.find_element_by_id("loginSub").click()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬取鳳凰網站的新聞，及其鏈接地址，來源，時間和內容，用selenium自動化和requests處理數據利用selenium實現自動翻頁爬取某魚數據 python selenium自動化爬取Boss直聘崗位 Selenium+PhantomJS自動化登錄爬取博客文章 selenium自動化測試爬取動態頁面大全 selenium五十行代碼自動化爬取淘寶 Selenium自動化實現web自動化-1 selenium爬取Twitter數據 Python+Selenium 自動化實現實例-數據驅動實例 Web自動化selenium技術快速實現爬蟲