爬蟲抓取動態內容

本文轉載自查看原文 2017-10-30 21:23 3070 爬蟲

一、簡單動態頁面爬取

　　我們之前進行的頁面爬取工作都是基於靜態的頁面。但是現在的很多頁面都采用了動態頁面，這些動態頁面又有百分之七十是由javascript寫的，因此我們了解如何從javascript頁面爬取信息就顯得非常的重要。

　　先認識具體情況之前，我們需要先了解什么是ajax，ajax它的英文全稱是asynchronous javascript and xml，是一種異步JavaScript和xml。我們可以通過ajax進行頁面數據請求，它返回的數據格式是json類型的。

　　然后我們就可以根據頁面的ajax格式進行數據爬取。以下是一個簡單的頁面爬取。

import json
from Chapter3 import download
import csv
def simpletest():
    '''
    it will write the date to the country.csv

    the json data has the attribute records, and the records has area, country and capital value
    :return:
    '''
    fileds = ('area', 'country', 'capital')
    writer = csv.writer(open("country.csv", "w"))
    writer.writerow(fileds)
    d = download.Downloader()
    html = d("http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=A")
    try:
        ajax = json.loads(html)
    except Exception as e:
        print str(e)
    else:
        for record in ajax['records']:
            row = [record[filed] for filed in fileds]
            writer.writerow(row)
if __name__ == "__main__":
    simpletest()

　　不知道是不是這個網站的問題，現在已經不能從上面的網址下載數據了，執行上面的程序，以下會是結果圖：

二、渲染動態頁面

　　在開始之前呢，首先要先下載pyside，直接用 pip install pyside 命令行即可。

　　然后我們就可以利用PySide來進行數據爬取。

from PySide.QtWebKit import *
from PySide.QtGui import *
from PySide.QtCore import *
import lxml.html

def simpletest():
    '''
    get content of the div # result in http://example.webscraping.com/places/default/dynamic
    :return: content
    '''
    
    app = QApplication([])
    webview = QWebView()
    loop = QEventLoop()
    # finish the loop if we have finished load the html
    webview.loadFinished.connect(loop.quit)
    webview.load(QUrl("http://example.webscraping.com/places/default/dynamic"))
    loop.exec_()
    htmled = webview.page().mainFrame().toHtml()
    # get the special content
    tree = lxml.html.fromstring(htmled)
    return tree.cssselect('#result')[0].text_content()


content = simpletest()
print content

　　我們回顧簡單動態頁面爬取的內容，之前的那種方式不成功，我想主要的原因是我的網址寫錯了，所以學習了pyside之后，我們可以使用這種全新的方式進行數據爬取。以下是具體代碼：

def getallcountry():
    '''
    open the html and set search term = a and page_size = 10
    and then click auto by javascript
    :return:
    '''
    app = QApplication([])
    webview = QWebView()
    loop = QEventLoop()
    # finish the loop if we have finished load the html
    webview.loadFinished.connect(loop.quit)
    webview.load(QUrl("http://example.webscraping.com/places/default/search"))
    loop.exec_()
    # show the webview
    webview.show()
    frame = webview.page().mainFrame()
    # set search text is b
    frame.findFirstElement('#search_term').setAttribute('value', 'b')
    # set page_size is 10
    frame.findFirstElement('#page_size option:checked').setPlainText('10')
    # click search button auto
    frame.findFirstElement('#search').evaluateJavaScript('this.click()')
    app.exec_()

　　以下是結果圖：

　　上面的過程我們只是利用pyside能夠在頁面得到結果，但是還沒有將數據爬取下來。因為ajax響應事件有一定的延遲，所以我們有以下三種方式可以進行數據爬取：

　　1、等待一定時間（低效）

　　2、重寫QT的網絡管理器，跟蹤url請求的完成時間（不適用於客戶端出問題的情況）

　　3、輪詢頁面，等待特定內容出現（檢查時會浪費cpu時間）

　　總的來說，第三種方法是比較可靠並且方便的。以下是它的概念代碼：它的主要思想在於while循環，如果沒有找到elements，就不斷的嘗試。

　　為了將以上的幾種方法變得更加具有通用性，我們可以把他們寫在一個類中。這個類包含的功能有：下載，獲取html,找到相應的元素，設置屬性值，設置文本值，點擊，輪詢頁面，等待下載

from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import *
import time
import sys

class BrowserRender(QWebView):
    def __init__(self, show=True):
        '''
        if the show is true then we can see webview
        :param show:
        '''
        self.app = QApplication(sys.argv)
        QWebView.__init__(self)
        if show:
            self.show()

    def download(self, url, timeout=60):
        '''
        download the url if timeout is false
        :param url: the download url
        :param timeout: the timeout time
        :return: html if not timeout
        '''
        loop = QEventLoop()
        timer = QTimer()
        timer.setSingleShot(True)
        timer.timeout.connect(loop.quit)
        self.loadFinished.connect(loop.quit)
        self.load(QUrl(url))
        timer.start(timeout*1000)
        loop.exec_()
        if timer.isActive():
            timer.stop()
            return self.html()
        else:
            print "Request time out "+url

    def html(self):
        '''
        shortcut to return the current html
        :return:
        '''
        return self.page().mainFrame().toHtml()

    def find(self, pattern):
        '''
        find all elements that match the pattern
        :param pattern:
        :return:
        '''
        return self.page().mainFrame().findAllElements(pattern)

    def attr(self, pattern, name, value):
        '''
        set attribute for matching pattern
        :param pattern:
        :param name:
        :param value:
        :return:
        '''
        for e in self.find(pattern):
            e.setAttribute(name, value)

    def text(self, pattern, value):
        '''
        set plaintext for matching pattern
        :param pattern:
        :param value:
        :return:
        '''
        for e in self.find(pattern):
            e.setPlainText(value)

    def click(self, pattern):
        '''
        click matching pattern
        :param pattern:
        :return:
        '''
        for e in self.find(pattern):
            e.evaluateJavaScript("this.click()")

    def wait_load(self, pattern, timeout=60):
        '''
        wait untill pattern is found and return matches
        :param pattern:
        :param timeout:
        :return:
        '''
        deadtiem = time.time() + timeout
        while time.time() < deadtiem:
            self.app.processEvents()
            matches = self.find(pattern)
            if matches:
                return matches
        print "wait load timed out"

br = BrowserRender()
br.download("http://example.webscraping.com/places/default/search")
br.attr('#search_term', 'value', '.')
br.text('#page_size option:checked', '1000')
br.click('#search')
elements = br.wait_load('#results a')
countries = [e.toPlainText().strip() for e in elements]
print countries

　　在調用的時候，一定要注意要把pattern寫對，我就把#results a 寫成了#result a，導致一直出現time out現象

三、selenium

　　selenium 是一個簡單的能夠與頁面交互的接口，它提供了使得瀏覽器自動化的API接口。selenium的使用非常的簡單，它相當於已經把我們想要的函數都已經封裝起來了，我們所需要的就是調用相應的函數。

　　以下是我們selenium來實現browsrender實現的內容。

from selenium import webdriver
def simpleuse():
    driver = webdriver.Chrome()
    driver.get("http://example.webscraping.com/places/default/search")
    driver.find_element_by_id("search_term").send_keys('.')
    js = "document.getElementById('page_size').options[1].text='1000'"
    driver.execute_script(js)
    driver.find_element_by_id('search').click()
    driver.implicitly_wait(30)
    links = driver.find_element_by_css_selector("#results a")
    countries = [link.text for link in links]
    print countries
　　 driver.close()

if __name__ == "__main__":
    simpleuse()

　　明明配置了chromedriver，但是它一直顯示未在path中找到可執行文件：

　　這個問題還沒解決，等待后續吧。

　　這個問題已經解決了，只需要去官網上下載對應版本的chromedriver.exe，然后將保存它的絕對路徑加入 webdriver.chrome(絕對路徑)即可。現在的代碼變成如下：

from selenium import webdriver
def simpleuse():
    driver = webdriver.Chrome('E:\chromedriver\chromedriver.exe')
    driver.get("http://example.webscraping.com/places/default/search")
    driver.find_element_by_id("search_term").send_keys('.')
    js = "document.getElementById('page_size').options[1].text='1000'"
    driver.execute_script(js)
    driver.find_element_by_id('search').click()
    driver.implicitly_wait(30)
    links = driver.find_elements_by_css_selector("#results a")
    countries = [link.text for link in links]
    print countries
    driver.close()

if __name__ == "__main__":
    simpleuse()

四、小結

首先，先采用逆向工程分析頁面，然后使用json即可對頁面進行解析。然后呢，使用了pyside進行動態頁面渲染，最后了為了簡便寫法使用了selenium。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲使用Selenium+PhantomJS抓取Ajax和動態HTML內容 Python爬蟲，抓取淘寶商品評論內容 JAVA使用Gecco爬蟲抓取網頁內容(附Demo) Python-爬蟲-動態渲染頁面抓取-（Selenium）的使用 python爬蟲之動態渲染頁面抓取-（Selenium）的使用 Chrome + Python 抓取動態網頁內容 python3.4學習筆記(十七) 網絡爬蟲使用Beautifulsoup4抓取內容 scrapy爬蟲成長日記之將抓取內容寫入mysql數據庫 python爬蟲：使用urllib.request和BeautifulSoup抓取新浪新聞標題、鏈接和主要內容 Python開發爬蟲之動態網頁抓取篇：爬取博客評論數據——通過Selenium模擬瀏覽器抓取