爬蟲系列(十三) 用selenium爬取京東商品

本文轉載自查看原文 2018-08-29 20:00 3903 網絡爬蟲/ selenium/ Python

這篇文章，我們將通過 selenium 模擬用戶使用瀏覽器的行為，爬取京東商品信息，還是先放上最終的效果圖：

1、網頁分析

（1）初步分析

原本博主打算寫一個能夠爬取所有商品信息的爬蟲，可是在分析過程中發現，不同商品的網頁結構竟然是不一樣的

所以，后來就放棄了這個想法，轉為只爬取筆記本類型商品的信息

如果需要爬取其它類型的商品信息，只需把提取數據的規則改變一下就好，有興趣的朋友可以自己試試看呀

好了，下面我們正式開始！

首先，用 Chrome 瀏覽器打開筆記本商品首頁，我們很容易發現該網頁是一個 動態加載 的網頁

因為剛打開網頁時只會顯示 30 個商品的信息，可是當我們向下拖動網頁時，它會再次加載剩下 30 個商品的信息

這時候我們可以通過 selenium 模擬瀏覽器下拉網頁的過程，獲取網站全部商品的信息

>>> browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

（2）模擬翻頁

另外，我們發現該網站一共有 100 個網頁

我們可以通過構造 URL 來獲取每一個網頁的內容，但是這里我們還是選擇使用 selenium 模擬瀏覽器的翻頁行為

下拉網頁至底部可以發現有一個 下一頁 的按鈕，我們只需獲取並點擊該元素即可實現翻頁

>>> browser.find_element_by_xpath('//a[@class="pn-next" and @onclick]').click()

（3）獲取數據

接下來，我們需要解析每一個網頁來獲取我們需要的數據，具體包括（可以使用 selenium 選擇元素）：

商品 ID：browser.find_elements_by_xpath('//li[@data-sku]') ，用於構造鏈接地址
商品價格：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[2]/strong/i')
商品名稱：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[3]/a/em')
評論人數：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[4]/strong')

2、編碼實現

好了，分析過程很簡單，基本思路是使用 selenium 模擬瀏覽器的行為，下面是代碼實現

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.common.exceptions
import json
import csv
import time

class JdSpider():
    def open_file(self):
        self.fm = input('請輸入文件保存格式（txt、json、csv）：')
        while self.fm!='txt' and self.fm!='json' and self.fm!='csv':
            self.fm = input('輸入錯誤，請重新輸入文件保存格式（txt、json、csv）：')
        if self.fm=='txt' :
            self.fd = open('Jd.txt','w',encoding='utf-8')
        elif self.fm=='json' :
            self.fd = open('Jd.json','w',encoding='utf-8')
        elif self.fm=='csv' :
            self.fd = open('Jd.csv','w',encoding='utf-8',newline='')

    def open_browser(self):
        self.browser = webdriver.Chrome()
        self.browser.implicitly_wait(10)
        self.wait = WebDriverWait(self.browser,10)

    def init_variable(self):
        self.data = zip()
        self.isLast = False

    def parse_page(self):
        try:
            skus = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//li[@class="gl-item"]')))
            skus = [item.get_attribute('data-sku') for item in skus]
            links = ['https://item.jd.com/{sku}.html'.format(sku=item) for item in skus]
            prices = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[2]/strong/i')))
            prices = [item.text for item in prices]
            names = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[3]/a/em')))
            names = [item.text for item in names]
            comments = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="gl-i-wrap"]/div[4]/strong')))
            comments = [item.text for item in comments]
            self.data = zip(links,prices,names,comments)
        except selenium.common.exceptions.TimeoutException:
            print('parse_page: TimeoutException')
            self.parse_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('parse_page: StaleElementReferenceException')
            self.browser.refresh()

    def turn_page(self):
        try:
            self.wait.until(EC.element_to_be_clickable((By.XPATH,'//a[@class="pn-next"]'))).click()
            time.sleep(1)
            self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            time.sleep(2)
        except selenium.common.exceptions.NoSuchElementException:
            self.isLast = True
        except selenium.common.exceptions.TimeoutException:
            print('turn_page: TimeoutException')
            self.turn_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('turn_page: StaleElementReferenceException')
            self.browser.refresh()

    def write_to_file(self):
        if self.fm == 'txt':
            for item in self.data:
                self.fd.write('----------------------------------------\n')
                self.fd.write('link：' + str(item[0]) + '\n')
                self.fd.write('price：' + str(item[1]) + '\n')
                self.fd.write('name：' + str(item[2]) + '\n')
                self.fd.write('comment：' + str(item[3]) + '\n')
        if self.fm == 'json':
            temp = ('link','price','name','comment')
            for item in self.data:
                json.dump(dict(zip(temp,item)),self.fd,ensure_ascii=False)
        if self.fm == 'csv':
            writer = csv.writer(self.fd)
            for item in self.data:
                writer.writerow(item)

    def close_file(self):
        self.fd.close()

    def close_browser(self):
        self.browser.quit()

    def crawl(self):
        self.open_file()
        self.open_browser()
        self.init_variable()
        print('開始爬取')
        self.browser.get('https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8')
        time.sleep(1)
        self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(2)
        count = 0
        while not self.isLast:
            count += 1
            print('正在爬取第 ' + str(count) + ' 頁......')
            self.parse_page()
            self.write_to_file()
            self.turn_page()
        self.close_file()
        self.close_browser()
        print('結束爬取')

if __name__ == '__main__':
    spider = JdSpider()
    spider.crawl()

代碼中有幾個需要注意的地方，現在記錄下來便於以后學習：

1、self.fd = open('Jd.csv','w',encoding='utf-8',newline='')

在打開 csv 文件時，最好加上參數 newline='' ，否則我們寫入的文件會出現空行，不利於后續的數據處理

2、self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")

在模擬瀏覽器向下拖動網頁時，由於數據更新不及時，所以經常出現 StaleElementReferenceException 異常

一般來說有兩種處理方法：

在完成操作后使用 time.sleep() 給瀏覽器充足的加載時間
捕獲該異常進行相應的處理

3、skus = [item.get_attribute('data-sku') for item in skus]

在 selenium 中使用 xpath 語法選取元素時，無法直接獲取節點的屬性值，而需要使用 get_attribute() 方法

4、無頭啟動瀏覽器可以加快爬取速度，只需在啟動瀏覽器時設置無頭參數即可

opt = webdriver.chrome.options.Options()
opt.set_headless()
browser = webdriver.Chrome(chrome_options=opt)

【爬蟲系列相關文章】

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲之selenium爬取京東商品信息爬蟲連載系列(3)--用Selenium+xpath爬取京東商城 python爬蟲-京東商品爬取爬蟲(十七)：Scrapy框架(四) 對接selenium爬取京東商品數據一起學爬蟲——使用selenium和pyquery爬取京東商品列表網絡爬蟲-爬取京東商品評價數據 Python爬蟲爬取淘寶，京東商品信息 Python爬蟲實戰（2）：爬取京東商品列表 python_爬蟲_爬取京東商品信息 python制作爬蟲爬取京東商品評論教程