Scrapy+Selenium爬取動態渲染網站

本文轉載自查看原文 2020-09-17 14:56 745 爬蟲/ python 運維開發

一、概述

使用情景

在通過scrapy框架進行某些網站數據爬取的時候，往往會碰到頁面動態數據加載的情況發生，如果直接使用scrapy對其url發請求，是絕對獲取不到那部分動態加載出來的數據值。但是通過觀察我們會發現，通過瀏覽器進行url請求發送則會加載出對應的動態加載出的數據。那么如果我們想要在scrapy也獲取動態加載出的數據，則必須使用selenium創建瀏覽器對象，然后通過該瀏覽器對象進行請求發送，獲取動態加載的數據值

使用流程

1. 重寫爬蟲文件的__init__()構造方法，在該方法中使用selenium實例化一個瀏覽器對象

2. 重寫爬蟲文件的closed(self,spider)方法，在其內部關閉瀏覽器對象,該方法是在爬蟲結束時被調用.

3. 在settings配置文件中開啟下載中間件

二、案例演示

這里以房天下為例，爬取樓盤信息，鏈接如下：

https://sh.newhouse.fang.com/house/s/a75-b91/?ctm=1.sh.xf_search.page.1

頁面分析

獲取信息列表

//*[@id="newhouse_loupai_list"]/ul/li//div[@class="nlc_details"]

它會獲取20條信息

獲取名稱

//*[@id="newhouse_loupai_list"]/ul/li//div[@class="nlc_details"]//div[@class="nlcd_name"]/a/text()

結果如下：

獲取價格

//*[@id="newhouse_loupai_list"]/ul/li//div[@class="nlc_details"]//div[@class="nhouse_price"]/span/text()

結果如下：

注意：別看它只有18條，因為還有2條，價格沒有公布，所以獲取不到。因此，后續我會它一個默認值：價格待定

獲取區域

//*[@id="newhouse_loupai_list"]/ul/li//div[@class="relative_message clearfix"]//a/span/text()

結果如下：

注意：別看它只有17條，因為還有3條，不在國內。比如泰國，老撾等。因此，后續我會它一個默認值：國外

獲取地址

//*[@id="newhouse_loupai_list"]/ul/li//div[@class="relative_message clearfix"]/div/a/text()

結果如下：

注意：多了17條，為什么呢？因此地址有些含有大段的空行，有些地址還包含了區域信息。因此，后續我會做一下處理，去除多余的換行符，通過正則匹配出地址信息。

獲取狀態

//*[@id="newhouse_loupai_list"]/ul/li//div[@class="nlc_details"]//span[@class="inSale"]/text()

結果如下：

注意：少了4條，那是因為它的狀態是待售。因此，后續我會做一下處理，沒有匹配的，給定默認值。

項目代碼

通過以上頁面分析出我們要的結果只會，就可以正式編寫代碼了。

創建項目

打開Pycharm，並打開Terminal，執行以下命令

scrapy startproject fang
cd fang
scrapy genspider newhouse sh.newhouse.fang.com

在scrapy.cfg同級目錄，創建bin.py，用於啟動Scrapy項目，內容如下：

# ！/usr/bin/python3
# -*- coding: utf-8 -*-
#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 第三個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'newhouse',"--nolog"])

創建好的項目樹形目錄如下：

./
├── bin.py
├── fang
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── newhouse.py
└── scrapy.cfg

修改newhouse.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊
import math
import re
from fang.items import FangItem
from selenium.webdriver import ChromeOptions
from selenium.webdriver import Chrome


class NewhouseSpider(scrapy.Spider):
    name = 'newhouse'
    allowed_domains = ['sh.newhouse.fang.com']
    base_url = "https://sh.newhouse.fang.com/house/s/a75-b91/?ctm=1.sh.xf_search.page."
    # start_urls = [base_url+str(1)]

    # 實例化一個瀏覽器對象
    def __init__(self):
        # 防止網站識別Selenium代碼
        options = ChromeOptions()
        options.add_argument("--headless")  # => 為Chrome配置無頭模式
        options.add_experimental_option('excludeSwitches', ['enable-automation'])
        options.add_experimental_option('useAutomationExtension', False)

        self.browser = Chrome(options=options)
        self.browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
            "source": """
                            Object.defineProperty(navigator, 'webdriver', {
                              get: () => undefined
                            })
                          """
        })

        super().__init__()

    def start_requests(self):
        print("開始爬蟲")
        self.base_url = "https://sh.newhouse.fang.com/house/s/a75-b91/?ctm=1.sh.xf_search.page."
        url = self.base_url + str(1)
        print("url",url)
        # url = "https://news.163.com/"
        response = scrapy.Request(url, callback=self.parse_index)
        yield response

    # 整個爬蟲結束后關閉瀏覽器
    def close(self, spider):
        print("關閉爬蟲")
        self.browser.quit()

    # 訪問主頁的url, 拿到對應板塊的response
    def parse_index(self, response):
        print("訪問主頁")
        # 獲取分頁
        # 查詢條數
        ret_num = response.xpath('//*[@id="sjina_C01_47"]/ul/li[1]/b/text()').extract_first()
        # print("ret_num", ret_num, type(ret_num))
        # 計算分頁，每一頁20條
        jsfy = int(ret_num) / 20
        # 向上取整
        page_num = math.ceil(jsfy)
        # print("page_num",page_num)

        for n in range(1, page_num):
            n += 1
            # 下一頁url
            url = self.base_url + str(n)
            print("url", url)
            # 訪問下一頁，有返回時，調用self.parse_details方法
            yield scrapy.Request(url=url, callback=self.parse_details)

    def parse_details(self, response):
        # 獲取頁面中要抓取的信息在網頁中位置節點
        node_list = response.xpath('//*[@id="newhouse_loupai_list"]/ul/li//div[@class="nlc_details"]')

        count = 0
        # 遍歷節點，進入詳情頁，獲取其他信息
        for node in node_list:
            count += 1
            try:
                # # 名稱
                nlcd_name = node.xpath('.//div[@class="nlcd_name"]/a/text()').extract()

                if nlcd_name:
                    nlcd_name = nlcd_name[0].strip()

                print("nlcd_name", nlcd_name)

                # # # 價格
                price = node.xpath('.//div[@class="nhouse_price"]/span/text()').extract()
                # print("原price",price,type(price))
                if price:
                    price = price[0].strip()

                if not price:
                    price = "價格待定"

                print("price", price)

                # 區域
                region_ret = node.xpath('.//div[@class="relative_message clearfix"]//a/span/text()').extract()
                region = ""
                if region_ret:
                    # if len(region) >=2:
                    region_ret = region_ret[0].strip()
                    # 正則匹配中括號的內容
                    p1 = re.compile(r'[\[](.*?)[\]]', re.S)
                    region = re.findall(p1, region_ret)
                    if region:
                        region = region[0]

                # print("region",region)
                # # # # 地址
                address_str = node.xpath('.//div[@class="relative_message clearfix"]/div/a/text()').extract()
                address = ""
                # 判斷匹配結果，截取地址信息
                if address_str:
                    if len(address_str) >= 2:
                        address_str = address_str[1].strip()
                    else:
                        address_str = address_str[0].strip()

                # print("address_str", address_str)

                # 判斷地址中，是否含有區域信息，比如[松江]
                p1 = re.compile(r'[\[](.*?)[\]]', re.S)  # 最小匹配
                address_ret = re.findall(p1, address_str)

                if address_ret:
                    # 截圖地區
                    region = address_ret[0]
                    # 地址拆分
                    add_cut_str = address_str.split()
                    # 截取地址
                    if add_cut_str:
                        address = add_cut_str[1]
                else:
                    address = address_str
                    # 為空時，表示在國外
                    if not region_ret:
                        region = "國外"

                print("region", region)
                print("address", address)

                # # # 狀態
                status = node.xpath('.//span[@class="inSale"]/text()').extract_first()
                # status = node.xpath('.//div[@class="fangyuan pr"]/span/text()').extract_first()
                if not status:
                    status = "待售"

                print("status", status)

                # item
                item = FangItem()
                item['nlcd_name'] = nlcd_name
                item['price'] = price
                item['region'] = region
                item['address'] = address
                item['status'] = status
                yield item
            except Exception as e:
                print(e)

        print("本次爬取數據: %s條" % count)

View Code

修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    nlcd_name = scrapy.Field()
    price = scrapy.Field()
    region = scrapy.Field()
    address = scrapy.Field()
    status = scrapy.Field()

View Code

修改pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class FangPipeline(object):
    def __init__(self):
        # python3保存文件 必須需要'wb' 保存為json格式
        self.f = open("fang_pipline.json", 'wb')

    def process_item(self, item, spider):
        # 讀取item中的數據 並換行處理
        content = json.dumps(dict(item), ensure_ascii=False) + ',\n'
        self.f.write(content.encode('utf=8'))

        return item

    def close_spider(self, spider):
        # 關閉文件
        self.f.close()

View Code

注意：這里為了方便，保存在一個json文件中。當然，也可以設置保存到數據庫中。

修改settings.py，應用pipelines

ITEM_PIPELINES = {
   'fang.pipelines.FangPipeline': 300,
}

執行bin.py，啟動爬蟲項目，效果如下：

查看文件fang_pipline.json，內容如下：

注意：本次訪問的頁面，只有6頁，每頁20條結果。因此可以獲取到120條信息。

本文參考鏈接：

https://www.cnblogs.com/bk9527/p/10504883.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy+selenium 爬取淘寶 Scrapy+selenium爬取簡書全站-爬蟲 scrapy+selenium爬取淘寶商品信息使用scrapy+selenium爬取淘寶網 scrapy+selenium爬取馬蜂窩網實戰 scrapy結合selenium爬取淘寶等動態網站動態渲染網頁爬取-selenium 動態渲染頁面爬取-Selenium & Splash 爬蟲之Selenium 動態渲染頁面爬取 Scrapy爬取動態內容(二)Selenium Chrome方案