Splash抓取jd

本文轉載自查看原文 2020-09-08 15:24 507 爬蟲/ python 運維開發

一、概述

在上一篇文章中，鏈接如下：https://www.cnblogs.com/xiao987334176/p/13656055.html

已經介紹了如何使用Splash抓取javaScript動態渲染頁面

這里做一下項目實戰，以爬取京東商城商品冰淇淋為例吧

環境說明

操作系統：centos 7.6

docker版本：19.03.12

ip地址：192.168.0.10

說明：使用docker安裝Splash服務

操作系統：windows 10

python版本：3.7.9

ip地址：192.168.0.9

說明：使用Pycharm開發工具，用於本地開發。

關於Splash的使用，參考上一篇文章，這里就不做說明了。

二、分析頁面

打開京東商城，輸入關鍵字：冰淇淋，滑動滾動條，我們發現隨着滾動條向下滑動，越來越多的商品信息被刷新了，這說明該頁面部分是ajax加載

注意：每一條商品信息，都是在<div class="gl-i-wrap"></div>里面的。

我們打開scrapy shell 爬取該頁面，如下圖：

scrapy shell "https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8"

輸出：

...
[s]   view(response)    View response in a browser
>>>

注意：url不要用單引號，否則會報錯。

接下來，輸入以下命令，使用css選擇器

>>> response.css('div.gl-i-wrap')
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '
), ' gl-i-wrap ')]"
...

返回了很多Selector 對象。

統計商品信息個數

>>> len(response.css('div.gl-i-wrap'))
30

得到返回結果發現只有30個冰淇凌的信息，而我們再頁面中明明看見了60個冰淇凌信息，這是為什么呢？

答：這也說明了剛開始頁面只用30個冰淇淋信息，而我們滑動滑塊時，執行了js代碼，並向后台發送了ajax請求，瀏覽器拿到數據后再進一步渲染出另外了30個信息

我們可以點擊network選項卡再次確認：

鑒於此，我們就想出了一種解決方案：即用js代碼模擬用戶滑動滑塊到底的行為再結合execute端點提供的js代碼執行服務即可（小伙伴們讓我們開始實踐吧）

注意：<div id="footer-2017"></div>就是頁面底部了，因此，只需要滑動到底部即可。

首先：模擬用戶行為

在console，輸入以下命令：

e = document.getElementById("footer-2017")
e.scrollIntoView(true)

效果如下，就直接滑動到底部了。

參數解釋：

scrollIntoView是一個與頁面（容器）滾動相關的API（官方解釋），該API只有boolean類型的參數能得到良好的支持（firefox 36+都支持）

參數為true時調用該函數，頁面（或容器）發生滾動，使element的頂部與視圖（容器）頂部對齊；

使用scrapy.Request

上面我們使用Request發送請求，觀察結果只有30條。為什么呢？因為頁面時動態加載的所有我們只收到了30個冰淇淋的信息。

所以這里，使用scrapy.Request發送請求，並使用execute 端點解決這個問題。

打開上一篇文章中的爬蟲項目dynamic_page，使用Pycharm打開，並點開Terminal

輸入dir，確保當前目錄是dynamic_page

(crawler) E:\python_script\爬蟲\dynamic_page>dir
 驅動器 E 中的卷是 file
 卷的序列號是 1607-A400

 E:\python_script\爬蟲\dynamic_page 的目錄

2020/09/12  10:37    <DIR>          .
2020/09/12  10:37    <DIR>          ..
2020/09/12  10:20               211 bin.py
2020/09/12  14:30                 0 dynamicpage_pipline.json
2020/09/12  10:36    <DIR>          dynamic_page
2020/09/12  10:33                 0 result.csv
2020/09/12  10:18               267 scrapy.cfg
               4 個文件            478 字節
               3 個目錄 260,445,159,424 可用字節

接下來打開scrapy shell，輸入命令：

scrapy shell

輸出：

...
[s]   view(response)    View response in a browser
>>>

最后粘貼以下代碼：

from scrapy_splash import SplashRequest  #使用scrapy.splash.Request發送請求

url = "https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8"
lua = '''
function main(splash)
    splash:go(splash.args.url)
    splash:wait(3)
    splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
    splash:wait(3)
    return splash:html()
end
'''
fetch(SplashRequest(url,endpoint = 'execute',args= {'lua_source':lua})) #再次請求，我們可以看到現在已通過splash服務的8050端點渲染了js代碼，並成果返回結果
len(response.css('div.gl-i-wrap'))

效果如下：

[s]   view(response)    View response in a browser
>>> from scrapy_splash import SplashRequest  #使用scrapy.splash.Request發送請求
>>> url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'
>>> lua = '''
... function main(splash)
...     splash:go(splash.args.url)
...     splash:wait(3)
...     splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
...     splash:wait(3)
...     return splash:html()
... end
... '''
>>> fetch(SplashRequest(url,endpoint = 'execute',args= {'lua_source':lua})) #再次請求，我們可以看到現
在已通過splash服務的8050端點渲染了js代碼，並成果返回結果
2020-09-12 14:30:54 [scrapy.core.engine] INFO: Spider opened
2020-09-12 14:30:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://w
ww.jd.com/error.aspx> from <GET https://search.jd.com/robots.txt>
2020-09-12 14:30:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jd.com/error.aspx> (ref
erer: None)
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 83 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 92 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 202 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 351 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 375 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 376 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 385 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 386 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 387 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 388 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 389 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 397 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 400 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 403 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 404 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 405 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 406 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 407 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 408 without any user agent to enforce it on.
2020-09-12 14:30:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.0.10:8050/robots.txt
> (referer: None)
2020-09-12 14:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://search.jd.com/Search?keywor
d=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8 via http://192.168.0.10:8050/execute> (referer: None)
>>> len(response.css('div.gl-i-wrap'))
60

注意：由於fetch不能直接在python代碼中執行，所以只能在scrapy shell 模式中執行。

最后的任務就回歸到了提取內容了階段了，小伙伴讓我們完成整個代碼吧！---這里結合scrapy shell 進行測試

三、代碼實現

新建項目

這里對目錄就沒有什么要求了，找個空目錄就行。

打開Pycharm，並打開Terminal，執行以下命令

scrapy startproject ice_cream
cd ice_cream
scrapy genspider jd search.jd.com

在scrapy.cfg同級目錄，創建bin.py，用於啟動Scrapy項目，內容如下：

#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 第三個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'jd',"--nolog"])

創建好的項目樹形目錄如下：

./
├── bin.py
├── ice_cream
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── jd.py
└── scrapy.cfg

修改settIngs.py

改寫settIngs.py文件這里小伙伴們可參考github（https://github.com/scrapy-plugins/scrapy-splash）---上面有詳細的說明

在最后添加如下內容：

# Splash服務器地址
SPLASH_URL = 'http://192.168.0.10:8050'
# 開啟兩個下載中間件，並調整HttpCompressionMiddlewares的次序
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 設置去重過濾器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 用來支持cache_args（可選）
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
ITEM_PIPELINES = {
   'ice_cream.pipelines.IceCreamPipeline': 100,
}

注意：請根據實際情況，修改Splash服務器地址，其他的不需要改動。

修改文件jd.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from ice_cream.items import IceCreamItem

#自定義lua腳本
lua = '''
function main(splash)
    splash:go(splash.args.url)
    splash:wait(3)
    splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
    splash:wait(3)
    return splash:html()
    end
'''


class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['search.jd.com']
    start_urls = ['https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8']
    base_url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def parse(self, response):
        # 這里能獲取100頁
        page_num = int(response.css('span.fp-text i::text').extract_first())
        # 但是100頁太多了，這里固定為2頁
        page_num = 2
        # print("page_num", page_num)
        for i in range(page_num):
            url = '%s?page=%s' % (self.base_url, 2 * i + 1)  # 通過觀察我們發現url頁面間有規律
            # print("url", url)
            yield SplashRequest(url, headers=self.settings.get('REQUEST_HEADERS'), endpoint='execute',
                                args={'lua_source': lua}, callback=self.parse_item)

    def parse_item(self, response):  # 頁面解析函數
        # 創建item字段對象，用來存儲信息
        item = IceCreamItem()

        for sel in response.css('div.gl-i-wrap'):
            name = sel.css('div.p-name em').extract_first()
            price = sel.css('div.p-price i::text').extract_first()
            # print("name", name)
            # print("price", price)

            item['name'] = name
            item['price'] = price
            yield item
            # yield {
            #     'name': name,
            #     'price': price,
            # }

View Code

修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IceCreamItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 與itcast.py定義的一一對應
    name = scrapy.Field()
    price = scrapy.Field()

View Code

修改pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class IceCreamPipeline(object):
    def __init__(self):

        #python3保存文件 必須需要'wb' 保存為json格式
        self.f = open("ice_cream_pipline.json",'wb')

    def process_item(self, item, spider):
        # 讀取item中的數據 並換行處理
        content = json.dumps(dict(item), ensure_ascii=False) + ',\n'
        self.f.write(content.encode('utf=8'))

        return item

    def close_spider(self,spider):
        #關閉文件
        self.f.close()

View Code

執行bin.py，等待1分鍾，就會生成文件ice_cream_pipline.json

打開json文件，內容如下：

{"name": "<em><span class=\"p-tag\" style=\"background-color:#c81623\">京東超市</span>\t\n明治（meiji）草莓白巧克力雪糕 245g（6支）彩盒 <font class=\"skcolor_ljg\">冰淇淋</font></em>", "price": "46.80"},
{"name": "<em><span class=\"p-tag\" style=\"background-color:#c81623\">京東超市</span>\t\n伊利 巧樂茲香草巧克力口味脆皮甜筒雪糕<font class=\"skcolor_ljg\">冰淇淋</font>冰激凌冷飲 73g*6/盒</em>", "price": "32.80"},
...

本文參考鏈接：

https://www.cnblogs.com/518894-lu/p/9067208.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy-splash抓取動態數據例子二 JD腳本 JD-GUI Docker 安裝Splash Splash Lua 腳本 Blender 之 Splash 代碼分析 scrapy的splash 的簡單使用 jd-eclipse插件的安裝 JD-GUI中文亂碼 splash官方文檔解讀(翻譯)