Splash抓取jd


一、概述

在上一篇文章中,鏈接如下:https://www.cnblogs.com/xiao987334176/p/13656055.html

已經介紹了如何使用Splash抓取javaScript動態渲染頁面

這里做一下項目實戰,以爬取京東商城商品冰淇淋為例吧

環境說明

操作系統:centos 7.6

docker版本:19.03.12

ip地址:192.168.0.10

說明:使用docker安裝Splash服務

 

操作系統:windows 10

python版本:3.7.9

ip地址:192.168.0.9

說明:使用Pycharm開發工具,用於本地開發。

 

關於Splash的使用,參考上一篇文章,這里就不做說明了。

 

二、分析頁面

打開京東商城,輸入關鍵字:冰淇淋,滑動滾動條,我們發現隨着滾動條向下滑動,越來越多的商品信息被刷新了,這說明該頁面部分是ajax加載

 

 注意:每一條商品信息,都是在<div class="gl-i-wrap"></div>里面的。

 

我們打開scrapy shell 爬取該頁面,如下圖:

scrapy shell "https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8"

輸出:

...
[s]   view(response)    View response in a browser
>>>

注意:url不要用單引號,否則會報錯。

 

接下來,輸入以下命令,使用css選擇器

>>> response.css('div.gl-i-wrap')
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '
), ' gl-i-wrap ')]"
...

返回了很多Selector 對象。

 

統計商品信息個數

>>> len(response.css('div.gl-i-wrap'))
30

得到返回結果發現只有30個冰淇凌的信息,而我們再頁面中明明看見了60個冰淇凌信息,這是為什么呢?

答:這也說明了剛開始頁面只用30個冰淇淋信息,而我們滑動滑塊時,執行了js代碼,並向后台發送了ajax請求,瀏覽器拿到數據后再進一步渲染出另外了30個信息

我們可以點擊network選項卡再次確認:

 

 

 

鑒於此,我們就想出了一種解決方案:即用js代碼模擬用戶滑動滑塊到底的行為再結合execute端點提供的js代碼執行服務即可(小伙伴們讓我們開始實踐吧)

 

 注意:<div id="footer-2017"></div>就是頁面底部了,因此,只需要滑動到底部即可。

 

首先:模擬用戶行為

在console,輸入以下命令:

e = document.getElementById("footer-2017")
e.scrollIntoView(true)

效果如下,就直接滑動到底部了。

參數解釋:

scrollIntoView是一個與頁面(容器)滾動相關的API(官方解釋),該API只有boolean類型的參數能得到良好的支持(firefox 36+都支持)

參數為true時調用該函數,頁面(或容器)發生滾動,使element的頂部與視圖(容器)頂部對齊;

 

使用scrapy.Request

上面我們使用Request發送請求,觀察結果只有30條。為什么呢?因為頁面時動態加載的所有我們只收到了30個冰淇淋的信息。

所以這里,使用scrapy.Request發送請求,並使用execute 端點解決這個問題。

打開上一篇文章中的爬蟲項目dynamic_page,使用Pycharm打開,並點開Terminal

輸入dir,確保當前目錄是dynamic_page

(crawler) E:\python_script\爬蟲\dynamic_page>dir
 驅動器 E 中的卷是 file
 卷的序列號是 1607-A400

 E:\python_script\爬蟲\dynamic_page 的目錄

2020/09/12  10:37    <DIR>          .
2020/09/12  10:37    <DIR>          ..
2020/09/12  10:20               211 bin.py
2020/09/12  14:30                 0 dynamicpage_pipline.json
2020/09/12  10:36    <DIR>          dynamic_page
2020/09/12  10:33                 0 result.csv
2020/09/12  10:18               267 scrapy.cfg
               4 個文件            478 字節
               3 個目錄 260,445,159,424 可用字節

 

接下來打開scrapy shell,輸入命令:

scrapy shell 

輸出:

...
[s]   view(response)    View response in a browser
>>>

 

最后粘貼以下代碼:

from scrapy_splash import SplashRequest  #使用scrapy.splash.Request發送請求

url = "https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8"
lua = '''
function main(splash)
    splash:go(splash.args.url)
    splash:wait(3)
    splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
    splash:wait(3)
    return splash:html()
end
'''
fetch(SplashRequest(url,endpoint = 'execute',args= {'lua_source':lua})) #再次請求,我們可以看到現在已通過splash服務的8050端點渲染了js代碼,並成果返回結果
len(response.css('div.gl-i-wrap'))

 

效果如下:

[s]   view(response)    View response in a browser
>>> from scrapy_splash import SplashRequest  #使用scrapy.splash.Request發送請求
>>> url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'
>>> lua = '''
... function main(splash)
...     splash:go(splash.args.url)
...     splash:wait(3)
...     splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
...     splash:wait(3)
...     return splash:html()
... end
... '''
>>> fetch(SplashRequest(url,endpoint = 'execute',args= {'lua_source':lua})) #再次請求,我們可以看到現
在已通過splash服務的8050端點渲染了js代碼,並成果返回結果
2020-09-12 14:30:54 [scrapy.core.engine] INFO: Spider opened
2020-09-12 14:30:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://w
ww.jd.com/error.aspx> from <GET https://search.jd.com/robots.txt>
2020-09-12 14:30:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jd.com/error.aspx> (ref
erer: None)
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 83 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 92 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 202 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 351 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 375 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 376 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 385 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 386 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 387 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 388 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 389 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 397 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 400 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 403 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 404 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 405 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 406 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 407 without any user agent to enforce it on.
2020-09-12 14:30:55 [protego] DEBUG: Rule at line 408 without any user agent to enforce it on.
2020-09-12 14:30:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.0.10:8050/robots.txt
> (referer: None)
2020-09-12 14:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://search.jd.com/Search?keywor
d=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8 via http://192.168.0.10:8050/execute> (referer: None)
>>> len(response.css('div.gl-i-wrap'))
60

注意:由於fetch不能直接在python代碼中執行,所以只能在scrapy shell 模式中執行。

 

最后的任務就回歸到了提取內容了階段了,小伙伴讓我們完成整個代碼吧!---這里結合scrapy shell 進行測試

三、代碼實現

新建項目

這里對目錄就沒有什么要求了,找個空目錄就行。

打開Pycharm,並打開Terminal,執行以下命令

scrapy startproject ice_cream
cd ice_cream
scrapy genspider jd search.jd.com

 

在scrapy.cfg同級目錄,創建bin.py,用於啟動Scrapy項目,內容如下:

#在項目根目錄下新建:bin.py
from scrapy.cmdline import execute
# 第三個參數是:爬蟲程序名
execute(['scrapy', 'crawl', 'jd',"--nolog"])

 

創建好的項目樹形目錄如下:

./
├── bin.py
├── ice_cream
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── jd.py
└── scrapy.cfg

 

修改settIngs.py

改寫settIngs.py文件這里小伙伴們可參考github(https://github.com/scrapy-plugins/scrapy-splash)---上面有詳細的說明

在最后添加如下內容:

# Splash服務器地址
SPLASH_URL = 'http://192.168.0.10:8050'
# 開啟兩個下載中間件,並調整HttpCompressionMiddlewares的次序
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 設置去重過濾器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 用來支持cache_args(可選)
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
ITEM_PIPELINES = {
   'ice_cream.pipelines.IceCreamPipeline': 100,
}

 

注意:請根據實際情況,修改Splash服務器地址,其他的不需要改動。

 

修改文件jd.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from ice_cream.items import IceCreamItem

#自定義lua腳本
lua = '''
function main(splash)
    splash:go(splash.args.url)
    splash:wait(3)
    splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
    splash:wait(3)
    return splash:html()
    end
'''


class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['search.jd.com']
    start_urls = ['https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8']
    base_url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'
    # 自定義配置,注意:變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def parse(self, response):
        # 這里能獲取100頁
        page_num = int(response.css('span.fp-text i::text').extract_first())
        # 但是100頁太多了,這里固定為2頁
        page_num = 2
        # print("page_num", page_num)
        for i in range(page_num):
            url = '%s?page=%s' % (self.base_url, 2 * i + 1)  # 通過觀察我們發現url頁面間有規律
            # print("url", url)
            yield SplashRequest(url, headers=self.settings.get('REQUEST_HEADERS'), endpoint='execute',
                                args={'lua_source': lua}, callback=self.parse_item)

    def parse_item(self, response):  # 頁面解析函數
        # 創建item字段對象,用來存儲信息
        item = IceCreamItem()

        for sel in response.css('div.gl-i-wrap'):
            name = sel.css('div.p-name em').extract_first()
            price = sel.css('div.p-price i::text').extract_first()
            # print("name", name)
            # print("price", price)

            item['name'] = name
            item['price'] = price
            yield item
            # yield {
            #     'name': name,
            #     'price': price,
            # }
View Code

 

修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class IceCreamItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 與itcast.py定義的一一對應
    name = scrapy.Field()
    price = scrapy.Field()
View Code

 

修改pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class IceCreamPipeline(object):
    def __init__(self):

        #python3保存文件 必須需要'wb' 保存為json格式
        self.f = open("ice_cream_pipline.json",'wb')

    def process_item(self, item, spider):
        # 讀取item中的數據 並換行處理
        content = json.dumps(dict(item), ensure_ascii=False) + ',\n'
        self.f.write(content.encode('utf=8'))

        return item

    def close_spider(self,spider):
        #關閉文件
        self.f.close()
View Code

 

執行bin.py,等待1分鍾,就會生成文件ice_cream_pipline.json

打開json文件,內容如下:

{"name": "<em><span class=\"p-tag\" style=\"background-color:#c81623\">京東超市</span>\t\n明治(meiji)草莓白巧克力雪糕 245g(6支)彩盒 <font class=\"skcolor_ljg\">冰淇淋</font></em>", "price": "46.80"},
{"name": "<em><span class=\"p-tag\" style=\"background-color:#c81623\">京東超市</span>\t\n伊利 巧樂茲香草巧克力口味脆皮甜筒雪糕<font class=\"skcolor_ljg\">冰淇淋</font>冰激凌冷飲 73g*6/盒</em>", "price": "32.80"},
...

 

 

本文參考鏈接:

https://www.cnblogs.com/518894-lu/p/9067208.html

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM