Python爬蟲庫Scrapy入門1--爬取當當網商品數據

本文轉載自查看原文 2016-12-13 22:21 1910 Python

1.關於scrapy庫的介紹，可以查看其官方文檔：http://scrapy-chs.readthedocs.io/zh_CN/latest/

2.安裝：pip install scrapy 注意這個庫的運行需要pywin32的支持，因此還需要安裝pywin32。可以在這個網站上選擇合適的版本下載安裝：https://sourceforge.net/projects/pywin32/files/pywin32/

3.挖掘當當網商品數據：

首先需要創建一個名為dangdang的爬蟲項目，在powershell中進入你項目所在的位置：

D:\Py\myweb>scrapy startproject dangdang

New Scrapy project 'dangdang', using template directory 'd:\\python35\\lib\\site-packages\\scrapy\\templates\\project', created in:

    D:\Python35\myweb\dangdang

You can start your first spider with:

    cd dangdang

    scrapy genspider example example.com

創建好了爬蟲項目之后，需要進入該爬蟲項目，然后在爬蟲項目中創建一個爬蟲，如下所示：

D:\Py\myweb>cd .\dangdang\

D:\Py\myweb\dangdang>scrapy genspider -t basic dangspd dangdang.com

Created spider 'dangspd' using template 'basic' in module:

  Dangdang.spiders.dangspd

隨后，需要編寫items.py文件，在該文件中定義好需要爬取的內容，將items.py文件修改為如下所示：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DangdangItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    #商品標題

    title=scrapy.Field()

    #商品評論數

    num=scrapy.Field()

隨后，需要編寫pipelines.py文件，在pipelines.py文件中，一般會編寫一些爬取后數據處理的代碼們需要將爬取到的信息依次展現到屏幕上同時保存在本地txt中，將pipelines.py文件修改為如下所示：

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class DangdangPipeline(object):

    def process_item(self, item, spider):

        #item=dict(item)

        #print(len(item["name"]))

        for j in range(0,len(item["title"])):

            print(j)

            title=item["title"][j]

            num=item["num"][j]

            print("商品名："+title)

            print("商品評論數："+num)

            print("--------")

            with open("result.txt",'a') as f:
                f.write(title+"\t"+num +"\n")

        return item

隨后，接下來還需要編寫配置文件settings.py，編寫配置文件的目的有兩個：

1）、啟用剛剛編寫的pipelines，因為默認是不啟用的。

2）、設置不遵循robots協議爬行，因為該協議對的爬蟲有相關限制，遵循該協議，可能會無法爬取到結果。

可以將配置文件settings.py的robots協議配置部分修改為如下所示，此時值設置為False，代表讓爬蟲不遵循當當網的robots協議爬行，當然不要利用這些技術做違法事項。

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

然后， 再將配置文件settings.py的pipelines配置部分設置為如下所示，開啟對應的pipelines:

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

    'dangdang.pipelines.DangdangPipeline': 300,

}

隨后，需要分析當當網的網頁結構，總結出信息提取的規則以及自動爬行的規律。

打開某一個頻道頁，各頁對應的網址如下所示：

http://category.dangdang.com/pg1-cid4002644.html

http://category.dangdang.com/pg2-cid4002644.html

http://category.dangdang.com/pg3-cid4002644.html

……

此時，會發現，網頁的格式形如：http://category.dangdang.com/pg[頁碼]-cid4002644.html

有了該規律之后，可以將頁碼位置設置為變量，通過for循環就可以構造出一個頻道中所有的商品頁，也就通過這種方式實現了自動爬取。

然后，再分析商品信息的提取規律。

打開任意一個頻道頁http://category.dangdang.com/pg1-cid4002644.html，然后可以看到如下界面：

此時需要提取該頁面中所有的商品標題和商品評論信息，將其他無關信息過濾掉。所以，可以查看該網頁源代碼，以第一個商品為例進行分析，然后總結出所有商品的提取規律。可以右鍵--查看源代碼，然后通過ctrl+find快速定位源碼中該商品的對應源代碼部分，如下所示：

對應源代碼復制出來如下所示：

……

<a title=" [當當自營]EGISOO御姬秀橙花潤唇膏3g 無色護唇膏淡化唇紋水潤晶瑩保濕潤唇膏 " class="pic" href="https://ask.hellobi.com/http://product.dangdang.com/60629118.html#ddclick?act=click&pos=60629118_0_2_m&cat=4002644&key=&qinfo=&pinfo=&minfo=14215_1_48&ninfo=&custid=&permid=20160906025129757347420307757891648&ref=&rcount=&type=&t=1476452492000&searchapi_version=test_ori" target="_blank" ><img src='http://img3x8.ddimg.cn/33/30/60629118-1_b_2.jpg' alt=' [當當自營]EGISOO御姬秀橙花潤唇膏3g 無色護唇膏淡化唇紋水潤晶瑩保濕潤唇膏 ' /></a> ¥9.90<a title=" [當當自營]EGISOO御姬秀橙花潤唇膏3g 無色護唇膏淡化唇紋水潤晶瑩保濕潤唇膏 " href="https://ask.hellobi.com/http://product.dangdang.com/60629118.html#ddclick?act=click&pos=60629118_0_2_m&cat=4002644&key=&qinfo=&pinfo=&minfo=14215_1_48&ninfo=&custid=&permid=20160906025129757347420307757891648&ref=&rcount=&type=&t=1476452492000&searchapi_version=test_ori" target="_blank" > [當當自營]EGISOO御姬秀橙花潤唇膏3g 無色護唇膏淡化唇紋水潤晶瑩保濕潤唇膏 </a> 明星都在用水潤護唇秋冬換季必備呵護你的雙唇晶瑩剔透明媚動人正品保證貨到付款 <a href="https://ask.hellobi.com/http://comm.dangdang.com/review/reviewlist.php?pid=60629118#ddclick?act=sort_total_review_count_desc&pos=60629118_0_2_m&cat=4002644&key=&qinfo=&pinfo=&minfo=14215_1_48&ninfo=&custid=&permid=20160906025129757347420307757891648&ref=&rcount=&type=&t=1476452492000&searchapi_version=test_ori" target="_blank" name="P_pl">434條評論</a> </div>

……

所以，可以得到提取商品標題和商品評論的Xpath表達式，如下所示：

#提取商品標題

"//a[@class='pic']/@title"

#提取商品評論

"//a[@name='P_pl']/text()"

此時，已經總結出了信息提取的對應的Xpath表達式，然后可以編寫剛才最開始的時候創建的爬蟲文件dangspd.py了，將爬蟲文件編寫修改為如下所示：

# -*- coding: utf-8 -*-

import scrapy

import re

from dangdang.items import DangdangItem

from scrapy.http import Request

 

class DangspdSpider(scrapy.Spider):

    name = "dangspd"

    allowed_domains = ["dangdang.com"]

    start_urls = (

        'http://category.dangdang.com/pg1-cid4002644.html',

    )


    def parse(self, response):

        item=DangdangItem()

        item["title"]=response.xpath("//a[@class='pic']/@title").extract()

        item["num"]=response.xpath("//a[@name='P_pl']/text()").extract()

        yield item

        for i in range(2,101):

            url="http://category.dangdang.com/pg"+str(i)+"-cid4002644.html"

            yield Request(url, callback=self.parse)

這樣，就可以實現爬蟲的編寫了。

隨后，可以進入調試和運行階段。

進入cmd界面，運行該爬蟲，出現如下所示結果，中間結果太長，省略了部分：

D:\Py\myweb\dangdang>scrapy crawl dangspd --nolog

……

43

商品名： WIS水潤面膜套裝24片祛痘控油補水保濕淡痘印收縮毛孔面膜貼男女

商品評論數：255條評論

--------

44

商品名：歐詩漫水活奇跡系列【水活奇跡珍珠水(清潤型)+珍珠水活奇跡保濕凝乳】

商品評論數：0條評論

--------

45

商品名：【法國進口】雅漾（Avene）活泉恆潤保濕精華乳30ml 0064

商品評論數：0條評論

--------

46

商品名：【法國進口】Avene雅漾敏感肌膚護理凈柔潔面摩絲150ml溫和泡沫潔面乳洗面奶0655

商品評論數：0條評論

--------

47

商品名：珍視明中老年護眼貼2盒裝 30對60貼針對中老年用眼問題緩解眼疲勞

商品評論數：226條評論

而且在本地文件會有一個result.txt文件。里面數據：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲案例-爬取當當網數據 scrapy爬取當當網 java爬蟲，爬取當當網數據【python爬蟲】爬取當當網TOP500圖書暢銷榜 Python爬取當當網書籍數據，並數據可視化展示 python爬取當當網書籍信息 Python3爬蟲爬取淘寶商品數據 python爬蟲06 | 你的第一個爬蟲，爬取當當網 Top 500 本五星好評書籍 Python網絡爬蟲——當當網當當網爬蟲