python網絡爬蟲之scrapy 調試以及爬取網頁

本文轉載自查看原文 2017-06-20 21:16 2682 python網絡爬蟲

Shell調試：

進入項目所在目錄，scrapy shell “網址”

如下例中的：

scrapy shell http://www.w3school.com.cn/xml/xml_syntax.asp

可以在如下終端界面調用過程代碼如下所示：

相關的網頁代碼：

我們用scrapy來爬取一個具體的網站。以迅讀網站為例。

如下是首頁的內容，我想要得到文章列表以及對應的作者名稱。

首先在items.py中定義title, author. 這里的Test1Item和Django中的modul作用類似。這里可以將Test1Item看做是一個容器。這個容器繼承自scrapy.Item.

而Item又繼承自DictItem。因此可以認為Test1Item就是一個字典的功能。其中title和author可以認為是item中的2個關鍵字。也就是字典中的key

class Item(DictItem):

class Test1Item(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title=Field()

    author=Field()

下面就在test_spider.py中開始寫網頁解析代碼

from scrapy.spiders import Spider

from scrapy.selector import Selector

from test1.items import Test1Item

class testSpider(Spider):

    name="test1"    #這里的name必須和創建工程的名字一致，否則會提示找不到爬蟲項目

    allowd_domains=['http://www.xunsee.com']

    start_urls=["http://www.xunsee.com/"]

    def parse(self, response):

        items=[]

        sel=Selector(response)

        sites = sel.xpath('//*[@id="content_1"]/div')  #這里是所有數據的入口。下面所有的div都是存儲的文章列表和作者

        for site in sites:

          item=Test1Item()

          title=site.xpath('span[@class="title"]/a/text()').extract()

          h=site.xpath('span[@class="title"]/a/@href').extract()

          item['title']=[t.encode('utf-8') for t in title]

        author=site.xpath('span[@class="author"]/a/text()').extract()

          item['author']=[a.encode('utf-8') for a in author]

          items.append(item)

         return items

獲取到title以及author的內容后，存儲到item中。再將所有的item存儲在items的列表中

在pipelines.py中修改Test1Pipeline如下。這個類中實現的是處理在testSpider中返回的items數據。也就是存儲數據的地方。我們將items數據存儲到json文件中去

class Test1Pipeline(object):

    def __init__(self):

        self.file=codecs.open('xundu.json','wb',encoding='utf-8')

    def process_item(self, item, spider):

        line=json.dumps(dict(item)) + '\n'

        self.file.write(line.decode("unicode_escape"))

        return item

工程運行后，可以看到在目錄下生成了一個xundu.json文件。其中運行日志可以在log文件中查看

從這個爬蟲可以看到，scrapy的結構還是比較簡單。主要是三步：

1 items.py定義內容存儲的關鍵字

2 自定義的test_spider.py中進行網頁數據的爬取並返回數據

3 pipelines.py中對tes_spider.py中返回的內容進行存儲

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python網絡爬蟲之使用scrapy自動爬取多個網頁【Python網絡爬蟲三】爬取網頁新聞 python網絡爬蟲之解析網頁的BeautifulSoup(爬取電影圖片)[三] python3下scrapy爬蟲(第八卷:循環爬取網頁多頁數據） python 爬蟲（爬取網頁的img並下載） Python爬蟲爬取動態網頁 Python爬蟲——爬取網頁圖片 Python爬蟲功能（爬取網頁圖片）精通python網絡爬蟲之自動爬取網頁的爬蟲代碼記錄 Python爬蟲爬取網頁圖片