Python Scrapy 爬蟲框架實例（一）

本文轉載自查看原文 2018-11-13 15:48 18455 網絡爬蟲 -- Python

之前有介紹 scrapy 的相關知識，但是沒有介紹相關實例，在這里做個小例，供大家參考學習。

注：后續不強調python 版本，默認即為python3.x。

爬取目標

這里簡單找一個圖片網站，獲取圖片的先關信息。

該網站網址： http://www.58pic.com/c/

創建項目

終端命令行執行以下命令

scrapy  startproject AdilCrawler

命令執行后，會生成如下結構的項目。

執行結果如下

如上圖提示，cd 到項目下，可以執行 scrapy genspider example example.com 命令，創建名為example,域名為example.com 的爬蟲文件。

編寫items.py

這里先簡單抓取圖片的作者名稱、圖片主題等信息。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class AdilcrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    author = scrapy.Field()   # 作者

    theme = scrapy.Field()    # 主題

編寫spider文件

進入AdilCrawler目錄，使用命令創建一個基礎爬蟲類：

 scrapy genspider  thousandPic www.58pic.com

#  thousandPic為爬蟲名，www.58pic.com為爬蟲作用范圍

執行命令后會在spiders文件夾中創建一個thousandPic.py的文件，現在開始對其編寫：

# -*- coding: utf-8 -*-
import scrapy
# 爬蟲 小試

class ThousandpicSpider(scrapy.Spider):
    name = 'thousandPic'
    allowed_domains = ['www.58pic.com']
    start_urls = ['http://www.58pic.com/c/']

    def parse(self, response):

        '''
        查看頁面元素
         /html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()
          因為頁面中 有多張圖，而圖是以 /html/body/div[4]/div[3]/div[i]  其中i  為變量 作為區分的 ，所以為了獲取當前頁面所有的圖
          這里 不寫 i 程序會遍歷 該 路徑下的所有 圖片。
        '''# author 作者
        # theme  主題
        author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()
        theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()
        # 使用 爬蟲的log 方法在控制台輸出爬取的內容。
        self.log(author)
        self.log(theme)
        # 使用遍歷的方式 打印出 爬取的內容，因為當前一頁有20張圖片。
        for i in range(1, 21):
            print(i,' **** ',theme[i - 1], ': ',author[i - 1] )

執行命令,查看打印結果

scrapy crawl thousandPic

結果如下，其中DEBUG為 log 輸出。

代碼優化

引入 item AdilcrawlerItem

# -*- coding: utf-8 -*-
import scrapy
# 這里使用 import 或是 下面from 的方式都行，關鍵要看 當前項目在pycharm的打開方式，是否是作為一個項目打開的，建議使用這一種方式。
import AdilCrawler.items as items

# 使用from 這種方式，AdilCrawler 需要作為一個項目打開。
# from AdilCrawler.items import AdilcrawlerItem


class ThousandpicSpider(scrapy.Spider):
    name = 'thousandPic'
    allowed_domains = ['www.58pic.com']
    start_urls = ['http://www.58pic.com/c/']

    def parse(self, response):

        '''
        查看頁面元素
         /html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()
          因為頁面中 有多張圖，而圖是以 /html/body/div[4]/div[3]/div[i]  其中i  為變量 作為區分的 ，所以為了獲取當前頁面所有的圖
          這里 不寫 i 程序會遍歷 該 路徑下的所有 圖片。
        '''

        item = items.AdilcrawlerItem()

        # author 作者
        # theme  主題

        author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

        theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

        item['author'] = author
        item['theme']  = theme

        return item

再次運營爬蟲，執行結果如下

保存結果到文件

執行命令如下

scrapy crawl thousandPic -o items.json

會生成如圖的文件

再次優化，使用 ItemLoader 功能類

使用itemLoader ，以取代雜亂的extract()和xpath()。

代碼如下：

# -*- coding: utf-8 -*-
import scrapy
from AdilCrawler.items import AdilcrawlerItem

# 導入 ItemLoader 功能類
from scrapy.loader import ItemLoader

# optimize  優化
# 爬蟲項目優化

class ThousandpicoptimizeSpider(scrapy.Spider):
    name = 'thousandPicOptimize'
    allowed_domains = ['www.58pic.com']
    start_urls = ['http://www.58pic.com/c/']

    def parse(self, response):

        '''
        查看頁面元素
         /html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()
          因為頁面中 有多張圖，而圖是以 /html/body/div[4]/div[3]/div[i]  其中i  為變量 作為區分的 ，所以為了獲取當前頁面所有的圖
          這里 不寫 i 程序會遍歷 該 路徑下的所有 圖片。
        '''

        # 使用功能類 itemLoader,以取代 看起來雜亂的 extract() 和 xpath() ，優化如下
        i = ItemLoader(item = AdilcrawlerItem(),response = response )
        # author 作者
        # theme  主題
        i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')
        i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')
        return i.load_item()

編寫pipelines文件

默認pipelines.py 文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class Adilcrawler1Pipeline(object):
    def process_item(self, item, spider):
        return item

優化后代碼如下

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class AdilcrawlerPipeline(object):
    '''
        保存item數據
    '''

    def __init__(self):
        self.filename = open('thousandPic.json','w')

    def process_item(self, item, spider):

        #  ensure_ascii=False 可以解決 json 文件中 亂碼的問題。
        text = json.dumps(dict(item), ensure_ascii=False) + ',\n'   #  這里是一個字典一個字典存儲的，后面加個 ',\n' 以便分隔和換行。
        self.filename.write(text)

        return item

    def close_spider(self,spider):
        self.filename.close()

settings文件設置

修改settings.py配置文件

找到pipelines 配置進行修改

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'AdilCrawler.pipelines.AdilcrawlerPipeline': 300,
# }

# 啟動pipeline 必須將其加入到“ITEM_PIPLINES”的配置中
# 其中根目錄是tutorial，pipelines是我的pipeline文件名，TutorialPipeline是類名
ITEM_PIPELINES = {
    'AdilCrawler.pipelines.AdilcrawlerPipeline': 300,
}

# 加入后，相當於開啟pipeline，此時在執行爬蟲，會執行對應的pipelines下的類，並執行該類相關的方法，比如這里上面的保存數據功能。

執行命令

scrapy crawl thousandPicOptimize

執行后生成如下圖文件及保存的數據

使用CrawlSpider類進行翻頁抓取

使用crawl 模板創建一個 CrawlSpider 
執行命令如下

scrapy genspider -t crawl thousandPicPaging www.58pic.com

items.py 文件不變，查看爬蟲 thousandPicPaging.py 文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ThousandpicpagingSpider(CrawlSpider):
    name = 'thousandPicPaging'
    allowed_domains = ['www.58pic.com']
    start_urls = ['http://www.58pic.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

修改后如下

# -*- coding: utf-8 -*-
import scrapy
# 導入鏈接規則匹配類，用來提取符合規則的連接
from scrapy.linkextractors import LinkExtractor
# 導入CrawlSpider類和Rule
from scrapy.spiders import CrawlSpider, Rule
import AdilCrawler.items as items

class ThousandpicpagingSpider(CrawlSpider):
    name = 'thousandPicPaging'
    allowed_domains = ['www.58pic.com']
    # 修改起始頁地址
    start_urls = ['http://www.58pic.com/c/']

    # Response里鏈接的提取規則，返回的符合匹配規則的鏈接匹配對象的列表
    # http://www.58pic.com/c/1-0-0-03.html  根據翻頁連接地址，找到 相應的 正則表達式   1-0-0-03  -> \S-\S-\S-\S\S  而且 這里使用 allow
    # 不能使用 restrict_xpaths ，使用 他的話，正則將失效
    page_link = LinkExtractor(allow='http://www.58pic.com/c/\S-\S-\S-\S\S.html', allow_domains='www.58pic.com')

    rules = (
        # 獲取這個列表里的鏈接，依次發送請求，並且繼續跟進，調用指定回調函數處理
        Rule(page_link, callback='parse_item', follow=True),  # 注意這里的 ',' 要不會報錯
    )


    # 加上這個 方法是為了 解決 parse_item() 不能抓取第一頁數據的問題 parse_start_url 是 CrawlSpider() 類下的方法，這里重寫一下即可
    def parse_start_url(self, response):
        i = items.AdilcrawlerItem()
        author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()
        theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()
        i['author'] = author
        i['theme'] = theme

        yield i

    # 指定的回調函數
    def parse_item(self, response):
        i = items.AdilcrawlerItem()
        author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()
        theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()
        i['author'] = author
        i['theme'] = theme
        yield i

再次執行

scrapy crawl thousandPicPaging

查看執行結果，可以看到是有4頁的內容

再次優化引入 ItemLoader 類

# -*- coding: utf-8 -*-
import scrapy
# 導入鏈接規則匹配類，用來提取符合規則的連接
from scrapy.linkextractors import LinkExtractor
# 導入CrawlSpider類和Rule
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
import AdilCrawler.items as items

class ThousandpicpagingopSpider(CrawlSpider):
    name = 'thousandPicPagingOp'
    allowed_domains = ['www.58pic.com']
    # 修改起始頁地址
    start_urls = ['http://www.58pic.com/c/']

    # Response里鏈接的提取規則，返回的符合匹配規則的鏈接匹配對象的列表
    # http://www.58pic.com/c/1-0-0-03.html  根據翻頁連接地址，找到 相應的 正則表達式   1-0-0-03  -> \S-\S-\S-\S\S  而且 這里使用 allow
    # 不能使用 restrict_xpaths ，使用 他的話，正則將失效
    page_link = LinkExtractor(allow='http://www.58pic.com/c/\S-\S-\S-\S\S.html', allow_domains='www.58pic.com')

    rules = (
        # 獲取這個列表里的鏈接，依次發送請求，並且繼續跟進，調用指定回調函數處理
        Rule(page_link, callback='parse_item', follow=True),  # 注意這里的 ',' 要不會報錯
    )

    # 加上這個 方法是為了 解決 parse_item() 不能抓取第一頁數據的問題 parse_start_url 是 CrawlSpider() 類下的方法，這里重寫一下即可
    def parse_start_url(self, response):

        i = ItemLoader(item = items.AdilcrawlerItem(),response = response )
        i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')
        i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

        yield  i.load_item()

    # 指定的回調函數
    def parse_item(self, response):
        i = ItemLoader(item = items.AdilcrawlerItem(),response = response )
        i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')
        i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

        yield  i.load_item()

執行結果是一樣的。

最后插播一條在線正則表達式測試工具的廣告，地址： http://tool.oschina.net/regex/

應用如下

至此，簡單完成了一個網站的簡單信息的爬取。后面還會有其他內容的介紹~

如果你要覺得對你有用的話，請不要吝惜你打賞，這將是我無盡的動力，謝謝！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲框架Scrapy實例（一） Python爬蟲框架Scrapy實例（二） Python之Scrapy爬蟲框架入門實例（一） python爬蟲之Scrapy框架 python爬蟲之Scrapy框架(CrawlSpider) python爬蟲之scrapy框架介紹基於Scrapy框架的Python新聞爬蟲 python爬蟲Scrapy框架之增量式爬蟲 python爬蟲框架之scrapy的快速上手 Python scrapy爬蟲框架常用setting配置