Python3 爬蟲之 Scrapy 核心功能實現（二）

本文轉載自查看原文 2018-01-14 00:39 4271 Scrapy/ Python

博客地址：http://www.moonxy.com

基於 Python 3.6.2 的 Scrapy 爬蟲框架使用，Scrapy 的搭建過程請參照本人的另一篇博客：Python3 爬蟲之 Scrapy 框架安裝配置（一）

1. 爬蟲項目創建

在抓取之前，需要新建一個 Scrapy 工程。進入一個你想用來保存代碼的目錄，比如 G:\projects 然后執行：

scrapy startproject SinanewsSpider

這個命令會在當前目錄下創建一個新目錄 SinanewsSpider，這就是此爬蟲的項目名稱，后面會使用到。

成功創建爬蟲項目文件結構后，使用：tree /f 查看文件層級的結構關系

這些文件主要是：
scrapy.cfg: 項目配置文件
SinanewsSpider/: 項目python模塊, 代碼將從這里導入
SinanewsSpider/items.py: 項目items文件
SinanewsSpider/pipelines.py: 項目管道文件
SinanewsSpider/settings.py: 項目配置文件
SinanewsSpider/spiders: 放置spider的目錄

2. 定義item

編輯 items.py 文件，items 是將要裝載抓取的數據的容器，它工作方式像 python 里面的字典，但它提供更多的保護，比如對未定義的字段填充以防止拼寫錯誤。在 items.py 文件里，scrapy 需要我們定義一個容器用於放置爬蟲抓取的數據，它通過創建一個scrapy.Item 類來聲明，定義它的屬性為scrpy.Field 對象，就像是一個對象關系映射(ORM, Object Relational Mapping)。我們通過將需要的 item 模型化，來控制從站點獲得的新聞數據，比如我們要獲得新聞的標題項、內容項、發表時間、圖片鏈接地址和頁面鏈接地址，則定義這5種屬性的域。Scrapy 框架已經定義好了基礎的 item，我們自己的 item 只需繼承 scrapy.Item 即可。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
 
class SinanewsspiderItem(scrapy.Item):  #定義數據項類，從scrapy.Item繼承
    # define the fields for your item here like:# name = scrapy.Field()
    title = scrapy.Field()        #定義標題項
    content = scrapy.Field()     #定義內容項
    pubtime = scrapy.Field()    #定義發表時間
    imageUrl = scrapy.Field()    #定義圖片鏈接地址
    Url = scrapy.Field()         #定義頁面鏈接地址

3. 編寫爬蟲 Spider

新建 SinanewsSpider.py 文件， Scrapy 框架已經幫助我們定義好了基礎爬蟲，只需要從 scrapy.spider 繼承，並重寫相應的解析函數 parse 即可。其中會涉及到使用 xPath 獲取頁面元素路徑的操作，xPaht 是 XML 頁面路徑語言，使用路徑表達式來選取 XML 文檔中的節點或節點集，節點是通過沿着路徑（Path）或者步（Steps）來選取的，html 是 XML 的子集，當然同樣適用，有興趣的讀者可以自行查閱相關的 Xpath 文檔。

# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from scrapy.xlib.pydispatch import dispatcher
from twisted.internet import reactor
from time import ctime,sleep
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
from twisted.internet import reactor
from SinanewsSpider.items import SinanewsspiderItem
from scrapy.http import Request
import logging
import MySQLdb
import scrapy
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
#以上是一些依賴包的導入
class SinanewsSpider(scrapy.Spider):
    name = "SinanewsSpider"      
    start_urls = []
    def __init__(self):
        self.start_urls = ["http://roll.news.sina.com.cn/news/gnxw/gdxw1/index.shtml"]

    def parse(self, response):
        for url in response.xpath('//ul/li/a/@href').extract():
            yield scrapy.Request(url, callback=self.parse_detail)

        nextLink = []
        nextLink = response.xpath('//div[@class="pagebox"]/span[last()-1]/a/@href').extract()
        if nextLink:
            nextLink = nextLink[0]
            nextpage= nextLink.split('./')[1]
            yield Request("http://roll.news.sina.com.cn/news/gnxw/gdxw1/" + nextpage,callback=self.parse)


    def parse_detail(self, response):
        item = SinanewsspiderItem()
        item['title'] = response.xpath('//h1[@class="main-title"]/text()').extract()[0]
        content = ''
        for con in response.xpath('//div[@id="article"]/p/text()').extract():
            content = content + con
        item['content'] = content
        item['pubtime'] = response.xpath('//span[@class="date"]/text()').extract()[0]
        imageurl = ''
        for img in response.xpath('//div[@id="article"]/div[@class="img_wrapper"]/img/@src').extract():
            imageurl = imageurl + img+'|'
        item['imageUrl'] = imageurl
        item['Url'] = response.url
        yield item

4. 數據存儲

編輯 pipelines.py 文件，用於將 items 中的數據存儲到數據庫中。

首先，創建 sinanews 數據庫，並創建 SinaLocalNews 數據表，用於存儲爬到的新聞數據：

mysql> create database sinanews;
mysql> use sinanews;
mysql> CREATE TABLE SinaLocalNews (
    ->   id int(11) NOT NULL AUTO_INCREMENT,
    ->   title VARCHAR(100),
    ->   content  TEXT,
    ->   imageUrl       VARCHAR(2000),
    ->   Url    VARCHAR(1000),
    ->   pubtime  DATETIME,
    ->   PRIMARY KEY (id)
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.96 sec)

創建數據庫：sinanews

創建數據表：SinaLocalNews

然后，在 process_item 方法中定義數據庫操作的代碼，process_item 方法在 pipeline 類中會默認執行：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb

class SinanewsspiderPipeline(object):
    con = MySQLdb.connect(host='localhost', port=3306, user='root', passwd='123456', db='sinanews', charset='utf8')
    cur = con.cursor()
    def process_item(self, item, spider):
        sql = "INSERT INTO SinaLocalNews(title, content, imageUrl, Url, pubtime) VALUES ('%s', '%s', '%s', '%s', trim(replace(replace(replace(left('%s',16),'年','-'),'月','-'),'日',' ')))" % (item['title'], item['content'], item['imageUrl'], item['Url'], item['pubtime'])
        self.cur.execute(sql)
        self.con.commit()

其中，host 為數據庫服務器的地址，port 為數據庫服務器監聽的端口號，usr 指定數據庫的用戶名，passwd 則為數據庫密碼，db 為所要連接的具體數據庫實例的名稱，charset 指定你目標數據庫的編碼字符集。

5. 激活 pipeline 管道

編輯 settings.py 文件，添加如下代碼：

BOT_NAME = 'SinanewsSpider'

SPIDER_MODULES = ['SinanewsSpider.spiders']
NEWSPIDER_MODULE = 'SinanewsSpider.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'SinanewsSpider.pipelines.SinanewsspiderPipeline': 300,
}