scrapy爬蟲事件以及數據保存為txt,json,mysql

本文轉載自查看原文 2017-07-24 19:13 12020 爬蟲

今天要爬取的網頁是虎嗅網

我們將完成如下幾個步驟：

創建一個新的Scrapy工程
定義你所需要要抽取的Item對象
編寫一個spider來爬取某個網站並提取出所有的Item對象
編寫一個Item Pipline來存儲提取出來的Item對象

創建Scrapy工程

在任何目錄下執行如下命令

scrapy startproject coolscrapy

cd coolscrapy 
scrapy genspider huxiu huxiu.com

我們看看創建的工程目錄結構：（news.json,news.txt是最后結果保存的）

定義Item

我們通過創建一個scrapy.Item類，並定義它的類型為scrapy.Field的屬性，我們准備將虎嗅網新聞列表的名稱、鏈接地址和摘要爬取下來。

 1 import scrapy
 2 
 3 
 4 class CoolscrapyItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     title = scrapy.Field() #標題
 8     link = scrapy.Field() #鏈接
 9     desc = scrapy.Field() #簡述
10     posttime = scrapy.Field() #發布時間

編寫Spider

蜘蛛就是你定義的一些類，Scrapy使用它們來從一個domain（或domain組）爬取信息。在蜘蛛類中定義了一個初始化的URL下載列表，以及怎樣跟蹤鏈接，如何解析頁面內容來提取Item。

定義一個Spider，只需繼承scrapy.Spider類並定於一些屬性：

name: Spider名稱，必須是唯一的
start_urls: 初始化下載鏈接URL
parse(): 用來解析下載后的Response對象，該對象也是這個方法的唯一參數。它負責解析返回頁面數據並提取出相應的Item（返回Item對象），還有其他合法的鏈接URL（返回Request對象）。

我們打開在coolscrapy/spiders文件夾下面的huxiu.py，內容如下：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from coolscrapy.items import CoolscrapyItem
 4 
 5 class HuxiuSpider(scrapy.Spider):
 6     name = "huxiu"
 7     allowed_domains = ["huxiu.com"]
 8     start_urls = ['http://huxiu.com/index.php']
 9 
10     def parse(self, response):
11         items = []
12         data = response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]')
13         for sel in data:
14             item = CoolscrapyItem()
15             if len(sel.xpath('./h2/a/text()').extract()) <= 0:
16                 item['title'] = 'No title'
17             else:
18                 item['title'] = sel.xpath('./h2/a/text()').extract()[0]
19             if len(sel.xpath('./h2/a/@href').extract()) <= 0:
20                 item['link'] = 'link在哪里！！！！！！！！'
21             else:
22                 item['link'] = sel.xpath('./h2/a/@href').extract()[0]
23             url = response.urljoin(item['link'])
24             if len(sel.xpath('div[@class="mob-sub"]/text()').extract()) <= 0:
25                 item['desc'] = '啥也沒有哦...'
26             else:
27                 item['desc'] = sel.xpath('div[@class="mob-sub"]/text()').extract()[0]
28             #item['posttime'] = sel.xpath('./div[@class="mob-author"]/span/@text()').extract()[0]
29             print(item['title'], item['link'], item['desc'])
30             items.append(item)
31         return items

現在可以在終端運行了，是可以打印每個新聞信息的。

scrapy crawl huxiu

如果一切正常，應該可以打印出每一個新聞

處理鏈接

如果想繼續跟蹤每個新聞鏈接進去，看看它的詳細內容的話，那么可以在parse()方法中返回一個Request對象，然后注冊一個回調函數來解析新聞詳情。

下面繼續編寫huxiu.py

# -*- coding: utf-8 -*-
import scrapy
from coolscrapy.items import CoolscrapyItem

class HuxiuSpider(scrapy.Spider):
    name = "huxiu"
    allowed_domains = ["huxiu.com"]
    start_urls = ['http://huxiu.com/index.php']

    def parse(self, response):
        #items = []
        data = response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]')
        for sel in data:
            item = CoolscrapyItem()
            if len(sel.xpath('./h2/a/text()').extract()) <= 0:
                item['title'] = 'No title'
            else:
                item['title'] = sel.xpath('./h2/a/text()').extract()[0]
            if len(sel.xpath('./h2/a/@href').extract()) <= 0:
                item['link'] = 'link在哪里！！！！！！！！'
            else:
                item['link'] = sel.xpath('./h2/a/@href').extract()[0]
            url = response.urljoin(item['link'])
            if len(sel.xpath('div[@class="mob-sub"]/text()').extract()) <= 0:
                item['desc'] = '啥也沒有哦...'
            else:
                item['desc'] = sel.xpath('div[@class="mob-sub"]/text()').extract()[0]
            #item['posttime'] = sel.xpath('./div[@class="mob-author"]/span/@text()').extract()[0]
            print(item['title'], item['link'], item['desc'])
            #items.append(item)
        #return items
            yield scrapy.Request(url,callback=self.parse_article)

    def parse_article(self,response):
        detail = response.xpath('//div[@class="article-wrap"]')
        item = CoolscrapyItem()
        item['title'] = detail.xpath('./h1/text()')[0].extract().strip()
        item['link'] = response.url
        item['posttime'] = detail.xpath('./div/div[@class="column-link-box"]/span[1]/text()')[0].extract()
        print(item['title'],item['link'],item['posttime'])
        yield item

現在parse只提取感興趣的鏈接，然后將鏈接內容解析交給另外的方法去處理了。你可以基於這個構建更加復雜的爬蟲程序了。

導出抓取數據

最簡單的保存抓取數據的方式是使用json格式的文件保存在本地，像下面這樣運行：

scrapey crawl huxiu -o items.json

一般構建爬蟲系統，建議自己編寫Item Pipeline

數據保存為TXT/JSON/MySql

1.數據保存為TXT

打開Pipeline.py

 1 import codecs
 2 import os
 3 import json
 4 import pymysql
 5 
 6 class CoolscrapyPipeline(object):#需要在setting.py里設置'coolscrapy.piplines.CoolscrapyPipeline':300
 7     def process_item(self, item, spider):
 8         # 獲取當前工作目錄
 9         base_dir = os.getcwd()
10         fiename = base_dir + '/news.txt'
11         # 從內存以追加的方式打開文件，並寫入對應的數據
12         with open(fiename, 'a') as f:
13             f.write(item['title'] + '\n')
14             f.write(item['link'] + '\n')
15             f.write(item['posttime'] + '\n\n')
16         return item

2.保存為json格式

在Pipeline.py里面新建一個類

 1 #以下兩種寫法保存json格式，需要在settings里面設置'coolscrapy.pipelines.JsonPipeline': 200
 2 
 3 class JsonPipeline(object):
 4     def __init__(self):
 5         self.file = codecs.open('logs.json', 'w', encoding='utf-8')
 6     def process_item(self, item, spider):
 7         line = json.dumps(dict(item), ensure_ascii=False) + "\n"
 8         self.file.write(line)
 9         return item
10     def spider_closed(self, spider):
11         self.file.close()
12 
13 
14 class JsonPipeline(object):
15     def process_item(self, item, spider):
16         base_dir = os.getcwd()
17         filename = base_dir + '/news.json'
18         # 打開json文件，向里面以dumps的方式吸入數據
19         # 注意需要有一個參數ensure_ascii=False ，不然數據會直接為utf編碼的方式存入比如
20         # :“/xe15”
21         with codecs.open(filename, 'a') as f:
22             line = json.dumps(dict(item), ensure_ascii=False) + '\n'
23             f.write(line)
24         return item

上面是兩種寫法，都是一樣的

3.保存到mysql

保存到數據庫需要建立表格newsDB，詳情請參考http://www.cnblogs.com/freeman818/p/7223161.html

在Pipeline.py里面新建一個類

 1 class mysqlPipeline(object):
 2     def process_item(self,item,spider):
 3         '''
 4         將爬取的信息保存到mysql
 5         '''
 6         # 將item里的數據拿出來
 7         title = item['title']
 8         link = item['link']
 9         posttime = item['posttime']
10 
11         # 和本地的newsDB數據庫建立連接
12         db = pymysql.connect(
13             host='localhost',  # 連接的是本地數據庫
14             user='root',  # 自己的mysql用戶名
15             passwd='123456',  # 自己的密碼
16             db='newsDB',  # 數據庫的名字
17             charset='utf8mb4',  # 默認的編碼方式：
18             cursorclass=pymysql.cursors.DictCursor)
19         try:
20             # 使用cursor()方法獲取操作游標
21             cursor = db.cursor()
22             # SQL 插入語句
23             sql = "INSERT INTO NEWS(title,link,posttime) \
24                   VALUES ('%s', '%s', '%s')" % (title,link,posttime)
25             # 執行SQL語句
26             cursor.execute(sql)
27             # 提交修改
28             db.commit()
29         finally:
30             # 關閉連接
31             db.close()
32         return item

編寫Settings.py

我們需要在Settings.py將我們寫好的PIPELINE添加進去，
scrapy才能夠跑起來
這里只需要增加一個dict格式的ITEM_PIPELINES，
數字value可以自定義，數字越小的優先處理

1 ITEM_PIPELINES={'coolscrapy.pipelines.CoolscrapyPipeline':300,
2                 'coolscrapy.pipelines.JsonPipeline': 200,
3                 'coolscrapy.pipelines.mysqlPipeline': 100,
4 }

下面讓程序跑起來

scrape crawl huxiu

看看結果：

好了，這次就到這里。代碼要自己敲才會慢慢熟練。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy爬蟲成長日記之創建工程-抽取數據-保存為json格式的數據 matlab數據保存為excel文件 Java爬蟲一鍵爬取結果並保存為Excel Save matrix to a txt file - matlab 在matlab中將矩陣變量保存為txt格式 canvas保存為圖片將cmd中命令輸出保存為TXT文本文件 js 把字符串保存為txt文件，並下載到本地如何實現用將富文本編輯器內容保存為txt文件並展示 Java遞歸讀取文件路徑下所有文件名稱並保存為Txt文檔 C#程序將對象保存為json文件的方法