Scrapy爬蟲實例教程（二）---數據存入MySQL

本文轉載自查看原文 2016-06-14 08:57 18808 scrapy 實例教程爬蟲/ Linux/ Python

本文將詳細描述使用scrapy爬去左岸讀書所有文章並存入本地MySql數據庫中，文中所有操作都是建立在scrapy已經配置完畢，並且系統中已經安裝了Mysql數據庫（有權限操作數據庫）。

為了避免讀者混淆，這里也使用tutorial作為scrapy project名稱（工程的名字可以有讀者自己定制）。

1. 建立tutorial工程

1 scrapy startproject tutorial

上述命令運行完畢后會得到tutorial（或者自定義名稱）的目錄，使用tree命令可以查看tutorial的目錄結構，如下圖所示

2. 解析左岸文章結構

左岸讀書為讀者提供了一些優美文章，喜歡的讀者可以自行訂閱（在這里提博主打廣告啦[不用謝^_^]）

　站中所有文章都以列表的形式列出，每篇文章鏈接都給出了文章摘要和相應的信息（如作者，發布時間，分類信息，閱讀量等信息）在列表底端給出了下一個列表的鏈接，具體如下圖所示

點擊相應的文章題目可以鏈接到具體的文章內容頁面，讀者可以自己實驗試下，這里不再贅述。

3. 建立mysql數據庫

建立mysql數據庫 crawed

1 create database crawed;
2 use crawed;

在數據庫中建立zreading數據表，這里我們要抓取文章標題，作者，文章發表日期，文章類別，文章標簽，閱讀量及文章內容，建立如下數據表

1 CREATE TABLE `zreading` (
2   `title` varchar(100) NOT NULL,
3   `author` varchar(50) NOT NULL,
4   `pub_date` varchar(30) DEFAULT NULL,
5   `types` varchar(50) DEFAULT NULL,
6   `tags` varchar(50) DEFAULT NULL,
7   `view_counts` varchar(20) DEFAULT '0',
8   `content` text
9 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

4. 在items.py中編寫需要抓取的內容

items.py是爬蟲根據用戶興趣定義爬去內容的文件，用戶可以根據自己的需求，定義相應的class，爬蟲在解析網頁時根據解析規則生成item類對象

這里根據我們步驟3中的數據類別建立如下類：

 1 class TutorialItem(scrapy.Item):
 2     # define the fields for your item here like:
 3     # name = scrapy.Field()
 4     title = scrapy.Field()
 5     author = scrapy.Field()
 6     pub_date = scrapy.Field()
 7     types = scrapy.Field()
 8     tags = scrapy.Field()
 9     view_count = scrapy.Field()
10     content = scrapy.Field()

5. 編輯pipelines.py文件

pipelines.py是設置抓取內容存儲方式的文件，例如可以存儲到mysql或是json文件中，讀者可以根據自己實際需求選擇相應的方式，本例中選擇存儲到mysql中。

 1 from twisted.enterprise import adbapi
 2 import MySQLdb
 3 import MySQLdb.cursors
 4 from scrapy.crawler import Settings as settings
 5 class TutorialPipeline(object):
 6 
 7     def __init__(self):
 8 
 9         dbargs = dict(
10             host = 'your host' ,
11             db = 'crawed',
12             user = 'user_name', #replace with you user name
13             passwd = 'user_password', # replace with you password
14             charset = 'utf8',
15             cursorclass = MySQLdb.cursors.DictCursor,
16             use_unicode = True,
17             )    
18         self.dbpool = adbapi.ConnectionPool('MySQLdb',**dbargs)
19 
20 
21     '''
22     The default pipeline invoke function
23     '''
24         def process_item(self, item,spider):
25             res = self.dbpool.runInteraction(self.insert_into_table,item)
26                 return item
27 
28         def insert_into_table(self,conn,item):
29                 conn.execute('insert into zreading(title,author,pub_date,types,tags,view_counts,content) values(%s,%s,%s,%s,%s,%s,%s)', (item['title'],item['author'],item['pub_date'],item['types'],item['tags'],item['view_count'],item['content']))

6. 在settings.py中設置pipeline

當使用pipeline保存抓取內容時，需要設置相應的pipeline類，以便讓系統知道根據什么方式進行存儲，在settings.py中加入一下代碼

1 ITEM_PIPELINES = {
2     'tutorial.pipelines.TutorialPipeline': 300,
3 }

7. 解析網頁，抓取需要內容

經過以上6步，所有的配置的工作已經結束，接下來，我們的重點就是如何從網頁中解析出我們所需要的內容，在解析過程中需要借助一些開發插件，比如firefox的firebug，chrome的開發者工具，本例中使用chrome的開發工具。

在這一步我們需要編寫網頁解析的具體邏輯-如何處理網頁，得到我們所需的內容。在spiders目錄下，新建zreading.py文件，然后定義zreadingCrawl爬蟲（繼承scrapy的BaseSpider即可）

1 class zreadingCrawl(BaseSpider):
2     name = "zreading" # the name of spider
3     allowed_domain = ['zreading.cn'] # allowed domain for spiders
4     start_urls = [
5     'http://www.zreading.cn'  #the start url / the entrance of spider
6     ]

具體的解析過程如下：

a. 首先解析左岸的文章列表，使用chrome的開發者工具，在文章標題處右擊，點擊檢查，然后復制為xpath路徑。在解析網頁是就可以根據這個路徑定位到你所需的內容，這里我們只是想獲得文章的連接，所有我們只需要提取文章題目鏈接的

href屬性值即可，在文章目錄頁中，有兩種我們需要的鏈接，一種是文章內容的鏈接，另一種則是文章列表的下一頁，對於文章內容鏈接我們可以直接請求響應的URL，然后解析內容即可；而對於目錄鏈接則可以從頭解析（也即請求目錄頁然后進一步解析）。

由上述可知，這是一個不斷循環的過程，直至沒有下一頁為止。

b. 在解析的過程中，對於每次的解析內容，都需要進行處理，如在提取標題時，得到的內容前后包括很多空格，而且為了避免在數據庫出現亂碼，所有數據都編碼成utf8。這里我們需要編寫

c. 具體代碼如下（在zreadingCrawl中添加如下函數）：

 1 def parse(self,response):
 2 
 3         if response.url.endswith('html'):    
 4 
 5             item = self.parsePaperContent(response)
 6 
 7         else:
 8             # get all the page links in list Page
 9             sel = Selector(response)            
10             links = sel.xpath('//*[@id="content"]/article/header/h2/a/@href').extract()
11             for link in links:
12                 yield Request(link,callback=self.parse)
13 
14             # get the next page to visitr
15             next_pages = sel.xpath('//*[@id="content"]/div/a[@class="next"]/@href').extract()
16             if len(next_pages) != 0:
17                 yield Request(next_pages[0],callback=self.parse)
18             # record the list page
19 
20         yield item

 1 def parsePaperContent(self,response):
 2         print "In parsse paper content function......"
 3         # get the page number  '5412.html'
 4         #  page_id = response.url.split('/')[-1].split('.')[0] ----- OK
 5         r  =re.match(r'\d+',response.url.split('/')[-1])
 6         page_id = r.group()
 7         # instantie the item
 8         zding = TutorialItem()
 9         sel = Selector(response)
10         #add tilte
11         title = sel.xpath("//div[@id='content']/article/header/h2/text()").extract()[0]
12         s_title = title.encode("utf-8")
13         zding['title'] = s_title.lstrip().rstrip()
14 
15         #add pub_date
16         pub_date = sel.xpath('//*[@id="'+page_id+'"]/div[2]/span[1]/text()').extract()[0]
17         s_pub_date = pub_date.encode("utf8")
18         zding['pub_date'] = s_pub_date.lstrip().rstrip()
19 
20         #add author
21         author = sel.xpath('//*[@id="'+page_id+'"]/div[2]/span[2]/a/text()').extract()[0]
22         s_author = author.encode("utf8")
23         zding['author'] = s_author.lstrip().rstrip()
24 
25         #add tags including type and paper tags
26 
27         tags = sel.xpath('//*[@id="'+page_id+'"]/div[2]/a/text()').extract()
28         tags = [s.encode('utf8') for s in tags]
29         zding['types'] = tags[0]
30         zding['tags'] = "+".join(tags[1:])
31 
32         #add view count
33         views = sel.xpath('//*[@id="'+page_id+'"]/div[2]/span[3]/text()').extract()[0]
34         r = re.search(r'\d+',views)
35         view_count = int(r.group())
36         zding['view_count'] = view_count
37         #add content 
38         content = sel.xpath('//*[@id="'+page_id+'"]/div[3]/p/text()').extract()
39         zding['content'] = "\n".join(content)
40 
41         #return the item 
42         return zding

8. 在命令行下運行

1 scrapy crawl zreading

在屏幕中會閃解析過的網頁和解析得到的item，等運行完畢后查看數據庫中的zreading表的內容，這里因為文章較長，不再單獨貼圖。

*****聲明：本帖純粹是個人興趣愛好，絕無其他任何惡意。本人很喜歡看左岸的文章，恰逢學習scrapy，就以此為例。在此聲明，本帖只是技術解析，絕無轉載。*****

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Scrapy 爬蟲實例教程（一）---簡介及資源列表 nodejs爬蟲數據存入mysql 數據結構實例教程（第2版） BurpSuite實例教程 BPEL 實例教程 Python實例教程 scrapy數據存入mongodb PHP+MYSQL會員系統的開發實例教程 scrapy將爬取的數據存入MySQL數據庫 WebService入門實例教程