使用scrapy框架爬取自己的博文（2）

本文轉載自查看原文 2014-05-05 15:14 3994 python與web基礎

　　之前寫了一篇用scrapy框架爬取自己博文的博客，后來發現對於中文的處理一直有問題- -

　　顯示的時候 [u'python\u4e0b\u722c\u67d0\u4e2a\u7f51\u9875\u7684\u56fe\u7247 - huhuuu - \u535a\u5ba2\u56ed'] 而不是 python下爬某個網頁的圖片 - huhuuu - 博客園。這顯然不是我們需要的結果。

　　現在如何把列表中的字符串轉到字符串，顯然不能直接用str! 那就遍歷列表，把信息提取出來。

def change_word(s): #把表中的字符串轉化到中文顯示
    print s
    sum = 0
    for i in s[0]:
        sum += 1
    ss2 = ''

    count = 0
    for i in range(0,sum):
        ss2 += s[0][i]
        
    s = ss2
    print s

　　運行一下，似乎是可以的，但是發現有些字符還是沒有轉化到中文字符，查了下編譯器的提示：

　　\u2014這個字符好像支持的不好，那就把這個字符除掉

　　一開始沒搞明白字符的單位是什么，判斷條件寫成了，自然就沒起到任何作用

       if (s[0][i] == '\\') and (s[0][i+1] == 'u'):
            if (s[0][i+2] == '2') and (s[0][i+3] == '0') and (s[0][i+4] == '1') and (s[0][i+5] == '4'):

　　原來在python中對中文字符與對英文字符都看做一個單位，所以：

        if (s[0][i] == u'\u2014'):
            continue

　　最后，可以正確的顯示所以中文字符了。

　　完整的spider代碼：

#!/usr/bin/env python 
#coding=utf-8
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
from scrapy.selector import HtmlXPathSelector
import sys
import string
sys.stdout=open('output.txt','w') #將打印信息輸出在相應的位置下

add = 0
def change_word(s): #把表中的字符串轉化到中文顯示
    print s
    sum = 0
    for i in s[0]:
        sum += 1
    ss2 = ''

    count = 0
    for i in range(0,sum):
        #對 /u2014處理
        if (s[0][i] == u'\u2014'):
            continue
        ss2 += s[0][i]
        
    s = ss2
    print s
        
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/huhuuu",
    ]

    rules = (
        # 提取匹配 huhuuu/default.html\?page\=([\w]+) 的鏈接並跟進鏈接(沒有callback意味着follow默認為True)
        Rule(SgmlLinkExtractor(allow=('huhuuu/default.html\?page\=([\w]+)', ),)),

        # 提取匹配 'huhuuu/p/' 的鏈接並使用spider的parse_item方法進行分析
        Rule(SgmlLinkExtractor(allow=('huhuuu/p/', )), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('huhuuu/archive/', )), callback='parse_item'), #以前的一些博客是archive形式的所以
    )



    def parse_item(self, response):
        global add #用於統計數量
        print  add
        add+=1
        
        sel = HtmlXPathSelector(response)
        items = []

        item = Website()

        temp = sel.xpath('/html/head/title/text()').extract()

        item['headTitle'] = temp#觀察網頁對應得html源碼
        item['url'] = response

        #print temp
        
        print item['url']
        change_word(temp)
        
        items.append(item)
        return items

爬取的結果：

近四百篇博文

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用scrapy框架爬取自己的博文 Scrapy爬取自己的博客內容 Scrapy 爬取新浪微博博文被爬是一種什么樣的體驗？爬蟲入門（四）——Scrapy框架入門：使用Scrapy框架爬取全書網小說數據 scrapy框架爬取妹子圖片 R 語言爬蟲之 cnblog博文爬取 nodejs爬取博客園的博文 scrapy框架爬取多級頁面 scrapy框架的使用