今天把scrapy的文檔研究了一下,感覺有點手癢,就寫點東西留點念想吧,也做為備忘錄。隨意寫寫,看到的朋友覺得不好,不要噴我哈。
創建scrapy工程
cd C:\Spider_dev\app\scrapyprojects scrapy startproject renren
創建定向爬蟲
cd renren scrapy genspider Person renren.com
查看目錄結構

定義items
class RenrenItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
sex = scrapy.Field() # 性別
birthday = scrapy.Field() # 生日
addr = scrapy.Field() # 家鄉
編寫爬蟲
# -*- coding: gbk -*-
import scrapy
# 導入items中的數據項定義模塊
from renren.items import RenrenItem
class PersonSpider(scrapy.Spider):
name = "Person"
allowed_domains = ['renren.com']
start_urls = ['http://www.renren.com/913043576/profile?v=info_timeline']
def start_requests(self):
return [scrapy.FormRequest('http://www.renren.com/PLogin.do',
formdata={'email':'15201417639','password':'kongzhagen.com'},
callback=self.login)]
def login(self,response):
for url in self.start_urls:
yield self.make_requests_from_url(url)
def parse(self, response):
item = RenrenItem()
basicInfo = response.xpath('//div[@id="basicInfo"]')
sex = basicInfo.xpath('div[2]/dl[1]/dd/text()').extract()[0]
birthday = basicInfo.xpath('div[2]/dl[2]/dd/a/text()').extract()
birthday = ''.join(birthday)
addr = basicInfo.xpath('div[2]/dl[3]/dd/text()').extract()[0]
item['sex'] = sex
item['addr'] = addr
item['birthday'] =birthday
return item
解釋:
allowed_domains:定義允許訪問的域名
start_urls:登陸人人網后訪問的URL
start_requests:程序的開始函數,FormRequest定義了scrapy如何post提交數據,返回函數或迭代器,回調函數login。
login:登陸人人網之后的爬蟲處理函數,make_requests_from_url處理start_urls中的url,其默認的回調函數為parse
parse:處理make_requests_from_url函數返回的結果
執行爬蟲
scrapy crawl Person -o person.csv
查看結果:

