基於scrapy的分布式爬蟲抓取新浪微博個人信息和微博內容存入MySQL

本文轉載自查看原文 2018-01-08 16:26 2803

為了學習機器學習深度學習和文本挖掘方面的知識，需要獲取一定的數據，新浪微博的大量數據可以作為此次研究歷程的對象

一、環境准備

python 2.7

scrapy框架的部署（可以查看上一篇博客的簡要操作，傳送門：點擊打開鏈接）

mysql的部署（需要的資源百度網盤鏈接：點擊打開鏈接）

heidiSQL數據庫可視化

本人的系統環境是 win 64位的所以以上環境都是需要兼容64位的

二、scrapy組件和數據流介紹

1、Scrapy architecture

組件Scrapy Engine

引擎負責控制數據流在系統中所有組件中流動，並在相應動作發生時觸發事件。

調度器(Scheduler)

調度器從引擎接受request並將他們入隊，以便之后引擎請求他們時提供給引擎。

下載器(Downloader)

下載器負責獲取頁面數據並提供給引擎，而后提供給spider。

Spiders

Spider是Scrapy用戶編寫用於分析response並提取item(即獲取到的item)或額外跟進的URL的類。每個spider負責處理一個特定(或一些)網站。Item PipelineItem Pipeline負責處理被spider提取出來的item。典型的處理有清理、驗證及持久化(例如存取到數據庫中)。

下載器中間件(Downloader middlewares)

下載器中間件是在引擎及下載器之間的特定鈎子(specific hook)，處理Downloader傳遞給引擎的response。其提供了一個簡便的機制，通過插入自定義代碼來擴展Scrapy功能。更多內容請看下載器中間件(Downloader Middleware) 。

Spider中間件(Spider middlewares)

Spider中間件是在引擎及Spider之間的特定鈎子(specific hook)，處理spider的輸入(response)和輸出(items及requests)。其提供了一個簡便的機制，通過插入自定義代碼來擴展Scrapy功能。更多內容請看 Spider中間件(Middleware) 。

2、數據流(Data flow)

Scrapy中的數據流由執行引擎控制，其過程如下:

1.引擎打開一個網站(open a domain)，找到處理該網站的Spider並向該spider請求第一個要爬取的URL(s)。

2.引擎從Spider中獲取到第一個要爬取的URL並在調度器(Scheduler)以Request調度。

3.引擎向調度器請求下一個要爬取的URL。

4.調度器返回下一個要爬取的URL給引擎，引擎將URL通過下載中間件(請求(request)方向)轉發給下載器(Downloader)。

5.一旦頁面下載完畢，下載器生成一個該頁面的Response，並將其通過下載中間件(返回(response)方向)發送給引擎。

6.引擎從下載器中接收到Response並通過Spider中間件(輸入方向)發送給Spider處理。

7.Spider處理Response並返回爬取到的Item及(跟進的)新的Request給引擎。

8.引擎將(Spider返回的)爬取到的Item給Item Pipeline，將(Spider返回的)Request給調度器。

9.(從第二步)重復直到調度器中沒有更多地request，引擎關閉該網站。

以上組件和數據流的部分是參考別的的介紹，覺得描述的挺好，比較容易理解整個框架的結構。下面是干貨：

三、scrapy工程對象

在你需要創建工程的目錄底下啟動cmd命令（按住shift鍵右鍵選擇在此處打開命令窗口）執行：scrapy startproject weibo

會在當前目錄下生成scrapy框架的目錄結構：

本人用的IDE是pycharm ，用IDE打開工程，工程最終的目錄結構如圖所示：

1、item.py的內容：

[python] view plain copy

# encoding=utf-8
from scrapy.item import Item, Field
class InformationItem(Item):
#關注對象的相關個人信息
_id = Field() # 用戶ID
Info = Field() # 用戶基本信息
Num_Tweets = Field() # 微博數
Num_Follows = Field() # 關注數
Num_Fans = Field() # 粉絲數
HomePage = Field() #關注者的主頁
class TweetsItem(Item):
#微博內容的相關信息
_id = Field() # 用戶ID
Content = Field() # 微博內容
Time_Location = Field() # 時間地點
Pic_Url = Field() # 原圖鏈接
Like = Field() # 點贊數
Transfer = Field() # 轉載數
Comment = Field() # 評論數

定義了兩個類，InformationItem獲取關注列表用戶的個人信息，TweetsItem獲取微博內容

2、weibo_spider.py的內容：

[python] view plain copy

# coding=utf-8
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from weibo.items import InformationItem,TweetsItem
import re
import requests
from bs4 import BeautifulSoup
class Weibo(Spider):
name = "weibospider"
redis_key = 'weibospider:start_urls'
#可以從多個用戶的關注列表中獲取這些用戶的關注對象信息和關注對象的微博信息
start_urls = ['http://weibo.cn/0123456789/follow','http://weibo.cn/0123456789/follow']
#如果通過用戶的分組獲取關注列表進行抓取數據，需要調整parse中如id和nextlink的多個參數
#strat_urls = ['http://weibo.cn/attgroup/show?cat=user¤tPage=2&rl=3&next_cursor=20&previous_cursor=10&type=opening&uid=1771329897&gid=201104290187632788&page=1']
url = 'http://weibo.cn'
#group_url = 'http://weibo.cn/attgroup/show'
#把已經獲取過的用戶ID提前加入Follow_ID中避免重復抓取
Follow_ID = ['0123456789']
TweetsID = []
def parse(self,response):
#用戶關注者信息
informationItems = InformationItem()
selector = Selector(response)
print selector
Followlist = selector.xpath('//tr/td[2]/a[2]/@href').extract()
print "輸出關注人ID信息"
print len(Followlist)
for each in Followlist:
#選取href字符串中的id信息
followId = each[(each.index("uid")+4):(each.index("rl")-1)]
print followId
follow_url = "http://weibo.cn/%s" % followId
#通過篩選條件獲取需要的微博信息,此處為篩選原創帶圖的微博
needed_url = "http://weibo.cn/%s/profile?hasori=1&haspic=1&endtime=20160822&advancedfilter=1&page=1" % followId
print follow_url
print needed_url
#抓取過數據的用戶不再抓取：
while followId not in self.Follow_ID:
yield Request(url=follow_url, meta={"item": informationItems, "ID": followId, "URL": follow_url}, callback=self.parse1)
yield Request(url=needed_url, callback=self.parse2)
self.Follow_ID.append(followId)
nextLink = selector.xpath('//div[@class="pa"]/form/div/a/@href').extract()
#查找下一頁，有則循環
if nextLink:
nextLink = nextLink[0]
print nextLink
yield Request(self.url + nextLink, callback=self.parse)
else:
#沒有下一頁即獲取完關注人列表之后輸出列表的全部ID
print self.Follow_ID
#yield informationItems
def parse1(self, response):
""" 通過ID訪問關注者信息 """
#通過meta把parse中的對象變量傳遞過來
informationItems = response.meta["item"]
informationItems['_id'] = response.meta["ID"]
informationItems['HomePage'] = response.meta["URL"]
selector = Selector(response)
#info = ";".join(selector.xpath('//div[@class="ut"]/text()').extract()) # 獲取標簽里的所有text()
info = selector.xpath('//div[@class="ut"]/span[@class="ctt"]/text()').extract()
#用/分開把列表中的各個元素便於區別不同的信息
allinfo = ' / '.join(info)
try:
#exceptions.TypeError: expected string or buffer
informationItems['Info'] = allinfo
except:
pass
#text2 = selector.xpath('body/div[@class="u"]/div[@class="tip2"]').extract()
num_tweets = selector.xpath('body/div[@class="u"]/div[@class="tip2"]/span/text()').extract() # 微博數
num_follows = selector.xpath('body/div[@class="u"]/div[@class="tip2"]/a[1]/text()').extract() # 關注數
num_fans = selector.xpath('body/div[@class="u"]/div[@class="tip2"]/a[2]/text()').extract() # 粉絲數
#選取'[' ']'之間的內容
if num_tweets:
informationItems["Num_Tweets"] = (num_tweets[0])[((num_tweets[0]).index("[")+1):((num_tweets[0]).index("]"))]
if num_follows:
informationItems["Num_Follows"] = (num_follows[0])[((num_follows[0]).index("[")+1):((num_follows[0]).index("]"))]
if num_fans:
informationItems["Num_Fans"] = (num_fans[0])[((num_fans[0]).index("[")+1):((num_fans[0]).index("]"))]
yield informationItems
#獲取關注人的微博內容相關信息
def parse2(self, response):
selector = Selector(response)
tweetitems = TweetsItem()
#可以直接用request的meta傳遞ID過來更方便
IDhref = selector.xpath('//div[@class="u"]/div[@class="tip2"]/a[1]/@href').extract()
ID = (IDhref[0])[1:11]
Tweets = selector.xpath('//div[@class="c"]')
# 跟parse1稍有不同，通過for循環尋找需要的對象
for eachtweet in Tweets:
#獲取每條微博唯一id標識
mark_id = eachtweet.xpath('@id').extract()
print mark_id
#當id不為空的時候加入到微博獲取列表
if mark_id:
#去重操作，對於已經獲取過的微博不再獲取
while mark_id not in self.TweetsID:
content = eachtweet.xpath('div/span[@class="ctt"]/text()').extract()
timelocation = eachtweet.xpath('div[2]/span[@class="ct"]/text()').extract()
pic_url = eachtweet.xpath('div[2]/a[2]/@href').extract()
like = eachtweet.xpath('div[2]/a[3]/text()').extract()
transfer = eachtweet.xpath('div[2]/a[4]/text()').extract()
comment = eachtweet.xpath('div[2]/a[5]/text()').extract()
tweetitems['_id'] = ID
#把列表元素連接且轉存成字符串
allcontents = ''.join(content)
#內容可能為空需要先判定
if allcontents:
tweetitems['Content'] = allcontents
else:
pass
if timelocation:
tweetitems['Time_Location'] = timelocation[0]
if pic_url:
tweetitems['Pic_Url'] = pic_url[0]
# 返回字符串中'[' ']'里的內容
if like:
tweetitems['Like'] = (like[0])[((like[0]).index("[")+1):((like[0]).index("]"))]
if transfer:
tweetitems['Transfer'] = (transfer[0])[((transfer[0]).index("[")+1):((transfer[0]).index("]"))]
if comment:
tweetitems['Comment'] = (comment[0])[((comment[0]).index("[")+1):((comment[0]).index("]"))]
#把已經抓取過的微博id存入列表
self.TweetsID.append(mark_id)
yield tweetitems
else:
#如果selector語句找不到id 查看當前查詢語句的狀態
print eachtweet
tweet_nextLink = selector.xpath('//div[@class="pa"]/form/div/a/@href').extract()
if tweet_nextLink:
tweet_nextLink = tweet_nextLink[0]
print tweet_nextLink
yield Request(self.url + tweet_nextLink, callback=self.parse2)

每個微博用戶都有唯一的標識uid，此uid是獲取需要對象的關鍵。修改start_url里面的ID（0123456789），比如換成留幾手的ID（1761179351），即把地址換成你想獲取的用戶的關注人列表的信息，可以對多個用戶的關注列表用redis_keyf方式進行分布式操作。內容比較多就不一一介紹，代碼不理解的可以留言探討，本人也是模仿着別人的框架寫出來的代碼，不是科班出身，代碼寫的比較渣渣，大神可以幫忙指點一二。

3、獲取cookies模擬登陸微博：

[python] view plain copy

# encoding=utf-8
import requests
from selenium import webdriver
import time
from PIL import Image
import urllib2
from bs4 import BeautifulSoup
import re
import urllib
#多點賬號防止被和諧
myAccount = [
{'no': 'XXXXXXXXXX', 'psw': 'XXXXXXXXX'},
{'no': 'XXXXXXXX', 'psw': 'XXXXXXX'},
{'no': 'XXXXXX', 'psw': 'XXXXXXX'}
]
headers={
"Host":"login.weibo.cn",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Accept":'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Accept-Encoding":"gzip, deflate",
"Connection":"keep-alive"
}
# 獲取驗證碼等相關登錄信息
def get_captchainfo(loginURL):
html = requests.get(loginURL).content
bs = BeautifulSoup(html)
#print bs
#注意通過bs.select元素尋找對象，返回的是列表對象
password_name = (bs.select('input[type="password"]'))[0].get('name')
vk = (bs.select('input[name="vk"]'))[0].get('value')
capId = (bs.select('input[name="capId"]'))[0].get('value')
print password_name,vk,capId
try:
captcha_img = bs.find("img", src=re.compile('http://weibo.cn/interface/f/ttt/captcha/')).get('src')
print captcha_img
#captchaid可以從驗證碼圖片地址中直接截取獲得
urllib.urlretrieve(captcha_img, 'captcha.jpg')
print "captcha download success!"
captcha_input = input("please input the captcha\n>")
except:
return None
return (captcha_input,password_name,vk,capId)
def getCookies(weibo):
""" 獲取Cookies """
cookies = []
loginURL = 'http://login.weibo.cn/login/'
for elem in weibo:
account = elem['no']
password = elem['psw']
captcha = get_captchainfo(loginURL)
if captcha[0] is None:
#不需要驗證碼時的表單,微博移動網頁版都要驗證碼，此處可以忽略
postData = {
"source": "None",
"redir": "http://weibo.cn/",
"mobile": account,
"password": password,
"login": "登錄",
}
else:
#需要驗證碼時的表單
print "提交表單數據"
postData = {
"mobile": account,
captcha[1]: password,
"code": captcha[0],
"remember":"on",
"backurl": "http://weibo.cn/",
"backtitle":u'微博',
"tryCount":"",
"vk": captcha[2],
"capId": captcha[3],
"submit": u'登錄',
}
print postData
session = requests.Session()
r = session.post(loginURL, data=postData, headers=headers)
#判斷post過后是否跳轉頁面
#time.sleep(2)
print r.url
if r.url == 'http://weibo.cn/?PHPSESSID=&vt=1'or 'http://weibo.cn/?PHPSESSID=&vt=4':
ceshihtml = requests.get(r.url).content
print ceshihtml
print 'Login successfully!!!'
cookie = session.cookies.get_dict()
cookies.append(cookie)
else:
print "login failed!"
return cookies
'''''
#通過selenium driver方式獲取cookie
def getcookieByDriver(weibo):
driver = webdriver.Firefox()
driver.maximize_window()
cookies = []
for elem in weibo:
account = elem['no']
password = elem['psw']
driver.get("http://login.weibo.cn/login/")
elem_user = driver.find_element_by_name("mobile")
elem_user.send_keys(account) # 用戶名
#微博的password有加后綴,
elem_pwd = driver.find_element_by_name("password_XXXX")
elem_pwd.send_keys(password) # 密碼
time.sleep(10)
#手動輸驗證碼時間
elem_sub = driver.find_element_by_name("submit")
elem_sub.click() # 點擊登陸
time.sleep(2)
weibo_cookies = driver.get_cookies()
#cookie = [item["name"] + "=" + item["value"] for item in douban_cookies]
#cookiestr = '; '.join(item for item in cookie)
cookies.append(weibo_cookies)
return cookies
'''
cookies = getCookies(myAccount)
#cookies = getcookieByDriver(myAccount)
print "Get Cookies Finish!( Num:%d)" % len(cookies)

在myAcount中輸入你自己擁有的微博賬號密碼，就可以模擬登陸微博啦：

這里有兩種方式：

【1】模擬瀏覽器提交表單登陸（推薦）

【2】通過selenium WebDriver 方式登陸

驗證碼暫時還是先手動輸一下吧，還沒有找到快速有效的方式破解。

反正只要拿到cookie保存下來就可以進行抓取操作啦。

4、數據管道pipeline存入MySQL數據庫：

[python] view plain copy

# -*- coding: utf-8 -*-
import MySQLdb
from items import InformationItem,TweetsItem
DEBUG = True
if DEBUG:
dbuser = 'root'
dbpass = '123456'
dbname = 'tweetinfo'
dbhost = '127.0.0.1'
dbport = '3306'
else:
dbuser = 'XXXXXXXX'
dbpass = 'XXXXXXX'
dbname = 'tweetinfo'
dbhost = '127.0.0.1'
dbport = '3306'
class MySQLStorePipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user=dbuser, passwd=dbpass, db=dbname, host=dbhost, charset="utf8",
use_unicode=True)
self.cursor = self.conn.cursor()
#建立需要存儲數據的表
# 清空表（測試階段）：
self.cursor.execute("truncate table followinfo;")
self.conn.commit()
self.cursor.execute("truncate table tweets;")
self.conn.commit()
def process_item(self, item, spider):
#curTime = datetime.datetime.now()
if isinstance(item, InformationItem):
print "開始寫入關注者信息"
try:
self.cursor.execute("""INSERT INTO followinfo (id, Info, Num_Tweets, Num_Follows, Num_Fans, HomePage)
VALUES (%s, %s, %s, %s, %s, %s)""",
(
item['_id'].encode('utf-8'),
item['Info'].encode('utf-8'),
item['Num_Tweets'].encode('utf-8'),
item['Num_Follows'].encode('utf-8'),
item['Num_Fans'].encode('utf-8'),
item['HomePage'].encode('utf-8'),
)
)
self.conn.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
elif isinstance(item, TweetsItem):
print "開始寫入微博信息"
try:
self.cursor.execute("""INSERT INTO tweets (id, Contents, Time_Location, Pic_Url, Zan, Transfer, Comment)
VALUES (%s, %s, %s, %s, %s, %s, %s)""",
(
item['_id'].encode('utf-8'),
item['Content'].encode('utf-8'),
item['Time_Location'].encode('utf-8'),
item['Pic_Url'].encode('utf-8'),
item['Like'].encode('utf-8'),
item['Transfer'].encode('utf-8'),
item['Comment'].encode('utf-8')
)
)
self.conn.commit()
except MySQLdb.Error, e:
print "出現錯誤"
print "Error %d: %s" % (e.args[0], e.args[1])
return item

MySQL部署好之后只要輸入自己的用戶名密碼就可以存到數據庫當中去

因為我的創建表格沒有寫到pipeline中，就先自己建好數據庫和表格好了：

需要注意的是：為了讓mysql正常顯示中文，在建立數據庫的時候使用如下語句：

[sql] view plain copy

CREATE DATABASE tweetinfo DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

數據庫目錄結構：

創建表格followinfo

[sql] view plain copy

CREATE TABLE `followinfo` (
`No` INT(11) NOT NULL AUTO_INCREMENT,
`id` VARCHAR(50) NULL DEFAULT NULL,
`Info` VARCHAR(100) NOT NULL,
`Num_Tweets` INT(10) NOT NULL,
`Num_Follows` INT(10) NOT NULL,
`Num_Fans` INT(10) NOT NULL,
`HomePage` VARCHAR(50) NOT NULL,
PRIMARY KEY (`No`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM
AUTO_INCREMENT=5
;

創建表格tweets

[sql] view plain copy

CREATE TABLE `tweets` (
`No` INT(11) NOT NULL AUTO_INCREMENT,
`id` VARCHAR(20) NOT NULL,
`Contents` VARCHAR(300) NULL DEFAULT NULL,
`Time_Location` VARCHAR(50) NOT NULL,
`Pic_Url` VARCHAR(100) NULL DEFAULT NULL,
`Zan` INT(10) NOT NULL,
`Transfer` INT(10) NOT NULL,
`Comment` INT(10) NOT NULL,
PRIMARY KEY (`No`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM
AUTO_INCREMENT=944
;

5、中間組建middleware:

[python] view plain copy

# encoding=utf-8
import random
from cookies import cookies
from user_agents import agents
class UserAgentMiddleware(object):
""" 換User-Agent """
def process_request(self, request, spider):
agent = random.choice(agents)
request.headers["User-Agent"] = agent
class CookiesMiddleware(object):
""" 換Cookie """
def process_request(self, request, spider):
cookie = random.choice(cookies)
request.cookies = cookie

6、設置相關settings:

[python] view plain copy

# coding=utf-8
BOT_NAME = 'weibo'
SPIDER_MODULES = ['weibo.spiders']
NEWSPIDER_MODULE = 'weibo.spiders'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
'''''
#把數據存到路徑中的CSV文件中去
FEED_URI = u'file:///G:/MovieData/followinfo.csv'
FEED_FORMAT = 'CSV'
'''
DOWNLOADER_MIDDLEWARES = {
"weibo.middleware.UserAgentMiddleware": 401,
"weibo.middleware.CookiesMiddleware": 402,
}
ITEM_PIPELINES = {
#'weather2.pipelines.Weather2Pipeline': 300,
'weibo.pipelines.MySQLStorePipeline': 300,
}
DOWNLOAD_DELAY = 2 # 下載器間隔時間
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'doubanmovie (+http://www.yourdomain.com)'

數據爬取效果展示：

四、總結：

1、學習了解scrapy框架寫代碼熟悉數據流的流程收獲還是很多的。

2、感覺不是太復雜的網站應該都是可以抓的。形成了自己的一套系統知識體系，具體情況具體分析吧。

3、驗證碼這塊簡單的還能識別，復雜的可能得稍微用點深度學習了，識別率很一般，暫時還是人工輸入吧。

4、爬蟲只是很入門的技術，后續需要學的東西還好多。

額，看到這里也是不容易，說了這么多，關鍵還是直接打包工程源碼： 點擊打開鏈接

轉載:http://blog.csdn.net/zengsl233/article/details/52294760

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲【四】Scrapy+Cookies池抓取新浪微博 Python爬蟲教程-新浪微博分布式爬蟲分享零授權抓取新浪微博任何用戶的微博內容基於redis分布式緩存實現（新浪微博案例）新浪微博爬蟲weiboSpider Scrapy 爬取新浪微博新浪微博數據抓取(java實現) python爬蟲之新浪微博登錄新浪微博架構 python爬蟲實戰（六）--------新浪微博（爬取微博帳號所發內容，不爬取歷史內容）