微博內容爬取

本文轉載自查看原文 2018-01-31 16:23 5540 爬蟲/ Python爬蟲/ 微博/ Python

在成功獲取微博用戶的列表之后，我們可以對每個用戶的主頁內容進行爬取了

環境

tools

1、chrome及其developer tools

2、python3.6

3、pycharm

Python3.6中使用的庫

 1 import urllib.error
 2 import urllib.request
 3 import urllib.parse
 4 import urllib
 5 import json
 6 import pandas as pd
 7 import time
 8 import random
 9 import re
10 from datetime import datetime
11 from lxml import etree

爬取字段確定

首先，我們只管的瀏覽用戶主頁，點擊全部微博，觀察我們能獲取到的信息：

用戶id
微博id
微博時間
微博內容
微博發布平台
微博評論數
微博點贊數
微博轉發數
原微博id
原微博用戶id
原微博用戶名
原微博內容
原微博評論數
原微博點贊數
原微博轉發數

然后，我們利用Chrome的developer tools觀察用戶個人主頁所能獲取到的主要內容，發現有些轉發內容如果過長，無法直接通過用戶主頁進行爬取，而需要點進該條微博鏈接，對原微博進行爬取。

因此，我們可以爬取原微博的url，通過解析原微博url的內容來獲取原微博的具體內容。

最終，通過綜合情況，最后確定的字段為：

用戶id——uid
微博id——mid
微博時間——time
微博發布平台——app_source
微博內容——content
微博評論數、點贊數、轉發數——others
微博地址——url
是否轉發——is_repost
原微博id——rootmid
原微博用戶id——rootuid
原微博名——rootname
原微博地址——rooturl

加載頁包抓取

在對用戶的微博內容進行爬取時，最為困難的是解決網頁加載的問題。微博需要兩次加載，才能載入微博的全部內容，並進入下一頁，因此，如何抓取到加載頁的包是我們工作中最為重要的部分。

這里，我們需要借助Chrome的開發者工具，抓取頁面加載時出現的包

發現加載的時間段中，出現了一個xhr類型的文件，長得最像我們需要的加載包：

https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&profile_ftype=1&is_all=1&pagebar=0&pl_name=Pl_Official_MyProfileFeed__22&id=1005051956890840&script_uri=/p/1005051956890840/home&feed_type=0&page=1&pre_page=1&domain_op=100505&__rnd=1517384223025

再加載一次試驗一下，發現出現它又出現了：

https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&profile_ftype=1&is_all=1&pagebar=1&pl_name=Pl_Official_MyProfileFeed__22&id=1005051956890840&script_uri=/p/1005051956890840/home&feed_type=0&page=1&pre_page=1&domain_op=100505&__rnd=1517384677638

看到pl_name=Pl_Official_MyProfileFeed__22基本上就准了，為了保險起見，我們再點開鏈接看看，然后發現——果然是熟悉的配方，熟悉的味道~~

仔細解析這段url，發現：

is_all是頁面屬性，表示全部微博
page和pre_page都表示頁數
id是用戶id【uid】和domain【100505】的結合體
script_uri是當前用戶的主頁url字段
pagebar長得最像加載頁，第一個加載頁為0，第二個加載頁為1
__rnd是時間戳，可以省略

結合初始網頁沒有pre_page和pagebar這兩個字段，我們去掉這兩個字段，運行一下url，觀察一下所得到的內容，發現為加載前的用戶發布的微博內容。

因此我們可以將用戶主頁的每一頁分為三個部分，分成三個url進行解析，獲取整個頁面的內容。

具體代碼如下：

 1 # 初始化url
 2 def getBeginURL(self):
 3     begin_url = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_all=1&id=100505'+str(self.uid)+\
 4                 '&script_uri=/u/1956890840&domain_op=100505&page='
 5     return begin_url
 6 
 7 # 設置加載頁url，並獲取html內容
 8 def getHTML(self,page_num,extend_part = ''):
 9     # extend_part為獲取加載頁的擴展字段
10     url = self.getBeginURL()+str(page_num)+extend_part
11     data = urllib.request.urlopen(url).read().decode('utf-8')
12     html = json.loads(data)['data']
13     return html
14 
15 for x in range(3):
16     if x == 0:  # 初始頁面
17         extend_part = ''
18     elif x == 1: 
19         b = x - 1
20         extend_part = '&pre_page=' + str(i) + '&pagebar=' + str(b)
21     elif x == 2:
22         b = x - 1
23         extend_part = '&pre_page=' + str(i) + '&pagebar=' + str(b)
24     html = self.getHTML(i, extend_part)
25     page = etree.HTML(html)

以上，最為頭大的問題就解決啦~

博主寫代碼的時候為了解決加載包的問題頭疼了好幾天，結果發現Chrome的開發者工具比我想的還要強大的多，不由的感嘆自己的愚蠢。發現了加載包的規律有，后面的一切都水到渠成，迅速的完成了微博內容爬取的代碼~

下面是我的代碼，各位可以參考。

博主為了可以實時更新內容，設置了爬取微博的時間段，可以避免每次都爬取頁數而造成微博重復爬取的麻煩。

代碼還有很多需要改進的地方，希望各位多多交流~

  1 import urllib.error
  2 import urllib.request
  3 import urllib.parse
  4 import urllib
  5 import json
  6 import pandas as pd
  7 import time
  8 import random
  9 import re
 10 from datetime import datetime
 11 from datetime import timedelta
 12 from lxml import etree
 13 
 14 class getWeiboContent():
 15     """
 16     微博內容爬取：
 17     mid
 18     time
 19     app_source
 20     content
 21     url
 22     others(repost, like, comment)
 23     is_repost
 24     rootmid
 25     rootname
 26     rootuid
 27     rooturl
 28     """
 29     def __init__(self, uid, begin_date=None, begin_page=1, interval=None, flag=True):
 30         self.uid = uid  # 微博用戶ID
 31         self.begin_page = begin_page  # 起始頁
 32         self.interval = interval  # 需要爬取的頁數，默認為None
 33         self.begin_date = begin_date  # 爬取的微博的起始發布日期，默認為None
 34         self.flag = flag
 35 
 36     # 初始化url
 37     def getBeginURL(self):
 38         begin_url = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&is_all=1&id=100505'+str(self.uid)+\
 39                     '&script_uri=/u/1956890840&domain_op=100505&page='
 40         return begin_url
 41 
 42     # 設置加載頁url，並獲取html內容
 43     def getHTML(self,page_num,extend_part = ''):
 44         url = self.getBeginURL()+str(page_num)+extend_part
 45         print(url)
 46         data = urllib.request.urlopen(url).read().decode('utf-8')
 47         html = json.loads(data)['data']
 48         return html
 49 
 50     # 爬取每條微博的內容，輸出字典
 51     def getContent(self,node):
 52         dic = {}
 53         dic['mid'] = node.xpath('./@mid')[0]
 54         print('mid:'+dic['mid'])
 55         dic['time'] = node.xpath('.//div[@class="WB_from S_txt2"]/a[1]/@title')[0]
 56         app_source = node.xpath('.//div[@class="WB_from S_txt2"]/a[2]/text()')
 57         if len(app_source) !=0 :  # 部分微博不顯示客戶端信息
 58             dic['app_source'] = app_source[0]
 59         content = node.xpath('./*/*/div[@class="WB_text W_f14"]')[0].xpath('string(.)')
 60         dic['content'] = re.compile('\n\s*(.*)').findall(content)[0]
 61         others = node.xpath('.//ul[@class="WB_row_line WB_row_r4 clearfix S_line2"]//span[@class="line S_line1"]/span/em[2]/text()')
 62         dic['repost_num'] = others[1]
 63         dic['comment_num'] = others[2]
 64         dic['like_num'] = others[3]
 65         detail_info = node.xpath('./div[@class="WB_feed_handle"]/div/ul/li[2]/a/@action-data')[0]
 66         dic['url'] = re.compile('&url=(.*?)&').findall(detail_info)[0]
 67         rootmid = node.xpath('./@omid')
 68         # 判斷是否存在轉發微博
 69         if len(rootmid) != 0:
 70             dic['is_repost'] = 1
 71             dic['rootmid'] = rootmid[0]
 72             weibo_expend = node.xpath('./*/*/div[@class="WB_feed_expand"]')[0]
 73             rootname = weibo_expend.xpath('./*/*/a[@class="W_fb S_txt1"]/@nick-name')
 74             # 判斷原博是否被刪除
 75             if len(rootname) != 0:
 76                 dic['rootuid'] = re.compile('rootuid=(.*?)&').findall(detail_info)[0]
 77                 dic['rootname'] = re.compile('rootname=(.*?)&').findall(detail_info)[0]
 78                 dic['rooturl'] = re.compile('rooturl=(.*?)&').findall(detail_info)[0]
 79 
 80         return dic
 81 
 82     # 獲取微博內容
 83     def getWeiboInfo(self):
 84         i = self.begin_page
 85         # 判斷是否划定了爬取頁數
 86         if self.interval is None:
 87             # 若未划定爬取頁數，則設置自動翻頁參數hasMore=True
 88             hasMore = True
 89             end_page = False
 90         else:
 91             # 若划定爬取頁數，則爬取頁數優先
 92             end_page = self.begin_page+self.interval
 93             hasMore = False
 94         # 初始化一個DataFrame用於存儲數據
 95         weibo_df = pd.DataFrame()
 96         while (i <= end_page | hasMore) and self.flag:
 97             for x in range(3):
 98                 if x == 0:  # 初始頁面
 99                     extend_part = ''
100                 elif x == 1:  # 第一個加載頁
101                     b = x-1
102                     extend_part = '&pre_page=' + str(i) + '&pagebar=' + str(b)
103                 elif x == 2:  # 第二個加載頁
104                     b = x-1
105                     extend_part = '&pre_page=' + str(i) + '&pagebar=' + str(b)
106                 html = self.getHTML(i, extend_part)
107                 page = etree.HTML(html)
108                 if page is None:
109                     break
110                 else:
111                     detail = page.xpath('//div[@class="WB_cardwrap WB_feed_type S_bg2 WB_feed_like "]')
112                 # 判斷用戶是否發過微博
113                 if len(detail) == 0:
114                     print('該用戶並未發過微博')
115                     break
116                 weibo = {}
117                 weibo['mid'] = []
118                 weibo['time'] = []
119                 weibo['content'] = []
120                 weibo['app_source'] = []
121                 weibo['url'] = []
122                 weibo['repost_num'] = []
123                 weibo['comment_num'] = []
124                 weibo['like_num'] = []
125                 weibo['is_repost'] = []
126                 weibo['rootmid'] = []
127                 weibo['rootname'] = []
128                 weibo['rootuid'] = []
129                 weibo['rooturl'] = []
130                 for w in detail:
131                     all_info = self.getContent(w)
132                     # 判斷是否設置了微博的開始日期
133                     if self.begin_date is None:
134                         pass
135                     else:
136                         weibo_dt = datetime.strptime(all_info['time'], '%Y-%m-%d %H:%M').date()
137                         begin_dt = datetime.strptime(self.begin_date, "%Y-%m-%d").date()
138                         # 判斷微博發布日期是否在開始日期之后
139                         if begin_dt > weibo_dt:
140                             # 當微博發布日期在開始日期之后時，停止爬取
141                             self.flag = False
142                             break
143                     weibo['mid'].append(all_info.get('mid', ''))
144                     weibo['time'].append(all_info.get('time', ''))
145                     weibo['app_source'].append(all_info.get('app_source',''))
146                     weibo['content'].append(all_info.get('content', ''))
147                     weibo['url'].append(all_info.get('url', ''))
148                     weibo['repost_num'].append(all_info.get('repost_num', ''))
149                     weibo['comment_num'].append(all_info.get('comment_num', ''))
150                     weibo['like_num'].append(all_info.get('like_num', ''))
151                     weibo['is_repost'].append(all_info.get('is_repost', 0))
152                     weibo['rootmid'].append(all_info.get('rootmid', ''))
153                     weibo['rootname'].append(all_info.get('rootname', ''))
154                     weibo['rootuid'].append(all_info.get('rootuid', ''))
155                     weibo['rooturl'].append(all_info.get('rooturl', ''))
156                 weibo = pd.DataFrame(weibo)
157                 weibo['uid'] = self.uid
158                 weibo_df = weibo_df.append(weibo,ignore_index=True)
159             # 提取下一頁鏈接
160             if page is None:
161                 break
162             else:
163                 next_page = page.xpath('//a[@class="page next S_txt1 S_line1"]/@href')
164             if len(next_page) == 0:  # 判斷是否存在下一頁
165                 self.flag = False
166                 print('已是最后一頁')
167             else:
168                 page_num = re.compile('page=(\d*)').findall(next_page[0])[0]
169                 i = int(page_num)
170             time.sleep(random.randint(5, 10))  # 設置睡眠時間
171         return weibo_df
172 
173 if __name__=='__main__':
174     uid = input('請輸入uid：')
175     begin_date = input('請輸入日期，格式為xxxx-xx-xx：')
176     begin_page = input('請輸入開始頁，默認為1：')
177     getWeiboContent(uid, begin_date).getWeiboInfo()

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3.5爬蟲-爬取微博某博主微博內容 python爬蟲實戰（六）--------新浪微博（爬取微博帳號所發內容，不爬取歷史內容）圍觀微博網友發起的美胸比賽學習爬取微博評論內容爬取微博文章內容，關鍵字搜索爬取 Python爬取新浪微博評論 Scrapy 爬取新浪微博 python爬取微博熱搜爬取微博熱搜 Python-爬取微博信息獲取數據——爬取某微博評論