Python_網絡爬蟲（新浪新聞抓取）

本文轉載自查看原文 2017-07-29 21:52 18828 python/ Python

爬取前的准備：

BeautifulSoup的導入：pip install BeautifulSoup4

requests的導入：pip install requests

下載jupyter notebook：pip install jupyter notebook

下載python，配置環境（可使用anocanda，里面提供了很多python模塊）

json

定義：是一種格式，用於數據交換。

Javascript 對象

定義：一種javascript的引用類型

中文格式除了‘ utf-8 ’還有‘ GBK ’、‘ GB2312 ’ 、‘ ISO-8859-1 ’、‘ GBK ‘’等

用requests可獲取網頁信息

用BeautifulSoup可以將網頁信息轉換為可操作物塊

1 soup = BeautifulSoup(res.text,'html.parser')
2 # 將requests獲取的網頁信息轉換為BeautifulSoup的物件存於soup中，並指明其剖析器為'html.parser'，否則會出現警告。

用beautifulSoup中的select方法可以獲取相應的元素，且獲取的元素為list形式，可以用for循環將其逐個解析出來

1 alink = soup.select('h1')
2 
3 for link in alink:
4     print(link.text)

獲取html標簽值后，可以用[‘href’]獲取‘href’屬性的值,如

1 for link in soup.select('a'):
2   　print(link['href'])

獲取新聞編號：

* .strip（）可以去除前后空白格，括號內加入字符串可以去除指定字符串，rstrip（）可以去除右邊的，lstrip（）可以去除左邊的；

* split（'/'）根據指定的字符對字符串進行切割

re正則表達式的使用：

1 import re
2 
3 m = re.search（' doc-i(.*).shtml ',newsurl）　　# 返回在newsurl中匹配到的字符串
4 print(m.group(1))　　# group（0）可以取得所有匹配到的部分，group（1）只可以取得括號內的部分

使用for循環獲取新聞的多頁鏈接

1 url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1501000415111'
2  
3 for i in rannge（0,10）：
4 　　print( url.format( i ) )
5 # format可以將url里面的大括號（要修改的部分我們把它刪去並換成大括號）換為我們要加入的值（如上面代碼中的 i）

獲取新聞發布的時間：

　　獲取的信息可能會有包含的成分，即會獲取到如出版社的其他我們不需要的元素，可以用contents將里面的元素分離成list形式，用contents[0]即可獲取相應元素

1 # 獲取出版時間
2 from datetime import datetime
3 
4 res = requests.get('http://news.sina.com.cn/c/nd/2017-07-22/doc-ifyihrmf3191202.shtml')
5 res.encoding = 'utf-8'
6 soup = BeautifulSoup(res.text,'html.parser')
7 timesource = soup.select('.time-source')
8 print(timesource[0].contents[0])

　　時間字符串轉換　

1 # 字符串轉時間：-strptime
2 dt = datetime.strptime(timesource,'%Y年%m月%d日%H：%M ’）
3 
4 # 時間轉換字符串：-strftime
5 dt.strftime('%Y-%m-%d‘）

獲取新聞內文：

　　檢查其所屬類后按照上面的 select 獲取新聞內文，獲取的內容為list形式，可用for循環將內容去除標簽后加入到自己創建的的list中（如article = []）

　　* 其中可以用 ‘ \n ’.join( article ) 將article列表中的每一項用換行符‘ \n ’分隔開；

1 # 獲取單篇新聞內容
2 article = []
3 for p in soup.select('.article p'):
4     article.append(p.text.strip())
5 print('\n'.join(article))

　　上面獲取單篇新聞的代碼可用一行完成：

1 # 一行完成上面獲取新聞內容的代碼
2 print('\n'.join([p.text.strip() for p in soup.select('.article p')]))

獲取評論數量：（在獲取評論數量時會發現評論是用js的形式發送給瀏覽器的，所以要先把獲取的內容轉化為json格式讀取python字典

1 # 取得評論數的數量
2 import requests
3 import json
4 comment = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=js&c\
5 hannel=gn&newsid=comos-fyihrmf3218511&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20')　　# 從評論地址獲取相關內容
6 comment.encoding = 'utf-8'
7 jd = json.loads(comment.text.strip('var data='))
8 jd['result']['count']['total']

完整代碼（以獲取新浪新聞為例）：

 1 # 獲取新聞的標題，內容，時間和評論數
 2 import requests
 3 from bs4 import BeautifulSoup
 4 from datetime import datetime
 5 import re
 6 import json
 7 import pandas
 8 
 9 def getNewsdetial(newsurl):
10     res = requests.get(newsurl)
11     res.encoding = 'utf-8'
12     soup = BeautifulSoup(res.text,'html.parser')
13     newsTitle = soup.select('.page-header h1')[0].text.strip()
14     nt = datetime.strptime(soup.select('.time-source')[0].contents[0].strip(),'%Y年%m月%d日%H:%M')
15     newsTime = datetime.strftime(nt,'%Y-%m-%d %H:%M')
16     newsArticle = getnewsArticle(soup.select('.article p'))
17     newsAuthor = newsArticle[-1]
18     return newsTitle,newsTime,newsArticle,newsAuthor
19 def getnewsArticle(news):
20     newsArticle = []
21     for p in news:
22          newsArticle.append(p.text.strip())
23     return newsArticle
24 
25 # 獲取評論數量
26 
27 def getCommentCount(newsurl):
28     m = re.search('doc-i(.+).shtml',newsurl)
29     newsid = m.group(1)
30     commenturl = 'http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20'
31     comment = requests.get(commenturl.format(newsid))   #將要修改的地方換成大括號，並用format將newsid放入大括號的位置
32     jd = json.loads(comment.text.lstrip('var data='))
33     return jd['result']['count']['total']
34 
35 
36 def getNewsLinkUrl():
37 #     得到異步載入的新聞地址（即獲得所有分頁新聞地址）
38     urlFormat = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1501000415111'
39     url = []
40     for i in range(1,10):
41         res = requests.get(urlFormat.format(i))
42         jd = json.loads(res.text.lstrip('  newsloadercallback(').rstrip(');'))
43         url.extend(getUrl(jd))     #entend和append的區別
44     return url
45 
46 def getUrl(jd):
47 #     獲取每一分頁的新聞地址
48     url = []
49     for i in jd['result']['data']:
50         url.append(i['url'])
51     return url
52 
53 # 取得新聞時間，編輯，內容，標題，評論數量並整合在total_2中
54 def getNewsDetial():
55     title_all = []
56     author_all = []
57     commentCount_all = []
58     article_all = []
59     time_all = []
60     url_all = getNewsLinkUrl()
61     for url in url_all:
62         title_all.append(getNewsdetial(url)[0])
63         time_all.append(getNewsdetial(url)[1])
64         article_all.append(getNewsdetial(url)[2])
65         author_all.append(getNewsdetial(url)[3])
66         commentCount_all.append(getCommentCount(url))
67     total_2 = {'a_title':title_all,'b_article':article_all,'c_commentCount':commentCount_all,'d_time':time_all,'e_editor':author_all}
68     return total_2
69 
70 # ( 運行起始點 )用pandas模塊處理數據並轉化為excel文檔
71 
72 df = pandas.DataFrame(getNewsDetial())
73 df.to_excel('news2.xlsx')

存儲的excel文檔如下：

TIPS：

問題：在jupyter notebook導入pandas時可能會出現導入錯誤

解決：不要用命令行打開jupyter notebook，直接找到軟件打開或者在Anocanda Navigator中打開

2017-07-29 21:49:37

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python_爬蟲_微信公眾號抓取 python3.4學習筆記(十四) 網絡爬蟲實例代碼，抓取新浪愛彩雙色球開獎數據實例【python】網絡爬蟲抓取圖片新浪新聞頁面抓取（JAVA-Jsoup） Python網絡爬蟲-網易新聞數據分析基於Scrapy框架的Python新聞爬蟲 Python網絡爬蟲筆記（一）：網頁抓取方式和LXML示例 [Python學習] 簡單網絡爬蟲抓取博客文章及思想介紹新浪微博python爬蟲分享（一天可抓取 1300 萬條數據）,超級無敵