Python 實現騰訊新聞抓取

本文轉載自查看原文 2012-08-14 09:56 13617 python

思路：

1.抓取騰訊新聞列表頁面: http://news.qq.com/

2.提取詳細頁面的url：http://news.qq.com/a/20120814/000070.htm

3.在詳細頁中提取新聞標題和內容

4.去除提取內容中的html標簽，生成txt文檔

代碼：

 1 #coding=utf-8
 2 import sys
 3 import urllib2
 4 import re
 5 import os
 6 
 7 def extract_url(info):
 8     rege="http://news.qq.com/a/\d{8}/\d{6}.htm"
 9     re_url = re.findall(rege, info)
10     return re_url
11 
12 def extract_sub_web_title(sub_web):
13     re_key = "<title>.+</title>"
14     title = re.findall(re_key,sub_web)
15     return title
16 
17 def extract_sub_web_content(sub_web):
18     re_key = "<div id=\"Cnt-Main-Article-QQ\".*</div>"
19     content = re.findall(re_key,sub_web)
20     return content
21 
22 def filter_tags(htmlstr):
23     re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA
24     re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)#Script
25     re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style
26     re_p=re.compile('<P\s*?/?>')#處理換行
27     re_h=re.compile('</?\w+[^>]*>')#HTML標簽
28     re_comment=re.compile('<!--[^>]*-->')#HTML注釋
29     s=re_cdata.sub('',htmlstr)#去掉CDATA
30     s=re_script.sub('',s) #去掉SCRIPT
31     s=re_style.sub('',s)#去掉style
32     s=re_p.sub('\r\n',s)#將<p>轉換為換行
33     s=re_h.sub('',s) #去掉HTML 標簽
34     s=re_comment.sub('',s)#去掉HTML注釋  
35     blank_line=re.compile('\n+')#去掉多余的空行
36     s=blank_line.sub('\n',s)
37     return s
38 
39 #get news
40 content = urllib2.urlopen('http://news.qq.com').read()
41 
42 #get the url
43 get_url = extract_url(content)
44 
45 #generate file
46 f = file('result.txt','w')
47 i = 15            #新聞起始位置，前面幾條格式不一致
48 flag = 30
49 while True:
50     f.write(str(i-14)+"\r\n")
51     
52     #get the sub web title and content
53     sub_web = urllib2.urlopen(get_url[i]).read()
54     sub_title = extract_sub_web_title(sub_web)
55     sub_content = extract_sub_web_content(sub_web)
56 
57     #remove html tag
58     if sub_title != [] and sub_content != []:
59         re_content = filter_tags(sub_title[0]+"\r\n"+sub_content[0])
60         f.write(re_content.decode("gb2312").encode("utf-8"))
61         f.write("\r\n")
62     else:
63         flag = flag +1
64     
65     if i == flag:
66         break
67  
68     i = i + 1
69     print "Have finished %d news" %(i-15)
70 f.close()

說明：

urllib2模塊：進行網頁內容抓取
re模塊：進行正則表達式提取
decode("gb2312").encode("utf-8")：因為提取網頁的編碼是gb2312所以要解碼后在編碼到utf-8顯示
filter_tags：去除提取的內容的html標簽，baidu可以找到這個函數，又修改了下

調試中遇到的問題：

1.Table 'polls.django_admin_log' doesn't exist
今天沒事調試一下DJANGO框架的時候官方的例子出現如下錯誤在這記錄一下吧~！

原因：數據庫未同步

解決方法：python manage.py syncdb

2.IndentationError: unexpected indent python

原因：縮進錯誤

解決方法：刪除縮進，統一用tab，注意tab設置為4空格

3.[Errno 9] Bad file descriptor

原因：讀文件用了 fileopen(filename,”w”)

解決方法：fileopen(filename,”r”)

4. IndexError: list index out of range

原因：for i in range(len(List))

del len(List)

在動態刪除List過程中越界

解決辦法：不要動態刪除，采用兩個List操作

5.TypeError: expected string or buffer
原因：re_h=re.compile('</?\w+[^>]*>')
s=re_h.sub('',str)

傳入的str是list變量導致出錯

解決辦法：傳入str類型變量

附：我的vim設置

要在 ~ 目錄下(即用戶根目錄)新建 .vimrc，這樣對其它用戶不影響

syntax on
set fileencodings=utf-8,cp936,big5,euc-jp,euc-kr,latin1,ucs-bom 
set fileencodings=utf-8,gbk 
set ambiwidth=double
set langmenu=zh_CN.UTF-8
set mouse=a
set nu
set foldmethod=indent
set sw=4
set ts=4
set smarttab
set spell
set tw=78
set lbr
set fo+=mB
set t_Co=256          //顏色覆蓋問題，默認的效果太差
colorscheme  default   //配色方案

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 實現抓取騰訊新聞文章網絡爬蟲抓取某年份騰訊新聞內容 Python爬蟲實現抓取騰訊視頻所有電影【實戰必學】 python快速抓取新聞標題及內容【轉】Python爬蟲：抓取新浪新聞數據 Python_網絡爬蟲（新浪新聞抓取） Swiper結合jQuery實現騰訊新聞首頁抓取新聞網站：異步爬蟲實現的流程和細節 python3使用newspaper快速抓取任何新聞文章正文 Python寫網絡爬蟲爬取騰訊新聞內容