python——博客園首頁信息提取與分析

本文轉載自查看原文 2013-08-20 20:42 1754 python/ 文本處理/ 思考與實踐/ 數據挖掘

前言

前兩天寫了博客，然后發到了博客園首頁，然后看着點擊量一點點上升，感覺怪怪的。

然后就產生了一點好奇：有多少人把博客發表到了首頁上？每天博客園首頁會發表多少文章？誰發表的文章最多？評論和閱讀數量的對應關系是多少？

有了好奇之后，就在想，怎樣才能知道答案？

1. 尋路第一步

通過瀏覽博客園發現，在博客園最多能看到200頁。所以，能不能先把這200頁給下載下來。之前有寫過一篇博客，批量下載圖片，所以可以用博客中類似的方法把這些網頁下載下來。

from html.parser import HTMLParser
import os,urllib.request,sys

#通過博客園NEXT按鈕，可以獲取下一個網頁的地址，這樣一直循環，就可以將200個網頁下載下來。

#setp 1. 通過解析網頁，獲取下一個網頁的地址。
class LinkParser(HTMLParser):
  def __init__(self,strict=False,domain=''):
    HTMLParser.__init__(self,strict)
    self.value=''
    self.domain=domain
    self.next=[]
  def handle_starttag(self,tag,attrs):
    if tag=='a':
      for i in attrs:
        if i[0]=='href':
          self.value=i[1]
  def handle_data(self,data):
    if data.startswith('Next'):
      if (self.domain!='' )and ('://' not in self.value):
        self.next.append(self.domain+self.value)
      else:
        self.next.append(self.value)

#setp 2. 下載當前網頁，並根據解析結果，下載下一個網頁。
def getLinks(url,domain):
  doing=[url]
  done=[]
  cnt=0;
  while len(doing)>=1:
    x=doing.pop();
    done.append(x)
    cnt=cnt+1;
    print('start:',x)
    try:
      f=urllib.request.urlopen(x,timeout=120)
      s=f.read()
      f.close()
      fx=open(os.path.join(os.getcwd(),'data','{0}.html'.format(str(cnt))),'wb') #需要在當前目錄建立data文件夾
      fx.write(s)
      fx.close()
      parser=LinkParser(strict=False,domain=domain)
      parser.feed(s.decode())
      for i in parser.next:
        if i not in done:
          doing.insert(0,i)
      parser.next=[]
      print('ok:',x)
    except:
      print('error:',x)
      print(sys.exc_info())
      continue
  return done

if __name__=='__main__':
  getLinks('http://www.cnblogs.com/','http://www.cnblogs.com/')

2. 從網頁抽取信息

網頁已經下載下來了，現在需要把信息從網頁上抽取出來。

經過分析，每個網頁上列出了20條記錄，每條記錄包含標題，作者，發布時間，推薦等信息。

怎樣把這些給抽取出來呢？

先寫一個小的程序，看看Python是怎么解析這些數據的：

數據：

<html>
<head></head>
<body>
<div class="post_item">
<div class="digg">
    <div class="diggit" onclick="DiggIt(3266366,130739,1)"> 
    <span class="diggnum" id="digg_count_3266366">10</span>
    </div>
    <div class="clear"></div>    
    <div id="digg_tip_3266366" class="digg_tip"></div>
</div>      
<div class="post_item_body">
    <h3><a class="titlelnk" href="http://www.cnblogs.com/ola2010/p/3266366.html" target="_blank">python——常用功能之文本處理</a></h3>                   
    <p class="post_item_summary">
    前言在生活、工作中，python一直都是一個好幫手。在python的眾多功能中，我覺得文本處理是最常用的。下面是平常使用中的一些總結。環境是python 3.30. 基礎在python中，使用str對象來保存字符串。str對象的建立很簡單，使用單引號或雙引號或3個單引號即可。例如：s='nice' ... 
    </p>              
    <div class="post_item_foot">                    
    <a href="http://www.cnblogs.com/ola2010/" class="lightblue">ola2010</a> 
    發布於 2013-08-18 21:27 
    <span class="article_comment"><a href="http://www.cnblogs.com/ola2010/p/3266366.html#commentform" title="2013-08-20 17:45" class="gray">
        評論(4)</a></span><span class="article_view"><a href="http://www.cnblogs.com/ola2010/p/3266366.html" class="gray">閱讀(1640)</a></span></div>
</div>
<div class="clear"></div>
</div>
</body>
</html>

View Code

代碼：

from html.parser import HTMLParser
import os,urllib.request,sys

#一個簡單的html解析器，主要用於看看Python對html的解析步驟
class TestParser(HTMLParser):
  def __init__(self,strict=False):
    HTMLParser.__init__(self,strict)
    self.current=0
  def handle_starttag(self,tag,attrs):
    print(tag,':',attrs)
  def handle_data(self,data):
    print(self.current,'data:',data.strip())
    self.current=self.current+1

if __name__=='__main__':
  parser=TestParser(strict=False)
  f=open(os.path.join(os.getcwd(),'test.txt'),encoding='utf-8')
  s=f.read()
  f.close()
  parser.feed(s)

通過小程序，確定好處理順序之后，然后就可以將這些數據一步一步地抽取出來了。之前有一篇博客python——有限狀態機寫到怎么提取信息。

代碼：

from html.parser import HTMLParser
import os,urllib.request,sys

#parser of content
class ContentParser(HTMLParser):
  def __init__(self,strict=False):
    HTMLParser.__init__(self,strict)
    self.state=0
    self.title=''
    self.author=''
    self.time=''
    self.comment=''
    self.view=''
    self.result=[]
  def handle_starttag(self,tag,attrs):
    if self.state==0:
      if tag=='a':
        for i in attrs:
          if i[0]=='class' and i[1]=='titlelnk':
            self.state=1   #title          
    elif self.state==2:
      if tag=='div':
        for i in attrs:
          if i[0]=='class' and i[1]=='post_item_foot':
            self.state=3
    elif self.state==3:
      if tag=='a':
         self.state=4  #author
    elif self.state==5:
      if tag=='span':
        for i in attrs:
          if i[0]=='class' and i[1]=='article_comment':
            self.state=6
    elif self.state==6:
      if tag=='span':
        for i in attrs:
          if i[0]=='class' and i[1]=='article_view':
            self.state=7
  def handle_data(self,data):
    if self.state==1:
      self.title=data.strip()
      self.state=2
    elif self.state==4:
      self.author=data.strip()
      self.state=5
    elif self.state==5:
      self.time=data.strip()[-16:]
    elif self.state==6:
      self.comment=data.strip()[3:-1]
    elif self.state==7:
      self.view=data.strip()[3:-1]
      self.result.append((self.title,self.author,self.time,self.comment,self.view))
      self.state=0

def getContent(file_name):
  parser=ContentParser(strict=False)
  f=open(os.path.join(os.getcwd(),'data',file_name),encoding='utf-8')
  s=f.read()
  f.close()
  parser.feed(s)
  f=open(os.path.join(os.getcwd(),'result.txt'),'a')
  for i in parser.result:
    f.write('{0}\t{1}\t{2}\t{3}\t{4}\n'.format(i[0],i[1],i[2],i[3],i[4]))
  f.close()
  
if __name__=='__main__':
  for i in os.listdir(os.path.join(os.getcwd(),'data')):
    print(i)
    getContent(i)

這樣，就將結果提取出來了。

3. 分析這些數據

因為我們是以tab鍵分割這些數據的，所以可以導入到excel中：

經統計：

從2013-05-22 16:22到2013-08-20 19:57近3個月的時間里：
有1356個人發布4000篇博客到博客園首頁，平均每天44.4篇，每人3篇；
其中，最高的一人發布了55篇；
所有的文章總共被查看4661643次，評論35210次，平均132次查看會有一次評論

拋磚引玉

1. 除了上述統計信息之外，是否可以找到一個星期中，那一天博客發表的最多？那一天最少？哪個人的評論最多？哪些主題關注度最大？

2. 互聯網的數據有很多，只要肯動手，就能獲取想要的信息。不僅僅是博客園的這些統計信息，也可以是其他網站的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python+scrapy分析博客園首頁4000篇優質博客(圖解) 怎么設計自己的博客園個人首頁 CTF-Bugku-分析-信息提取 Python自動提取生成博客園年度報告 python 之 BeautifulSoup標簽查找與信息提取博客園個人首頁背景設置博客園個人首頁背景設置博客園首頁頁面設計 Python網絡爬蟲與信息提取（二）—— BeautifulSoup .NET Core 實現定時抓取博客園首頁文章信息並發送到郵箱