爬蟲的廣度優先和深度優先算法

本文轉載自查看原文 2017-04-19 17:20 10882 爬蟲

廣度優先算法介紹

　　整個的廣度優先爬蟲過程就是從一系列的種子節點開始，把這些網頁中的"子節點"(也就是超鏈接)提取出來，放入隊列中依次進行抓取。被處理過的鏈接需要放入一張表(通常稱為Visited表)中。每次新處理一個鏈接之前，需要查看這個鏈接是否已經存在於Visited表中。如果存在，證明鏈接已經處理過，跳過，不做處理，否則進行下一步處理。

　　初始的URL地址是爬蟲系統中提供的種子URL(一般在系統的配置文件中指定)。當解析這些種子URL所表示的網頁時，會產生新的URL(比如從頁面中的<a href= "http://www.admin.com "中提取出http://www.admin.com 這個鏈接)。然后，進行以下工作：

把解析出的鏈接和Visited表中的鏈接進行比較，若Visited表中不存在此鏈接，表示其未被訪問過。
把鏈接放入TODO表中。
處理完畢后，再次從TODO表中取得一條鏈接，直接放入Visited表中。
針對這個鏈接所表示的網頁，繼續上述過程。如此循環往復。

廣度優先遍歷是爬蟲中使用最廣泛的一種爬蟲策略，之所以使用廣度優先搜索策略，主要原因有三點：

重要的網頁往往離種子比較近，例如我們打開新聞網站的時候往往是最熱門的新聞，隨着不斷的深入沖浪，所看到的網頁的重要性越來越低。
萬維網的實際深度最多能達到17層，但到達某個網頁總存在一條很短的路徑。而廣度優先遍歷會以最快的速度到達這個網頁。
廣度優先有利於多爬蟲的合作抓取，多爬蟲合作通常先抓取站內鏈接，抓取的封閉性很強。

爬蟲深度優先搜索

深度優先搜索是一種在開發爬蟲早期使用較多的方法。它的目的是要達到被搜索結構的葉結點(即那些不包含任何超鏈的HTML文件) 。在一個HTML文件中，當一個超鏈被選擇后，被鏈接的HTML文件將執行深度優先搜索，即在搜索其余的超鏈結果之前必須先完整地搜索單獨的一條鏈。深度優先搜索沿着HTML文件上的超鏈走到不能再深入為止，然后返回到某一個HTML文件，再繼續選擇該HTML文件中的其他超鏈。當不再有其他超鏈可選擇時，說明搜索已經結束。優點是能遍歷一個Web 站點或深層嵌套的文檔集合；缺點是因為Web結構相當深,，有可能造成一旦進去，再也出不來的情況發生。

  1 #encoding=utf-8
  2 from bs4 import BeautifulSoup
  3 import socket
  4 import urllib2
  5 import re
  6 import zlib
  7  
  8  class MyCrawler:
  9      def __init__(self,seeds):
 10          #初始化當前抓取的深度
 11          self.current_deepth = 1
 12          #使用種子初始化url隊列
 13          self.linkQuence=linkQuence()
 14          if isinstance(seeds,str):
 15              self.linkQuence.addUnvisitedUrl(seeds)
 16          if isinstance(seeds,list):
 17              for i in seeds:
 18                  self.linkQuence.addUnvisitedUrl(i)
 19          print "Add the seeds url \"%s\" to the unvisited url list"%str(self.linkQuence.unVisited)
 20      #抓取過程主函數
 21      def crawling(self,seeds,crawl_deepth):
 22          #循環條件：抓取深度不超過crawl_deepth
 23          while self.current_deepth <= crawl_deepth:
 24              #循環條件：待抓取的鏈接不空
 25              while not self.linkQuence.unVisitedUrlsEnmpy():
 26                  #隊頭url出隊列
 27                  visitUrl=self.linkQuence.unVisitedUrlDeQuence()
 28                  print "Pop out one url \"%s\" from unvisited url list"%visitUrl
 29                  if visitUrl is None or visitUrl=="":
 30                      continue
 31                  #獲取超鏈接
 32                  links=self.getHyperLinks(visitUrl)
 33                  print "Get %d new links"%len(links)
 34                  #將url放入已訪問的url中
 35                  self.linkQuence.addVisitedUrl(visitUrl)
 36                  print "Visited url count: "+str(self.linkQuence.getVisitedUrlCount())
 37                  print "Visited deepth: "+str(self.current_deepth)
 38              #未訪問的url入列
 39              for link in links:
 40                  self.linkQuence.addUnvisitedUrl(link)
 41              print "%d unvisited links:"%len(self.linkQuence.getUnvisitedUrl())
 42              self.current_deepth += 1
 43              
 44      #獲取源碼中得超鏈接
 45      def getHyperLinks(self,url):
 46          links=[]
 47          data=self.getPageSource(url)
 48          if data[0]=="200":
 49              soup=BeautifulSoup(data[1])
 50              a=soup.findAll("a",{"href":re.compile('^http|^/')})
 51              for i in a:
 52                  if i["href"].find("http://")!=-1:
 53                      links.append(i["href"]) 
 54          return links
 55      
 56      #獲取網頁源碼
 57      def getPageSource(self,url,timeout=100,coding=None):
 58          try:
 59              socket.setdefaulttimeout(timeout)
 60              req = urllib2.Request(url)
 61              req.add_header('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)')
 62              response = urllib2.urlopen(req)
 63              page = '' 
 64              if response.headers.get('Content-Encoding') == 'gzip': 
 65                  page = zlib.decompress(page, 16+zlib.MAX_WBITS) 
 66              
 67              if coding is None:   
 68                  coding= response.headers.getparam("charset")   
 69         　　#如果獲取的網站編碼為None 
 70              if coding is None:   
 71                  page=response.read()   
 72         　　#獲取網站編碼並轉化為utf-8 
 73              else:           
 74                  page=response.read()   
 75                  page=page.decode(coding).encode('utf-8')   
 76              return ["200",page]
 77          except Exception,e:
 78              print str(e)
 79              return [str(e),None]
 80          
 81  class linkQuence:
 82      def __init__(self):
 83          #已訪問的url集合
 84          self.visted=[]
 85          #待訪問的url集合
 86          self.unVisited=[]
 87      #獲取訪問過的url隊列
 88      def getVisitedUrl(self):
 89          return self.visted
 90      #獲取未訪問的url隊列
 91      def getUnvisitedUrl(self):
 92          return self.unVisited
 93      #添加到訪問過得url隊列中
 94      def addVisitedUrl(self,url):
 95          self.visted.append(url)
 96      #移除訪問過得url
 97      def removeVisitedUrl(self,url):
 98          self.visted.remove(url)
 99      #未訪問過得url出隊列
100      def unVisitedUrlDeQuence(self):
101          try:
102              return self.unVisited.pop()
103          except:
104              return None
105      #保證每個url只被訪問一次
106      def addUnvisitedUrl(self,url):
107          if url!="" and url not in self.visted and url not in self.unVisited:
108              self.unVisited.insert(0,url)
109      #獲得已訪問的url數目
110      def getVisitedUrlCount(self):
111          return len(self.visted)
112      #獲得未訪問的url數目
113      def getUnvistedUrlCount(self):
114          return len(self.unVisited)
115      #判斷未訪問的url隊列是否為空
116      def unVisitedUrlsEnmpy(self):
117          return len(self.unVisited)==0
118      
119  def main(seeds,crawl_deepth):
120      craw=MyCrawler(seeds)
121      craw.crawling(seeds,crawl_deepth)
122      
123  if __name__=="__main__":
124      main(["http://www.baidu.com", "http://www.google.com.hk", "http://www.sina.com.cn"],10)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 理解爬蟲的廣度優先和深度優先算法廣度優先和深度優先算法深度優先和廣度優先算法【算法】廣度優先算法和深度優先算法算法：深度優先算法和廣度優先算法深度優先、廣度優先python爬蟲廣度優先算法(BFS)與深度優先算法(DFS) 深度優先算法與廣度優先算法深度優先算法和廣度優先算法圖基本算法圖搜索（廣度優先、深度優先）