以下是《用python寫網絡爬蟲》的讀書筆記:
一.背景調研
1.檢查robots.txt文件,通過在所需要爬取數據的頁面的網址后加上robots.txt就可以看到當前網站對於數據爬取有哪些限制
以下是一個典型的robots.txt文件,這個robots.txt文件是網站 http://example.webscarping.com/的robots.tx
#section 1 User-agent : BadCrawler Disallow:/ #section 2 User-agent : * Crawl-delay : 5 Disallow : /trap #section 3 Sitemap : http://example.webscarping.com/sitemap.xml
這里的sitemap表示的網站地圖,我們可以進入相應的頁面進行查看。網站地圖提供了所有頁面的鏈接
2.識別網站所用的技術
在進行爬蟲之前如果我們知道了該網站的搭建技術,那么就可以更好更快的進行數據爬取
(1)首先我們要用cmd進入到安裝python的目錄
(2)執行 pip install builtwith
(3)測試
import bulitwith //test the builtwith builtwith.parse("http://www.zhipin.com")
3.尋找網站所有者
我們可以利用whois模塊查詢域名的注冊者
(1)首先我們要用cmd進入到安裝python的目錄
(2)執行 pip install python-whois
(3)測試
import whois //test the whois modle whois.whois("www.zhipin.com")
4.編寫第一個網絡爬蟲
(1)3種爬取網站的常見方法:
a.爬取網站地圖
b.遍歷每個網頁的數據庫ID
c.跟蹤網頁鏈接
(2)重新下載
import urllib2
def download(url, times = 2):
'''
download the html page from the url. and then return the html
:param url:
:param times:
:return: html
'''
print "download : ", url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print "download error : ", e.reason
html = None
if times > 0:
if hasattr(e, "code") and 500 <= e.code < 600:
# if the error is server error, try it again
return download(url, times -1)
return html
(2)設置用戶代理
默認情況下,urllib2使用Python-urllib/2.7作為用戶代理,但是為了盡量防止被封號,我們應該設置自己的用戶代理。
import urllib2 import re def download(url, user_agent="wuyanjing", times=2): ''' download the html page from the url. and then return the html it will catch the dowmload error, set the user-agent and try to dowmload again when occur 5xx error :param url: the url which you want to download :param user_agent: set the user_agent :param times: set the repeat times when occur 5xx error :return html: the html of the url page ''' print "download : ", url headers = {"User-agent":user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print "download error : ", e.reason html = None if times > 0: if hasattr(e, "code") and 500 <= e.code < 600: # if the error is server error, try it again return download(url, times -1) return html
(3)使用網站地圖進行爬蟲
網站地圖中的loc標簽記錄了當前網站的所有可爬取的頁面的網址,因此我們只需要利用sitemap,然后結合之前寫的download函數就可以得到想要下載頁面的html
def crawl_sitemap(url): ''' download all html page from sitemap loc label :param url: :return: ''' # download the sitemap sitemap = download(url) # find all link page in the sitemap links = re.findall("<loc>(.*?)</loc>", sitemap) for link in links: html = download(link)
我用的測試網站是美團的網站地圖
(4)使用ID遍歷進行爬蟲
因為網站地圖標簽中的loc標簽中,可能不包含我們所需要的網址,所以為了更好的對所需要查找的網頁實現覆蓋,我們可以使用ID遍歷進行爬蟲。這種爬蟲方式,是利用網站在存儲頁面的時候以數據庫ID值代替頁面別名的方式,對數據爬取進行簡化。
def crawl_dataid(url, max_error = 5): ''' download the page link like url there can not have more than max_error page link not in the database in succession :param url: the general url :return htmls: ''' numerror = 0 htmls = [] for page in itertools.count(1): currentpage = url + "/-%d" % page html = download(currentpage) if html is None: numerror += 1 if numerror == max_error: break else: numerror = 0 htmls.append(html)
我的測試網址是:書中給的測試網址
(5)使用網站鏈接進行爬蟲
不管是以上的任何一種方案,我們都要特別注意,在爬取數據的時候要暫停幾秒再進行下一條數據的爬取,不然會被封號的,或者會出現request too much,不讓你爬太多數據。我在根據書上的內容進行測試網址的數據爬取的時候就遇到了這個問題,這個網站限制一次性不能爬取超過十條數據,否則就不讓爬了。
def crawl_link(url, link_regx): ''' find all unique link int the url and its son page the link should be unique and :param url: the url which you want to search :param link_regx: re match pattern :return: ''' links = [url] linksset = set(links) while links: currentpage = links.pop() html = download(currentpage) # pay attention that it should sleep one second to continue time.sleep(1) for linkson in find_html(html): if re.match(link_regx, linkson): # make sure dosen't have the same link linkson = urlparse.urljoin(url, linkson) if linkson not in linksset: linksset.add(linkson) links.append(linkson) def find_html(html): ''' find all link from the html :param html: :return: the all href ''' pattern = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) return pattern.findall(html) # the test code crawl_link("http://example.webscraping.com", '.*/(view|index)/')
(6)高級功能
1.解析robots.txt
因為robots.txt中包含了禁止訪問的user-agent的名字,因此我們可以利用robotparser對robots.txt的內容進行讀取,然后得到有效的user-agent
def crawl_link(url, link_regx, url_robots, user_agent="wuyanjing"): ''' find all unique link int the url and its son page the link should be unique and :param url: the url which you want to search :param link_regx: re match pattern :param url_robots: robots.txt :param user_agent: user-agent :return: ''' links = [url] linksset = set(links) while links: currentpage = links.pop() rp = robot_parse(url_robots) if rp.can_fetch(user_agent, currentpage): html = download(currentpage) # pay attention that it should sleep one second to continue time.sleep(1) for linkson in find_html(html): if re.match(link_regx, linkson): # make sure dosen't have the same link linkson = urlparse.urljoin(url, linkson) if linkson not in linksset: linksset.add(linkson) links.append(linkson) else: print 'Blocked by robots.txt', url def find_html(html): ''' find all link from the html :param html: :return: the all href ''' pattern = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) return pattern.findall(html) def robot_parse(url): ''' parse the robot.txt and read value :param url: the url of robots.txt :return rp: robotparser ''' rp = robotparser.RobotFileParser() rp.set_url(url) rp.read() return rp
2.支持代理
有時候我們本機的ip地址會被封掉,那么這個時候我們就要使用代理。它相當於重新設定了本機的IP地址。具體的內容可見代理
以下是使用urllib2構建代理的一般方法:
我們可以將支持代理結合到download方法中,那么就有了以下方法:
def download(url, user_agent="wuyanjing", proxy=None, num_retrie=2): ''' use a proxy to visit server :param url: :param user_agent:user-agent name :param proxy:string, ip:port :param num_retrie: :return:the html of the url ''' print "download: ", url headers = {"User-agent": user_agent} request = urllib2.Request(url, headers=headers) opener = urllib2.build_opener() if proxy: proxy_params = {urlparse.urlparse(url).scheme: proxy} opener.add_handler(urllib2.ProxyHandler(proxy_params)) try: html = opener.open(request).read() except urllib2.URLError as e: html = None num_retrie -= 1 if num_retrie > 0: if hasattr(e, "code") and 500 <= e.code < 600: print "download error: ", e.reason return download(url, user_agent, proxy, num_retrie-1) return html
3.下載限速
首先我們要先弄懂urlparse模塊,有什么用。urlparse.urllparse(url).netloc,能夠獲取當前url的服務器地址
我們首先要創建一個類,用來記錄延遲時間,和上次申請服務器的時間:
import urlparse import datetime import time ''' the break time between the same download should more than delay :param self.delay: the delay time :param self.domains: the recent time we request a server ''' class Throttle: def __init__(self, delay): self.delay = delay self.domains = {} def wait(self, url): ''' make sure that the break between two same download should be more than delay time :param url: :return: ''' domain = urlparse.urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_seconds = self.delay - (datetime.datetime.now() - last_accessed).seconds if sleep_seconds > 0: time.sleep(sleep_seconds) self.domains[domain] = datetime.datetime.now()
在修改crawl_link函數的時候,我們要特別注意只能初始化Throttle類一次,也就是在循環外初始化Thtottle,否則的話它的domains值就一直都為空
def crawl_link(url, link_regx, url_robots, user_agent="wuyanjing"): ''' find all unique link int the url and its son page the link should be unique and :param url: the url which you want to search :param link_regx: re match pattern :param url_robots: robots.txt :param user_agent: user-agent :return: ''' links = [url] linksset = set(links) throttle = Throttle.Throttle(5) while links: currentpage = links.pop() rp = robot_parse(url_robots) if rp.can_fetch(user_agent, currentpage): throttle.wait(currentpage) html = download(currentpage) # pay attention that it should sleep one second to continue for linkson in find_html(html): if re.match(link_regx, linkson): # make sure dosen't have the same link linkson = urlparse.urljoin(url, linkson) if linkson not in linksset: linksset.add(linkson) links.append(linkson) else: print 'Blocked by robots.txt', url
4.避免爬蟲陷阱
爬蟲陷阱指的是當一個頁面是動態生成的時候我們在進行爬蟲的時候,因為我們之前寫的程序碰到沒有出現過的鏈接就會記錄,然后繼續執行下去,那么就會出現無窮多個鏈接,無法結束。避免這個困境的方法就是記錄到達當前頁面經歷了多少個鏈接,稱之為深度,如果深度超過最大深度的話,那么就要去除這個鏈接。
要解決這個問題呢,我們只要將crawl_link中的linkset變量改成字典,然后記錄下每個url的深度即可。
def crawl_link(url, link_regx, url_robots, user_agent="wuyanjing", max_depth=2): ''' find all unique link int the url and its son page the link should be unique and :param url: the url which you want to search :param link_regx: re match pattern :param url_robots: robots.txt :param user_agent: user-agent :param max_depth: the max depth of every url :return: ''' links = [url] linkset = {url: 0} throttle = Throttle.Throttle(5) while links: currentpage = links.pop() rp = robot_parse(url_robots) if rp.can_fetch(user_agent, currentpage): throttle.wait(currentpage) html = download(currentpage) # pay attention that it should sleep one second to continue for linkson in find_html(html): if re.match(link_regx, linkson): # make sure dosen't have the same link linkson = urlparse.urljoin(url, linkson) depth = linkset[url] if depth < max_depth: if linkson not in linkset: linkset[linkson] = depth + 1 links.append(linkson) else: print 'Blocked by robots.txt', url