書籍介紹
背景調研
檢查robots.txt
大多數的網站都會定義robots.txt文件,這樣可以讓爬蟲了解爬取該網站時存在哪些限制。這些限制雖然是僅僅作為建議給出,但是良好的網絡公民都應當遵守這些限制。
更多信息參見:https://www.robotstxt.org
示例:
訪問http://example.python-scraping.com/robots.txt獲取如下內容:
# section 1 User-agent: BadCrawler Disallow: / # section 2 User-agent: * Disallow: /trap Crawl-delay: 5 # section 3 Sitemap: http://example.python-scraping.com/sitemap.xml
在section1中,robots.txt文件禁止用戶代理未BadCcrawler的爬蟲爬取該網站,不過這種寫法可能無法起到應有的作用,因為惡意爬蟲根本不會遵從robots.txt的要求。
section2規定,無論使用哪種用戶代理,都應該在兩次下載請求之間給出5秒的抓取延遲,我們需要遵從建議以免服務器過載。這里還有一個/trap鏈接,用於封禁那些爬取了不允許訪問的鏈接的惡意爬蟲。如果你訪問了這個鏈接,服務器就會封禁你的IP一分鍾!一個真實的網站可能會對你的IP封禁更長時間,甚至是永久封禁。
section3定義了一個Sitemap文件(即網站地圖)。
檢查網站地圖
網站提供的Sitemap文件(即網站地圖)可以幫助爬蟲定位網站最新的內容,而無需爬取每一個網頁,如果想要了解更多信息,可以從https://www.sitemaps.org/protocol.html獲取網站地圖的標准定義。許多網站發布平台都有自動生成網站地圖的能力。下面是robots.txt文件中定位到的Sitemap文件的內容:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://example.python-scraping.com/places/default/view/Afghanistan-1</loc></url> <url><loc>http://example.python-scraping.com/places/default/view/Aland-Islands-2</loc></url> <url><loc>http://example.python-scraping.com/places/default/view/Albania-3</loc></url> ... </urlset>
網站地圖提供了所有網頁的鏈接
編寫第一個網絡爬蟲
下載網頁
import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError def download(url): print('Downloading:', url) try: html = urllib.request.urlopen(url).read() except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None return html
下載重試
下面代碼保證download函數在發送5xx錯誤時重新下載,可以嘗試下載 http://httpstat.us/500 ,該網址會始終返回500錯誤碼。
import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2): print('Downloading:', url) try: html = urllib.request.urlopen(url).read() except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html
設置用戶代理user-agent
默認情況下,urllib使用Python-urllib/3.x作為用戶代理下載網頁內容,其中3.x是環境當前所用的Python的版本號。也許是因為曾經歷過質量不佳的Python網絡爬蟲造成的服務器過載,一些網站還會封禁這個默認代理。
為了使下載網站更加可靠,我們需要控制用戶代理的設定。下面的代碼對download這個函數進行了參數化,設定了一個默認的用戶代理‘wswp’(即Web Scraping With Python的首字母縮寫)
import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp'): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: html = urllib.request.urlopen(request).read() except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries=num_retries - 1) return html
網站地圖爬蟲
import urllib.request import re from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp', charset='utf-8'): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries=num_retries - 1) return html def crawl_sitemap(url): # download the sitemap file sitemap = download(url) # extract the sitemap links links = re.findall('<loc>(.*?)</loc>', sitemap) # download each link for link in links: html = download(link) # scrape html here
ID遍歷爬蟲
下面代碼對ID進行遍歷,直到出現下載錯誤時停止。
import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError import itertools def download(url, num_retries=2): print('Downloading:', url) try: html = urllib.request.urlopen(url).read() except (URLError, HTTPError, ContentTooShortError) as e: print('Download error', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code <= 500: return download(url, num_retries - 1) return html def crawl_site(url): for page in itertools.count(1): pg_url = '{0}{1}'.format(url, page) html = download(pg_url) if html is None: break
上面實現方式有一個缺陷就是,某個記錄可能被刪除,數據庫ID之間並不是連續的,此時只要訪問某個間隔點,爬蟲就會立即退出。
下面代碼對此進行改進,該版本連續發生多次下載錯誤后才會退出程序
import itertools import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp', charset='utf-8'): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html def crawl_site(url, max_errors=5): num_errors = 0 for page in itertools.count(1): pg_url = '{}{}'.format(url, page) html = download(pg_url) if html is None: num_errors += 1 if num_errors == max_errors: # reached max number of errors, so exit break else: num_errors = 0 # success - can scrape the result
鏈接爬蟲
下面代碼完成下載鏈接、將相對鏈接轉為絕對鏈接、去重功能
import re import urllib.request from urllib.parse import urljoin from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp', charset='utf-8'): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html def link_crawler(start_url, link_regex): " Crawl from the given start URL following links matched by link_regex " crawl_queue = [start_url] # keep track which URL's have seen before seen = set(crawl_queue) while crawl_queue: url = crawl_queue.pop() html = download(url) if not html: continue # filter for links matching our regular expression for link in get_links(html): if re.match(link_regex, link): abs_link = urljoin(start_url, link) if abs_link not in seen: seen.add(abs_link) crawl_queue.append(abs_link) def get_links(html): " Return a list of links from html " # a regular expression to extract all links from the webpage webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html)
解析robots.txt
首先,我們需要解析robots.txt 文件,以避免下載禁止爬取的URL,使用Python的urllib庫中的robotparser模塊,就可以輕松完成這項工作,如下面的代碼所示:
from urllib import robotparser rp = robotparser.RobotFileParser() rp.set_url('http://example.python-scraping.com/robots.txt') rp.read() url = 'http://example.python-scraping.com/robots.txt' user_agent = 'BadCrawler' print(rp.can_fetch(user_agent, url)) # False user_agent = 'GoodCrawler' print(rp.can_fetch(user_agent, url)) # True
為將robotparser集成到鏈接爬蟲中,我們首先需要創建有個新函數用於返回robotparser對象。
from urllib import robotparser def get_robots_parser(robots_url): rp = robotparser.RobotFileParser() rp.set_url(robots_url) rp.read() return rp
我們需要可靠的設置robots_url,此時我們可以通過向函數傳遞額外的關鍵字參數的方法實現這一目標,我們還可以設置一個默認值,防止用戶沒有傳遞該變量,此外還需要定義user_agent
def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'): ... if not robots_url: robots_url = '{}/robots.txt'.format(start_url) rp = get_robots_parser(robots_url) # 最后我們在crawl循環中添加解釋器檢查 ... while crawl_queue: url = crawl_queue.pop() if rp.can_fetch(user_agent, url): html = download(url, use=user_agent) ... else: print('Blocked by robots.txt:', url)
支持代理
下面是使用urllib只存儲代理的代碼
proxy = 'http://myproxy.net:1234' proxy_support = urllib.request.ProxyHandler({'http':proxy}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener)
下面是集成了該功能的新版本的download函數
import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, user_agent='wswp', num_retries=2, charset='utf-8', proxy=None): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: if proxy: proxy = 'http://myproxy.net:1234' proxy_support = urllib.request.ProxyHandler({'http': proxy}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code <= 500: return download(url, user_agent=user_agent, num_retries=num_retries, charset=charset, proxy=charset) return html
目前,默認情況下(python3.5),urllib模塊不支持https代理。
下載限速
如果我們爬取網站的速度過快,就會面臨被封禁或是造成服務器過載的風險。為了降低這些風險,我們可以在兩次下載之間添加一組延時,從而對爬蟲限速。下面是實現了該功能的類的代碼。
from urllib.parse import urlparse import time class Throttle: """ Add a delay between downloads to the same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (time.time() - last_accessed) if sleep_secs > 0: # domain has been accessed recently # so need to sleep time.sleep(sleep_secs) # update the last accessed time self.domains[domain] = time.time()
Throttle類記錄了每個域名上次訪問的時間,如果當前時間距離上次訪問時間小於指定延時,則執行睡眠操作。我們可以在每次下載之前調用throttle對爬蟲進行限速。
throttle = Throttle(delay) throttle.wait(url) html = download(url, user_agent=user_agent, num_retries=num_retries, charset=charset, proxy=charset)
避免爬蟲陷阱
目前,我們的爬蟲會跟蹤所有之前沒有訪問過的鏈接。但是,一些網站會動態生成頁面內容,這樣就會出現無限多的頁面。比如,網站有一個在線日歷功能,提供了可以訪問下個月和下一年的鏈接,那么下個月的頁面中同樣會包含訪問再下個月的鏈接,這樣就會一直持續請求到部件設定的最大時間(可能會是很久之后的時間)。該站點可能還會在簡單的分頁導航中提供相同的功能,本質上是 分頁請求不斷訪問空的搜索結果頁,直至達到最大頁數。這種情況被稱為爬蟲陷阱。
想要避免陷入爬蟲陷阱,一個簡單的方法是記錄到達當前網頁經過了多少個鏈接,也就是深度。當達到最大深度時,爬蟲就不再向隊列中添加該網頁中的鏈接了,想要實現最大深度的功能,我們需要修改seen變量,該變量原先只記錄了訪問過的網頁鏈接,現在修改為一個字典,添加已發現鏈接的深度記錄。
def link_crawler(..., max_depth=4): seen = {} ... if rp.can_fetch(user_agent, url): depth = seen.get(url, 0) if depth == max_depth: print('Skipping %s due to depth' % url) continnue ... for link in get_links(html): if re.match(link_regex, link): abs_link = urljoin(start_url, link) if abs_link not in seen: seen[abs_link] = depth + 1 crawl_queue.append(abs_link)
有了該功能之后,我們就有信心爬蟲最終一定能夠完成了。如果想要禁用該功能,只需要將max_depth設為一個負數即可,此時,當前深度永遠不會與之相等。
完整版代碼
import re import time import urllib.request from urllib import robotparser from urllib.parse import urljoin,urlparse from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp', charset='utf-8', proxy=None): """ Download a given URL and return the page content args: url (str): URL kwargs: user_agent (str): user agent (default: wswp) charset (str): charset if website does not include one in headers proxy (str): proxy url, ex 'http://IP' (default: None) num_retries (int): number of retries if a 5xx error is seen (default: 2) """ print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: if proxy: proxy_support = urllib.request.ProxyHandler({'http': proxy}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries=num_retries - 1) return html def get_robots_parser(robots_url): " Return the robots parser object using the robots_url " rp = robotparser.RobotFileParser() rp.set_url(robots_url) rp.read() return rp def get_links(html): " Return a list of links (using simple regex matching) from the html content " # a regular expression to extract all links from the webpage webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html) class Throttle: """ Add a delay between downloads to the same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (time.time() - last_accessed) if sleep_secs > 0: # domain has been accessed recently # so need to sleep time.sleep(sleep_secs) # update the last accessed time self.domains[domain] = time.time() def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp', proxy=None, delay=3, max_depth=4): """ Crawl from the given start URL following links matched by link_regex. In the current implementation, we do not actually scrapy any information. args: start_url (str): web site to start crawl link_regex (str): regex to match for links kwargs: robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt) user_agent (str): user agent (default: wswp) proxy (str): proxy url, ex 'http://IP' (default: None) delay (int): seconds to throttle between requests to one domain (default: 3) max_depth (int): maximum crawl depth (to avoid traps) (default: 4) """ crawl_queue = [start_url] # keep track which URL's have seen before seen = {} if not robots_url: robots_url = '{}/robots.txt'.format(start_url) rp = get_robots_parser(robots_url) throttle = Throttle(delay) while crawl_queue: url = crawl_queue.pop() # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): depth = seen.get(url, 0) if depth == max_depth: print('Skipping %s due to depth' % url) continue throttle.wait(url) html = download(url, user_agent=user_agent, proxy=proxy) if not html: continue # TODO: add actual data scraping here # filter for links matching our regular expression for link in get_links(html): if re.match(link_regex, link): abs_link = urljoin(start_url, link) if abs_link not in seen: seen[abs_link] = depth + 1 crawl_queue.append(abs_link) else: print('Blocked by robots.txt:', url)
requests版本:
import re import time import requests from urllib import robotparser from urllib.parse import urljoin,urlparse class Throttle: """ Add a delay between downloads to the same domain """ def __init__(self, delay): # amount of delay between downloads for each domain self.delay = delay # timestamp of when a domain was last accessed self.domains = {} def wait(self, url): domain = urlparse(url).netloc last_accessed = self.domains.get(domain) if self.delay > 0 and last_accessed is not None: sleep_secs = self.delay - (time.time() - last_accessed) if sleep_secs > 0: # domain has been accessed recently # so need to sleep time.sleep(sleep_secs) # update the last accessed time self.domains[domain] = time.time() def download(url, num_retries=2, user_agent='wswp', proxies=None): """ Download a given URL and return the page content args: url (str): URL kwargs: user_agent (str): user agent (default: wswp) proxies (dict): proxy dict w/ keys 'http' and 'https', values are strs (i.e. 'http(s)://IP') (default: None) num_retries (int): # of retries if a 5xx error is seen (default: 2) """ print('Downloading:', url) headers = {'User-Agent': user_agent} try: resp = requests.get(url, headers=headers, proxies=proxies) html = resp.text if resp.status_code >= 400: print('Download error:', resp.text) html = None if num_retries and 500 <= resp.status_code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries=num_retries - 1) except requests.exceptions.RequestException as e: print('Download error:', e) html = None return html def get_robots_parser(robots_url): " Return the robots parser object using the robots_url " rp = robotparser.RobotFileParser() rp.set_url(robots_url) rp.read() return rp def get_links(html): """ Return a list of links (using simple regex matching) from the html content """ # a regular expression to extract all links from the webpage webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html) def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp', proxies=None, delay=3, max_depth=4): """ Crawl from the given start URL following links matched by link_regex. In the current implementation, we do not actually scrape any information. args: start_url (str): web site to start crawl link_regex (str): regex to match for links kwargs: robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt) user_agent (str): user agent (default: wswp) proxies (dict): proxy dict w/ keys 'http' and 'https', values are strs (i.e. 'http(s)://IP') (default: None) delay (int): seconds to throttle between requests to one domain (default: 3) max_depth (int): maximum crawl depth (to avoid traps) (default: 4) """ crawl_queue = [start_url] # keep track which URL's have seen before seen = {} if not robots_url: robots_url = '{}/robots.txt'.format(start_url) rp = get_robots_parser(robots_url) throttle = Throttle(delay) while crawl_queue: url = crawl_queue.pop() # check url passes robots.txt restrictions if rp.can_fetch(user_agent, url): depth = seen.get(url, 0) if depth == max_depth: print('Skipping %s due to depth' % url) continue throttle.wait(url) html = download(url, user_agent=user_agent, proxies=proxies) if not html: continue # TODO: add actual data scraping here # filter for links matching our regular expression for link in get_links(html): if re.match(link_regex, link): abs_link = urljoin(start_url, link) if abs_link not in seen: seen[abs_link] = depth + 1 crawl_queue.append(abs_link) else: print('Blocked by robots.txt:', url)
數據抓取
上面已經學習了如何構建一個爬蟲來下載網頁,現在,我們要讓這個爬蟲從每個網頁中抽取一些數據,然后實現某些事情,這種做法也稱為抓取(scraping)。
正則表達式
官方文檔:https://docs.python.org/3/howto/regex.html
Beautiful Soup
中文文檔:https://beautifulsoup.readthedocs.io/zh_CN/latest/
安裝命令:
pip install beautifulsoup4
安裝html5lib解析器
pip install html5lib
使用了html5lib的BeautifulSoup能夠正確解析缺失的屬性引號以及閉合標簽,使其成為完整的HTML文檔
Lxml
LXML是基於libxml2這一XML解析庫構建的Python庫,它使用C語言編寫,解析速度比BeautifulSoup更快。