用Python寫網絡爬蟲第二版

書名：用 Python 寫網絡爬蟲（第2版）

內容簡介：本書包括網絡爬蟲的定義以及如何爬取網站，如何使用幾種庫從網頁中抽取數據，如何通過緩存結果避免重復下載的問題，如何通過並行下載來加速數據抓取，如何利用不同的方式從動態網站中抽取數據，如何使用輸入及導航等表達進行搜索和登錄，如何訪問被驗證碼圖像保護的數據，如何使用 Scrapy 爬蟲框架進行快速的並行抓取，以及使用 Portia 的 Web 界面構建網路爬蟲。

豆瓣：https://book.douban.com/subject/30275479/

背景調研

檢查robots.txt

大多數的網站都會定義robots.txt文件，這樣可以讓爬蟲了解爬取該網站時存在哪些限制。這些限制雖然是僅僅作為建議給出，但是良好的網絡公民都應當遵守這些限制。

更多信息參見：https://www.robotstxt.org

示例：

訪問http://example.python-scraping.com/robots.txt獲取如下內容：

# section 1
User-agent: BadCrawler
Disallow: /

# section 2
User-agent: *
Disallow: /trap 
Crawl-delay: 5

# section 3
Sitemap: http://example.python-scraping.com/sitemap.xml

在section1中，robots.txt文件禁止用戶代理未BadCcrawler的爬蟲爬取該網站，不過這種寫法可能無法起到應有的作用，因為惡意爬蟲根本不會遵從robots.txt的要求。

section2規定，無論使用哪種用戶代理，都應該在兩次下載請求之間給出5秒的抓取延遲，我們需要遵從建議以免服務器過載。這里還有一個/trap鏈接，用於封禁那些爬取了不允許訪問的鏈接的惡意爬蟲。如果你訪問了這個鏈接，服務器就會封禁你的IP一分鍾！一個真實的網站可能會對你的IP封禁更長時間，甚至是永久封禁。

section3定義了一個Sitemap文件（即網站地圖）。

檢查網站地圖

網站提供的Sitemap文件（即網站地圖）可以幫助爬蟲定位網站最新的內容，而無需爬取每一個網頁，如果想要了解更多信息，可以從https://www.sitemaps.org/protocol.html獲取網站地圖的標准定義。許多網站發布平台都有自動生成網站地圖的能力。下面是robots.txt文件中定位到的Sitemap文件的內容：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://example.python-scraping.com/places/default/view/Afghanistan-1</loc></url>
<url><loc>http://example.python-scraping.com/places/default/view/Aland-Islands-2</loc></url>
<url><loc>http://example.python-scraping.com/places/default/view/Albania-3</loc></url>
...
</urlset>

網站地圖提供了所有網頁的鏈接

編寫第一個網絡爬蟲

下載網頁

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
    return html

下載重試

下面代碼保證download函數在發送5xx錯誤時重新下載，可以嘗試下載 http://httpstat.us/500 ，該網址會始終返回500錯誤碼。

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

設置用戶代理user-agent

默認情況下，urllib使用Python-urllib/3.x作為用戶代理下載網頁內容，其中3.x是環境當前所用的Python的版本號。也許是因為曾經歷過質量不佳的Python網絡爬蟲造成的服務器過載，一些網站還會封禁這個默認代理。

為了使下載網站更加可靠，我們需要控制用戶代理的設定。下面的代碼對download這個函數進行了參數化，設定了一個默認的用戶代理‘wswp’（即Web Scraping With Python的首字母縮寫）

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries=num_retries - 1)
    return html

網站地圖爬蟲

import urllib.request
import re

from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries=num_retries - 1)
    return html


def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here

ID遍歷爬蟲

下面代碼對ID進行遍歷，直到出現下載錯誤時停止。

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
import itertools


def download(url, num_retries=2):
    print('Downloading:', url)
    try:
        html = urllib.request.urlopen(url).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code <= 500:
                return download(url, num_retries - 1)
    return html


def crawl_site(url):
    for page in itertools.count(1):
        pg_url = '{0}{1}'.format(url, page)
        html = download(pg_url)
        if html is None:
            break

上面實現方式有一個缺陷就是，某個記錄可能被刪除，數據庫ID之間並不是連續的，此時只要訪問某個間隔點，爬蟲就會立即退出。

下面代碼對此進行改進，該版本連續發生多次下載錯誤后才會退出程序

import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_site(url, max_errors=5):
    num_errors = 0
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                # reached max number of errors, so exit
                break
        else:
            num_errors = 0
            # success - can scrape the result

鏈接爬蟲

下面代碼完成下載鏈接、將相對鏈接轉為絕對鏈接、去重功能

import re
import urllib.request
from urllib.parse import urljoin
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def link_crawler(start_url, link_regex):
    " Crawl from the given start URL following links matched by link_regex "
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if not html:
            continue
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    seen.add(abs_link)
                    crawl_queue.append(abs_link)


def get_links(html):
    " Return a list of links from html "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

解析robots.txt

首先，我們需要解析robots.txt 文件，以避免下載禁止爬取的URL，使用Python的urllib庫中的robotparser模塊，就可以輕松完成這項工作，如下面的代碼所示：

from urllib import robotparser

rp = robotparser.RobotFileParser()
rp.set_url('http://example.python-scraping.com/robots.txt')
rp.read()
url = 'http://example.python-scraping.com/robots.txt'
user_agent = 'BadCrawler'
print(rp.can_fetch(user_agent, url))  # False
user_agent = 'GoodCrawler'
print(rp.can_fetch(user_agent, url))  # True

為將robotparser集成到鏈接爬蟲中，我們首先需要創建有個新函數用於返回robotparser對象。

from urllib import robotparser


def get_robots_parser(robots_url):
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp

我們需要可靠的設置robots_url，此時我們可以通過向函數傳遞額外的關鍵字參數的方法實現這一目標，我們還可以設置一個默認值，防止用戶沒有傳遞該變量，此外還需要定義user_agent

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'):
    ...
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)

    # 最后我們在crawl循環中添加解釋器檢查
    ...
    while crawl_queue:
        url = crawl_queue.pop()
        if rp.can_fetch(user_agent, url):
            html = download(url, use=user_agent)
            ...
        else:
            print('Blocked by robots.txt:', url)

支持代理

下面是使用urllib只存儲代理的代碼

proxy = 'http://myproxy.net:1234'
proxy_support = urllib.request.ProxyHandler({'http':proxy})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

下面是集成了該功能的新版本的download函數

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, user_agent='wswp', num_retries=2, charset='utf-8', proxy=None):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)

    try:
        if proxy:
            proxy = 'http://myproxy.net:1234'
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code <= 500:
                return download(url, user_agent=user_agent, num_retries=num_retries, charset=charset, proxy=charset)
    return html

目前，默認情況下（python3.5）,urllib模塊不支持https代理。

下載限速

如果我們爬取網站的速度過快，就會面臨被封禁或是造成服務器過載的風險。為了降低這些風險，我們可以在兩次下載之間添加一組延時，從而對爬蟲限速。下面是實現了該功能的類的代碼。

from urllib.parse import urlparse
import time


class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()

Throttle類記錄了每個域名上次訪問的時間，如果當前時間距離上次訪問時間小於指定延時，則執行睡眠操作。我們可以在每次下載之前調用throttle對爬蟲進行限速。

throttle = Throttle(delay)
throttle.wait(url)
html = download(url, user_agent=user_agent, num_retries=num_retries, charset=charset, proxy=charset)

避免爬蟲陷阱

目前，我們的爬蟲會跟蹤所有之前沒有訪問過的鏈接。但是，一些網站會動態生成頁面內容，這樣就會出現無限多的頁面。比如，網站有一個在線日歷功能，提供了可以訪問下個月和下一年的鏈接，那么下個月的頁面中同樣會包含訪問再下個月的鏈接，這樣就會一直持續請求到部件設定的最大時間（可能會是很久之后的時間）。該站點可能還會在簡單的分頁導航中提供相同的功能，本質上是分頁請求不斷訪問空的搜索結果頁，直至達到最大頁數。這種情況被稱為爬蟲陷阱。

想要避免陷入爬蟲陷阱，一個簡單的方法是記錄到達當前網頁經過了多少個鏈接，也就是深度。當達到最大深度時，爬蟲就不再向隊列中添加該網頁中的鏈接了，想要實現最大深度的功能，我們需要修改seen變量，該變量原先只記錄了訪問過的網頁鏈接，現在修改為一個字典，添加已發現鏈接的深度記錄。

def link_crawler(..., max_depth=4):
    seen = {}
    ...
    if rp.can_fetch(user_agent, url):
        depth = seen.get(url, 0)
        if depth == max_depth:
            print('Skipping %s due to depth' % url)
            continnue
        ...
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    seen[abs_link] = depth + 1
                    crawl_queue.append(abs_link)

有了該功能之后，我們就有信心爬蟲最終一定能夠完成了。如果想要禁用該功能，只需要將max_depth設為一個負數即可，此時，當前深度永遠不會與之相等。

完整版代碼

import re
import time
import urllib.request
from urllib import robotparser
from urllib.parse import urljoin,urlparse
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8', proxy=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            charset (str): charset if website does not include one in headers
            proxy (str): proxy url, ex 'http://IP' (default: None)
            num_retries (int): number of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        if proxy:
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries=num_retries - 1)
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    " Return a list of links (using simple regex matching) from the html content "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxy=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex. In the current
        implementation, we do not actually scrapy any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxy (str): proxy url, ex 'http://IP' (default: None)
            delay (int): seconds to throttle between requests to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxy=proxy)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                if re.match(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

requests版本：

import re
import time
import requests
from urllib import robotparser
from urllib.parse import urljoin,urlparse

class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()

def download(url, num_retries=2, user_agent='wswp', proxies=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            proxies (dict): proxy dict w/ keys 'http' and 'https', values
                            are strs (i.e. 'http(s)://IP') (default: None)
            num_retries (int): # of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    headers = {'User-Agent': user_agent}
    try:
        resp = requests.get(url, headers=headers, proxies=proxies)
        html = resp.text
        if resp.status_code >= 400:
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries=num_retries - 1)
    except requests.exceptions.RequestException as e:
        print('Download error:', e)
        html = None
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    """ Return a list of links (using simple regex matching)
        from the html content """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxies=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex.
    In the current implementation, we do not actually scrape any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt
                              (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxies (dict): proxy dict w/ keys 'http' and 'https', values
                            are strs (i.e. 'http(s)://IP') (default: None)
            delay (int): seconds to throttle between requests
                         to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxies=proxies)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                if re.match(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

數據抓取

上面已經學習了如何構建一個爬蟲來下載網頁，現在，我們要讓這個爬蟲從每個網頁中抽取一些數據，然后實現某些事情，這種做法也稱為抓取（scraping）。

正則表達式

官方文檔：https://docs.python.org/3/howto/regex.html

Beautiful Soup

中文文檔：https://beautifulsoup.readthedocs.io/zh_CN/latest/

安裝命令：

pip install beautifulsoup4

安裝html5lib解析器

pip install html5lib

使用了html5lib的BeautifulSoup能夠正確解析缺失的屬性引號以及閉合標簽，使其成為完整的HTML文檔

Lxml

LXML是基於libxml2這一XML解析庫構建的Python庫，它使用C語言編寫，解析速度比BeautifulSoup更快。