第一章：網絡爬蟲簡介

1.1 網絡爬蟲何時會有用？

以結構化的格式，獲取網上的批量數據（理論上可以手工，但是自動化可以省時省力）

1.2 網絡爬蟲是否合法？

被抓取的數據用於個人用途，且在合理使用版權法的條件下，通常沒有問題

1.3 python3

工具：
- anaconda
- virtual environment wrapper （https://virtuallenvwrapper.readthedocs.io/en/latest）
- conda (https://conda.io/docs/intro.html)
python 版本：python3.4+

1.4 背景調研

調研工具：
- robots.txt
- sitemap
- google -> WHQIS

1.4.1 檢查robots.txt

了解當前網站的爬取限制
可以發現和網站結構相關的線索
詳見：http://robotstxt.org

1.4.2 檢查網站地圖(sitemap)

幫助爬蟲定位網站最新的內容，無需爬取每一個網頁
網站地圖標准定義：http://www.sitemap.org/protocol.html

1.4.3 估算網站大小

目標網站大小會影響我們爬取方式：效率問題
工具：https://www.google.com/advanced_search
- 在域名后面添加url路徑，可以對結果過濾，僅顯示網站的某些部分

1.4.4 識別網站所有技術

detectem模塊 (pip install detectem)
工具：
- 安裝 Docker （http://www.docker.com/products/overview）
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虛擬環境（https://docs.python.org/3/library/venv.html）
- conda 環境（https://conda.io/docs/using/envs.com）
- 查看項目的README（https://github.com/spectresearch/detectem）

$ det http://example.python-scraping.com

'''
[{'name': 'jquery', 'version': '1.11.0'},
 {'name': 'modernizr', 'version': '2.7.1'},
 {'name': 'nginx', 'version': '1.12.2'}]
'''
$ docker pull wappalyzer/cli
$ docker run wappalyzer/cli http://example.python-scraping.com

1.4.5 尋找網站所有者

尋找網站所有者：使用WHOIS協議查詢網站域名注冊所有者
- python 中有針對該協議封裝的庫（https://pypi.python.org/pypi/python-whois）
- 安裝：pip install python-whois

import whois
print(whois.whois('url'))

1.5 編寫第一個網絡爬蟲

爬取：下載包涵感興趣數據的網頁
爬取所用的方法有很多，選取哪種更合適：取決於目標網站的結構
三種爬取網站的常見方法：
- 爬取網站地圖
- 使用數據庫ID便歷每一個網頁
- 跟蹤網頁鏈接

1.5.1 抓取與爬取的對比

抓取：針對特定網站，並在站點上獲取指定信息
爬取：通用的方式構建，目標是一系列頂級域名的網站或是整個網絡。可以用來收集更具體的信息，更常見的是爬取整個網絡。從不同站點或頁面獲取的小而通用的信息，然后跟蹤連接到其他頁面中。

1.5.2 下載網頁

1.5.2.1 下載網頁

下載時經常遇到臨時錯誤：
- 服務器過載（503 Service Unavailable）
  - 短暫等待后繼續嘗試重新下載
- 網頁不存在（404 Not Found）
- 請求時發生問題（4XX）-重新下載無效果
- 服務端存在問題（5XX）-可重新下載

1.5.2.2 設置代理

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

# user_agent='wswp' 設置用戶代理
def download(url, num_retries=2, user_agent='wswp'):
    print('Downloading:', url)
    # 設置用戶代理
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

1.5.3 網站地圖爬蟲

使用正則表達式將robots.txt的url從標簽中取出

# 導入url解析庫
import urllib.request
# 導入正則庫
import re
# 導入解析錯誤庫
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here

test_url = 'http://example.python-scraping.com/sitemap.xml'
crawl_sitemap(test_url)

'''
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1
Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/places/default/view/Albania-3
Downloading: http://example.python-scraping.com/places/default/view/Algeria-4
Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5
Downloading: http://example.python-scraping.com/places/default/view/Andorra-6
Downloading: http://example.python-scraping.com/places/default/view/Angola-7
Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8
Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9
Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/places/default/view/Argentina-11
...
'''

1.5.4 ID便歷爬蟲

import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_site(url, max_errors=5):
    num_errors = 0
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                # reached max number of errors, so exit
                break
        else:
            num_errors = 0
            # success - can scrape the result
test_url2 = 'http://example.python-scraping.com/view/-'
# 暫時存在問題，待調
crawl_sitemap(test_url2)

1.5.5 鏈接爬蟲

使用正則表達式確定應當下載哪些頁面

# 正則表達式
import re
# 發送請求
import urllib.request
# 解析+鏈接拼接
from urllib.parse import urljoin
# 導入錯誤類型
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def link_crawler(start_url, link_regex):
    " Crawl from the given start URL following links matched by link_regex "
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        if not html:
            continue
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                abs_link = urljoin(start_url, link)
                if abs_link not in seen:
                    seen.add(abs_link)
                    crawl_queue.append(abs_link)


def get_links(html):
    " Return a list of links from html "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

高級功能
- 1.解析robots.txt文件，避免下載禁止爬取的URL，使用python的urllib庫中的robotparser模塊，就可以輕松完成這項工作
- 2.支持代理：有時候需要使用代理訪問某個網站，，使用python urllib支持代理
- 3.下載限速：降低被封號的風險，在兩次下載之間添加一組延時，對爬蟲進行限速
- 4.避免爬蟲陷阱：下載無限的網頁，避免爬蟲陷阱，記錄當前爬取深度
最終版本

# 最終版本
from urllib.parse import urlparse
import time


class Throttle:
    """ Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        domain = urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (time.time() - last_accessed)
            if sleep_secs > 0:
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = time.time()


import re
import urllib.request
from urllib import robotparser
from urllib.parse import urljoin
from urllib.error import URLError, HTTPError, ContentTooShortError
# from throttle import Throttle
# from throtte import Throttle


def download(url, num_retries=2, user_agent='wswp', charset='utf-8', proxy=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            charset (str): charset if website does not include one in headers
            proxy (str): proxy url, ex 'http://IP' (default: None)
            num_retries (int): number of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        if proxy:
            proxy_support = urllib.request.ProxyHandler({'http': proxy})
            opener = urllib.request.build_opener(proxy_support)
            urllib.request.install_opener(opener)
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    " Return a list of links (using simple regex matching) from the html content "
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxy=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex. In the current
        implementation, we do not actually scrapy any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxy (str): proxy url, ex 'http://IP' (default: None)
            delay (int): seconds to throttle between requests to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxy=proxy)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                if re.match(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

link_regex = '/(index|view)/'
link_crawler('http://example.python-scraping.com/index',link_regex,max_depth = 1)

1.5.6 使用 request庫

python主流爬蟲一般都會使用requests庫來管理復雜的HTTP請求
足夠簡單且易於使用
安裝 $pip install requests

# 使用requests庫的高級鏈接爬蟲
import re
from urllib import robotparser
from urllib.parse import urljoin

import requests
from chp1.throttle import Throttle


def download(url, num_retries=2, user_agent='wswp', proxies=None):
    """ Download a given URL and return the page content
        args:
            url (str): URL
        kwargs:
            user_agent (str): user agent (default: wswp)
            proxies (dict): proxy dict w/ keys 'http' and 'https', values
                            are strs (i.e. 'http(s)://IP') (default: None)
            num_retries (int): # of retries if a 5xx error is seen (default: 2)
    """
    print('Downloading:', url)
    headers = {'User-Agent': user_agent}
    try:
        resp = requests.get(url, headers=headers, proxies=proxies)
        html = resp.text
        if resp.status_code >= 400:
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    except requests.exceptions.RequestException as e:
        print('Download error:', e)
        html = None
    return html


def get_robots_parser(robots_url):
    " Return the robots parser object using the robots_url "
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp


def get_links(html):
    """ Return a list of links (using simple regex matching)
        from the html content """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile("""<a[^>]+href=["'](.*?)["']""", re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp',
                 proxies=None, delay=3, max_depth=4):
    """ Crawl from the given start URL following links matched by link_regex.
    In the current implementation, we do not actually scrape any information.

        args:
            start_url (str): web site to start crawl
            link_regex (str): regex to match for links
        kwargs:
            robots_url (str): url of the site's robots.txt
                              (default: start_url + /robots.txt)
            user_agent (str): user agent (default: wswp)
            proxies (dict): proxy dict w/ keys 'http' and 'https', values
                            are strs (i.e. 'http(s)://IP') (default: None)
            delay (int): seconds to throttle between requests
                         to one domain (default: 3)
            max_depth (int): maximum crawl depth (to avoid traps) (default: 4)
    """
    crawl_queue = [start_url]
    # keep track which URL's have seen before
    seen = {}
    if not robots_url:
        robots_url = '{}/robots.txt'.format(start_url)
    rp = get_robots_parser(robots_url)
    throttle = Throttle(delay)
    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            depth = seen.get(url, 0)
            if depth == max_depth:
                print('Skipping %s due to depth' % url)
                continue
            throttle.wait(url)
            html = download(url, user_agent=user_agent, proxies=proxies)
            if not html:
                continue
            # TODO: add actual data scraping here
            # filter for links matching our regular expression
            for link in get_links(html):
                if re.match(link_regex, link):
                    abs_link = urljoin(start_url, link)
                    if abs_link not in seen:
                        seen[abs_link] = depth + 1
                        crawl_queue.append(abs_link)
        else:
            print('Blocked by robots.txt:', url)

1.6 本章小結

1.介紹網絡爬蟲
2.給出了一個成熟的爬蟲-可復用
3.介紹一些外部工具和模塊的使用方法（了解網站、用戶代理、網站地圖、爬蟲延時及其他高級爬取技術）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 用Python寫網絡爬蟲第二版 Python核心編程(第二版)PDF selenium webdriver (python) 第二版 python基礎教程（第二版）零基礎學python》（第二版） python音樂播放器第二版 Python核心編程第二版(中文).pdf 目錄整理 Python核心編程第二版(中文).pdf 目錄整理《Python核心編程》第二版第五章答案什么是網絡爬蟲？為什么要選擇Python寫網絡爬蟲？