Python 網絡爬蟲 009 (編程) 通過正則表達式來獲取一個網頁中的所有的URL鏈接，並下載這些URL鏈接的源代碼

本文轉載自查看原文 2016-09-19 14:48 2440

通過正則表達式來獲取一個網頁中的所有的 URL鏈接，並下載這些 URL鏈接的源代碼

使用的系統：Windows 10 64位
Python 語言版本：Python 2.7.10 V
使用的編程 Python 的集成開發環境：PyCharm 2016 04
我使用的 urllib 的版本：urllib2

注意： 我沒這里使用的是 Python2 ，而不是Python3

一 . 前言

通過之前兩節（爬取一個網頁的網絡爬蟲和解決爬取到的網頁顯示時亂碼問題），我們終於完成了最終的 download() 函數。
並且上上一節，我們通過網站地圖解析里面的URL的方式爬取了目標站點的所有網頁。在上一節，介紹一種方法來爬取一個網頁里面所有的鏈接網頁。這一節，我們通過正則表達式來獲取一個網頁中的所有的URL鏈接，並下載這些URL鏈接的源代碼。

二 . 簡介

到目前為止，我們已經利用目標網站的結構特點實現了兩個簡單爬蟲。只要這兩個技術可用，就應當使用其進行爬取，因為這兩個方法最小化了需要下載的網頁數量。不過，對於一些網站，我們需要讓爬蟲表現得更像普通用戶：跟蹤鏈接，訪問感興趣的內容。

通過跟蹤所有鏈接的方式，我們可以很容易地下載整個網站的頁面。但是這種方法會下載大量我們並不需要的網頁。例如，我們想要從一個在線論壇中爬取用戶賬號詳情頁，那么此時我們只需要下載賬戶頁，而不需要下載討論輪貼的頁面。本篇博客中的鏈接爬蟲將使用正則表達式來確定需要下載那些頁面。

三 . 初級代碼

import re

def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex """
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                crawl_queue.append(link)

def get_links(html):
    """Return a list of links from html """
    # a regular expression to extract all links from the webpage 
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

四 . 講解初級代碼

1 .

def link_crawler(seed_url, link_regex):

這個函數就是我們要在外部使用的函數。功能：先下載 seed_url 網頁的源代碼，然后提取出里面所有的鏈接URL，接着對所有匹配到的鏈接URL與link_regex 進行匹配，如果鏈接URL里面有link_regex內容，就將這個鏈接URL放入到隊列中，下一次執行 while crawl_queue: 就對這個鏈接URL 進行同樣的操作。反反復復，直到 crawl_queue 隊列為空，才退出函數。

2 .

get_links(html) 函數的功能：用來獲取 html 網頁中所有的鏈接URL。

3 .

webpage_regex = re.compile('<a[^>]+href=["\']'(.*?)["\']', re.IGNORECASE)

做了一個匹配模板，存在 webpage_regex 對象里面。匹配<a href="xxx"> 這樣的字符串，並提取出里面xxx的內容，這個xxx就是網址 URL 。

4 .

return webpage_regex.findall(html)

使用 webpage_regex 這個模板對 html 網頁源代碼匹配所有符合<a href="xxx"> 格式的字符串，並提取出里面的 xxx 內容。

詳細的正則表達式的知識，請到這個網站了解：
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

五 . 運行

先啟動Python 終端交互指令，在PyCharm軟件的Terminal窗口中或者在Windows 系統的DOS窗口中執行下面的命令：

C:\Python27\python.exe -i 1-4-4-regular_expression.py

執行link_crawler() 函數：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

輸出：

Downloading:  http://example.webscraping.com
Downloading:  /index/1
Traceback (most recent call last):
  File "1-4-4-regular_expression.py", line 50, in <module>
    link_crawler('http://example.webscraping.com', '/(index|view)')
  File "1-4-4-regular_expression.py", line 36, in link_crawler
    html = download(url)
  File "1-4-4-regular_expression.py", line 13, in download
    html = urllib2.urlopen(request).read()
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 423, in open
    protocol = req.get_type()
  File "C:\Python27\lib\urllib2.py", line 285, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: /index/1

運行的時候，出現了錯誤。這個錯誤出在：下載 /index/1 URL時。這個 /index/1 是目標站點中的一個相對鏈接，就是完整網頁URL 的路徑部分，而沒有協議和服務器部分。我們使用download() 函數是沒有辦法下載的。在瀏覽器里瀏覽網頁，相對鏈接是可以正常工作的，但是在使用 urllib2 下載網頁時，因為無法知道上下文，所以無法下載成功。

七 . 改進代碼

所以為了讓urllib2 能夠定為網頁，我們需要將相對鏈接轉換為絕對鏈接，這樣方可解決問題。
Python 里面有可以實現這個功能的模塊：urlparse。

下面對 link_crawler() 函數進行改進：

import urlparse
def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex """
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urlparse.urljoin(seed_url, link)
                crawl_queue.append(link)

八 . 運行：

運行程序：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

輸出：

Downloading:  http://example.webscraping.com
Downloading:  http://example.webscraping.com/index/1
Downloading:  http://example.webscraping.com/index/2
Downloading:  http://example.webscraping.com/index/3
Downloading:  http://example.webscraping.com/index/4
Downloading:  http://example.webscraping.com/index/5
Downloading:  http://example.webscraping.com/index/6
Downloading:  http://example.webscraping.com/index/7
Downloading:  http://example.webscraping.com/index/8
Downloading:  http://example.webscraping.com/index/9
Downloading:  http://example.webscraping.com/index/10
Downloading:  http://example.webscraping.com/index/11
Downloading:  http://example.webscraping.com/index/12
Downloading:  http://example.webscraping.com/index/13
Downloading:  http://example.webscraping.com/index/14
Downloading:  http://example.webscraping.com/index/15
Downloading:  http://example.webscraping.com/index/16
Downloading:  http://example.webscraping.com/index/17
Downloading:  http://example.webscraping.com/index/18
Downloading:  http://example.webscraping.com/index/19
Downloading:  http://example.webscraping.com/index/20
Downloading:  http://example.webscraping.com/index/21
Downloading:  http://example.webscraping.com/index/22
Downloading:  http://example.webscraping.com/index/23
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/index/24

通過運行得到的結果，你可以看出來：雖然，現在可以下載網頁沒有出錯，但是同樣的網頁會被不斷的下載到。為什么會這樣？這是因為這些鏈接URL相互之間存在鏈接。如果兩個網頁之間相互都有對方的鏈接，那么對着這個程序，它會不斷死循環下去。

所以，我們還需要繼續改進程序：避免爬取相同的鏈接，所以我們需要記錄哪些鏈接已經被爬取過，如果已經被爬取過了，就不在爬取它。

九 . 繼續改進 `link_crawler()`函數：

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            # check if link matches expected regex
            if re.match(link_regex, link):
                # form absolute link
                link = urlparse.urljoin(seed_url, link)
                # check if have already seen this link
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

十 . 運行：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

輸出：

Downloading:  http://example.webscraping.com
Downloading:  http://example.webscraping.com/index/1
Downloading:  http://example.webscraping.com/index/2
Downloading:  http://example.webscraping.com/index/3
Downloading:  http://example.webscraping.com/index/4
Downloading:  http://example.webscraping.com/index/5
Downloading:  http://example.webscraping.com/index/6
Downloading:  http://example.webscraping.com/index/7
Downloading:  http://example.webscraping.com/index/8
Downloading:  http://example.webscraping.com/index/9
Downloading:  http://example.webscraping.com/index/10
Downloading:  http://example.webscraping.com/index/11
Downloading:  http://example.webscraping.com/index/12
Downloading:  http://example.webscraping.com/index/13
Downloading:  http://example.webscraping.com/index/14
Downloading:  http://example.webscraping.com/index/15
Downloading:  http://example.webscraping.com/index/16
Downloading:  http://example.webscraping.com/index/17
Downloading:  http://example.webscraping.com/index/18
Downloading:  http://example.webscraping.com/index/19
Downloading:  http://example.webscraping.com/index/20
Downloading:  http://example.webscraping.com/index/21
Downloading:  http://example.webscraping.com/index/22
Downloading:  http://example.webscraping.com/index/23
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/view/Zimbabwe-252
Downloading:  http://example.webscraping.com/view/Zambia-251
Downloading:  http://example.webscraping.com/view/Yemen-250
Downloading:  http://example.webscraping.com/view/Western-Sahara-249

現在這個程序就是一個非常完美的程序，它會爬取所有地點，並且能夠如期停止。最終，完美得到了一個可用的爬蟲。

總結：
這樣，我們就已經介紹了3種爬取一個站點或者一個網頁里面所有的鏈接URL的源代碼。這些只是初步的程序，接下來，我們還可能會遇到這樣的問題：
1 . 如果一些網站設置了禁止爬取的URL，我們為了執行這個站點的規則，就要按照它的 robots.txt 文件來設計爬取程序。
2 . 在國內是上不了google的，那么如果我們想要使用代理的方式上谷歌，就需要給我們的爬蟲程序設置代理。
3 . 如果我們的爬蟲程序爬取網站的速度太快，可能就會被目標站點的服務器封殺，所以我們需要限制下載速度。
4 . 有一些網頁里面有類似日歷的東西，這個東西里面的每一個日期都是一個URL鏈接，我們有不會去爬取這種沒有意義的東西。日期是無止境的，所以對於我們的爬蟲程序來說，這就是一個爬蟲陷阱，我們需要避免陷入爬蟲陷阱。

我們需要解決上這4個問題。才能得到最終版本的爬蟲程序。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 正則表達式獲取字符串中的img標簽中的url鏈接驗證URL鏈接和IP有效性的JS代碼（正則表達式） Java正則表達式獲取網頁所有網址和鏈接文字 Selenium+python --使用正則表達式爬取頁面的URL鏈接用python正則表達式提取網頁的url 從網頁中通過正則表達式獲取標題、URL和發表時間通過正則表達式獲取url中參數用正則表達式獲取URL中的查詢參數用正則表達式獲取URL中的查詢參數 python的url正則表達式