實現一個簡單的郵箱地址爬蟲（python)

本文轉載自查看原文 2014-08-11 16:40 4534 爬蟲/ python

　　我經常收到關於email爬蟲的問題。有跡象表明那些想從網頁上抓取聯系方式的人對這個問題很感興趣。在這篇文章里，我想演示一下如何使用python實現一個簡單的郵箱爬蟲。這個爬蟲很簡單，但從這個例子中你可以學到許多東西（尤其是當你想做一個新蟲的時候）。

　　我特意簡化了代碼，盡可能的把主要思路表達清楚。這樣你就可以在需要的時候加上自己的功能。雖然很簡單，但完整的實現從網上抓取email地址的功能。注意，本文的代碼是使用python3寫的。

　　好。讓我們逐步深入吧。我一點一點的實現，並加上注釋。最后再把完整的代碼貼出來。

　　首先引入所有必要的庫。在這個例子中，我們使用的BeautifulSoup 和 Requests 是第三方庫，urllib, collections 和 re 是內置庫。

BeaufulSoup可以使檢索Html文檔更簡便，Requests讓執行web請求更容易。

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re

　　下面我定義了一個列表，用於存放要抓取的網頁地址，比如http://www.huazeming.com/ ，當然你也可以找有明顯email地址的網頁作為地址，數量不限。雖然這個集合應該是個列表（在python中），但我選擇了 deque 這個類型，因為這個更符合我們的需要。

# a queue of urls to be crawled
new_urls = deque(['http://www.themoscowtimes.com/contact_us/'])

　　接下來，我們需要把處理過的url存起來，以避免重復處理。我選擇set類型，因為這個集合可以保證元素值不重復。

# a set of urls that we have already crawled
processed_urls = set()

　　定義一個email集合，用於存儲收集到地址：

# a set of crawled emails
emails = set()

　　讓我們開始抓取吧！我們有一個循環，不斷取出隊列的地址進行處理，直到隊列里沒有地址為止。取出地址后，我們立即把這個地址加到已處理的地址列表中，以免將來忘記。

# process urls one by one until we exhaust the queue
while len(new_urls):
    # move next url from the queue to the set of processed urls
    url = new_urls.popleft()
    processed_urls.add(url)

　　然后我們需要從當前地址中提取出根地址，這樣當我們從文檔中找到相對地址時，我們就可以把它轉換成絕對地址。

# extract base url and path to resolve relative links
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url

　　下面我們從網上獲取頁面內容，如果遇到錯誤，就跳過繼續處理下一個網頁。

# get url's content
print("Processing %s" % url)
try:
    response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
    # ignore pages with errors
    continue

　　當我們得到網頁內容后，我們找到內容里所有email地址，把其添加到列表里。我們使用正則表達式提取email地址：

# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)

　　在我們提取完當前網頁內容的email地址后，我們找到當前網頁中的其他網頁地址，並將其添加到帶處理的地址隊列里。這里我們使用BeautifulSoup庫來分析網頁html。

     
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text)

　　這個庫的find_all方法可以根據html標簽名來抽取元素。

# find and process all the anchors in the document
for anchor in soup.find_all("a"):

　　但網頁總的有些a標簽可能不包含url地址，這個我們需要考慮到。

# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''

　　如果這個地址以斜線開頭，那么我們把它當做相對地址，然后給他加上必要的根地址：

# add base url to relative links
if link.startswith('/'):
    link = base_url + link

　　到此我們得到了一個有效地址（以http開頭），如果我們的地址隊列沒有，而且之前也沒有處理過，那我們就把這個地址加入地址隊列里:

# add the new url to the queue if it's of HTTP protocol, not enqueued and not processed yet
if link.startswith('http') and not link in new_urls and not link in processed_urls:
    new_urls.append(link)

　　好，就是這樣。以下是完整代碼：

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re

# a queue of urls to be crawled
new_urls = deque(['http://www.themoscowtimes.com/contact_us/index.php'])

# a set of urls that we have already crawled
processed_urls = set()

# a set of crawled emails
emails = set()

# process urls one by one until we exhaust the queue
while len(new_urls):

    # move next url from the queue to the set of processed urls
    url = new_urls.popleft()
    processed_urls.add(url)

    # extract base url to resolve relative links
    parts = urlsplit(url)
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
    print("Processing %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors
        continue

    # extract all email addresses and add them into the resulting set
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
    emails.update(new_emails)

    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text)

    # find and process all the anchors in the document
    for anchor in soup.find_all("a"):
        # extract link url from the anchor
        link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links
        if link.startswith('/'):
            link = base_url + link
        elif not link.startswith('http'):
            link = path + link
        # add the new url to the queue if it was not enqueued nor processed yet
        if not link in new_urls and not link in processed_urls:
            new_urls.append(link)

　　這個爬蟲比較簡單，省去了一些功能（比如把郵箱地址保存到文件中），但提供了編寫郵箱爬蟲的一些基本原則。你可以嘗試對這個程序進行改進。

　　當然，如果你有任何問題和建議，歡迎指正！

　　英文原文：A Simple Email Crawler in Python

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python同時給多個郵箱地址發送郵件 PHP preg_match的簡單使用,驗證郵箱地址 python3爬取網頁中的郵箱地址用JAVA代碼實現驗證郵箱地址是否符合 Python學習手冊之正則表達式示例--郵箱地址提取 python簡單爬蟲，抓取郵箱郵箱地址是什么？163手機郵箱怎么登錄？ GIT 查看/修改用戶名和郵箱地址取消EXCEL 2007/2010中郵箱地址的自動鏈接正則表達式驗證郵箱地址