python簡單爬蟲，抓取郵箱

本文轉載自查看原文 2013-04-23 22:07 3233 python

　　最近，老師給了一個練習是，實現一個爬蟲，就爬大概100個網頁，匹配出郵箱。

　　於是，我花了幾天時間，熟悉熟悉了python,就有了下面這個超級簡單的爬蟲程序。各種毛病。。。。。。

　　這里先說明一下，python庫的安裝，因為我在這上面浪費了不少時間。

　　首先是pip和distribute。這兩個是用來管理和安裝python庫的。具體請看這里http://jiayanjujyj.iteye.com/blog/1409819

　　在windows下，在命令行中python distribute_setup.py (在distribute_setup.py這個文件目錄下)。然后就可以用easy_install 命令來裝其他模塊了。

　　pyquery有一個依賴庫，是lxml。這個模塊要用到本機c語言的編譯器，如果本機裝有VS或者mingw相關的東西，容易遇到各種裝不上的問題。理論上來說是，只要正確配置，就會利用本機的編譯器將lxml模塊裝好。但是，我是各種郁悶裝不上。。於是找到了這里http://www.lfd.uci.edu/~gohlke/pythonlibs/。

　　在那個網站里面，找到對應的版本，裝上就ok了。

 1 import urllib2
 2 import re
 3 from pyquery import PyQuery as pq
 4 from lxml import etree
 5 import sys
 6 import copy
 7 ##reload(sys)
 8 ##sys.setdefaultencoding("utf8")     
 9 
10 
11 mailpattern = re.compile('[^\._:>\\-][\w\.-]+@(?:[A-Za-z0-9]+\.)+[A-Za-z]+')
12 #mailpattern = re.compile('[A-Za-z0-9]+@(?:[A-Za-z0-9]+\.)+[A-Za-z]+')
13 
14 url = "http://www.xxx.cn"
15 firstUrls = []# to store the urls
16 secondUrls = []
17 count = 1 # to count levels
18 furls = open("E:/py/crawler/urlsRecord.txt","a")
19 fmail = open("E:/py/crawler/mailresult.txt","a")
20 
21 
22 
23 def geturls(data):   #the function to get the urls in the html
24     urls = []
25     d = pq(data)
26     label_a = d.find('a')#用pyquery庫去找到 a 標簽.
27     if label_a:
28         label_a_href = d('a').map(lambda i, e:pq(e)('a').attr('href'))
29         for u in label_a_href:
30             if u[0:10]!="javascript" :  
31                 if u[0:4] == "http":
32                     urls.append(u)
33                 else:
34                     urls.append(url + u)              
35         for u in urls:
36             furls.write(u)
37             furls.write('\n')
38     return urls
39         
40                 
41 def savemails(data): # the function to save the emails
42     mailResult = mailpattern.findall(data)
43     if mailResult:
44         for u in mailResult:
45             print u
46             fmail.write(u)
47             fmail.write('\n')
48 
49 def gethtml(url):
50     fp = urllib2.urlopen(url)
51     mybytes =fp.read()
52     myWebStr  = mybytes.decode("gbk")   #這里讀取出來要從bytes到文本
53     fp.close()
54     return myWebStr
55 
56    
57 furls.write(url+'\n')
58 
59 myWebStr = gethtml(url)
60 if myWebStr:
61     savemails(myWebStr)
62     firstUrls = geturls(myWebStr)
63 if firstUrls:
64     for i in range(0,len(firstUrls)):
65         html = gethtml(firstUrls[i])
66         if html:
67             savemails(html)
68 ##        tempurls = geturls(html)        #這里本來想再抓一層，慢得要死，就沒再繼續了
69 ##        if tempurls:
70 ##            nexturls = nexturls + tempurls
71         
72 ##    if nexturls:
73 ##        for i in range(0,len(nexturls)):
74 ##            nexthtml = gethtml(nexturls[i])
75 ##            if nexthtml:
76 ##                savemails(nexthtml)
77          
80 fmail.close()
81 furls.close()
82

　　現在這個程序存在的問題就是：

　　1.如果直接運行，就會出現編碼問題：

Traceback (most recent call last):
  File "E:\py\crawler.py", line 67, in <module>
    savemails(html)
  File "E:\py\crawler.py", line 46, in savemails
    fmail.write(u)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u81f3' in position 0: ordinal not in range(128)

　　然后我google之，是因為編碼問題。

reload(sys）
sys.setdefaultencoding("utf8")

　　用這個方法即可解決，即我在最開始的代碼里面第7,8注釋的兩行。不過問題又出現了，雖然不會出現上面的報錯，但是第45行的 print 語句無效了。而且無論在何處的print語句均無效了。這是為何。。。。。。

　　在46行中，我試着把出現問題的部分print出來，發現，是因為鏈接中里面出現了：

至huaweibiancheng@163.com

　　然后fmail.write(u)的時候，碰到這種就寫不了。我查了下，剛好‘至’的unicode 編碼就是 81f3 （在這里查http://ipseeker.cn/tools/pywb.php）

　　到此處，難道是write()不能寫中文？我用如下代碼測試：

poem = '至huaweibiancheng@163.com'

f = open("E:/py/poem.txt","w")
f.write(poem)
f.close()

f = open("E:/py/poem.txt",'r')

while True:
    line = f.readline()
    if len(line) == 0:
        break
    print (line)

f.close()

結果：
至huaweibiancheng@163.com

　　接着把代碼中,改成“poem = u'\u81f3' ”,一模一樣的錯誤出現了：

Traceback (most recent call last):
  File "E:\py\test.py", line 212, in <module>
    f.write(poem)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u81f3' in position 0: ordinal not in range(128)

　　也就是說，在抓取的網頁中是以 " u'\u81f3'huaweibiancheng@163.com " 存在。然后不能寫入文件，出錯。

　　求高人解答啊。。。。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python3 爬蟲實例（一）-- 簡單網頁抓取 Python3簡單爬蟲抓取網頁圖片 python簡單爬蟲抓取視頻demo Python3簡單爬蟲抓取網頁圖片 Python3簡單爬蟲抓取網頁圖片實現一個簡單的郵箱地址爬蟲（python) 使用Python編寫簡單網絡爬蟲抓取視頻下載資源 [Python學習] 簡單網絡爬蟲抓取博客文章及思想介紹 python簡單爬蟲抓取視頻demo-完善 python 爬蟲抓取心得