Python小爬蟲-自動下載三億文庫文檔

本文轉載自查看原文 2014-07-10 14:02 3783 Python

　　新手學python，寫了一個抓取網頁后自動下載文檔的腳本，和大家分享。

首先我們打開三億文庫下載欄目的網址，比如專業資料（IT／計算機／互聯網）http://3y.uu456.com/bl-197?od=1&pn=0，可以觀察到，鏈接中pn=后面的數字就是對應的頁碼，所以一會我們會用iurl = 'http://3y.uu456.com/bl-197?od=1&pn='，后面加上頁碼來抓取網頁.

一般網頁會用1，2，3...不過機智的三億文庫用0，25，50...來表示，所以我們在拼接url時還得轉換一下。

右鍵查看網頁源代碼，可以觀察到這里每一個文檔都用一個<a>標簽標記，href對應文檔的鏈接，title是文檔名字，我們只需要用正則表達式將其“扣”出來就可以了.

不過你會發現我們扣出來的文檔地址eg："bp-602d123348d7c1c708a14sqb-1.html"，並不是真正的文檔下載地址，進一步點擊文檔至下載頁面，我們可以發現文檔真正的下載路徑是：“dlDoc-602d123348d7c1c708a14sqb-1-toword.doc”，清晰易見，我們只需提取文檔序號602d123348d7c1c708a14sqb-1，再拼接起來便OK了。

<p>
　　<a href="bp-602d123348d7c1c708a14sqb-1.html" title="視頻會議系統" target="_blank">視頻會議系統</a>
</p>

<a rel="nofollow" target="_blank" href="dlDoc-602d123348d7c1c708a14sqb-1-toword.doc">視頻會議系統-第1頁.doc</a>

運行結果如下：

代碼如下：

# -*- coding: utf-8 -*-  
#----------------------------------------------------- 
#   功能：將訪問的頁面存儲為html文件，並將頁面內的文檔下載至本地 
#   作者：chenbjin 
#   日期：2014-07-10  
#   語言：Python 2.7.6  
#   環境：linux（ubuntu）
#-----------------------------------------------------

import string
import urllib
import urllib2
import re
import os

#函數功能：抓取begin-end頁面，存入threeuPage文件夾中，並將其中的文檔下載到threeuFile文件夾中。
def threeu_page(burl,url,begin_page,end_page) :
    
    #The directory to save web page
    sPagePath = './treeuPage'
    if not os.path.exists(sPagePath) : 
        os.mkdir(sPagePath)

    #The director to save downloaded file
    sFilePath = './threeuFile'
    if not os.path.exists(sFilePath) : 
        os.mkdir(sFilePath)

    for i in range(begin_page,end_page+1) :
        pn = (i-1)*25
        #自動填充成六位的文件名，eg:00001.html
        sName = sPagePath + '/'+ string.zfill(i,5) + '.html'
        print 'Spidering the ' + str(i) + ' page ,saved to ' + sName + '...'
        f = open(sName,'w+')  
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = { 'User-Agent' : user_agent }
        request = urllib2.Request(url+str(pn),headers = headers)

        try: 
            con = urllib2.urlopen(request, timeout=10).read()
            #正則匹配出文檔的地址
            myItems = re.findall('<a href="bp-(.*?).html" title="(.*?)" target="_blank">(.*?)</a>',con,re.S)
            #print "Total : ",len(myItems)
            for item in myItems :
                print 'Dowloading the ' +item[0] + "  "+ item[1].decode('gbk') + '...'
                #下載文檔
                durl = burl+item[0]+'-toword.doc'
                urllib.urlretrieve(durl,sFilePath+'/'+item[1].decode('gbk')+'.doc')         
        except urllib2.URLError,e :
            print e
        else:
            f.write(con)
        f.close()

#這是三億文庫中“專業資料 > IT/計算機 > 互聯網”的地址
burl = 'http://3y.uu456.com/dlDoc-'
iurl = 'http://3y.uu456.com/bl-197?od=1&pn='
ibegin = 1
iend = 1
threeu_page(burl,iurl,ibegin,iend)
#end

　　參考資料：

1.Python爬蟲入門教程：http://blog.csdn.net/column/details/why-bug.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python實現爬蟲從網絡上下載文檔百度文庫付費文檔免費下載文庫下載器【推薦】如何用Python爬蟲實現百度圖片自動下載？ python爬蟲小例子 Python之爬蟲小例子百度文庫免費下載方法作業-python爬蟲-音樂下載 python爬蟲-搜索小說並下載 python小實例一：簡單爬蟲