python編寫爬蟲腳本並實現APScheduler調度

本文轉載自查看原文 2013-04-11 23:27 3267 python/ MongoDB

前段時間自學了python，作為新手就想着自己寫個東西能練習一下，了解到python編寫爬蟲腳本非常方便，且最近又學習了MongoDB相關的知識，萬事具備只欠東風。

程序的需求是這樣的，爬蟲爬的頁面是京東的電子書網站頁面，每天會更新一些免費的電子書，爬蟲會把每天更新的免費的書名以第一時間通過郵件發給我，通知我去下載。

一、編寫思路：

　　1.爬蟲腳本獲取當日免費書籍信息

　　2.把獲取到的書籍信息與數據庫中的已有信息作比較，如果書籍存在不做任何操作，書籍不存在，執行插入數據庫的操作，把數據的信息存入MongoDB

　　3.執行數據庫插入操作時，把更新的數據以郵件的形式發送出來

　　4.用APScheduler調度框架完成python腳本調度

二、腳本的主要知識點：

1.python簡單爬蟲

本次用到的模塊有urllib2用來抓取頁面，導入模塊如下：

import urllib2
from sgmllib import SGMLParser

urlopen()方法獲取網頁HTML源碼，都存儲在content中，listhref()類主要的功能是解析HTML代碼，處理HTML類型的半結構化文檔。

content = urllib2.urlopen('http://sale.jd.com/act/yufbrhZtjx6JTV.html').read()
 listhref = ListHref()
listhref.feed(content)

listhref()類代碼可以在下面全部代碼中查詢到，這里只說幾個關鍵點：

listhref()類繼承了SGMLParser 類並重寫了其中的內部方法。SGMLParser 將HTML分解成有用的片段，比如開始標記和結束標記。一旦成功地分解出某個數據為一個有用的片段，它會根據所發現的數據，調用一個自身內部的方法。為了使用這個分析器，您需要子類化 SGMLParser類，並且重寫父類的這些方法。

SGMLParser 將 HTML 分析成不同類數據及標記，然后對每一類調用單獨的方法:
開始標記 (Start_tag)
是一個開始一個塊的 HTML 標記，像 <html>，<head>，<body> , <pre> 等，或是一個獨一的標記，象 <br> 或 <img> 等。本例當它找到一個開始標記<a>，SGMLParser將查找名為 start_a或do_a的方法。如果找到了，SGMLParser會使用這個標記的屬性列表來調用這個方法；否則，它用這個標記的名字和屬性列表來調用unknown_starttag方法。
結束標記 (End_tag)
是結束一個塊的HTML標記，像 </html>，</head>，</body> 或 </pre> 等。本例中當找到一個結束標記時，SGMLParser 將查找名為end_a的方法。如果找到，SGMLParser調用這個方法，否則它使用標記的名字來調用unknown_endtag。
文本數據(Text data)
獲取文本塊，當不滿足其它各類別的任何標記時，調用handle_data獲取文本。

以下的幾類在本文中沒有用到
字符引用 (Character reference)
用字符的十進制或等同的十六進制來表示的轉義字符，當找到該字符，SGMLParser用字符調用 handle_charref 。
實體引用 (Entity reference)
HTML實體，像&ref，當找到該實體，SGMLParser實體的名字調用handle_entityref。
注釋 (Comment)
HTML注釋, 包括在 之間。當找到，SGMLParser用注釋內容調用handle_comment。
處理指令 (Processing instruction)
HTML處理指令，包括在 <? ... > 之間。當找到，SGMLParser用指令內容調 handle_pi。
聲明 (Declaration)
HTML聲明，如DOCTYPE，包括在 <! ... >之間。當找到，SGMLParser用聲明內容調用handle_decl。

具體的說明參考API：http://docs.python.org/2/library/sgmllib.html?highlight=sgmlparser#sgmllib.SGMLParser

2.python操作MongoDB數據庫

首先要安裝python對mongoDB的驅動PyMongo,下載地址：https://pypi.python.org/pypi/pymongo/2.5

導入模塊

import pymongo

連接數據庫服務器127.0.0.1和切換到所用數據庫mydatabase

mongoCon=pymongo.Connection(host="127.0.0.1",port=27017)
db= mongoCon.mydatabase

查找數據庫相關書籍信息，book為查找的collection

bookInfo = db.book.find_one({"href":bookItem.href})

為數據庫插入書籍信息，python支持中文，但是對於中文的編碼和解碼還是比較復雜，相關解碼和編碼請參考http://blog.csdn.net/mayflowers/article/details/1568852

b={
               "bookname":bookItem.bookname.decode('gbk').encode('utf8'),
               "href":bookItem.href,
               "date":bookItem.date
               }
            db.book.insert(b,safe=True)

關於PyMongo請參考API文檔http://api.mongodb.org/python/2.0.1/

3.python發送郵件

導入郵件模塊

# Import smtplib for the actual sending function
import smtplib
from email.mime.text import MIMEText

"localhost"為郵件服務器地址

　 msg = MIMEText(context) #文本郵件的內容
    msg['Subject'] = sub #主題
    msg['From'] = "my@vmail.cn" #發信人
    msg['To'] = COMMASPACE.join(mailto_list) #收信人列表

def send_mail(mailto_list, sub, context): 
    COMMASPACE = ','
    mail_host = "localhost"
    me = "my@vmail.cn"
    # Create a text/plain message
    msg = MIMEText(context) 
    msg['Subject'] = sub 
    msg['From'] = "my@vmail.cn"
    msg['To'] = COMMASPACE.join(mailto_list)
    
    send_smtp = smtplib.SMTP(mail_host) 

    send_smtp.sendmail(me, mailto_list, msg.as_string()) 
    send_smtp.close()

應用文檔：http://docs.python.org/2/library/email.html?highlight=smtplib#

4.Python調度框架ApScheduler

下載地址https://pypi.python.org/pypi/APScheduler/2.1.0

官方文檔：http://pythonhosted.org/APScheduler/#faq

API：http://pythonhosted.org/APScheduler/genindex.html

安裝方法：下載之后解壓縮，然后執行python setup.py install，導入模塊

from apscheduler.scheduler import Scheduler

ApScheduler配置比較簡單，本例中只用到了add_interval_job方法，在每間隔一段時間后執行任務腳本，本例中的間隔是30分鍾。可參考實例文章http://flykite.blog.51cto.com/4721239/832036

# Start the scheduler  
sched = Scheduler()
sched.daemonic = False  
sched.add_interval_job(job,minutes=30)  
sched.start()

關於daemonic參數：

apscheduler會創建一個線程，這個線程默認是daemon=True，也就是默認的是線程守護的。

在上面的代碼里面，要是不加上sched.daemonic=False的話，這個腳本就不會按時間運行。

因為腳本要是沒有sched.daemonic=False，它會創建一個守護線程。這個過程中，會創建scheduler的實例。但是由於腳本運行速度很快，主線程mainthread會馬上結束，而此時定時任務的線程還沒來得及執行，就跟隨主線程結束而結束了。（守護線程和主線程之間的關系決定的）。要讓腳本運行正常，必須設置該腳本為非守護線程。sched.daemonic=False

附：全部腳本代碼

All Code

#-*- coding: UTF-8 -*-
import urllib2
from sgmllib import SGMLParser
import pymongo
import time
# Import smtplib for the actual sending function
import smtplib
from email.mime.text import MIMEText
from apscheduler.scheduler import Scheduler

#get freebook hrefs
class ListHref(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.is_a = ""
        self.name = []
        self.freehref=""
        self.hrefs=[]

    def start_a(self, attrs):
        self.is_a = 1
        href = [v for k, v in attrs if k == "href"]
        self.freehref=href[0]

    def end_a(self):
        self.is_a = ""

    def handle_data(self, text):
        if self.is_a == 1 and text.decode('utf8').encode('gbk')=="限時免費":
            self.hrefs.append(self.freehref)
#get freebook Info
class FreeBook(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.is_title=""
        self.name = ""
    def start_title(self, attrs):
        self.is_title = 1
    def end_title(self):
        self.is_title = ""
    def handle_data(self, text):
        if self.is_title == 1:            
            self.name=text
#Mongo Store Module
class freeBookMod:
    def __init__(self, date, bookname ,href):
        self.date=date
        self.bookname=bookname
        self.href=href


def get_book(bookList):
    content = urllib2.urlopen('http://sale.jd.com/act/yufbrhZtjx6JTV.html').read()
    listhref = ListHref()
    listhref.feed(content)

    for href in listhref.hrefs:
        content = urllib2.urlopen(str(href)).read()
        listbook=FreeBook()
        listbook.feed(content)
        name = listbook.name
        n= name.index('》')
        #print (name[0:n+2])
        freebook=freeBookMod(time.strftime('%Y-%m-%d',time.localtime(time.time())),name[0:n+2],href)
        bookList.append(freebook)
    return bookList

def record_book(bookList,context,isSendMail):
    # DataBase Operation
    mongoCon=pymongo.Connection(host="127.0.0.1",port=27017)
    db= mongoCon.mydatabase
    for bookItem in bookList:
        bookInfo = db.book.find_one({"href":bookItem.href})

        if not bookInfo:
            b={
               "bookname":bookItem.bookname.decode('gbk').encode('utf8'),
               "href":bookItem.href,
               "date":bookItem.date
               }
            db.book.insert(b,safe=True)
            isSendMail=True
            context=context+bookItem.bookname.decode('gbk').encode('utf8')+','
    return context,isSendMail  

#Send Message
def send_mail(mailto_list, sub, context): 
    COMMASPACE = ','
    mail_host = "localhost"
    me = "my@vmail.cn"
    # Create a text/plain message
    msg = MIMEText(context) 
    msg['Subject'] = sub 
    msg['From'] = "my@vmail.cn"
    msg['To'] = COMMASPACE.join(mailto_list)
    
    send_smtp = smtplib.SMTP(mail_host) 

    send_smtp.sendmail(me, mailto_list, msg.as_string()) 
    send_smtp.close()  

#Main job for scheduler  
def job(): 
    bookList=[]
    isSendMail=False; 
    context="Today free books are"
    mailto_list=["mailto@mail.cn"]
    bookList=get_book(bookList)
    context,isSendMail=record_book(bookList,context,isSendMail)
    if isSendMail==True:       
        send_mail(mailto_list,"Free Book is Update",context)


if __name__=="__main__":      
    # Start the scheduler  
    sched = Scheduler()
    sched.daemonic = False  
    sched.add_interval_job(job,minutes=30)  
    sched.start()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python調度框架APScheduler使用詳解 Python任務調度模塊APScheduler python調度框架APScheduler使用詳解 Python3-apscheduler模塊-定時調度 python編寫腳本爬蟲背景調研----用python編寫網絡爬蟲(一) 用python語言編寫網絡爬蟲 Python使用APScheduler實現定時任務 Python使用APScheduler實現定時任務【python爬蟲】用python編寫LOL戰績查詢