Python 爬蟲入門——小項目實戰（自動私信博客園某篇博客下的評論人，隨機發送一條笑話，完整代碼在博文最后）

本文轉載自查看原文 2016-03-03 15:43 3776 爬蟲/ python

　　之前寫的都是針對爬蟲過程中遇到問題的解決方案，沒怎么涉及到實際案例。這次，就以博客園為主題，寫一個自動私信博客下的評論人員（在本篇留下的評論的同學也會被自動私信，如果不想被私信，同時又有問題，請私信我）。

　　1).確定監控的博客，這里以http://www.cnblogs.com/hearzeus/p/5226546.html為例，后面會更改為本篇博客的博客地址。

　　2).獲取博客下的評論人員。

　　　打開瀏覽器控制台-網絡面板，可以看到如下信息：

　　分析可知，獲取評論人員的請求為：

http://www.cnblogs.com/mvc/blog/GetComments.aspx?postId=5226546&blogApp=hearzeus&pageIndex=0&anchorCommentId=0&_=1456989055561

　　python代碼如下：

def getCommentsHtml(index):
    url = "http://www.cnblogs.com/mvc/blog/GetComments.aspx"
    params = {
        "postId":"5226546",
        "blogApp":"hearzeus",
        "pageIndex":`index`,
        'anchorCommentId':`0`,
        '_=':'1456908852216'
    }
    url_params = urllib.urlencode(params)
    return json.loads(urllib2.urlopen(url,data=url_params).read())['commentsHtml']

　　可以通過index來遍歷所有的評論人員。如果，評論人員只有1頁，但是，我把index設為2，這個時候就取不到數據。分析有無數據的返回值，可以通過關鍵特征告訴爬蟲，已經遍歷結束了。我用的特征代碼如下：

if(html.count(u"comment_date")<1):
            print "遍歷結束："+`i`

　　即提取返回值中是否有"comment_date"關鍵字來判斷是否遍歷結束

　　我們將這個鏈接，直接放在瀏覽器里面打開，可以看到請求結果，如下圖所示：

　　放在json處理工具里面（http://www.bejson.com/jsonviewernew/），可以看到如下：

　　從圖中可以看出，有效信息為commentsHtml 字段，同時可以發現，返回的用戶列表形式為html，所以還要對返回值進行解析。

　　經過一步步的分析，我發現解析代碼：

#parseHtml.py
#encoding=utf8
from bs4 import BeautifulSoup
def parse(html):
    soup = BeautifulSoup(html,"html.parser")
    acount = len(soup.find_all("div","post"))
    name_list = []
    # print acount
    for i in range(acount):
        name_list.append(soup.find_all("div","posthead")[i].find_all("a")[2].string)
    return name_list

　　3).保存用戶名，以保證不重復給一個人發送私信。代碼如下：

#FileOperation.py
#encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

def checkName(name):
    file = open("../src/comments")
    contents = file.read().split("\n")
    for i in range(len(contents)):
        if(contents[i].count(name)>0):
            file.close()
            return True
def wirteName(name):
    file = open("../src/comments","a")
    file.write(name+"\n")
    file.close()
    return True

　　checkName函數，是用來檢查該用戶是否已經被發送過私信

　　writeName函數，是將發送私信成功后的用戶寫入文本

　　4).發送私信（這個接口可以自己在博客園發送私信截取到，方法同上）,代碼如下

#sendMessage.py
#encoding=utf8
import urllib
import urllib2
def send(name,content):
    url = "http://msg.cnblogs.com/ajax/msg/send"
    header={
        "Cookie":"**********"
    }
    # print `name`
    params = {
        "incept":name,
        "title":"腳本私信",
        "content":`content`
    }
    url_param = urllib.urlencode(params)
    request = urllib2.Request(url=url,headers=header,data=url_param)
    print urllib2.urlopen(request).read()

　　　　a).其中header里面的cookie，需要登錄博客園之后獲取，如下圖馬賽克部分，

　　　　b).params通過名稱可以看到每個參數的作用。

　　5).定時器

　　　　python定時器，代碼示例：

import threading
def sayhello():
    print "hello world"
    t = threading.Timer(2.0, sayhello)
    t.start()
    return
sayhello()

　　附錄——完整代碼

#spider.py
#encoding=utf8
import urllib2
import urllib
import json
import parseHtml
import sendMessage
import FileOperation
import threading
def getCommentsHtml(index):
    url = "http://www.cnblogs.com/mvc/blog/GetComments.aspx"
    params = {
        "postId":"5226546",#不要監控我的
        "blogApp":"hearzeus",#不要監控我的
        "pageIndex":`index`,
        'anchorCommentId':`0`,
        '_=':'1456908852216'
    }
    url_params = urllib.urlencode(params)
    return json.loads(urllib2.urlopen(url,data=url_params).read())['commentsHtml']
def getCommentsUser(html):
    return parseHtml.parse(html)

def sendHello(name):
    # for i in range(len(list_name)):
    sendMessage.send(name,"腳本私信。如有打擾，還望海涵")
    # print("hello:"+name)

def main():
    for i in range(10):
        html = getCommentsHtml(i)
        if(html.count(u"comment_date")<1):
            print "遍歷結束："+`i`
            t = threading.Timer(10.0, main)
            t.start()
            return
        list_name = getCommentsUser(html)
        for i in range(len(list_name)):
            if(FileOperation.checkName(list_name[i])!=True):
                sendHello(list_name[i])
                FileOperation.wirteName(list_name[i])

main()

　　其他三個py在上面都給出了

　　注意，

　　監控的博客頁面一定要改，不要監控我的！！！！

　　我說了三遍！！！！！！

　　以上

　　a).代碼僅供學習交流

　　b).如有錯誤，多多指教

　　c).轉載請注明出處

2016/3/4 10:36最新更新

估計用不了多久，我就會被封了。　

2016/3/4 10:42更新

有部分同學測試自動回復，這里更新下自動回復的代碼：

#encoding=utf8
import urllib
import urllib2
def sendLetter(name,content):#自動私信
    url = "http://msg.cnblogs.com/ajax/msg/send"
    header={
        "Cookie":""

    }
    params = {
        "incept":name,
        "title":"腳本私信",
        "content":content
    }
    url_param = urllib.urlencode(params)
    request = urllib2.Request(url=url,headers=header,data=url_param)
    print urllib2.urlopen(request).read()
def sendComments(parentid,contents):#自動回復
    url = "http://www.cnblogs.com/mvc/PostComment/Add.aspx"
    header={
        "Cookie":""
    }
    params = {
        "blogApp":"hearzeus",
        "postId":"5238867",
        "body":contents,
        "parentCommentId":parentid
    }
    url_param = urllib.urlencode(params)
    print url_param
    request = urllib2.Request(url=url,headers=header,data=url_param)
    print urllib2.urlopen(request).read()

2016/3/4 11:01更新

已經被封

2016/3/4 11:34更新

自動回復過快也會失敗，所以，就不打算部署了。

說在最后：

這篇博文只是一個簡單的示例，理解就行。不用繼續測試了，我已經在服務器關了這些功能

注意，Cookie 在sendMessage.py里面改成自己的

https://yunpan.cn/cYy5a9aJ3wLaW 提取碼 f746

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲實現統計博客園博文數量、閱讀量、評論數博客園-博文自動發布工具爬蟲實戰【1】使用python爬取博客園的某一篇文章 Python爬蟲-博客園首頁推薦博客排行(整合詞雲+郵件發送) Python爬蟲入門教程——爬取自己的博客園博客 python 模擬登錄博客園並且自動發布一篇文章博客園背景音樂調用，讓音樂為你的博文加點料用 zoom.js 給博客園中博文的圖片添加單擊時彈出放大效果 Python簡單爬蟲爬取自己博客園所有文章博客園代碼高亮