🆙🆙

由於在豆瓣發了個租房帖子，發現很快就被其他的帖子淹沒，但是手動頂帖實在太累，😭，所以想通過自動頂帖的方式來解放雙手！
這里特別注意：不管你是為了Python就業還是興趣愛好，記住：項目開發經驗永遠是核心，如果你沒有2020最新python入門到高級實戰視頻教程，可以去小編的Python交流.裙：七衣衣九七七巴而五（數字的諧音）轉換下可以找到了，里面很多新python教程項目，還可以跟老司機交流討教！

評論請求分析

通過Chrome network 分析

add_comment

評論url是https://www.douban.com/group/topic/129122199/add_comment
需要帶5個參數，其中 ck 是 cookie 里面的值，rv_comment 是評論
返回302代表重定向

Python 模擬請求：

# 豆瓣具體帖子 url = "https://www.douban.com/group/topic/129122199/" # 豆瓣具體帖子回復的接口，格式是帖子鏈接+/add_comment comment_url = url + "/add_comment" cookie = 'cookie' referer = url agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' headers = { "Host": "www.douban.com", "Referer": referer, 'User-Agent': agent, "Cookie": cookie } params = { "rv_comment": '🆙', "ck": re.findall("ck=(.*?);", headers["Cookie"])[-1], 'start': '0', 'submit_btn': '發送' } response = requests.post(comment_url, headers=headers, allow_redirects=False, data=params, verify=False) 復制代碼

直接運行即可。

但是多運行幾次就會發現，返回的狀態碼是200，而且沒有頂帖成功。實際上是觸發了豆瓣的防爬蟲。

觸發了豆瓣驗證碼

而且在我們頂帖的時候發送請求的時候還帶有 captcha-solution 和 captcha-id 字段。

目前發現，每次評論就算相隔1分鍾，只要滿3次，就一定會彈出這個驗證碼進行驗證。

驗證碼解析

遇到驗證碼我們就來破解驗證碼。

tesserocr

識別圖形驗證碼需要安裝tesserocr這個庫，下面介紹下tesserocr。

tesserocr是Python的一個OCR識別庫，但其實是對tesseract做了一層Python Api的封裝，核心還是tesseract，所以在安裝tesserocr之前，需要先安裝tesseract。Tesseract(/‘tesərækt/) 這個詞的意思是”超立方體”，指的是幾何學里的四維標准方體，又稱”正八胞體”，是一款被廣泛使用的開源 OCR 工具。

在Mac下，使用 brew 安裝

brew install tesseract --all-languages
復制代碼

接下來再安裝tesserocr即可：

brew install imagemagick
pip install tesserocr pillow
復制代碼

Python 代碼如下：

import tesserocr from PIL import Image if __name__ == '__main__': # 新建Image對象 image = Image.open("/Users/liwenhao/Desktop/douban-captcha-example1.jpeg") # 調用tesserocr的image_to_text()方法，傳入image對象完成識別 result = tesserocr.image_to_text(image) print(result) 復制代碼

驗證的圖片如下：

douban-captcha-example1

結果無法識別。

換一張簡單的圖片試試：

結果如下：

5594
復制代碼

看來 Tesseract 只能識別一些簡單的驗證碼，不適合豆瓣驗證碼識別。

試試識別驗證碼平台。

百度OCR

官方接入文檔: 文字識別-Python SDK接入文檔

重點：免費
通用識別（包括身份證、銀行卡）500次/日，
高精度則50次/日，
駕駛證，行駛證，車票，營業執照，通用票據均為200次/日

注意：支持2.7.+及3.+

配置流程：

先開通個百度的賬號；
開通文字識別服務，打開后點擊立即使用：cloud.baidu.com/product/ocr…
點擊步驟2，應該有個信息確認的，確認后，會進入到用戶個人首頁，向下滑動，直接點擊文字識別:
點擊創建應用，輸入一堆內容后，點擊確認即可，然后點擊我的應用，這里面的API Key 跟Secret Key需要使用到:
點擊右上角，用戶中心，用戶ID也需要用到:
需要的信息准備好了，pip 安裝一波
```
pip install baidu-aip
復制代碼
```

測試一波

import json from aip import AipOcr if __name__ == '__main__': APP_ID = ' ' API_KEY = ' ' SECRET_KEY = ' ' client = AipOcr(APP_ID, API_KEY, SECRET_KEY) # 讀取圖片 def get_file_content(file_path): with open(file_path, 'rb') as fp: return fp.read() image = get_file_content('/Users/liwenhao/Desktop/douban-captcha-example2.jpg') """ 調用通用文字識別(高精度), 圖片參數為本地圖片 """ result = json.dumps(client.basicAccurate(image)) print(result) 復制代碼

驗證的圖片如下：

douban-captcha-example1

結果走一波：

{"log_id": 3968431492157876638, "words_result_num": 1, "words_result": [{"words": " minute:"}]} 復制代碼

從結果可以看出識別出了這個驗證碼。

words_result_num 是識別結果數
words_result 是定位和識別結果數組
words 是識別結果字符串

再來試試

douban-captcha-example2

結果如下：

{"log_id": 5251449865676063710, "words_result_num": 0, "words_result": []} 復制代碼

沒有識別出來，可以看到對於復雜一些的驗證碼還是會出現無法識別的情況，但是勝在免費。

超級鷹

對於無法識別的情況就需要打碼平台了，業界比較出名的是超級鷹。

超級鷹是按量級收費，量大便宜，標准價格:1元=1000題分，不同驗證碼類型，需要的題分不一樣，詳情可以到這里查詢 www.chaojiying.com/price.html

python 代碼如下：

from hashlib import md5 import requests import json # 通過超級鷹識別驗證碼 def recognition_captcha(filename, code_type): im = open(filename, 'rb').read() params = { 'user': '賬號', 'pass2': md5('密碼'.encode('utf8')).hexdigest(), 'softid': 'softid', 'codetype': code_type } headers = { 'Connection': 'Keep-Alive', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)', } files = {'userfile': ('ccc.jpg', im)} resp = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=headers).json() return resp # 調用代碼 if __name__ == '__main__': print(json.dumps(recognition_captcha('/Users/liwenhao/Desktop/douban-captcha-example2.jpg', 1006))) 復制代碼

上傳的驗證碼就是上面百度 OCR 未曾識別的驗證碼，如下：

douban-captcha-example2結果如下：

{"err_str": "OK", "err_no": 0, "md5": "0475b05654c376deb409bfef7eee75cd", "pic_id": "8054415552001300054", "pic_str": "yacvmd"} 復制代碼

發現驗證碼 yacvmd 已出來。但是時間花了5s左右。后來測試發現對於豆瓣比較建的驗證碼花費的時間在1s內，因此從時間和准確性上面，最后還是采用了超級鷹打碼平台。

失敗微信通知

無論采用什么方式，都有可能出現失敗的情況，我總不能采取輪詢的方式，隔幾個小時就去看看到底前面幾次是否🆙成功，因此需要一個異步通知，最開始想用郵件，后來發現了 Server醬這個神器，可以幫助我們發送微信通知，而且特別簡單。

具體可以查看 Server醬。

完整代碼

采用 python2

import os import requests import urllib3 import re from hashlib import md5 import random from lxml import html import logging logging.basicConfig(level=logging.INFO, format='%(asctime)s.%(msecs)03d %(levelname)s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S') urllib3.disable_warnings() # 下載驗證碼圖片 def download_captcha(captcha_url, agent): # findall返回的是一個列表 captcha_name = re.findall("id=(.*?):", captcha_url) filename = "douban_%s.jpg" % (str(captcha_name[0])) logging.info("文件名為: " + filename) with open(filename, 'wb') as f: # 以二進制寫入的模式在本地構建新文件 header = { 'User-Agent': agent, 'Referer': captcha_url } f.write(requests.get(captcha_url, headers=header).content) logging.info("%s 下載完成" % filename) return filename # 通過超級鷹識別驗證碼 def recognition_captcha(filename, code_type): im = open(filename, 'rb').read() params = { 'user': '用戶', 'pass2': md5('密碼'.encode('utf8')).hexdigest(), 'softid': 'softid', 'codetype': code_type } headers = { 'Connection': 'Keep-Alive', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)', } files = {'userfile': ('ccc.jpg', im)} resp = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=headers).json() # 錯誤處理 if resp.get('err_no', 0) == 0: return resp.get('pic_str') def result_verification(response): if response.status_code == 302: logging.info("豆瓣ding成功") else: logging.info(response.status_code) logging.info(response) url = "https://sc.ftqq.com/你的SCKEY.send?text=douban失敗" + \ str(random.randint(0, 1000)) requests.post(url) logging.info("豆瓣ding失敗，發送失敗信息到微信") # 豆瓣頂帖 def douban_ding(): # 豆瓣具體帖子 url = "https://www.douban.com/group/topic/129122199/" # 豆瓣具體帖子回復的接口，格式是帖子鏈接+/add_comment comment_url = url + "/add_comment" cookie = 'cookie' referer = url agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' headers = { "Host": "www.douban.com", "Referer": referer, 'User-Agent': agent, "Cookie": cookie } params = { "rv_comment": '🆙', "ck": re.findall("ck=(.*?);", headers["Cookie"])[-1], 'start': '0', 'submit_btn': '發送' } response = requests.get(url, headers=headers, verify=False).content.decode('utf-8') selector = html.fromstring(response) captcha_image = selector.xpath("//img[@id=\"captcha_image\"]/@src") if captcha_image: logging.info("發現驗證碼，下載驗證碼") captcha_id = selector.xpath("//input[@name=\"captcha-id\"]/@value") filename = download_captcha(captcha_image[0], agent) captcha_solution = recognition_captcha(filename, 1006) os.remove(filename) params['captcha-solution'] = captcha_solution params['captcha-id'] = captcha_id else: logging.info("沒有驗證碼") response = requests.post(comment_url, headers=headers, allow_redirects=False, data=params, verify=False) result_verification(response) if __name__ == '__main__': douban_ding() 復制代碼

運行結果：

第1次：

2018-12-30 16:09:35.589 INFO: 沒有驗證碼
2018-12-30 16:09:36.436 INFO: 豆瓣ding成功
復制代碼

第4次：

2018-12-30 16:13:02.135 INFO: 發現驗證碼，下載驗證碼
2018-12-30 16:13:02.135 INFO: 文件名為: douban_OJGsVa0hST4O2WhFA0VpMnR9.jpg
2018-12-30 16:13:02.554 INFO: douban_OJGsVa0hST4O2WhFA0VpMnR9.jpg 下載完成
2018-12-30 16:13:09.687 INFO: 豆瓣ding成功
復制代碼

效果圖：

最后注意：不管你是為了Python就業還是興趣愛好，記住：項目開發經驗永遠是核心，如果你沒有2020最新python入門到高級實戰視頻教程，可以去小編的Python交流.裙：七衣衣九七七巴而五（數字的諧音）轉換下可以找到了，里面很多新python教程項目，還可以跟老司機交流討教！

本文的文字及圖片來源於網絡加上自己的想法,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯系我們以作處理。

頂帖的時候控制好頻率，不然容易被禁言。

用Python實現自動頂帖，告別手動