用Python+Aria2寫一個自動選擇最優下載方式的E站爬蟲


前言

E站爬蟲在網上已經有很多了,但多數都只能以圖片為單位下載,且偶爾會遇到圖片加載失敗的情況;熟悉E站的朋友們應該知道,E站許多資源都是有提供BT種子的,而且通常打包的是比默認看圖模式更高清的文件;但如果只下載種子,又會遇到某些資源未放種/種子已死的情況。本文將編寫一個能自動檢測最優下載來源並儲存到本地的E站爬蟲,該爬蟲以數據庫作為緩沖區,支持以后台服務方式運行,可以輕易進行分布式擴展,並對於網絡錯誤有良好的魯棒性。

環境要求

Python3,MySQL,安裝了Aria2並開啟PRC遠程訪問

Aria2是一個強大的命令行下載工具,並支持web界面管理,可以運行在window和Linux下。介紹及安裝使用可參見

https://blog.csdn.net/yhcad/article/details/86561233

http://aria2c.com/usage.html

https://aria2.github.io/manual/en/html/aria2c.html

基礎配置

在MySQL中按如下格式建表

MySQL表結構

表字段說明
字段名稱 字段意義
id id主鍵
comic_name 本子名稱
starttime 開始下載的時間
endtime 下載結束的時間
status 當前下載狀態
checktimes 遇錯重試次數
raw_address e-hentai頁面地址
failed_links 記錄由於網絡波動暫時訪問失敗的頁面地址
failed_paths 記錄失敗頁面地址對應的圖片本地路徑
inserttime 記錄地址進入到數據庫的時間
oldpage 存放Aria2條目的gid
filepath bt下載路徑

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

本文之后假設MySQL數據庫名為comics_local,表名為comic_urls

aria2配置為后台服務,假設RPC地址為:127.0.0.1:6800,token為12345678

需要安裝pymysql, requests, filetype, zipfile, wget等Python包

pip install pymysql requests filetype zipfile wget

項目代碼

工作流程

整個爬蟲服務的工作流程如下:用戶將待抓取的E站鏈接(形式如 https://e-hentai.org/g/xxxxxxx/yyyyyyyyyy/ )放入數據表的raw_address字段,設置狀態字段為待爬取;爬取服務可以在后台輪詢或回調觸發,提取出數據庫中待爬取的鏈接后訪問頁面,判斷頁面里是否提供了bt種子下載,如有則下載種子並傳給Aria2下載,如無則直接下載圖片(圖片優先下載高清版)。

在圖片下載模式下,如果一切正常,則結束后置狀態字段為已完成;如出現了問題,則置字段為相應異常狀態,在下次輪詢/調用時進行處理。

在bt下載模式下,另開一個后台進程定時詢問Aria2的下載狀態,在Aria2返回下載完成報告后解壓目標文件,並置狀態字段為已完成;如出現了種子已死等問題,則刪除Aria2任務並切換到圖片下載模式。

數據庫操作模塊

該模塊包裝了一些MySQL的操作接口,遵照此邏輯,MySQL可以換成其他數據庫,如Redis,進而支持分布式部署。

  1 #!/usr/bin/env python3
  2 # -*- coding: utf-8 -*-
  3 """
  4 filename: sql_module.py
  5 
  6 Created on Sun Sep 22 23:24:39 2019
  7 
  8 @author: qjfoidnh
  9 """
 10 
 11 import pymysql
 12 from pymysql.err import IntegrityError
 13 
 14 class MySQLconn_url(object):
 15     def __init__(self):
 16 
 17         self.conn = pymysql.connect(
 18                 host='127.0.0.1',
 19                 port=3306,
 20                 user='username',
 21                 passwd='password',
 22                 db='comics_local'
 23                 )
 24         self.conn.autocommit(True) #開啟自動提交,生產環境不建議數據庫DBA這樣做
 25         self.cursor = self.conn.cursor(cursor=pymysql.cursors.DictCursor)
 26         #讓MySQL以字典形式返回數據
 27 
 28         
 29     def __del__(self):
 30         
 31         self.conn.close()
 32         
 33     #功能:取指定狀態的一條數據    
 34     def fetchoneurl(self, mode="pending", tabname='comic_urls'):
 35         
 36         sql = "SELECT * FROM %s \
 37                 WHERE status = '%s'" %(tabname, mode)
 38         self.conn.ping(True) #mysql長連接防止timeut自動斷開
 39         try:
 40             self.cursor.execute(sql)
 41         except Exception as e:
 42             return e
 43         else:
 44             item = self.cursor.fetchone()
 45             if not item:
 46                 return None
 47             if mode=="pending" or mode=='aria2':
 48                 if item['checktimes']<3:
 49                     sql = "UPDATE %s SET starttime = now(), status = 'ongoing' \
 50                     WHERE id = %d" %(tabname, item['id'])
 51                 else:
 52                     sql = "UPDATE %s SET status = 'error' \
 53                     WHERE id = %d" %(tabname, item['id'])
 54                     if mode=='aria2':
 55                         sql = "UPDATE %s SET status = 'pending', checktimes = 0, raw_address=CONCAT('chmode',raw_address) \
 56                     WHERE id = %d" %(tabname, item['id'])
 57                     self.cursor.execute(sql)
 58                     return 'toomany' 
 59             elif mode=="except":
 60                 sql = "UPDATE %s SET status = 'ongoing' \
 61                 WHERE id = %d" %(tabname, item['id'])
 62             try:
 63                 self.cursor.execute(sql)
 64             except Exception as e:
 65                 self.conn.rollback()
 66                 return e
 67             else:
 68                 return item
 69     
 70     #功能:更新指定id條目的狀態字段
 71     def updateurl(self, itemid, status='finished', tabname='comic_urls'):
 72         sql = "UPDATE %s SET endtime = now(),status = '%s' WHERE id = %d" %(tabname, status, itemid)
 73         self.conn.ping(True)
 74         try:
 75             self.cursor.execute(sql)
 76         except Exception as e:
 77             self.conn.rollback()
 78             return e
 79         else:
 80             return itemid
 81     
 82     #功能:更新指定id條目狀態及重試次數字段    
 83     def reseturl(self, itemid, mode, count=0, tabname='comic_urls'):
 84 
 85         sql = "UPDATE %s SET status = '%s', checktimes=checktimes+%d WHERE id = %d" %(tabname, mode, count, itemid)
 86         self.conn.ping(True)
 87         try:
 88             self.cursor.execute(sql)
 89         except Exception as e:
 90             print(e)
 91             self.conn.rollback()
 92             return e
 93         else:
 94             return itemid
 95         
 96     #功能:將未下載完成圖片的網址列表寫入數據庫,
 97     def fixunfinish(self, itemid, img_urls, filepaths, tabname='comic_urls'):
 98 
 99         img_urls = "Š".join(img_urls) #用不常見拉丁字母做分隔符,避免真實地址中有分隔符導致錯誤分割
100         filepaths = "Š".join(filepaths)
101         sql = "UPDATE %s SET failed_links = '%s', failed_paths = '%s', status='except' WHERE id = %d" %(tabname, img_urls, filepaths, itemid)  
102         self.conn.ping(True)
103         try:
104             self.cursor.execute(sql)
105         except Exception as e:
106             self.conn.rollback()
107             return e
108         else:
109             return 0
110         
111     #功能:在嘗試完一次未完成補全后,更新未完成列表    
112     def resetunfinish(self, itemid, img_urls, filepaths, tabname='comic_urls'):
113         failed_num = len(img_urls)
114         if failed_num==0:
115             sql = "UPDATE %s SET failed_links = null, failed_paths = null, status = 'finished', endtime = now() WHERE id = %d" %(tabname, itemid)
116         else:
117             img_urls = "Š".join(img_urls) #用拉丁字母做分隔符,避免真實地址中有分隔符導致錯誤分割
118             filepaths = "Š".join(filepaths)
119             sql = "UPDATE %s SET failed_links = '%s', failed_paths = '%s', status = 'except' WHERE id = %d" %(tabname, img_urls, filepaths, itemid)
120         self.conn.ping(True)
121         try:
122             self.cursor.execute(sql)
123         except Exception as e:
124             self.conn.rollback()
125             return e
126         else:
127             return failed_num
128         
129     #功能:為條目補上資源名稱    
130     def addcomicname(self, address, title, tabname='comic_urls'):
131         sql = "UPDATE %s SET comic_name = '%s' WHERE raw_address = '%s'" %(tabname, title, address)  #由於調用地點處沒有id值,所以這里用address定位。也是本項目中唯二處用address定位的
132         self.conn.ping(True)
133         try:
134             self.cursor.execute(sql)
135         except IntegrityError:
136             self.conn.rollback()
137             sql_sk = "UPDATE %s SET status = 'skipped' \
138                     WHERE raw_address = '%s'" %(tabname, address)
139             self.cursor.execute(sql_sk)
140             return Exception(title+" Already downloaded!")
141         except Exception as e:
142             self.conn.rollback()
143             return e
144         else:
145             return 0
146         
147     #功能:通過網址查詢標識Aria2里對應的gid    
148     def fetchonegid(self, address, tabname='comic_urls'):
149         sql = "SELECT * FROM %s \
150                 WHERE raw_address = '%s'" %(tabname, address)
151         self.conn.ping(True)
152         try:
153             self.cursor.execute(sql)
154         except Exception as e:
155             return e
156         else:
157             item = self.cursor.fetchone()
158             if not item:
159                 return None
160             else:
161                 return item.get('oldpage')
162             
163 mq = MySQLconn_url()
View Code

 

初級處理模塊

該模塊對E站鏈接進行初步處理,包括獲取資源名稱,指定下載類型,以及調用次級處理模塊,並返回給主函數表示處理結果的狀態量。

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 """
 4 filename: init_process.py
 5 
 6 Created on Sun Sep 22 21:20:54 2019
 7 
 8 @author: qjfoidnh
 9 """
10 
11 from settings import *
12 from tools import Get_page, Download_img
13 from second_process import Ehentai
14 from checkdalive import removetask
15 from sql_module import mq
16 import time    
17 import os
18 
19 
20 #功能:嘗試下載未完成列表里的圖片到指定路徑    
21 def fixexcepts(itemid, img_urls, filepaths):
22     img_urls_new = list()
23     filepaths_new = list()
24     img_urls = img_urls.split("Š") #從字符串還原回列表
25     filepaths = filepaths.split("Š")
26     for (imglink,path) in zip(img_urls,filepaths):
27         try:
28             content = Get_page(imglink, cookie=cookie_ehentai(imglink))
29             if not content:
30                 img_urls_new.append(imglink)
31                 filepaths_new.append(path)
32                 continue
33             time.sleep(10)
34             try:
35                 img_src = content.select_one("#i7 > a").get('href') #高質量圖
36             except AttributeError: #如果高質量圖沒提供資源
37                 img_src = content.select_one("img[id='img']").get("src") #一般質量圖
38             src_name = content.select_one("#i2 > div:nth-of-type(2)").text.split("::")[0].strip() #圖文件名
39             raw_path = path
40             if os.path.exists(raw_path+'/'+src_name):
41                 continue
42             http_code = Download_img(img_src, raw_path+'/'+src_name, cookie=cookie_ehentai(imglink))
43             if http_code!=200:
44                 raise Exception("Network error!")
45         except Exception:
46             img_urls_new.append(imglink)
47             filepaths_new.append(path)
48     result = mq.resetunfinish(itemid, img_urls_new, filepaths_new)
49     return result
50 
51 
52 class DownEngine:
53     def __init__(self):
54         pass
55     
56     #功能:根據傳入地址,選擇優先下載模式。獲取資源標題,寫入數據庫,並調用次級處理模塊
57     def engineEhentai(self, address):  
58         if 'chmode' in address:
59             mode='normal'
60             removetask(address=address)
61         else:
62             mode='bt'
63         address = address.replace('chmode', '')
64         content = Get_page(address, cookie=cookie_ehentai(address))
65         if not content:
66             return 2
67         warning = content.find('h1', text="Content Warning")
68         #e站對部分敏感內容有二次確認
69         if warning:
70             address += '?nw=session'
71             content = Get_page(address, cookie=cookie_ehentai(address))
72             if not content:
73                 return 2
74         title = content.select_one("h1[id='gj']").text
75         if not len(title): #有些資源沒有日文名,則取英文名
76             title = content.select_one("h1[id='gn']").text
77             if not len(title):
78                 return 2
79             
80         title = title.replace("'",'''"''')  #含有單引號的標題會令sql語句中斷
81         title_st = mq.addcomicname(address, title)
82         if type(title_st)==Exception:
83             return title_st
84  
85         ehentai = Ehentai(address, title, mode=mode)
86         result = ehentai.getOthers()
87         return result
View Code

 

次級處理模塊

該模塊由初級處理模塊調用,其通過預設規則尋找給定資源對應的bt種子/所有圖片頁面,然后進行下載。bt下載模式直接將種子文件和下載元信息傳給Aria2,圖片下載模式按順序下載資源里的每一張圖片。

並未引入grequests等多協程庫來進行請求是因為E站會封禁短時間內過頻訪問的IP;事實上,我們甚至還需要設置下載間隔時間,如果有更高的效率要求,可考慮縮短下載間隔以及開啟多個使用不同代理的進程。

  1 #!/usr/bin/env python3
  2 # -*- coding: utf-8 -*-
  3 """
  4 filename: second_process.py
  5 
  6 Created on Mon Sep 23 20:35:48 2019
  7 
  8 @author: qjfoidnh
  9 """
 10 
 11 import time
 12 import datetime
 13 import requests
 14 from tools import Get_page, Download_img, postorrent
 15 from checkdalive import getinfos
 16 from settings import proxies
 17 from settings import *
 18 import re
 19 import os
 20 from logger_module import logger
 21 
 22 formatted_today=lambda:datetime.date.today().strftime('%Y-%m-%d')+'/' #返回當前日期的字符串,建立文件夾用
 23 
 24 
 25 #功能:處理資源標題里一些可能引起轉義問題的特殊字符
 26 def legalpath(path):
 27     path = list(path)
 28     path_raw = path[:]
 29     for i in range(len(path_raw)):
 30         if path_raw[i] in [' ','[',']','(',')','/','\\']:
 31             path[i] = '\\'+ path[i]
 32         elif path_raw[i]==":":
 33             path[i] = '-'
 34     return ''.join(path)
 35 
 36 class Ehentai(object):
 37     def __init__(self, address, comic_name, mode='normal'):
 38         self.head = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
 39                      'accept-encoding': 'gzip, deflate, br',
 40                      'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
 41                      'upgrade-insecure-requests': '1',
 42                      'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
 43                      }
 44         self.address = address
 45         self.mode = mode
 46         self.gid = address.split('/')[4]
 47         self.tid = address.split('/')[5]
 48         self.content = Get_page(address, cookie=cookie_ehentai(address))
 49         self.comic_name = legalpath(comic_name)
 50         self.raw_name = comic_name.replace("/"," ")
 51         self.raw_name = self.raw_name.replace(":","-")
 52         self.src_list = []
 53         self.path_list = []
 54     
 55     #功能:下載的主功能函數    
 56     def getOthers(self):
 57         if not self.content:
 58             return 2
 59         today = formatted_today()
 60         logger.info("E-hentai: %s start!" %self.raw_name)
 61         complete_flag = True
 62         pre_addr = re.search(r'(e.+org)', self.address).group(1)
 63         if self.mode=='bt': #bt種子模式
 64             content = Get_page("https://%s/gallerytorrents.php?gid=%s&t=%s"%(pre_addr,self.gid,self.tid), cookie=cookie_ehentai(self.address))
 65             torrents = content.find_all(text="Seeds:")
 66             if not torrents:
 67                 self.mode = 'normal' #圖片下載模式
 68             else:
 69                 torrents_num = [int(tag.next_element) for tag in torrents]
 70                 target_index = torrents_num.index(max(torrents_num))
 71                 torrent_link = content.select('a')[target_index].get('href')
 72                 torrent_name = content.select('a')[target_index].text.replace('/',' ')
 73             
 74                 #e-hentai與exhentai有細微差別
 75                 if 'ehtracker' in torrent_link:
 76                     req = requests.get(torrent_link, proxy=proxies) 
 77                     if req.status_code==200:
 78                         with open(abs_path+'bttemp/'+torrent_name+'.torrent', 'wb') as ft:
 79                             ft.write(req.content)
 80                     id = postorrent(abs_path+'bttemp/'+torrent_name+'.torrent', dir=abs_path+today)
 81                     if id: 
 82                         filepath = getinfos().get(id).get('filepath')
 83                         return {'taskid':id, 'filepath':filepath}
 84                     else: self.mode = 'normal'
 85                 
 86                 #e-hentai與exhentai有細微差別
 87                 elif 'exhentai' in torrent_link:
 88 
 89                     req = requests.get(torrent_link, headers={'Host': 'exhentai.org',
 90                                                               'Referer': "https://%s/gallerytorrents.php?gid=%s&t=%s"%(pre_addr,self.gid,self.tid),
 91                                                               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}, 
 92                                                               cookies=cookie_ehentai(self.address), proxies=proxies)
 93                     if req.status_code==200:
 94                         with open(abs_path+'bttemp/'+torrent_name+'.torrent', 'wb') as ft:
 95                             ft.write(req.content)
 96                         id = postorrent(abs_path+'bttemp/'+torrent_name+'.torrent', dir=abs_path+today)
 97                         if id: 
 98                             filepath = getinfos().get(id).get('filepath')
 99                             return {'taskid':id, 'filepath':filepath}
100                         else:
101                             self.mode = 'normal'
102                     else:
103                         self.mode = 'normal'
104                     
105         page_tag1 = self.content.select_one(".ptds")
106         page_tags = self.content.select("td[onclick='document.location=this.firstChild.href']")
107         indexslen = len(page_tags)//2-1
108         if indexslen <=0:
109             indexslen = 0
110         pagetags = page_tags[0:indexslen]
111         pagetags.insert(0, page_tag1)
112         
113         #有些頁面圖片超過8頁,頁面直接鏈接可能獲取不全,采用按規則生成鏈接方法
114         last_page = pagetags[-1]
115         last_link = last_page.a.get('href')
116         page_links = [pagetag.a.get('href') for pagetag in pagetags]
117         try:
118             last_number = int(re.findall(r'\?p=([0-9]+)',last_link)[0])
119         except IndexError: 
120             pass #說明本子較短,只有一頁,不需做特別處理
121         else:
122             if last_number>=8:
123                 templete_link = re.findall(r'(.+\?p=)[0-9]+',last_link)[0]
124                 page_links = [templete_link+str(page+1) for page in range(last_number)]
125                 page_links.insert(0, page_tag1.a.get('href'))
126             
127         for page_link in page_links:
128             content = Get_page(page_link, cookie=cookie_ehentai(self.address))
129             if not content:
130                 return 2
131             imgpage_links = content.select("div[class='gdtm']") #一種定位標簽
132             if not imgpage_links:
133                 imgpage_links = content.select("div[class='gdtl']") #有時是另一種標簽
134             for img_page in imgpage_links:
135                 try:
136                     imglink = img_page.div.a.get('href') #對應第一種
137                 except:
138                     imglink = img_page.a.get('href') #對應第二種
139                 content = Get_page(imglink, cookie=cookie_ehentai(self.address))
140                 if not content:
141                     complete_flag = False
142                     self.src_list.append(imglink)
143                     self.path_list.append(abs_path+today+self.raw_name)
144                     continue
145                 try:
146                     img_src = content.select_one("#i7 > a").get('href') #高質量圖
147                 except AttributeError:
148                     img_src = content.select_one("img[id='img']").get("src") #小圖
149                 src_name = content.select_one("#i2 > div:nth-of-type(2)").text.split("::")[0].strip() #圖文件名
150                 raw_path = abs_path+today+self.raw_name
151                 try:
152                     os.makedirs(raw_path)
153                 except FileExistsError:
154                     pass
155                 if os.path.exists(raw_path+'/'+src_name):
156                     continue
157                 http_code = Download_img(img_src, raw_path+'/'+src_name, cookie=cookie_ehentai(self.address))
158                 if http_code!=200:
159                     time.sleep(10)
160                     complete_flag = False
161                     self.src_list.append(imglink)
162                     self.path_list.append(raw_path)
163                     continue
164                 else:
165                     time.sleep(10)
166         if not complete_flag:
167             logger.warning("E-hentai: %s ONLY PARTLY finished downloading!" %self.raw_name)
168             return (self.src_list, self.path_list)
169             
170         else:
171             logger.info("E-hentai: %s has COMPLETELY finished downloading!" %self.raw_name)
172             return 1
View Code

 

下載狀態查詢模塊

該模塊定時向Aria2查詢那些使用了bt下載策略的條目的完成情況,當發現完成時即解壓zip文件,然后將數據庫中狀態字段改為完成;如果連續三次發現下載進度都是0,則認為種子死亡,為條目添加策略切換信號,令初級處理模塊在下次處理時按圖片下載模式處理。

  1 #!/usr/bin/env python3
  2 # -*- coding: utf-8 -*-
  3 """
  4 filename: checkdalive.py
  5 
  6 Created on Mon Sep 23 21:20:09 2019
  7 
  8 @author: qjfoidnh
  9 """
 10 
 11 import os
 12 from settings import current_path
 13 os.chdir(current_path)
 14 from sql_module import mq
 15 import requests
 16 from settings import aria2url, aria2token
 17 import time
 18 import json
 19 import base64
 20 import zipfile
 21 import filetype
 22 
 23 # 功能:向Aria2發送查詢請求
 24 def getinfos():
 25     id_str = "AriaNg_%s_0.043716476479668254"%str(int(time.time())) #隨機生成即可,不用遵循一定格式
 26     id = str(base64.b64encode(id_str.encode('utf-8')), 'utf-8')
 27     id_str2 = "AriaNg_%s_0.053716476479668254"%str(int(time.time()))
 28     id2 = str(base64.b64encode(id_str2.encode('utf-8')), 'utf-8')
 29     data = json.dumps({"jsonrpc":"2.0","method":"aria2.tellActive","id":id,"params":["token:%s"%aria2token,["gid","totalLength","completedLength","uploadSpeed","downloadSpeed","connections","numSeeders","seeder","status","errorCode","verifiedLength","verifyIntegrityPending","files","bittorrent","infoHash"]]})
 30     data2 = json.dumps({"jsonrpc":"2.0","method":"aria2.tellWaiting","id":id2,"params":["token:%s"%aria2token,0,1000,["gid","totalLength","completedLength","uploadSpeed","downloadSpeed","connections","numSeeders","seeder","status","errorCode","verifiedLength","verifyIntegrityPending","files","bittorrent","infoHash"]]})
 31     req = requests.post(aria2url, data)
 32     req2 = requests.post(aria2url, data2)
 33     if req.status_code!=200:
 34         return
 35     else:
 36         status_dict = dict()
 37         results = req.json().get('result')
 38         results2 = req2.json().get('result')
 39         results.extend(results2)
 40         for res in results:
 41             status = res.get('status')
 42             completelen = int(res.get('completedLength'))
 43             totallen = int(res.get('totalLength'))
 44             filepath = res.get('files')[0].get('path').replace('//','/').replace("'","\\'")
 45             if completelen==totallen and completelen!=0:
 46                 status = 'finished'
 47             status_dict[res.get('gid')] = {'status':status, 'completelen':completelen, 'filepath':filepath}
 48     return status_dict
 49 
 50 # 功能:也是向Aria2發送另一種查詢請求
 51 def getdownloadings(status_dict):
 52     item = mq.fetchoneurl(mode='aria2')
 53     checkingidlist = list()
 54     while item:
 55         if item=='toomany':
 56             item = mq.fetchoneurl(mode='aria2')
 57             continue
 58         gid = item.get('oldpage')
 59         gid = gid or 'default'
 60         complete = status_dict.get(gid, {'status':'finished'})
 61         if complete.get('status')=='finished':
 62             mq.updateurl(item['id'])
 63             filepath = item['filepath']
 64             flag = unzipfile(filepath)
 65             removetask(taskid=gid)
 66         elif complete.get('completelen')==0 and complete.get('status')!='waiting':
 67             mq.reseturl(item['id'], 'checking', count=1)
 68             checkingidlist.append(item['id'])
 69         else:
 70             mq.reseturl(item['id'], 'checking')
 71             checkingidlist.append(item['id'])
 72         item = mq.fetchoneurl(mode='aria2')
 73     for id in checkingidlist:
 74         mq.reseturl(id, 'aria2')
 75 
 76 # 功能:解壓zip文件      
 77 def unzipfile(filepath):
 78     kind = filetype.guess(filepath)
 79     if kind.extension!='zip':
 80         return None
 81     f = zipfile.ZipFile(filepath, 'r')
 82     flist = f.namelist()
 83     depstruct = [len(file.strip('/').split('/')) for file in flist]
 84     if depstruct[0]==1 and depstruct[1]!=1:
 85         try:
 86             f.extractall(path=os.path.dirname(filepath))
 87         except:
 88             return None
 89         else:
 90             return True
 91     else:
 92         try:
 93             f.extractall(path=os.path.splitext(filepath)[0])
 94         except:
 95             return None
 96         else:
 97             return True
 98 
 99 #功能:把已完成的任務從隊列里刪除,以免后來的任務被阻塞
100 def removetask(taskid=None, address=None):
101     id_str = "AriaNg_%s_0.043116476479668254"%str(int(time.time()))
102     id = str(base64.b64encode(id_str.encode('utf-8')), 'utf-8')
103     if taskid:
104         data = json.dumps({"jsonrpc":"2.0","method":"aria2.forceRemove","id":id,"params":["token:%s"%aria2token,taskid]})
105     if address:
106         taskid = mq.fetchonegid(address)
107         if taskid:
108             data = json.dumps({"jsonrpc":"2.0","method":"aria2.forceRemove","id":id,"params":["token:%s"%aria2token,taskid]})
109         else:
110             data = json.dumps({"jsonrpc":"2.0","method":"aria2.forceRemove","id":id,"params":["token:%s"%aria2token,"default"]})
111     req = requests.post(aria2url, data)
112         
113             
114 if __name__=="__main__":
115     res = getinfos()
116     if res:
117         getdownloadings(res)
View Code

 

工具模塊

該模塊里定義了一些需要多次調用,或者完成某項功能的函數,比如獲取網頁內容的Get_page()

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 """
 4 filename: tools.py
 5 
 6 Created on Mon Sep 23 20:57:31 2019
 7 
 8 @author: qjfoidnh
 9 """
10 
11 
12 import requests
13 import time
14 from bs4 import BeautifulSoup as Bs
15 from settings import head, aria2url, aria2token
16 from settings import proxies
17 import json
18 import base64
19 
20 # 功能:對requets.get方法的一個封裝,返回Bs對象
21 def Get_page(page_address, headers={}, cookie=None):
22     pic_page = None
23     innerhead = head.copy()
24     innerhead.update(headers)
25     try:
26         pic_page = requests.get(page_address, headers=innerhead, proxies=proxies, cookies=cookie, verify=False)
27     except Exception as e:
28         return None
29     if not pic_page:
30         return None
31     pic_page.encoding = 'utf-8'
32     text_response = pic_page.text
33     content = Bs(text_response, 'lxml')
34     
35     return content
36 
37 #功能:把種子文件發給Aria2服務,文件以base64編碼
38 def postorrent(path, dir):
39     with open(path, 'rb') as f:
40         b64str = str(base64.b64encode(f.read()), 'utf-8')
41     url = aria2url
42     id_str = "AriaNg_%s_0.043716476479668254"%str(int(time.time())) #這個字符串可以隨便起,只要能保證每次調用生成時不重復就行
43     id = str(base64.b64encode(id_str.encode('utf-8')), 'utf-8').strip('=')
44     req = requests.post(url, data=json.dumps({"jsonrpc":"2.0","method":"aria2.addTorrent","id":id,"params":["token:"+aria2token, b64str,[],{'dir':dir, 'allow-overwrite':"true"}]}))
45     if req.status_code==200:
46         return req.json().get('result')
47     else:
48         return False
49 
50 # 功能:下載圖片文件
51 def Download_img(page_address, filepath, cookie=None):
52 
53     try:
54         pic_page = requests.get(page_address, headers=head, proxies=proxies, cookies=cookie, timeout=8, verify=False)
55         if pic_page.status_code==200:
56             pic_content = pic_page.content
57             with open(filepath, 'wb') as file:
58                 file.write(pic_content)
59         return pic_page.status_code
60     except Exception as e:
61         return e
View Code

 

日志模塊

對logging進行了一個簡單的包裝,輸出日志到文件有利於監控服務的運行狀況。

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 """
 4 filename: logger_module.py
 5 
 6 Created on Mon Sep 23 21:18:37 2019
 7 
 8 @author: qjfoidnh
 9 """
10 
11 import logging
12 
13 
14 LOG_FORMAT = "%(asctime)s - %(filename)s -Line: %(lineno)d - %(levelname)s: %(message)s"
15 logging.basicConfig(filename='downloadsys.log', level=logging.INFO, format=LOG_FORMAT, filemode='a')
16 
17 logger = logging.getLogger(__name__)
View Code

 

設置信息

該文件里定義了一些信息,比如代理地址,cookies值,下載路徑等等。

雖然不需要登錄,但不帶cookies的訪問很容易被E站認為是惡意攻擊。在瀏覽器里打開開發者工具,然后隨意訪問一個E站鏈接,從Network標簽頁里就能讀到Cookies字段的值。不想手動添加cookies,可以考慮使用requests中的session方法重構tools.py中的Get_page函數,自動加cookies。

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 """
 4 filename: settings.py
 5 
 6 Created on Mon Sep 23 21:06:33 2019
 7 
 8 @author: qjfoidnh
 9 """
10 
11 abs_path = "/home/user/e-hentai/" 
12 #下載文件的目錄,此處為Linux下目錄格式,Windows需注意反斜杠轉義問題。此目錄必須事先建好,且最后一個‘/‘不能丟
13 
14 current_path = "/home/user/e-hentai/" 
15 #此目錄代表項目代碼的位置,不一定與上一個相同
16 
17 #aria2配置
18 aria2url = "http://127.0.0.1:6800/jsonrpc"
19 aria2token = "12345678"
20 
21 #瀏覽器通用頭部
22 head = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
23 
24 cookie_raw_ehentai = '''nw=1; __cfduid=xxxxxxxxxxxx; ipb_member_id=xxxxxx; ipb_pass_hash=xxxxx;xxxxxxxx'''
25 #從瀏覽器里復制來的cookies大概就是這樣的格式,exhentai同理
26 
27 cookie_raw_exhentai = '''xxxxxxxx'''
28 
29 #代理地址,E站需要科kx學訪問,此處僅支持http代理。關於代理的獲得及設置請自行學習
30 #聽說現在不科學也可以了,如果屬實,可令proxies = None
31 proxies = {"http": "http://localhost:10808", "https": "http://localhost:10808"} 
32 # proxies = None
33 
34 
35 def cookieToDict(cookie):
36     '''
37     將從瀏覽器上Copy來的cookie字符串轉化為Dict格式
38     '''
39     itemDict = {}
40     items = cookie.split(';')
41     for item in items:
42         key = item.split('=')[0].replace(' ', '')
43         value = item.split('=')[1]
44         itemDict[key] = value
45     return itemDict
46 
47 def cookie_ehentai(address):
48     if "e-hentai" in address:
49         return cookieToDict(cookie_raw_ehentai)
50     elif "exhentai" in address:
51         return cookieToDict(cookie_raw_exhentai)
52     else:
53         return cookieToDict(cookie_raw_ehentai)
View Code

 

主函數

主函數從數據庫里取條目,並根據返回結果對數據庫條目進行相應的更新。

  1 #!/usr/bin/env python3
  2 # -*- coding: utf-8 -*-
  3 """
  4 filename: access.py
  5 
  6 Created on Mon Sep 23 20:18:01 2019
  7 
  8 @author: qjfoidnh
  9 """
 10 
 11 import time
 12 from sql_module import mq
 13 from init_process import DownEngine, fixexcepts
 14 import os
 15 from logger_module import logger
 16 
 17 
 18 if __name__ =="__main__":
 19     
 20     engine = DownEngine()
 21     On = True
 22     print("%d進程開始運行..." %os.getpid())
 23     while On:
 24         
 25         #先處理下載未完全的異常條目
 26         item = mq.fetchoneurl(mode="except")
 27         if type(item)==Exception:
 28             logger.error(item)
 29         elif not item:
 30             pass
 31         else:
 32             img_srcs = item['failed_links']; filepaths = item['failed_paths']; itemid = item['id']; raw_address = item['raw_address']
 33             
 34             res = fixexcepts(itemid, img_srcs, filepaths)
 35             if type(res)!=int:
 36                 logger.error(res)
 37                 continue
 38             elif res==0:
 39                 logger.info("%d進程,%d號頁面修復完畢.\
 40                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
 41             elif res>0:
 42                 logger.warning("%d進程,%d號頁面修復未完,仍余%d.\
 43                                 頁面地址為%s" %(os.getpid(), itemid, res, raw_address))
 44 
 45                 
 46         item = mq.fetchoneurl()
 47         if item=='toomany': #指取到的條目超過最大重試次數上限
 48             continue
 49         if type(item)==Exception:
 50             logger.error(item)
 51             continue
 52         elif not item:
 53             time.sleep(600)
 54             continue
 55         else:
 56             raw_address = item['raw_address']; itemid = item['id']
 57             logger.info("%d進程,%d號頁面開始下載.\
 58                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
 59             res = engine.engineSwitch(raw_address)
 60             if type(res)==Exception:
 61                 logger.warning("%d進程,%d號頁面引擎出錯.\
 62                                 出錯信息為%s" %(os.getpid(), itemid, str(res)))
 63                 mq.reseturl(itemid, 'skipped')
 64                 continue
 65                 
 66             if type(res)==tuple and len(res)==2:
 67                 response = mq.fixunfinish(itemid, res[0], res[1])
 68                 if response==0:
 69                     logger.warning("%d進程,%d號頁面下載部分出錯,已標志異常下載狀態.\
 70                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
 71                 else:
 72                     logger.warning("%d進程,%d號頁面下載部分出錯且標志數據庫狀態字段時發生錯誤. 錯誤為%s, \
 73                                 頁面地址為%s" %(os.getpid(), itemid, str(response), raw_address))
 74                 
 75             elif type(res)==dict:
 76                 if 'taskid' in res:
 77                     response = mq.reseturl(itemid, 'aria2')
 78                     mq.replaceurl(itemid, res['taskid'], item['raw_address'], filepath=res['filepath'])
 79             
 80             elif res==1:
 81                 response = mq.updateurl(itemid)
 82                 if type(response)==int:
 83                     logger.info("%d進程,%d號頁面下載完畢.\
 84                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
 85                 else:
 86                     logger.warning("%d進程,%d號頁面下載完畢但更新數據庫狀態字段時發生錯誤:%s.\
 87                                 頁面地址為%s" %(os.getpid(), itemid, str(response), raw_address))
 88             elif res==2:
 89                 response = mq.reseturl(itemid, 'pending', count=1)
 90                 if type(response)==int:
 91                     logger.info("%d進程,%d號頁面遭遇初始請求失敗,已重置下載狀態.\
 92                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
 93                 else:
 94                     logger.warning("%d進程,%d號頁面遭遇初始請求失敗,且重置數據庫狀態字段時發生錯誤.\
 95                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
 96             elif res==9:
 97                 response = mq.reseturl(itemid, 'aria2')
 98                 if type(response)==int:
 99                     logger.info("%d進程,%d號頁面送入aria2下載器.\
100                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
101                 else:
102                     logger.warning("%d進程,%d號頁面送入aria2下載器,但更新狀態字段時發生錯誤.\
103                                 頁面地址為%s" %(os.getpid(), itemid, raw_address))
104                 
105             time.sleep(10)
View Code

 

使用方法

把所有文件放在同一目錄,在設置信息里修改好配置,運行主函數即可。

另把checkdalive.py加入任務計划或crontab,每隔一段時間執行一次(建議半小時或一小時)

接下來只要把待抓取的頁面鏈接寫入數據庫raw_address字段,status字段寫為pending(可以另做一個腳本/插件來進行這個工作,筆者就是開發了一個Chrome擴展來在E站頁面一鍵入庫)即可,程序在輪詢中會自動開始處理,不一會兒就能在指定目錄下看到資源文件了。

后記

這個爬蟲是筆者從自己開發的一套更大的系統上拆分下來的子功能,所以整體邏輯顯得比較復雜;另外由於開發歷時較久,有些冗余或不合理的代碼,讀者可自行刪減不需要的功能,或進行優化。

比如說,如果不追求效率,可以棄用Aria2下載的部分,全部使用圖片下載模式;對於失敗圖片鏈接的儲存,諸如Redis等內存數據庫其實比MySQL更適合;可以增加一個檢測環節,檢測代理失效或IP被封禁的情形,等等。

對於有一定爬蟲基礎的人來說,該爬蟲的代碼並不復雜,其精華實則在於設計思想和對各類異常的處理。筆者看過網上其他的一些E站爬蟲,自認為在穩定性和擴展性方面,此爬蟲可以說是頗具優勢的。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM