臨近期末考試,但是根本不想復習!啊啊啊啊啊啊啊!!!!
於是做了一個爬蟲,網址為 https://yande.re,網頁圖片為動漫美圖(圖片帶點顏色........宅男福利
github項目地址為:https://github.com/MyBules/yande_pider
多線程代碼分為兩個版本:一個是基於多頁面多線程,一個是基於單頁面多線程
一下是第一種代碼:
''' 基於多頁面多線程 ''' import os # 引入文件模塊 import re # 正則表達式 import urllib.request import threading # 連接網頁並返回源碼 def open_url(url): try: req = urllib.request.Request(url) req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36") response = urllib.request.urlopen(req) status_code = response.code html = response.read() return html except: print(url + " 404") return 404 def mkdir(path): ''' :param path: 路徑 :return: ''' # 引入模塊 import os # 去除首位空格 path = path.strip() # 去除尾部 \ 符號 path = path.rstrip("\\") # 判斷路徑是否存在 # 存在 True # 不存在 False isExists = os.path.exists(path) # 判斷結果 if not isExists: # 如果不存在則創建目錄 # 創建目錄操作函數 os.makedirs(path) print(path + ' 創建成功') return True else: # 如果目錄存在則不創建,並提示目錄已存在 print(path + ' 目錄已存在') return False def Yande1(i): imgs = 1 url = 'https://yande.re/post?page=' + str(i) floder = "E:\\Python\\爬蟲\\yande\\img\\page" + str(i) mkdir(floder) html = open_url(url) html = html.decode('gbk', 'ignore') img_adds = [] img_adds = re.findall(r'<a class="directlink largeimg" href="([^"]+\.jpg)"', html) for i in img_adds: filename = floder + "\\" + str(imgs) + '.jpg' imgs += 1 img_html = open_url(i) if img_html == 404: continue with open(filename, 'wb') as f: f.write(img_html) print(i + ' 下載完成......') exitflag = 0 class myThread(threading.Thread): def __init__(self, threadID, name, list): threading.Thread.__init__(self) self.threadID = threadID self.name = name self.list = list def run(self): print("開始線程:" + self.name) # threadLock.acquire() get_img(self.name, self.list) # threadLock.release() print("退出線程:"+ self.name) def get_img(threadname, list): if len(list): for i in list: if exitflag: threadname.exit() Yande1(i) if __name__ == '__main__': pages1 = int(input('請輸入你要下載的起始頁面數:')) pages2 = int(input('請輸入你要下載的末尾頁面數:')) mkdir('img') # for i in range() list1 = [] list2 = [] list3 = [] for i in range(pages1, pages2+1): if i % 3 == 0: list3.append(i) if i % 3 == 1: list1.append(i) if i % 3 == 2: list2.append(i) threadLock = threading.Lock() threads = [] thread1 = myThread(1, "thread-1", list1) thread2 = myThread(2, "thread-2", list2) thread3 = myThread(3, "thread-3", list3) thread1.start() thread2.start() thread3.start() threads.append(thread1) threads.append(thread2) threads.append(thread3) for t in threads: t.join() print("退出主線程")
經測試,兩種方法速度相差不大。
第二種方法放在github項目地址里了,如果各位游客是為了學習的話,第二種方法的代碼還是去看一下較好。