今天回憶廖大的多線程的時候,看到下面有人寫了個多線程的爬蟲http://www.tendcode.com/article/jiandan-meizi-spider-2/,點進去看了下,分析的很仔細,寫了接近200行代碼吧
讓后我就研究了一下這個網站,emmmm,selenium + PhantomJS不就直接搞定了嘛,然后就寫了段code:
然后發現,哇,selenium不支持PhantomJS了,因為chrome和firefox自帶了headless的訪問,然后就去各個blog看,最后爬下了這個網站:
1 import unittest 2 import requests 3 import time 4 import re 5 from random import randint 6 from selenium import webdriver 7 from selenium.webdriver.chrome.options import Options 8 from selenium.webdriver.common.keys import Keys 9 10 class ooxx_spider(unittest.TestCase): 11 12 def setUp(self): 13 chrome_options = Options() 14 chrome_options.add_argument('--headless') 15 chrome_options.add_argument('--disable-gpu') 16 self.driver = webdriver.Chrome('E:/chromedriver.exe', chrome_options=chrome_options) 17 18 def test_spider(self): 19 for i in range(1, 80): 20 url = 'http://jandan.net/ooxx/' + 'page-' + str(i) 21 self.driver.get(url) 22 print(url) 23 elem = self.driver.find_elements_by_xpath('//*[@class="commentlist"]/li/div/div/div/p/img')#/li/div/div/div/p/img 24 for j in elem: 25 self.save_img(j.get_attribute('src')) 26 print('第{}頁爬取成功'.format(i)) 27 28 def save_img(self, res): 29 suffix = res.split('.')[-1] 30 destination = 'picture/' + str(randint(1, 1000)) + str(randint(1, 1000)) + '.'+ suffix 31 r = requests.get(res) 32 with open(destination, 'wb') as f: 33 f.write(r.content) 34 35 def tearDown(self): 36 self.driver.close() 37 38 if __name__ == '__main__': 39 unittest.main()
補上多線程的代碼
核心代碼:
1 def test_multiscraping(self): 2 p = Pool()#默認大小是cpu的核數,你可以修改比如說雙核Pool(2) 3 #這里假設我是4個進程,所以range(5) 4 for i in range(5): 5 p.apply_async(scraping, args = (i, )) 6 p.close() 7 p.join()
cpu太垃圾了,晚上回去用同學的cpu測試一下(留下了窮人的眼淚)