爬取煎蛋XXOO妹子圖片


今天回憶廖大的多線程的時候,看到下面有人寫了個多線程的爬蟲http://www.tendcode.com/article/jiandan-meizi-spider-2/,點進去看了下,分析的很仔細,寫了接近200行代碼吧

讓后我就研究了一下這個網站,emmmm,selenium + PhantomJS不就直接搞定了嘛,然后就寫了段code:

然后發現,哇,selenium不支持PhantomJS了,因為chrome和firefox自帶了headless的訪問,然后就去各個blog看,最后爬下了這個網站:

 1 import unittest
 2 import requests
 3 import time
 4 import re
 5 from random import randint
 6 from selenium import webdriver
 7 from selenium.webdriver.chrome.options import Options
 8 from selenium.webdriver.common.keys import Keys
 9 
10 class ooxx_spider(unittest.TestCase):
11 
12     def setUp(self):
13         chrome_options = Options()
14         chrome_options.add_argument('--headless')
15         chrome_options.add_argument('--disable-gpu')
16         self.driver = webdriver.Chrome('E:/chromedriver.exe', chrome_options=chrome_options)
17 
18     def test_spider(self):
19         for i in range(1, 80):
20             url = 'http://jandan.net/ooxx/' + 'page-' + str(i)
21             self.driver.get(url)
22             print(url)
23             elem = self.driver.find_elements_by_xpath('//*[@class="commentlist"]/li/div/div/div/p/img')#/li/div/div/div/p/img
24             for j in elem:
25                 self.save_img(j.get_attribute('src'))
26             print('第{}頁爬取成功'.format(i))
27 
28     def save_img(self, res):
29         suffix = res.split('.')[-1]
30         destination = 'picture/' + str(randint(1, 1000)) + str(randint(1, 1000)) + '.'+ suffix
31         r = requests.get(res)
32         with open(destination, 'wb') as f:
33             f.write(r.content)
34 
35     def tearDown(self):
36         self.driver.close()
37 
38 if __name__ == '__main__':
39     unittest.main()

補上多線程的代碼

核心代碼:

1 def test_multiscraping(self):
2         p = Pool()#默認大小是cpu的核數,你可以修改比如說雙核Pool(2)
3         #這里假設我是4個進程,所以range(5)
4         for i in range(5):
5             p.apply_async(scraping, args = (i, ))
6         p.close()
7         p.join()

cpu太垃圾了,晚上回去用同學的cpu測試一下(留下了窮人的眼淚)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM