Scrapy框架實戰-妹子圖爬蟲

本文轉載自查看原文 2018-02-19 16:32 1189 Scrapy/ 爬蟲、妹子圖/ Python

Scrapy這個成熟的爬蟲框架，用起來之后發現並沒有想象中的那么難。即便是在一些小型的項目上，用scrapy甚至比用requests、urllib、urllib2更方便，簡單，效率也更高。廢話不多說，下面詳細介紹下如何用scrapy將妹子圖爬下來，存儲在你的硬盤之中。關於Python、Scrapy的安裝以及scrapy的原理這里就不作介紹，自行google、百度了解學習。

一、開發工具
Pycharm 2017
Python 2.7
Scrapy 1.5.0
requests

二、爬取過程

1、創建mzitu項目

進入"E:\Code\PythonSpider>"目錄執行scrapy startproject mzitu命令創建一個爬蟲項目：

1 scrapy startproject mzitu

執行完成后，生產目錄文件結果如下：

 1 ├── mzitu
 2 │   ├── mzitu
 3 │   │   ├── __init__.py
 4 │   │   ├── items.py
 5 │   │   ├── middlewares.py
 6 │   │   ├── pipelines.py
 7 │   │   ├── settings.py
 8 │   │   └── spiders
 9 │   │       ├── __init__.py
10 │   │       └── Mymzitu.py
11 │   └── scrapy.cfg

2、進入mzitu項目，編寫修改items.py文件

定義titile，用於存儲圖片目錄的名稱
定義img，用於存儲圖片的url
定義name，用於存儲圖片的名稱

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 class MzituItem(scrapy.Item):
11     # define the fields for your item here like:
12     title = scrapy.Field()
13     img = scrapy.Field()
14     name = scrapy.Field()

3、編寫修改spiders/Mymzitu.py文件

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from mzitu.items import MzituItem
 4 from lxml import etree
 5 import requests
 6 import sys
 7 reload(sys)
 8 sys.setdefaultencoding('utf8')
 9 
10 
11 class MymzituSpider(scrapy.Spider):
12     def get_urls():
13         url = 'http://www.mzitu.com'
14         headers = {}
15         headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
16         r = requests.get(url,headers=headers)
17         html = etree.HTML(r.text)
18         urls = html.xpath('//*[@id="pins"]/li/a/@href')
19         return urls
20 
21     name = 'Mymzitu'
22     allowed_domains = ['www.mzitu.com']
23     start_urls = get_urls()
24 
25     def parse(self, response):
26         item = MzituItem()
27         #item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract()
28         item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract().split('（')[0]
29         item['img'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract()
30         item['name'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract().split('/')[-1]
31         yield item
32 
33         next_url = response.xpath('//div[@class="pagenavi"]/a/@href')[-1].extract()
34         if next_url:
35             yield scrapy.Request(next_url, callback=self.parse)

我們要爬取的是妹子圖網站“最新”的妹子圖片，對應的主url是http://www.mzitu.com，通過查看網頁源代碼發現每一個圖片主題的url在<li>標簽中，通過上面代碼中get_urls函數可以獲取，並且返回一個url列表，這里必須說明一下，用python寫爬蟲，像re、xpath、Beautiful Soup之類的模塊必須掌握一個，否則根本無法下手。這里使用xpath工具來獲取url地址，在lxml和scrapy中，都支持使用xpath。

1     def get_urls():
2         url = 'http://www.mzitu.com'
3         headers = {}
4         headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
5         r = requests.get(url,headers=headers)
6         html = etree.HTML(r.text)
7         urls = html.xpath('//*[@id="pins"]/li/a/@href')
8         return urls

name定義爬蟲的名稱，allowed_domains定義包含了spider允許爬取的域名(domain)列表(list)，start_urls定義了爬取了url列表。

1 name = 'Mymzitu'
2 allowed_domains = ['www.mzitu.com']
3 start_urls = get_urls()

分析圖片詳情頁，獲取圖片主題、圖片url和圖片名稱，同時獲取下一頁，循環爬取：

 1     def parse(self, response):
 2         item = MzituItem()
 3         #item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract()
 4         item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract().split('（')[0]
 5         item['img'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract()
 6         item['name'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract().split('/')[-1]
 7         yield item
 8 
 9         next_url = response.xpath('//div[@class="pagenavi"]/a/@href')[-1].extract()
10         if next_url:
11             yield scrapy.Request(next_url, callback=self.parse)

4、編寫修改pipelines.py文件，下載圖片

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 import requests
 8 import os
 9 
10 class MzituPipeline(object):
11     def process_item(self, item, spider):
12         headers = {
13             'Referer': 'http://www.mzitu.com/'
14         }
15         local_dir = 'E:\\data\\mzitu\\' + item['title']
16         local_file = local_dir + '\\' + item['name']
17         if not os.path.exists(local_dir):
18             os.makedirs(local_dir)
19         with open(local_file,'wb') as f:
20             f.write(requests.get(item['img'],headers=headers).content)
21         return item

5、middlewares.py文件中新增一個RotateUserAgentMiddleware類

 1 class RotateUserAgentMiddleware(UserAgentMiddleware):
 2     def __init__(self, user_agent=''):
 3         self.user_agent = user_agent
 4     def process_request(self, request, spider):
 5         ua = random.choice(self.user_agent_list)
 6         if ua:
 7             request.headers.setdefault('User-Agent', ua)
 8     #the default user_agent_list composes chrome,IE,firefox,Mozilla,opera,netscape
 9     #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
10     user_agent_list = [
11         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
12         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
13         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
14         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
15         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
16         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
17         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
18         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
19         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
20         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
21         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
22         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
23         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
24         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
25         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
26         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
27         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
28         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
29     ]

6、settings.py設置

 1 # Obey robots.txt rules
 2 ROBOTSTXT_OBEY = False
 3 # Configure maximum concurrent requests performed by Scrapy (default: 16)
 4 CONCURRENT_REQUESTS = 100
 5 # Disable cookies (enabled by default)
 6 COOKIES_ENABLED = False
 7 DOWNLOADER_MIDDLEWARES = {
 8     'mzitu.middlewares.MzituDownloaderMiddleware': 543,
 9     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
10     'mzitu.middlewares.RotateUserAgentMiddleware': 400,
11 }

7、運行爬蟲

進入E:\Code\PythonSpider\mzitu目錄，運行scrapy crawl Mymzitu命令啟動爬蟲：

運行結果及完整代碼詳見：https://github.com/Eivll0m/PythonSpider/tree/master/mzitu

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scrapy框架爬取妹子圖片 Python使用Scrapy爬蟲框架全站爬取圖片並保存本地(@妹子圖@) 使用Scrapy爬蟲框架簡單爬取圖片並保存本地(妹子圖） python、scrapy下編寫妹子圖爬蟲程序 python爬蟲-妹子圖 java爬蟲-妹子圖一個爬蟲的練習（妹子圖）爬蟲爬取妹子圖爬蟲練習--爬妹子圖 Python的scrapy之爬取妹子圖片