Scrapy安裝
- Linux
- pip install scrapy
- Windows
- pip install wheel
- 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
- 進入第二步下載文件目錄,pip install 下載的文件名
- pip install pywin2
- pip install scrapy
創建項目、爬蟲文件
- 新建項目
scrapy startproject crawPro
- 新建爬蟲文件
進入爬蟲項目目錄 cd crawPro
scrapy genspider -t craw5i5j www.xxx.com #www.xxx.com為起始url,后面爬蟲文件會注釋掉
編寫爬蟲文件
- 注釋起始url
- 因爬蟲地址第一頁和第二頁url不一致,所以新增加一個規則解析器,規則解析器使用正則表達式
- 參數:follow = True表示跟隨。就會渠道全部頁碼
- 參數:callback='parse_item',回調函數表示每一個URL的返回數據,都需要parse_item方法進行解析
- 參數:response返回的數據類型為select類型,取值需要使用extract_first()表示取第一個值,extract()取多個值,返回數據類型為list
- 文件:items中需要定義參數,見代碼
- 導入item :from crawPro.items import CrawproItem;實例化item = CrawproItem(),把解析的值裝入item
- 最后需要:yield item ,把值傳入items
- craw5i5j.py代碼

# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor # 鏈接提取器 from scrapy.spiders import CrawlSpider, Rule # Rule規則解析器對象 from crawPro.items import CrawproItem class Craw5i5jSpider(CrawlSpider): name = 'craw5i5j' # allowed_domains = ['www.xxx.com'] start_urls = ['https://nj.5i5j.com/xiaoqu/pukouqu/'] # 鏈接提取器:前提follow=False,作用就是用來提取起始URL對應頁面中符合要求的鏈接 # 參數 allow是一個正在表達式。 link = LinkExtractor(allow=r'^https://nj.5i5j.com/xiaoqu/pukouqu/n\d+/$') link1 = LinkExtractor(allow=r'^https://nj.5i5j.com/xiaoqu/pukouqu/$') rules = ( # 規則解析器對象 LinkExtractor實例化鏈接提取器、callback回調函數 Rule(link, callback='parse_item', follow=True), Rule(link1, callback='parse_item', follow=True), ) def parse_item(self, response): for li in response.xpath('//div[@class="list-con-box"]/ul/li'): xq_name = li.xpath(".//h3[@class='listTit']/a/text()").extract_first().strip() xq_chengjiao = li.xpath(".//div[@class='listX']/p/span[1]/a/text()").extract_first().strip() xq_danjia = li.xpath(".//div[@class='listX']/div[@class='jia']/p[@class='redC']//text()").extract_first().strip() xq_zongjia =li.xpath(".//div[@class='listX']/div[@class='jia']/p[2]/text()").extract_first().strip() item = CrawproItem() item['xq_name'] = xq_name item['xq_chengjiao'] = xq_chengjiao item['xq_danjia'] = xq_danjia item['xq_zongjia'] = xq_zongjia yield item
- items.py代碼

import scrapy class CrawproItem(scrapy.Item): # define the fields for your item here like: xq_name = scrapy.Field() xq_chengjiao = scrapy.Field() xq_danjia = scrapy.Field() xq_zongjia = scrapy.Field()
編寫管道文件
- 重寫父類方法:def open_spider(self,spider):作用是打開文件一次,避免文件多次打開(若是數據庫該方法應該是打開數據連接)
- 重寫父類方法:def close_spider(self,spider):把第一步打開的文件,進行關閉(數據庫應該為關閉數據庫連接)
- 方法:process_item,設置寫入文件的格式(寫入數據庫的操作)
- 代碼

1 class CrawproPipeline(object): 2 fp = None 3 # 重寫父類,該方法只會被調用一次,打開文件。 4 def open_spider(self,spider): 5 self.fp = open("1.txt",'w',encoding='utf-8') 6 7 def process_item(self, item, spider): 8 self.fp.write(item["xq_name"]+"\t"+item["xq_chengjiao"]+"\t"+item["xq_danjia"]+"\t"+item["xq_zongjia"]+"\n") 9 return item 10 11 def close_spider(self,spider): 12 self.fp.close()
設置中間件
- 設置UA代理池
- 中間件:middlewares.py,找到方法 :def process_request(self, request, spider):設置UA代理池

1 def process_request(self, request, spider): 2 # Called for each request that goes through the downloader 3 # middleware. 4 5 # Must either: 6 # - return None: continue processing this request 7 # - or return a Response object 8 # - or return a Request object 9 # - or raise IgnoreRequest: process_exception() methods of 10 # installed downloader middleware will be called 11 user_agents = [ 12 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60', 13 'Opera/8.0 (Windows NT 5.1; U; en)', 14 'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50', 15 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50', 16 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0', 17 'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10', 18 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ', 19 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36', 20 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 21 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16', 22 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36', 23 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko', 24 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11', 25 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER', 26 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)', 27 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0', 28 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ', 29 ] 30 request.headers['USER-Agent'] =random.choice(user_agents) 31 # print(request.headers) 32 return None
設置配置文件
- ROBOTSTXT_OBEY = False 設置為False 不遵守robot協議
- 設置用戶表示USER_AGENT:USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
- 設置等待時間,看實際情況設置避免爬取太快:DOWNLOAD_DELAY = 3
- 打開中間件

DOWNLOADER_MIDDLEWARES = { 'crawPro.middlewares.CrawproDownloaderMiddleware': 543, }
- 打開管道

1 ITEM_PIPELINES = { 2 'crawPro.pipelines.CrawproPipeline': 300, 3 }
執行爬蟲文件
- 命令行中執行:scrapy crawl craw5i5j --nolog(不打印日志)
- 命令行中執行:scrapy crawl craw5i5j(打印日志)
- 執行文件后,查看是否有文件產生、或者數據庫中是否有數據