(1)、前言
原理分析:我們編寫代碼模擬向網站發出登錄請求,也就是提交包含登錄信息的表單(用戶名、密碼等)。
實現方式:當我們想在請求數據時發送post請求,這時候需要借助Request的子類FormRequest來實現,如果想進一步在爬蟲一開始時就發送post請求,那么我們需要重寫start_request()方法,舍棄原先的start_url()(采用get請求)
(2)、模擬登錄人人網(例子1)
1、創建項目
scrapy startproject renren---cd renren--創建爬蟲scrapy genspider spider renren.com
2、改寫settings.py
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for renren project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # https://doc.scrapy.org/en/latest/topics/settings.html 9 # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 10 # https://doc.scrapy.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'renren' 13 14 SPIDER_MODULES = ['renren.spiders'] 15 NEWSPIDER_MODULE = 'renren.spiders' 16 17 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 #USER_AGENT = 'renren (+http://www.yourdomain.com)' 20 21 # Obey robots.txt rules 22 ROBOTSTXT_OBEY = False 23 24 # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 #CONCURRENT_REQUESTS = 32 26 27 # Configure a delay for requests for the same website (default: 0) 28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay 29 # See also autothrottle settings and docs 30 DOWNLOAD_DELAY = 1 31 # The download delay setting will honor only one of: 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 #CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 #COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 #TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 DEFAULT_REQUEST_HEADERS = { 43 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 'Accept-Language': 'en', 45 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36', 46 47 48 } 49 50 # Enable or disable spider middlewares 51 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html 52 #SPIDER_MIDDLEWARES = { 53 # 'renren.middlewares.RenrenSpiderMiddleware': 543, 54 #} 55 56 # Enable or disable downloader middlewares 57 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html 58 #DOWNLOADER_MIDDLEWARES = { 59 # 'renren.middlewares.RenrenDownloaderMiddleware': 543, 60 #} 61 62 # Enable or disable extensions 63 # See https://doc.scrapy.org/en/latest/topics/extensions.html 64 #EXTENSIONS = { 65 # 'scrapy.extensions.telnet.TelnetConsole': None, 66 #} 67 68 # Configure item pipelines 69 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html 70 #ITEM_PIPELINES = { 71 # 'renren.pipelines.RenrenPipeline': 300, 72 #} 73 74 # Enable and configure the AutoThrottle extension (disabled by default) 75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html 76 #AUTOTHROTTLE_ENABLED = True 77 # The initial download delay 78 #AUTOTHROTTLE_START_DELAY = 5 79 # The maximum download delay to be set in case of high latencies 80 #AUTOTHROTTLE_MAX_DELAY = 60 81 # The average number of requests Scrapy should be sending in parallel to 82 # each remote server 83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 84 # Enable showing throttling stats for every response received: 85 #AUTOTHROTTLE_DEBUG = False 86 87 # Enable and configure HTTP caching (disabled by default) 88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 89 #HTTPCACHE_ENABLED = True 90 #HTTPCACHE_EXPIRATION_SECS = 0 91 #HTTPCACHE_DIR = 'httpcache' 92 #HTTPCACHE_IGNORE_HTTP_CODES = [] 93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3、改寫spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class SpiderSpider(scrapy.Spider): 6 name = 'spider' 7 allowed_domains = ['renren.com'] 8 start_urls = ['http://renren.com/'] 9 10 11 def start_requests(self): 12 url = 'http://www.renren.com/PLogin.do' 13 data = { 14 'email':'827832075@qq.com', 15 'password':'56571218lu', 16 } #構造表單數據 17 request = scrapy.FormRequest(url ,formdata=data, callback=self.parse_page) 18 yield request 19 20 def parse_page(self,response): 21 url2 = 'http://www.renren.com/880792860/profile' 22 request = scrapy.Request(url2 ,callback=self.parse_profile) 23 yield request 24 25 def parse_profile(self,response): 26 with open('baobeier.html','w',encoding='utf-8') as f: 寫入文件 27 f.write(response.text) 28 f.close()
4、運行爬蟲
1 #author: "xian" 2 #date: 2018/6/13 3 from scrapy import cmdline 4 cmdline.execute('scrapy crawl spider'.split())
5、效果展示(我們成功登錄並爬取了包貝兒的人人主頁)
(3)、使用阿里雲驗證碼服務自動識別驗證碼(服務地址:https://market.aliyun.com/products/57126001/cmapi014396.html#sku=yuncode=839600006)
測試服務:我們同樣使用豆瓣登錄頁面的驗證碼進行測試:
1 #author: "xian" 2 #date: 2018/6/13 3 from urllib import request 4 from base64 import b64encode 5 import requests 6 7 captcha_url = 'https://www.douban.com/misc/captcha?id=oL8chJoRiCTIikzwtEECZNGH:en&size=s' 8 9 request.urlretrieve(captcha_url ,'captcha.png') 10 11 recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e' 12 13 formdata = {} 14 with open('captcha.png','rb') as f: 15 data = f.read() 16 pic = b64encode(data) 17 formdata['pic'] = pic 18 19 appcode = '614a1376aa4340b7a159d551d4eb0179' 20 headers = { 21 'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8', 22 'Authorization':'APPCODE ' + appcode, 23 } 24 25 response = requests.post(recognize_url,data = formdata,headers =headers) 26 print(response.json()) #返回json格式
運行效果展示:(我們借助阿里雲平台成功進行了驗證碼的自動識別)
(4)、使用阿里雲服務進行驗證碼驗證並模擬登錄豆瓣網
1、創建項目scrapy startproject douban---cd douban---創建爬蟲scrapy genspider spider doubao.com(樹形目錄如下:)
改寫settings.py
不遵循robots協議
設置請求頭
設置爬取時間間隔
改寫spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from urllib import request 4 from PIL import Image #導入識別圖形的庫 5 from base64 import b64encode #導入b64編碼庫 6 import requests 7 8 9 class SpiderSpider(scrapy.Spider): 10 name = 'spider' 11 allowed_domains = ['douban.com'] 12 start_urls = ['https://accounts.douban.com/login'] #起始url 13 login_url = 'https://accounts.douban.com/login' #登錄界面url 14 profile_url = 'https://www.douban.com/people/179834288/' #個人主要url 15 editsignature_url = 'https://www.douban.com/j/people/179834288/edit_signature' #編輯簽名的接口url 16 17 def parse(self, response): 18 formdata = { 19 'source': 'None', 20 'redir':'https://www.douban.com', 21 'form_email': '827832075@qq.com', 22 'form_password': '56571218lu', 23 'remember': 'on', 24 'login': '登錄', 25 26 } #傳入部分表單數據 27 captcha_url = response.css('img#captcha_image::attr(src)').get() #獲取驗證碼 28 if captcha_url: #判斷是否存在驗證碼 29 captcha = self.regonize_captcha(captcha_url) #識別驗證碼 30 formdata['captcha-solution'] = captcha #獲取captcha-solution表單字段 31 captcha_id = response.xpath('//input[@name = "captcha-id"]/@value').get() #獲取captcha_id表單字段 32 formdata['captcha-id'] = captcha_id 33 yield scrapy.FormRequest(url = self.login_url,formdata=formdata,callback=self.parse_after_login) #提交表單數據 34 35 36 def parse_after_login(self,response): #解析登錄頁面函數 37 if response.url == 'https://www.douban.com': #判斷是否登錄成功 38 yield scrapy.Request(self.profile_url,callback=self.parse_profile) #如果登錄成功向個人主頁發送請求並回調解析函數 39 print('登錄成功!') 40 else: 41 print('登錄失敗!') 42 43 def parse_profile(self,response): #解析個人主頁函數 44 print(response.url) 45 if response.url == self.profile_url: #判斷是否成功到達個人主頁 46 ck = response.xpath('//input[@name = "ck"]/@value').get() #獲取ck value的值 47 formdata = { 48 'ck': ck, 49 'signature':'積土成山,風雨興焉!', 50 } #構造表單數據 51 yield scrapy.FormRequest(self.editsignature_url,formdata=formdata,callback=self.parse_None) 提交表單數據,最后callback 指定回調函數,這里如果不指定回調函數默認回調parse,最后會出現登錄失敗信息 52 else: 53 print('進入個人中心失敗了!') 54 55 def parse_None(self,response): 56 pass 57 58 59 #部分為人工驗證碼識別登錄方式 60 # def regonize_captcha(self,image_url): 61 # request.urlretrieve(image_url,'captcha.png') 62 # image = Image.open('captcha.png') 63 # image.show() 64 # captcha = input('請您輸入驗證碼:') 65 # return captcha 66 67 def regonize_captcha(self, image_url): #這里使用上面的阿里雲服務識別驗證碼,參考阿里雲上提供的使用手冊即可 68 captcha_url = image_url 69 70 request.urlretrieve(captcha_url, 'captcha.png') 71 72 recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e' 73 74 formdata = {} 75 with open('captcha.png', 'rb') as f: 76 data = f.read() 77 pic = b64encode(data) 78 formdata['pic'] = pic 79 80 appcode = '614a1376aa4340b7a159d551d4eb0179' 81 headers = { 82 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 83 'Authorization': 'APPCODE ' + appcode, 84 } 85 86 response = requests.post(recognize_url, data=formdata, headers=headers) 87 result = response.json() 88 code = result['result']['code'] 89 return code
最后運行爬蟲項目:
新建一個main.py方便調試
1 #author: "xian" 2 #date: 2018/6/13 3 from scrapy import cmdline 4 cmdline.execute('scrapy crawl spider'.split())
運行結果:(部分)
我們可以看到我的主頁的個性簽名已經改變為我們設置的了!(程序成功運行了!)
(5)、總結
1、在scrapy中想要發送Post請求,推薦使用scrapy.FormRequest方法,並指定表單數據
2、在爬蟲開始時發送Post請求,請重寫start_requests()方法