scrapy使用PhantomJS爬取數據

本文轉載自查看原文 2018-01-08 14:50 1301 爬蟲/ python/ PhantomJS/ selenium/ scrapy

環境：python2.7+scrapy+selenium+PhantomJS

內容：測試scrapy+PhantomJS

爬去內容：涉及到js加載更多的頁面

原理：配置文件打開中間件+修改process_request函數（在里面增加PhantomJS操作）

第一步：

settings.py

DOWNLOADER_MIDDLEWARES = {
    'dbdm.middlewares.DbdmSpiderMiddleware': 543,
}

項目不一樣名字會改變不影響。

第二步：

----------默認開啟PhantomJS

middlewares.py

上面需要加載selenium 
from selenium import webdriver
#........省略部分代碼 
@classmethod
    def process_request(cls, request, spider):
        #if request.meta.has_key('PhantomJS'):
        driver = webdriver.PhantomJS('E:\\p_python\\Scripts\\phantomjs\\bin\\phantomjs.exe') 
        driver.get(request.url)
        if request.url=='https://movie.douban.com/tag':
            driver.find_element_by_xpath('//*[@id="app"]/div/div[1]/div[1]/ul[1]/li[5]/span').click()
            time.sleep(5)
            if driver.find_element_by_xpath('//*[@id="app"]/div/div[1]/a'):
                click_more(driver)
        content = driver.page_source.encode('utf-8')
        #print content
        #file = open(path.join(d, '1.txt'),'w')
        #file.write(content)
        #file.close()
        driver.quit()  
        return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

def click_more(driver,i=1):
    driver.find_element_by_xpath('//*[@id="app"]/div/div[1]/a').click()
    print str(i)+'  click'
    time.sleep(5)
    i = i+1
    try:
        more_btn = driver.find_element_by_xpath('//*[@id="app"]/div/div[1]/a')
        if more_btn:
            click_more(driver,i)
    except:
        print 'click Over!!'

上面只是測試的代碼，具體根據自己的項目更改，當前默認是打開PhantomJS訪問url,可以通過判斷。

-----------需要開啟時再開啟

判斷key的值

上面需要加載selenium 
from selenium import webdriver

#........省略部分代碼


@classmethod
    def process_request(cls, request, spider):
        if request.meta.has_key('PhantomJS'):
            driver = webdriver.PhantomJS('E:\\p_python\\Scripts\\phantomjs\\bin\\phantomjs.exe') 
            driver.get(request.url)
            content = driver.page_source.encode('utf-8')
            driver.quit()  
            return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

key的值設定在spider文件里面

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from phantomjs_test.items import PhantomscrapyItem

class PhantomjsTestSpider(CrawlSpider):
    name = 'phantomjs_test'
    allowed_domains = ['book.com']
    start_urls = ['http://book.com/']
    #all_urls = []   去重似乎不需要
     rules = (
        ###獲取所有的分頁列表
        Rule(LinkExtractor(allow=r'/story/p/[2-9]*'), callback='parse', follow=True),
        ###獲取里面所有的詳情頁
        #Rule(LinkExtractor(allow=r'/detail/p/[2-9]*'), callback = 'parse_item',follow=True),
    )

    ###從分頁頁面獲取所有的文章url
    def parse(self, response):
        url_list = response.xpath('/a/@href').extract()
        for url in url_list:
            request = Request(url=url, callback=self.parse_item, dont_filter=True)
            request.meta['PhantomJS'] = True
            yield request

    def parse_item(self, response):
        item = PhantomscrapyItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        item['bookName'] = response.xpath()
        items = []
        items.append(item)
        return items

以上便是默認打開與判斷條件再打開的區別，根據頁面不同可以設置，代碼仍需要完善才能人性化。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Django+Scrapy結合使用並爬取數據入庫爬蟲入門（四）——Scrapy框架入門：使用Scrapy框架爬取全書網小說數據 scrapy數據增量式爬取 scrapy使用爬取多個頁面 scrapy增量爬取 Python使用Scrapy框架爬取數據存入CSV文件(Python爬蟲實戰4) scrapy爬取數據的基本流程及url地址拼接 Scrapy爬取豆瓣圖書數據並寫入MySQL scrapy過濾重復數據和增量爬取 Scrapy 爬蟲實戰1—股票數據爬取