Scrapy實戰篇（八）之Scrapy對接selenium爬取京東商城商品數據

本文轉載自查看原文 2019-01-31 21:31 676 python爬蟲-Scrapy

本篇目標：我們以爬取京東商城商品數據為例，展示Scrapy框架對接selenium爬取京東商城商品數據。

背景：

　　京東商城頁面為js動態加載頁面，直接使用request請求，無法得到我們想要的商品數據，故需要借助於selenium模擬人的行為發起請求，輸出源代碼，然后解析源代碼，得到我們想要的數據。

第一步：設置我們需要提取的字段，也就是在Scrapy框架中設置Item.py文件。

class ProductItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    dp = Field()
    title = Field()
    price = Field()
    comment=Field()
    url=Field()
    type=Field()


這里我們需提取上面幾個字段，依次為店鋪名稱，商品名稱，商品價格，評論數，商品url，類型（區分是什么類型的商品）


第二步：
　　設置我們需要從哪個頁面開始爬起，也就是開發scrapy框架中的spider文件，代碼如下

class JingdongSpider(scrapy.Spider):
    name = 'jingdong'
    allowed_domains = ['www.jingdong.com']
    base_url = 'https://search.jd.com/Search?keyword='

    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1,self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(keyword)
                #dont_filter = True  不去重
                yield Request(url = url ,callback = self.parse,meta = {'page':page},dont_filter=True)

　　我們設置初始url為京東商城搜索商品的頁面鏈接，其中搜索的商品用KEYWORDS表示，在settings文件中以列表的形式設置，因為搜索出來的頁數可能很多，所有我們需要爬取的頁數頁用MAX_PAGE變量
在settings文件中設置。

KEYWORDS=['iPad']
MAX_PAGE=2



如果此時運行項目，鏈接會直接發送給下載器進行下載，無法得到我們想要的數據，所以我們在下載器中間件中對該請求進行處理。

第三步：
　　在下載器中間件中對接selenium，直接輸出源代碼並返回，不在下載器中下載頁面。

class SeleniumMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self,timeout=None):
        self.logger=getLogger(__name__)
        self.timeout = timeout
        self.browser = webdriver.Chrome()
        self.browser.set_window_size(1400,700)
        self.browser.set_page_load_timeout(self.timeout)
        self.wait = WebDriverWait(self.browser,self.timeout)

    def __del__(self):
        self.browser.close()

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'))

    def process_request(self, request, spider):
        '''
        在下載器中間件中對接使用selenium，輸出源代碼之后，構造htmlresponse對象，直接返回
        給spider解析頁面，提取數據
        並且也不在執行下載器下載頁面動作
        htmlresponse對象的文檔：
        :param request:
        :param spider:
        :return:
        '''

        print('Chorme is Starting')
        page = request.meta.get('page', 1)
        self.wait = WebDriverWait(self.browser, self.timeout)
        try:
            self.browser.get(request.url)
            if page > 1:
                input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
                input.clear()
                input.send_keys(page)
                time.sleep(5)

                # 將網頁中輸入跳轉頁的輸入框賦值給input變量 EC.presence_of_element_located，判斷輸入框已經被加載出來
                input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
                # 將網頁中調准頁面的確定按鈕賦值給submit變量，EC.element_to_be_clickable 判斷此按鈕是可點擊的
                submit = self.wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a')))
                input.clear()
                input.send_keys(page)
                submit.click()  # 點擊按鈕
                time.sleep(5)

                # 判斷當前頁碼出現在了輸入的頁面中，EC.text_to_be_present_in_element 判斷元素在指定字符串中出現
                self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))
                # 等待 #J_goodsList 加載出來，為頁面數據，加載出來之后，在返回網頁源代碼
                self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))
            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, status=500, request=request)


__init__和類函數都執行一些初始化操作，無需多說，我們主要看process_request()方法
首先我們這是瀏覽器的等待時長，然后我們見輸入頁碼的輸入框賦值給input變量，在將翻頁的點擊按鈕框賦值給submit變量，然后在數據框中輸入頁碼，等待頁面加載，直接返回
htmlresponse給spider解析，這里我們沒有經過下載器下載，直接構造response的子類htmlresponse返回。(當下載器中間件返回response對象時，更低優先級的process_request將不在執行，轉而
執行其他的process_response()方法，本例中沒有其他的process_response(),所以直接將結果返回給spider解析。)

第四步：
　　開發第二步中Request對象中的回調函數，解析頁面數據，提取我們想要的數據。這里我們采用BeautifulSoup進行解析，代碼如下：

def parse(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    lis = soup.find_all(name='li', class_="gl-item")
    for li in lis:
        proc_dict = {}
        dp = li.find(name='span', class_="J_im_icon")
        if dp:
            proc_dict['dp'] = dp.get_text().strip()
        else:
            continue
        id = li.attrs['data-sku']
        title = li.find(name='div', class_="p-name p-name-type-2")
        proc_dict['title'] = title.get_text().strip()
        price = li.find(name='strong', class_="J_" + id)
        proc_dict['price'] = price.get_text()
        comment = li.find(name='a', id="J_comment_" + id)
        proc_dict['comment'] = comment.get_text() + '條評論'
        url = 'https://item.jd.com/' + id + '.html'
        proc_dict['url'] = url
        proc_dict['type'] = 'JINGDONG'
        yield proc_dict

第五步：
　　提取完頁面數據之后，數據會發送到item pipeline處進行數據處理，清洗，入庫等操作，所以我們此時當然需要定義項目管道了，在此我們將數據存儲在mongodb數據庫中。

class MongoPipeline(object):

    def __init__(self,mongo_url,mongo_db,collection):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
        self.collection = collection

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DB'),
            collection = crawler.settings.get('COLLECTION')
        )

    def open_spider(self,spider):
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    def process_item(self,item, spider):
        # name = item.__class__.collection
        name = self.collection
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

我們使用類方法from_crawler從settings文件中獲取mongodb數據庫的配置信息，在__init__中進行初始化，在process_item中將數據存儲到mongodb中。

第六步
　　1、配置settings文件，將項目中使用到的配置項在settings文件中配置，本項目中使用到了KEYWORDS,MAX_PAGE,SELENIUM_TIMEOUT(頁面加載超時時間)，MONGOURL,MONGODB,COLLECTION;
　　2、修改配置項，激活下載器中間件和item pipeline。

DOWNLOADER_MIDDLEWARES = {
   'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543,
}

ITEM_PIPELINES = {
   'scrapyseleniumtest.pipelines.MongoPipeline': 300,
}

至此，項目中所有需要開發的代碼和配置項開發完成，運行項目之后，在mongodb中查看數據，應該已經執行成功。

本項目完整代碼：

https://gitee.com/liangxinbin/Scrpay/tree/master/scrapyseleniumtest

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Scrapy實戰篇（七）之Scrapy配合Selenium爬取京東商城信息（下）爬蟲(十七)：Scrapy框架(四) 對接selenium爬取京東商品數據 Scrapy實戰篇（五）爬取京東商城文胸信息 Scrapy實戰篇（六）之Scrapy配合Selenium爬取京東信息（上）爬取京東商城的商品數據 Scrapy實戰篇（六）之爬取360圖片數據和圖片 Scrapy實戰篇（五）之爬取歷史天氣數據 Scrapy練習——爬取京東商城商品信息用scrapy爬取京東商城的商品信息 Scrapy實戰篇（三）之爬取豆瓣電影短評