Python爬蟲-爬取京東商品信息-按給定關鍵詞

本文轉載自查看原文 2019-05-25 12:21 2699 爬蟲/ Python

目的：按給定關鍵詞爬取京東商品信息，並保存至mongodb。

字段：title、url、store、store_url、item_id、price、comments_count、comments

工具：requests、lxml、pymongo、concurrent

分析：

1. https://search.jd.com/Search?keyword=耳機&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=er%27ji&page=1&s=56&click=0，這是京東搜索耳機的跳轉url，其中關鍵參數為：

　keyword：關鍵詞

　enc：字符串編碼

　page：頁碼，需要注意的是，這里的數值均為奇數

　所以簡化后的 url 為 https://search.jd.com/Search?keyword=耳機&enc=utf-8&page=1

2. 分析各字段的 xpath，發現在搜索頁面只能匹配到 title、url、store、store_url、item_id、price。至於 comments_count、comments 需要單獨發出請求。

3. 打開某一商品詳情頁，點擊商品評價，打開開發者工具。點擊評論區的下一頁，發現在新的請求中，除去響應為媒體格式外，僅多出一個 js 響應，故猜測評論內容包含其中。

4. 分析上述請求的 url，簡化后為 https://sclub.jd.com/comment/productPageComments.action?productId=100004325476&score=0&sortType=5&page=0&pageSize=10，其中：

　productId：商品的Id，可簡單的從詳情頁的 url 中獲取

　page：評論頁碼

5. 由以上可以得出，我們需要先從搜索頁面中獲取的商品 id，通過 id 信息再去獲取評論信息。爬取評論時需要注意，服務器會判斷請求頭中的 Referer，即只有通過商品詳情頁訪問才能得到評論，所以我們每次都根據 item_id 構造請求頭。

6. 先將基礎信息插入至數據庫，在得到評論信息后，根據索引 item_id 將其補充完整。

代碼:

  1 import requests
  2 from lxml import etree
  3 import pymongo
  4 from concurrent import futures
  5 
  6 
  7 class CrawlDog:
  8     def __init__(self, keyword):
  9         """
 10         初始化
 11         :param keyword: 搜索的關鍵詞
 12         """
 13         self.keyword = keyword
 14         self.mongo_client = pymongo.MongoClient(host='localhost')
 15         self.mongo_collection = self.mongo_client['spiders']['jd']
 16         self.mongo_collection.create_index([('item_id', pymongo.ASCENDING)])
 17 
 18     def get_index(self, page):
 19         """
 20         從搜索頁獲取相應信息並存入數據庫
 21         :param page: 搜索頁的頁碼
 22         :return: 商品的id
 23         """
 24         url = 'https://search.jd.com/Search?keyword=%s&enc=utf-8&page=%d' % (self.keyword, page)
 25         index_headers = {
 26             'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
 27                       'application/signed-exchange;v=b3',
 28             'accept-encoding': 'gzip, deflate, br',
 29             'Accept-Charset': 'utf-8',
 30             'accept-language': 'zh,en-US;q=0.9,en;q=0.8,zh-TW;q=0.7,zh-CN;q=0.6',
 31             'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
 32                           'Chrome/74.0.3729.169 Safari/537.36'
 33         }
 34         rsp = requests.get(url=url, headers=index_headers).content.decode()
 35         rsp = etree.HTML(rsp)
 36         items = rsp.xpath('//li[contains(@class, "gl-item")]')
 37         for item in items:
 38             try:
 39                 info = dict()
 40                 info['title'] = ''.join(item.xpath('.//div[@class="p-name p-name-type-2"]//em//text()'))
 41                 info['url'] = 'https:' + item.xpath('.//div[@class="p-name p-name-type-2"]/a/@href')[0]
 42                 info['store'] = item.xpath('.//div[@class="p-shop"]/span/a/text()')[0]
 43                 info['store_url'] = 'https' + item.xpath('.//div[@class="p-shop"]/span/a/@href')[0]
 44                 info['item_id'] = info.get('url').split('/')[-1][:-5]
 45                 info['price'] = item.xpath('.//div[@class="p-price"]//i/text()')[0]
 46                 info['comments'] = []
 47                 self.mongo_collection.insert_one(info)
 48                 yield info['item_id']
 49             # 實際爬取過程中有一些廣告, 其中的一些上述字段為空
 50             except IndexError:
 51                 print('item信息不全, drop!')
 52                 continue
 53 
 54     def get_comment(self, params):
 55         """
 56         獲取對應商品id的評論
 57         :param params: 字典形式, 其中item_id為商品id, page為評論頁碼
 58         :return:
 59         """
 60         url = 'https://sclub.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=5&page=%d&' \
 61               'pageSize=10' % (params['item_id'], params['page'])
 62         comment_headers = {
 63             'Referer': 'https://item.jd.com/%s.html' % params['item_id'],
 64             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
 65                           'Chrome/74.0.3729.169 Safari/537.36'
 66         }
 67         rsp = requests.get(url=url, headers=comment_headers).json()
 68         comments_count = rsp.get('productCommentSummary').get('commentCountStr')
 69         comments = rsp.get('comments')
 70         comments = [comment.get('content') for comment in comments]
 71         self.mongo_collection.update_one(
 72             # 定位至相應數據
 73             {'item_id': params['item_id']},
 74             {
 75                 '$set': {'comments_count': comments_count},  # 添加comments_count字段
 76                 '$addToSet': {'comments': {'$each': comments}}  # 將comments中的每一項添加至comments字段中
 77             }, True)
 78 
 79     def main(self, index_pn, comment_pn):
 80         """
 81         實現爬取的函數
 82         :param index_pn: 爬取搜索頁的頁碼總數
 83         :param comment_pn: 爬取評論頁的頁碼總數
 84         :return:
 85         """
 86         # 爬取搜索頁函數的參數列表
 87         il = [i * 2 + 1 for i in range(index_pn)]
 88         # 創建一定數量的線程執行爬取
 89         with futures.ThreadPoolExecutor(15) as executor:
 90             res = executor.map(self.get_index, il)
 91         for item_ids in res:
 92             # 爬取評論頁函數的參數列表
 93             cl = [{'item_id': item_id, 'page': page} for item_id in item_ids for page in range(comment_pn)]
 94             with futures.ThreadPoolExecutor(15) as executor:
 95                 executor.map(self.get_comment, cl)
 96 
 97 
 98 if __name__ == '__main__':
 99     # 測試, 只爬取兩頁搜索頁與兩頁評論
100     test = CrawlDog('耳機')
101     test.main(2, 2)

總結：爬取的過程中可能會被封 IP，測試時評論頁面的獲取被封鎖，使用代理可以解決該問題，后面會來主要說一下代理的使用。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲爬取淘寶，京東商品信息 python_爬蟲_爬取京東商品信息 Java 利用爬蟲爬取京東、天貓商品信息爬蟲項目-爬取亞馬遜商品信息 python爬蟲-京東商品爬取 Python 爬取淘寶商品信息和相應價格 python簡單爬蟲（爬取pornhub特定關鍵詞的items圖片集） Python爬蟲實戰（2）：爬取京東商品列表 python制作爬蟲爬取京東商品評論教程 python使用requests庫和re庫寫的京東商品信息爬蟲