用scrapy爬取京東商城的商品信息


軟件環境:

 1 gevent (1.2.2)
 2 greenlet (0.4.12)
 3 lxml (4.1.1)
 4 pymongo (3.6.0)
 5 pyOpenSSL (17.5.0)
 6 requests (2.18.4)
 7 Scrapy (1.5.0)
 8 SQLAlchemy (1.2.0)
 9 Twisted (17.9.0)
10 wheel (0.30.0)

 

1.創建爬蟲項目

2創建京東網站爬蟲. 進入爬蟲項目目錄,執行命令:

scrapy genspider jd www.jd.com

會在spiders目錄下會創建和你起的名字一樣的py文件:jd.py,這個文件就是用來寫你爬蟲的請求和響應邏輯的

3. jd.py文件配置

分析的amazon網站的url規則:
https://search.jd.com/Search?
以防關鍵字是中文,所以要做urlencode
        1.首先寫一個start_request函數,用來發送第一次請求,並把請求結果發給回調函數parse_index,同時把reponse返回值傳遞給回調函數,response類型<class                 'scrapy.http.response.html.HtmlResponse'>
 1     def start_requests(self):
 2         # https://www.amazon.cn/s/ref=nb_sb_ss_i_1_6?field-keywords=macbook+pro
 3         # 拼接處符合條件的URL地址
 4         # 並通過scrapy.Requst封裝請求,並調用回調函數parse_index處理,同時會把response傳遞給回調函數
 6         url = 'https://search.jd.com/Search?'
 7         # 拼接的時候field-keywords后面是不加等號的
 9         url += urlencode({"keyword": self.keyword, "enc": "utf-8"})
10         yield scrapy.Request(url,
11                              callback=self.parse_index,
12                              )
        2.parse_index從reponse中獲取所有的產品詳情頁url地址,並遍歷所有的url地址發送request請求,同時調用回調函數parse_detail去處理結果
 1 def parse_detail(self, response):
 2     """
 3     接收parse_index的回調,並接收response返回值,並解析response
 4     :param response:
 5     :return:
 6     """
 7     jd_url = response.url
 8     sku = jd_url.split('/')[-1].strip(".html")
 9     # price信息是通過jsonp獲取,可以通過開發者工具中的script找到它的請求地址
10     price_url = "https://p.3.cn/prices/mgets?skuIds=J_" + sku
11     response_price = requests.get(price_url)
12     # extraParam={"originid":"1"}  skuIds=J_3726834
13     # 這里是物流信息的請求地址,也是通過jsonp發送的,但目前沒有找到它的參數怎么獲取的,這個是一個固定的參數,如果有哪位大佬知道,好望指教
14     express_url = "https://c0.3.cn/stock?skuId=3726834&area=1_72_4137_0&cat=9987,653,655&extraParam={%22originid%22:%221%22}"
15     response_express = requests.get(express_url)
16     response_express = json.loads(response_express.text)['stock']['serviceInfo'].split('>')[1].split('<')[0]
17     title = response.xpath('//*[@class="sku-name"]/text()').extract_first().strip()
18     price = json.loads(response_price.text)[0]['p']
19     delivery_method = response_express
20     # # 把需要的數據保存到Item中,用來會后續儲存做准備
21     item = AmazonItem()
22     item['title'] = title
23     item['price'] = price
24     item['delivery_method'] = delivery_method
25 
26     # 最后返回item,如果返回的數據類型是item,engine會檢測到並把返回值發給pipelines處理
27     return item

4. item.py配置

 1 import scrapy
 2 
 3 
 4 class JdItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     # amazome Item
 8     title = scrapy.Field()
 9     price = scrapy.Field()
10     delivery_method = scrapy.Field()

5. pipelines.py配置

 1 from pymongo import MongoClient
 2 
 3 
 4 class MongoPipeline(object):
 5     """
 6     用來保存數據到MongoDB的pipeline
 7     """
 8 
 9     def __init__(self, db, collection, host, port, user, pwd):
10         """
11         連接數據庫
12         :param db: databaes name
13         :param collection: table name
14         :param host: the ip for server
15         :param port: thr port for server
16         :param user: the username for login
17         :param pwd: the password for login
18         """
19         self.db = db
20         self.collection = collection
21         self.host = host
22         self.port = port
23         self.user = user
24         self.pwd = pwd
25 
26     @classmethod
27     def from_crawler(cls, crawler):
28         """
29         this classmethod is used for to get the configuration from settings
30         :param crwaler:
31         :return:
32         """
33         db = crawler.settings.get('DB')
34         collection = crawler.settings.get('COLLECTION')
35         host = crawler.settings.get('HOST')
36         port = crawler.settings.get('PORT')
37         user = crawler.settings.get('USER')
38         pwd = crawler.settings.get('PWD')
39 
40         return cls(db, collection, host, port, user, pwd)
41 
42     def open_spider(self, spider):
43         """
44         run once time when the spider is starting
45         :param spider:
46         :return:
47         """
48         # 連接數據庫
50         self.client = MongoClient("mongodb://%s:%s@%s:%s" % (
51             self.user,
52             self.pwd,
53             self.host,
54             self.port
55         ))
56 
57     def process_item(self, item, spider):
58         """
59         storage the data into database
60         :param item:
61         :param spider:
62         :return:
63         """
      # 獲取item數據,並轉換成字典格式
64 d = dict(item)
       # 有空值得不保存
65 if all(d.values()):
          # 保存到mongodb中
66 self.client[self.db][self.collection].save(d) 67 return item 68 69 # 表示將item丟棄,不會被后續pipeline處理 70 # raise DropItem()

 

6. 配置文件

 1 # database server
 2 DB = "jd"
 3 COLLECTION = "goods"
 4 HOST = "127.0.0.1"
 5 PORT = 27017
 6 USER = "root"
 7 PWD = "123"
 8 ITEM_PIPELINES = {
 9    'MyScrapy.pipelines.MongoPipeline': 300,
10 }

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM