爬蟲實例-淘寶頁面商品信息獲取

本文轉載自查看原文 2020-10-08 09:45 1457

------------恢復內容開始------------

一、完整代碼：

在MOOC課上嵩天老師的課上有一個查找商品頁面的實例，學習了一下，發現跟着嵩天老師的源代碼已經爬不出來了。這是因為2019年開始淘寶搜索頁面就必須登錄了，所以要爬取商品內容必須登錄賬號，具體的header與cookie信息如下：

cookie登錄信息可以登錄淘寶頁面后經過在元素控制台內部查找。（記得刷新）

先給出完整代碼

import requests
import re


def getHTMLText(url):
    try:
        header = {

            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

            'cookie': '_samesite_flag_=t*********************kmn'

        }
        r = requests.get(url, timeout=30, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("1312")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序號","價格","商品名稱"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count,g[0],g[1]))


def main():
    goods = '手表'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44 * i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)


main()

二、代碼解析:

總共把代碼分為三個部分：

1、獲取商品頁面信息==》getHTMLText

2、解析商品頁面信息==》parsePage

3、打印商品信息 ==》printGoodsList

（一）、getHTMLText

def getHTMLText(url):
    try:
        header = {

            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',

            'cookie': '_samesite_flag_=t*********************kmn'

        }
        r = requests.get(url, timeout=30, headers=header)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

1. 首先定義header部分，登錄信息與瀏覽器信息等。

2. r.raise_for_status() 當爬取失敗的時候會報錯，讓try進入except，使代碼整體健壯。

3. r.encoding = r.apparent_encoding解析代碼編碼，讓r資源的編碼 = 顯示的編碼apparent_encoding

4. 最終返回r.text 文本部分

（二）、parsePage

def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")

這個部分是代碼最關鍵的部分，即核心代碼，負責查找與解析r.text中的文本。

先說明其中的正則表達式

r'\"view_price\"\:\"[\d\.]*\"'

在淘寶搜索書包之后可以發現其商品價格前面的都有一個關鍵詞 view_price，同理發現商品標題都有raw_title關鍵詞：

其中\"view_price\"\:\"[\d\.]*\"之所以出現這么多\，是因為轉義字符其意義是查找"view_price: [\d.]*"這樣的一個字符串，使用findall函數可以爬取全部的資源。

同理，經過商品標題可以選擇title與raw_title不過最后選擇了raw_title，因為title在一個商品信息內出現了兩次。

最終plt 與 tlt 分別是所有商品信息的價格和標題，其序號是一一對應的。

 for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])

最終把所有商品信息的價格和名稱放入列表ilt內

這plt去除外面的雙引號后使用SPLIT方法吧view_price：129這樣的商品元素分開，並取位置[ 1 ]上的元素，即商品的價格。

三、商品價格信息打印：printGoodsList

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序號","價格","商品名稱"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count,g[0],g[1]))

先定義TPLT格式信息，最后使用count計數當做序號。

四、main()函數執行

def main():
    goods = '手表'
    depth = 2
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44 * i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

根據查找淘寶頁面的url我們可以發現其搜索的接口為 search?q=

並且其頁面元素為44位一頁，第一頁為空第二頁為44 第三頁為88

所以根據查找多少我們可以定義一個深度depth為查找的頁面，遍歷次數即每一次翻頁的次數。

https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=6&ntoffset=6&p4ppushleft=1%2C48&s=0
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=44
https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&bcoffset=3&ntoffset=0&p4ppushleft=1%2C48&s=88

最終運行結果為：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python：爬蟲獲取淘寶/天貓的商品信息爬蟲實戰（三）：爬淘寶商品信息 Python爬蟲爬取淘寶，京東商品信息爬取淘寶商品信息，放到html頁面展示使用selenium抓取淘寶的商品信息練習抓取淘寶商品信息淘寶開放平台php-sdk測試獲取淘寶商品信息(轉) requests庫爬取淘寶商品信息蘑菇街商品信息獲取 python爬取並分析淘寶商品信息