python爬蟲-唯品會商品信息實戰步驟詳解


唯品會商品信息實戰

  • ​1. 目標網址和頁面解析

  • 2. 爬蟲初探

  • 3. 爬蟲實操

  •  

    • 3.1 進行商品id信息的爬取

    • 3.2 商品id數據url構造

    • 3.3 商品id數據格式轉化及數量驗證

    • 3.4 商品詳細信息獲取

  • 4. 全部代碼

 

1. 目標網址和頁面解析

唯品會官網中假如搜索護膚套裝,返回的頁面如下


下拉右側滾動條可以發現,滑動到下面的時候頁面會自動刷新出商品的數據,這里就體現了ajax交互,說明商品的信息是存放在json接口中,接着拉到底就可以發現翻頁的按鈕了,如下

2. 爬蟲初探

嘗試進行抓包,獲取真實商品數據所在的網址頁面,首先鼠標右鍵進入檢查界面,點擊Network后刷新頁面,這時候就會返回請求的信息,需要進行查找篩選,找到具體含有商品信息的鏈接文件,經過檢查發現內容大多在callback有關的文件中,如下


分析這七個文件,發現有用的只有四個,其中第二個rank文件包含了當前頁面的所有商品的編號


然后剩下的3個v2文件中就是將這120個商品進行拆分,分別如下(商品的序號都是從0開始的)




因此搜索頁面的120個商品的信息真實的數據接口就查找完畢了,然后以其中的某一個鏈接文件進行爬蟲數據的獲取嘗試,看看獲得結果如何,然后總結規律看看是否可以同時爬取該頁面中全部的數據

添加user-agent,cookie,refer相關信息后設置后請求頭(鼠標點擊Headers),把頁面接口數據的url復制粘貼后賦值,並進行數據請求,代碼如下,比如先以20個商品的數據進行請求


獲取cookie,可以取消callback的篩選,然后選擇默認返回的第一個suggest文件,如下


注意:根據自己的瀏覽器返回的內容設置請求頭headers

import requests

headers = {
    'Cookie': 'vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
    'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C6918479374036836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440'
html = requests.get(url,headers=headers)
print(html.text)

 

輸出結果為:(最終的輸出結果與界面返回的結果一致)


因此就可以探究一下這三個v2文件中的實際請求url之間的區別,方便找出其中的規律

'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C6918479374036836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440'
'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918241720044454476%2C6917919624790589569%2C6917935170607219714%2C6918794091804350029%2C6918825617469761228%2C6918821681541400066%2C6918343188631192386%2C6918909902880919752%2C6918944714357405314%2C6918598446593061836%2C6917992439761061707%2C6918565057324098974%2C6918647344809112386%2C6918787811445699149%2C6918729979027610590%2C6918770949378056781%2C6918331290238460382%2C6918782319292540574%2C6918398146810241165%2C6918659293579989333%2C6917923814107067291%2C6918162041180009111%2C6918398146827042957%2C6917992175963801365%2C6918885216264034310%2C6918787811496047181%2C6918273588862755984%2C6917924752735125662%2C6918466082515404493%2C6918934739456193886%2C6917924837261255565%2C6918935779609622221%2C6917920117494382747%2C6917987978233958977%2C6917923641027928222%2C6918229910205674453%2C6917970328155673856%2C6918470882161509397%2C6918659293832008021%2C6918750646128649741%2C6917923139576259723%2C6918387987850605333%2C6917924445491982494%2C6918790938962557837%2C6918383695533143067%2C6918872378378761054%2C6918640250037793602%2C6918750646128641549%2C6917937020463562910%2C6917920520629265102%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865436'
'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets2&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918690813782926366%2C6918447252612175371%2C6918159188446941835%2C6918205147496443989%2C6918006775182997019%2C6918710130501497419%2C6917951703208964235%2C6918936224464094528%2C6918394023211385035%2C6918872268898919262%2C6918397905200202715%2C6918798460682221086%2C6918800888595138517%2C6917919413703328321%2C1369067222846365%2C6917924520139822219%2C6918904223283803413%2C6918507022166130843%2C6918479374087209281%2C6917924176900793243%2C6918750646145443341%2C6918449056102412742%2C6918901362318117467%2C6918570897095177292%2C6917924520223884427%2C6918757924517328902%2C6918398146827051149%2C6918789686747831253%2C6918476662192264973%2C6917919300445017109%2C6917919922739126933%2C6917920155539928286%2C6918662208810186512%2C6917923139508970635%2C6918859281628675166%2C6918750645658871309%2C6918820034693202694%2C6918689681141637573%2C6917919916536480340%2C6918719763326603415%2C6918659293579997525%2C6917920335390225555%2C6918589584225669211%2C6918386595131470421%2C6918640034622429077%2C6917923665227256725%2C6918331290238476766%2C6917924054840074398%2C6917924438479938177%2C6917920679932125915%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865437'

 


對比三個商品信息的url,發現根本的區別就是在於中間的productIds參數,因此只要獲取到所有商品的id就可以獲取全部的商品的信息,這也就是發現url的規律


剛好全部的商品的id又存放在第二個rank文件中,故需要首先請求一下這個鏈接文件,獲取商品id信息,然后再重新組合url,最終獲取商品詳細的信息

3. 爬蟲實操

3.1 進行商品id信息的爬取

為了實現翻頁的要求,可以查找一下控制每頁數量的參數,如下,比如第一頁共120條數據,其中的pageOffset參數為0

第二頁中的pageOffset參數為120,由此類推,第三頁的參數為240,往后每翻一頁數量增加120條,其余部分參數幾乎沒變

3.2 商品id數據url構造

因此請求的代碼如下

import requests
import json
headers = {
    'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
    'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
n = 1 #n就是用來確定請求的頁數,可以使用input語句替代
for num in range(120,(n+1)*120,120):  #這里是從第二頁開始取數據了,第一個參數可以設置為0
    url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&keyword=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435'
    html = requests.get(url,headers=headers)
    print(html.text)

 

輸出的結果為:(可以成功獲得商品id的信息)

3.3 商品id數據格式轉化及數量驗證

進行json數據的解析,也就是將輸出的數據沒有固定格式的轉化為可以python操作的格式,代碼如下

import json

#注意下面的代碼是在for循環中
start = html.text.index('{')
end = html.text.index('})')+1
json_data = json.loads(html.text[start:end])
print(json_data)

 

輸出的結果為:(包含了想要的商品數據的id信息)


驗證一下是否為全部商品數據量,也就是獲取的商品的id數量(這里就是pid字段數據)是否等於120,代碼如下

#同樣也是在for循環下
print(json_data['data']['products'])
print('')
print(len(json_data['data']['products']))

 

輸出的結果為:(驗證完畢,注意第一個print輸出的是一個列表嵌套字典的數據)

3.4 商品詳細信息獲取

因此就可以再次遍歷循環獲取每一個商品的id信息了,注意這里的product_url的構造,將中間的商品id的信息全部刪除然后使用format方法進行替換即可,代碼如下

#在上面的for循環之中
for product_id in product_ids:
    print('商品id',product_id['pid'])
    product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={}%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(product_id['pid'])
    product_html = requests.get(product_url,headers = headers)
    print(product_html.text)

 

輸出的結果為:(截取部分輸出結果)


可以發現和最初獲取商品id信息一樣,具體的信息數據也需要進行格式的轉換,然后再提取,比如提取商品的名稱,品牌和價格

#這里以獲取前10個商品數據為例進行展示
product_start = product_html.text.index('{')
product_end = product_html.text.index('})')+1
product_json_data = json.loads(product_html.text[product_start:product_end])
product_info_data = product_json_data['data']['products'][0]
# print(product_info_data)
product_title = product_info_data['title']
product_brand = product_info_data['brandShowName']
product_price = product_info_data['price']['salePrice']
print('商品名稱:{},品牌:{},折后價格:{}'.format(product_title,product_brand,product_price))

 

輸出的結果為:(可以正常獲取相關的信息,這里就以商品的標題,品牌和售賣價格舉例,還可以獲取其他更為詳盡的數據)


最后一步就是將獲取的數據寫入本地:

with open('vip.txt','a+',encoding = 'utf-8') as f:
    f.write('商品名稱:{},品牌:{},折后價格:{}\n'.format(product_title,product_brand,product_price))

 

輸出結果為:(數據爬取完畢,並保存與本地)

4. 全部代碼

可以將整個過程封裝為函數,也可以將數據以csv或者xlsx的形式存放在本地,這里只列舉了txt文本數據的存儲

import requests
import json

headers = {
    'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
    'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

n = 1 #注意這里的n就代表你要爬取的實際頁碼數
for num in range(0,n*120,120): 
    url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&keyword=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435'
    html = requests.get(url,headers=headers)
    # print(html.text)

    start = html.text.index('{')
    end = html.text.index('})')+1
    json_data = json.loads(html.text[start:end])
    product_ids = json_data['data']['products']
    for product_id in product_ids:
        print('商品id',product_id['pid'])
        product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={}%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(product_id['pid'])
        product_html = requests.get(product_url,headers = headers)
        product_start = product_html.text.index('{')
        product_end = product_html.text.index('})')+1
        product_json_data = json.loads(product_html.text[product_start:product_end])
        product_info_data = product_json_data['data']['products'][0]
        # print(product_info_data)
        product_title = product_info_data['title']
        product_brand = product_info_data['brandShowName']
        product_price = product_info_data['price']['salePrice']
        print('商品名稱:{},品牌:{},折后價格:{}'.format(product_title,product_brand,product_price))
        with open('vip.txt','a+',encoding = 'utf-8') as f:
            f.write('商品名稱:{},品牌:{},折后價格:{}\n'.format(product_title,product_brand,product_price))

 

這里假使n=4,再次運行代碼,輸出的結果如下:(為了查看數據量,使用sublime打開txt文件,可以發現剛好是4頁商品的數量總和,因此整個唯品會商品的信息的爬取至此完結)


 

歡迎關注公眾號:Python爬蟲數據分析挖掘

記錄學習python的點點滴滴;

回復【開源源碼】免費獲取更多開源項目源碼;

公眾號每日更新python知識和【免費】工具;

本文已同步到【開源中國】和【騰訊雲社區】;

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM