python爬蟲爬取京東商品信息

本文轉載自查看原文 2020-05-08 19:05 4532

爬取思路：
1、在京東首頁搜索欄輸入關鍵詞，以“電腦“為例。
2、爬取搜索頁面中共十頁的600件商品信息，其中包括商品名稱，商品價格，店鋪鏈接，商品樣例圖，商品價格，商品描述，店鋪名稱，商品當前活動(如免郵，秒殺)。
3、在爬取搜索頁面的商品信息時，獲得店鋪id，通過店鋪id跳轉到商品詳細信息頁面，爬取商品的50條評論信息，商品標簽信息及評論總人數，好評數、差評數、中評數。
4、將每一件商品的信息都用json格式存儲，並以json格式寫入本地txt文件中。
5、通過數據處理，計算出相同店鋪的銷售量，總銷售額和平均價格並排序，最后將排完序的數據分別寫入本地txt文件中，並將數據通過echarts進行展示。
6、對好評率超過70%的商品，進行標簽分析，拿到買家評論的標簽，進行統計，得到該類商品哪些優點會更受大家的青睞，並生成圖雲文件。
7、爬取中的問題：
7.1：在搜索頁爬取商品時會拿不到一些商品的信息，因為當一些特價商品處於秒殺情況時，他的信息會標紅，因此css樣式的class會不同。
解決：進行分類，觀察頁面會出現多少種類別的class，將不同的class進行不同的循環提取，保證不會丟失數據。
7.2：在商品詳細信息頁面中，用戶評論信息是實時加載，需要單獨拿取，且它的數據格式非標准化json，為jQuery+商品id+{內容}
解決：需要在拿到數據后，進行字符串切割操作，將{}外面的數據全部清除，這樣才能使用json.loads()
7.3：在進行數據統計的時候，如價格計算的時候會有臟數據，比如有些特賣電腦標價100，但這只是定金，所以不能用來計算，需要舍棄。
解決：對電腦價格進行一個判斷，超出常理的價格將會被丟棄，如小於一千以及大於五萬。

注：此次編寫的爬蟲程序，可以對任意京東商品進行爬取，並且爬取完會儲存數據到本地，若使用django，那么本程序可以只用一個查詢框，當輸入想要查詢的商品時，
會一鍵生成信息文件，價格，銷量排行數據，以及echarts圖表展示，並且可以下載到本地。所以當以后我們需要購買什么商品時，此程序便可以清晰明了的給我們提供商品信息，
而不用我們一件一件的商品進行查看，四處對比評論及店鋪好評率，節省了很多麻煩。
當然，本程序盡供學習使用，非商業用途。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2020/5/7 11:17
# @Author  : dddchongya
# @Site    : 
# @File    : ComputerFromJD.py
# @Software: PyCharm
import requests
from bs4 import BeautifulSoup as bst
import json
import os
informationnumber=0
def GetComment(id):
    param = {
        'callback': 'fetchJSON_comment98',
        'productId': id,
        'score': 0,
        'sortType': 5,
        'page': 1,
        'pageSize': 10,
        'isShadowSku': 0,
        'rid': 0,
        'fold': 1,
    }
    url="https://club.jd.com/comment/productPageComments.action"
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
        # 標記了請求從什么設備，什么瀏覽器上發出
    }
    CommentLs={}
    bool=1
    label=[]
    comments=[]
    commentnumber={}
    for i in range(1,5):
        param["page"]=i
        res_songs = requests.get(url, params=param, headers=headers)
        jsondata = res_songs.text
        jsondata = json.loads(jsondata.replace("(", "").replace(")", "").replace("fetchJSON_comment98", "").replace(" ","").replace(";", ""))
        if bool ==1 :
            # 標簽只用拿一次
            hotCommentTagStatistics=jsondata["hotCommentTagStatistics"]
            for j in hotCommentTagStatistics:
                label.append(j["name"]+":"+str(j["count"]))
            # 評論數量也只用拿一次
            productCommentSummary = jsondata["productCommentSummary"]
            commentnumber["commentCount"]=productCommentSummary["commentCount"]
            commentnumber["defaultGoodCount"] = productCommentSummary["defaultGoodCount"]
            commentnumber["goodCount"] = productCommentSummary["goodCount"]
            commentnumber["poorCount"] = productCommentSummary["poorCount"]
            commentnumber["generalCount"] = productCommentSummary["generalCount"]
            commentnumber["afterCountStr"] = productCommentSummary["afterCount"]
            commentnumber["showCount"] = productCommentSummary["showCount"]
            bool=bool+1

        comment=jsondata["comments"]
        for j in comment:
            comments.append(j["content"].replace("\n",""))
    CommentLs["commentnumber"]=commentnumber
    CommentLs["label"]=label
    CommentLs["comments"]=comments
    return CommentLs


def GetMoreInformation(id):
    url="https://item.jd.com/"+id+".html"
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
        # 標記了請求從什么設備，什么瀏覽器上發出
    }
    res= requests.get(url, headers=headers)
    html=bst(res.content)

def GetGoodResone(LsComputer):
        labells = []
        label = set()
        labellist = {}
        for i in LsComputer:
            if (i['comments']['commentnumber']['goodCount'] + i['comments']['commentnumber'][
                'defaultGoodCount']) / float(i['comments']['commentnumber']['commentCount']) > 0.7:
                labells.append(i['comments']["label"])
        for i in labells:
            for j in i:
                label.add(j.split(":")[0])
        for i in label:
            labellist[i] = 0
        for j in labells:
            for k in j:
                labellist[k.split(":")[0]] = labellist[k.split(":")[0]] + float(k.split(":")[1])
        result = sorted(labellist.items(), key=lambda x: x[1], reverse=False)
        with open(os.getcwd() + '\好評過七十的標簽排行.txt', 'w', encoding="utf-8") as f:
            for i in result:
                f.write(str(i))
                f.write('\r\n')
        f.close()

def GetMaxSalesShop(LsComputer):
    shop=set()
    for i in LsComputer:
        shop.add(i["ShopName"])
    shopcount={}
    shopsalecount={}
    shopprice={}
    for i in shop:
        shopcount[i]=0
        shopsalecount[i]=0
        shopprice[i] = []
    for i in shop:
        for j in LsComputer:
            if j["ShopName"]==i:
                if j["Price"].__len__()>=5:
                    price=j["Price"][0:-3].replace("\n","").replace(" ","").replace("\t","")
                    # 銷售額
                    shopcount[i]=shopcount[i]+j["comments"]["commentnumber"]["commentCount"]*float(price)
                    #價格總和，為了求平均數
                    shopprice[i].append(price)
                    # 銷售量
                    shopsalecount[i]=shopsalecount[i]+j["comments"]["commentnumber"]["commentCount"]
    shopprice2={}
    for i in shopprice:
        sum=0
        if shopprice[i].__len__() != 0:
            for j in shopprice[i]:
                sum=sum+float(j)
            price=sum/(shopprice[i].__len__())
            shopprice2[i]=price

    print()
    print()
    result=sorted(shopcount.items(), key=lambda x: x[1], reverse=False)
    print("銷售額排行::")
    for i in result:
        print(i)
    with open(os.getcwd() + '\銷售額排行.txt', 'w', encoding="utf-8") as f:
        for i in result:
            f.write(str(i))
            f.write('\r\n')
    f.close()

    print()
    print()
    result = sorted(shopprice2.items(), key=lambda x: x[1], reverse=False)
    print("銷售量排行::")
    for i in result:
        print(i)
    with open(os.getcwd() + '\銷售量排行.txt', 'w', encoding="utf-8") as f:
        for i in result:
            f.write(str(i))
            f.write('\r\n')
    f.close()

    print()
    print()
    result = sorted(shopsalecount.items(), key=lambda x: x[1], reverse=False)
    print("平均價格排行::")
    for i in result:
        print(i)
    with open(os.getcwd() + '\平均價格排行.txt', 'w', encoding="utf-8") as f:
        for i in result:
            f.write(str(i))
            f.write('\r\n')
    f.close()

# 可任意寫搜索鏈接
url = 'https://search.jd.com/Search?keyword=%E7%94%B5%E8%84%91&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E7%94%B5%E8%84%91&page='
headers={
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
      # 標記了請求從什么設備，什么瀏覽器上發出
    }
# 偽裝請求頭
LsComputer=[]
#bool=1    # 每頁開頭第一個商品格式有誤差，所以以此為判斷符號跳過第一個
for k in range(1,10):
    url=url+str(k*2+1)
    res= requests.get(url, headers=headers)
    html=bst(res.content)
    list=html.findAll("li",{"class","gl-item gl-item-presell"})
    for html in list:
        ComputerInformation={}
        CustomUrl=html.find("div",{"class","p-img"}).find("a").get("href")
        if not str(CustomUrl).__contains__("https:"):
            CustomUrl="https:"+CustomUrl
      #  print(CustomUrl)
        id=html.find("div",{"class","p-price"}).find("strong").get("class")
        id=id[0].replace("J","").replace("_","")
        # 拿到評論信息
        Comments=GetComment(id)
        #print(Comment)

        #進入頁面拿更詳細的信息

        ImgUrl="https:"+str(html.find("div",{"class","p-img"}).find("img").get("source-data-lazy-img"))
      #  print(ImgUrl)

        Price=str(html.find("div",{"class","p-price"}).find("i"))[3:-4]
      #  print(Price[3:-4])

        Describe=str(html.find("div",{"class","p-name p-name-type-2"}).find("em").getText())
      #  print(Describe)

        #第一行一個會為空
        ShopName=html.find("div",{"class","p-shop"}).find("a")
        if ShopName != None:
            ShopName=str(ShopName.getText())

        # print(ShopName)
        # 店鋪描述可能有多個
        Mode=html.find("div",{"class","p-icons"}).findAll("i")
        BusinessMode=[]
        for i in Mode:
            BusinessMode.append(i.getText())
      #  print(BusinessMode)

        ComputerInformation["CustomUrl"]=CustomUrl
        ComputerInformation["ImgUrl"] = ImgUrl
        ComputerInformation["Price"] = Price
        ComputerInformation["Describe"] = Describe
        ComputerInformation["ShopName"] = ShopName
        ComputerInformation["CustomUrl"] = CustomUrl
        ComputerInformation["BusinessMode"] = BusinessMode
        ComputerInformation["comments"]=Comments
        LsComputer.append(ComputerInformation)


for k in range(1,10):
    url=url+str(k*2+1)
    res= requests.get(url, headers=headers)
    html=bst(res.content)
    list=html.findAll("li",{"class","gl-item"})
    for html in list:
        ComputerInformation={}
        CustomUrl=html.find("div",{"class","p-img"}).find("a").get("href")
        if not str(CustomUrl).__contains__("https:"):
            CustomUrl="https:"+CustomUrl
      #  print(CustomUrl)
        id=html.find("div",{"class","p-price"}).find("strong").get("class")
        id=id[0].replace("J","").replace("_","")
        # 拿到評論信息
        Comments=GetComment(id)
        #print(Comment)

        #進入頁面拿更詳細的信息

        ImgUrl="https:"+str(html.find("div",{"class","p-img"}).find("img").get("source-data-lazy-img"))
      #  print(ImgUrl)

        Price=str(html.find("div",{"class","p-price"}).find("i"))[3:-4]
      #  print(Price[3:-4])

        Describe=str(html.find("div",{"class","p-name p-name-type-2"}).find("em").getText())
      #  print(Describe)

        #第一行一個會為空
        ShopName=html.find("div",{"class","p-shop"}).find("a")
        if ShopName != None:
            ShopName=str(ShopName.getText())
       # print(ShopName)
        # 店鋪描述可能有多個
        Mode=html.find("div",{"class","p-icons"}).findAll("i")
        BusinessMode=[]
        for i in Mode:
            BusinessMode.append(i.getText())
      #  print(BusinessMode)

        ComputerInformation["CustomUrl"]=CustomUrl
        ComputerInformation["ImgUrl"] = ImgUrl
        ComputerInformation["Price"] = Price
        ComputerInformation["Describe"] = Describe
        ComputerInformation["ShopName"] = ShopName
        ComputerInformation["CustomUrl"] = CustomUrl
        ComputerInformation["BusinessMode"] = BusinessMode
        ComputerInformation["comments"]=Comments
        LsComputer.append(ComputerInformation)

#數據寫入文件
with open(os.getcwd() + '\json.txt', 'w',encoding="utf-8") as f:
    for i in LsComputer:
        f.write(json.dumps(i,indent=4,ensure_ascii=False))
f.close()

GetMaxSalesShop(LsComputer)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲實踐——爬取京東商品信息 Python爬蟲爬取淘寶，京東商品信息 python_爬蟲_爬取京東商品信息 python爬蟲：爬取京東商品信息 Java爬蟲爬取京東商品信息爬蟲之selenium爬取京東商品信息 Java爬蟲爬取京東商品信息 Python爬蟲-爬取京東商品信息-按給定關鍵詞京東app商品信息爬取 python爬蟲爬取58同城商品信息