python爬蟲爬取京東商品信息


爬取思路:
1、在京東首頁搜索欄輸入關鍵詞,以“電腦“為例。
2、爬取搜索頁面中共十頁的600件商品信息,其中包括商品名稱,商品價格,店鋪鏈接,商品樣例圖,商品價格,商品描述,店鋪名稱,商品當前活動(如免郵,秒殺)。
3、在爬取搜索頁面的商品信息時,獲得店鋪id,通過店鋪id跳轉到商品詳細信息頁面,爬取商品的50條評論信息,商品標簽信息及評論總人數,好評數、差評數、中評數。
4、將每一件商品的信息都用json格式存儲,並以json格式寫入本地txt文件中。
5、通過數據處理,計算出相同店鋪的銷售量,總銷售額和平均價格並排序,最后將排完序的數據分別寫入本地txt文件中,並將數據通過echarts進行展示。
6、對好評率超過70%的商品,進行標簽分析,拿到買家評論的標簽,進行統計,得到該類商品哪些優點會更受大家的青睞,並生成圖雲文件。
7、爬取中的問題:
7.1:在搜索頁爬取商品時會拿不到一些商品的信息,因為當一些特價商品處於秒殺情況時,他的信息會標紅,因此css樣式的class會不同。
解決:進行分類,觀察頁面會出現多少種類別的class,將不同的class進行不同的循環提取,保證不會丟失數據。
7.2:在商品詳細信息頁面中,用戶評論信息是實時加載,需要單獨拿取,且它的數據格式非標准化json,為jQuery+商品id+{內容}
解決:需要在拿到數據后,進行字符串切割操作,將{}外面的數據全部清除,這樣才能使用json.loads()
7.3:在進行數據統計的時候,如價格計算的時候會有臟數據,比如有些特賣電腦標價100,但這只是定金,所以不能用來計算,需要舍棄。
解決:對電腦價格進行一個判斷,超出常理的價格將會被丟棄,如小於一千以及大於五萬。

注:此次編寫的爬蟲程序,可以對任意京東商品進行爬取,並且爬取完會儲存數據到本地,若使用django,那么本程序可以只用一個查詢框,當輸入想要查詢的商品時,
會一鍵生成信息文件,價格,銷量排行數據,以及echarts圖表展示,並且可以下載到本地。所以當以后我們需要購買什么商品時,此程序便可以清晰明了的給我們提供商品信息,
而不用我們一件一件的商品進行查看,四處對比評論及店鋪好評率,節省了很多麻煩。
當然,本程序盡供學習使用,非商業用途。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2020/5/7 11:17
# @Author  : dddchongya
# @Site    : 
# @File    : ComputerFromJD.py
# @Software: PyCharm
import requests
from bs4 import BeautifulSoup as bst
import json
import os
informationnumber=0
def GetComment(id):
    param = {
        'callback': 'fetchJSON_comment98',
        'productId': id,
        'score': 0,
        'sortType': 5,
        'page': 1,
        'pageSize': 10,
        'isShadowSku': 0,
        'rid': 0,
        'fold': 1,
    }
    url="https://club.jd.com/comment/productPageComments.action"
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
        # 標記了請求從什么設備,什么瀏覽器上發出
    }
    CommentLs={}
    bool=1
    label=[]
    comments=[]
    commentnumber={}
    for i in range(1,5):
        param["page"]=i
        res_songs = requests.get(url, params=param, headers=headers)
        jsondata = res_songs.text
        jsondata = json.loads(jsondata.replace("(", "").replace(")", "").replace("fetchJSON_comment98", "").replace(" ","").replace(";", ""))
        if bool ==1 :
            # 標簽只用拿一次
            hotCommentTagStatistics=jsondata["hotCommentTagStatistics"]
            for j in hotCommentTagStatistics:
                label.append(j["name"]+":"+str(j["count"]))
            # 評論數量也只用拿一次
            productCommentSummary = jsondata["productCommentSummary"]
            commentnumber["commentCount"]=productCommentSummary["commentCount"]
            commentnumber["defaultGoodCount"] = productCommentSummary["defaultGoodCount"]
            commentnumber["goodCount"] = productCommentSummary["goodCount"]
            commentnumber["poorCount"] = productCommentSummary["poorCount"]
            commentnumber["generalCount"] = productCommentSummary["generalCount"]
            commentnumber["afterCountStr"] = productCommentSummary["afterCount"]
            commentnumber["showCount"] = productCommentSummary["showCount"]
            bool=bool+1

        comment=jsondata["comments"]
        for j in comment:
            comments.append(j["content"].replace("\n",""))
    CommentLs["commentnumber"]=commentnumber
    CommentLs["label"]=label
    CommentLs["comments"]=comments
    return CommentLs


def GetMoreInformation(id):
    url="https://item.jd.com/"+id+".html"
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
        # 標記了請求從什么設備,什么瀏覽器上發出
    }
    res= requests.get(url, headers=headers)
    html=bst(res.content)

def GetGoodResone(LsComputer):
        labells = []
        label = set()
        labellist = {}
        for i in LsComputer:
            if (i['comments']['commentnumber']['goodCount'] + i['comments']['commentnumber'][
                'defaultGoodCount']) / float(i['comments']['commentnumber']['commentCount']) > 0.7:
                labells.append(i['comments']["label"])
        for i in labells:
            for j in i:
                label.add(j.split(":")[0])
        for i in label:
            labellist[i] = 0
        for j in labells:
            for k in j:
                labellist[k.split(":")[0]] = labellist[k.split(":")[0]] + float(k.split(":")[1])
        result = sorted(labellist.items(), key=lambda x: x[1], reverse=False)
        with open(os.getcwd() + '\好評過七十的標簽排行.txt', 'w', encoding="utf-8") as f:
            for i in result:
                f.write(str(i))
                f.write('\r\n')
        f.close()

def GetMaxSalesShop(LsComputer):
    shop=set()
    for i in LsComputer:
        shop.add(i["ShopName"])
    shopcount={}
    shopsalecount={}
    shopprice={}
    for i in shop:
        shopcount[i]=0
        shopsalecount[i]=0
        shopprice[i] = []
    for i in shop:
        for j in LsComputer:
            if j["ShopName"]==i:
                if j["Price"].__len__()>=5:
                    price=j["Price"][0:-3].replace("\n","").replace(" ","").replace("\t","")
                    # 銷售額
                    shopcount[i]=shopcount[i]+j["comments"]["commentnumber"]["commentCount"]*float(price)
                    #價格總和,為了求平均數
                    shopprice[i].append(price)
                    # 銷售量
                    shopsalecount[i]=shopsalecount[i]+j["comments"]["commentnumber"]["commentCount"]
    shopprice2={}
    for i in shopprice:
        sum=0
        if shopprice[i].__len__() != 0:
            for j in shopprice[i]:
                sum=sum+float(j)
            price=sum/(shopprice[i].__len__())
            shopprice2[i]=price

    print()
    print()
    result=sorted(shopcount.items(), key=lambda x: x[1], reverse=False)
    print("銷售額排行::")
    for i in result:
        print(i)
    with open(os.getcwd() + '\銷售額排行.txt', 'w', encoding="utf-8") as f:
        for i in result:
            f.write(str(i))
            f.write('\r\n')
    f.close()

    print()
    print()
    result = sorted(shopprice2.items(), key=lambda x: x[1], reverse=False)
    print("銷售量排行::")
    for i in result:
        print(i)
    with open(os.getcwd() + '\銷售量排行.txt', 'w', encoding="utf-8") as f:
        for i in result:
            f.write(str(i))
            f.write('\r\n')
    f.close()

    print()
    print()
    result = sorted(shopsalecount.items(), key=lambda x: x[1], reverse=False)
    print("平均價格排行::")
    for i in result:
        print(i)
    with open(os.getcwd() + '\平均價格排行.txt', 'w', encoding="utf-8") as f:
        for i in result:
            f.write(str(i))
            f.write('\r\n')
    f.close()

# 可任意寫搜索鏈接
url = 'https://search.jd.com/Search?keyword=%E7%94%B5%E8%84%91&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E7%94%B5%E8%84%91&page='
headers={
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
      # 標記了請求從什么設備,什么瀏覽器上發出
    }
# 偽裝請求頭
LsComputer=[]
#bool=1    # 每頁開頭第一個商品格式有誤差,所以以此為判斷符號跳過第一個
for k in range(1,10):
    url=url+str(k*2+1)
    res= requests.get(url, headers=headers)
    html=bst(res.content)
    list=html.findAll("li",{"class","gl-item gl-item-presell"})
    for html in list:
        ComputerInformation={}
        CustomUrl=html.find("div",{"class","p-img"}).find("a").get("href")
        if not str(CustomUrl).__contains__("https:"):
            CustomUrl="https:"+CustomUrl
      #  print(CustomUrl)
        id=html.find("div",{"class","p-price"}).find("strong").get("class")
        id=id[0].replace("J","").replace("_","")
        # 拿到評論信息
        Comments=GetComment(id)
        #print(Comment)

        #進入頁面拿更詳細的信息

        ImgUrl="https:"+str(html.find("div",{"class","p-img"}).find("img").get("source-data-lazy-img"))
      #  print(ImgUrl)

        Price=str(html.find("div",{"class","p-price"}).find("i"))[3:-4]
      #  print(Price[3:-4])

        Describe=str(html.find("div",{"class","p-name p-name-type-2"}).find("em").getText())
      #  print(Describe)

        #第一行一個會為空
        ShopName=html.find("div",{"class","p-shop"}).find("a")
        if ShopName != None:
            ShopName=str(ShopName.getText())

        # print(ShopName)
        # 店鋪描述可能有多個
        Mode=html.find("div",{"class","p-icons"}).findAll("i")
        BusinessMode=[]
        for i in Mode:
            BusinessMode.append(i.getText())
      #  print(BusinessMode)

        ComputerInformation["CustomUrl"]=CustomUrl
        ComputerInformation["ImgUrl"] = ImgUrl
        ComputerInformation["Price"] = Price
        ComputerInformation["Describe"] = Describe
        ComputerInformation["ShopName"] = ShopName
        ComputerInformation["CustomUrl"] = CustomUrl
        ComputerInformation["BusinessMode"] = BusinessMode
        ComputerInformation["comments"]=Comments
        LsComputer.append(ComputerInformation)


for k in range(1,10):
    url=url+str(k*2+1)
    res= requests.get(url, headers=headers)
    html=bst(res.content)
    list=html.findAll("li",{"class","gl-item"})
    for html in list:
        ComputerInformation={}
        CustomUrl=html.find("div",{"class","p-img"}).find("a").get("href")
        if not str(CustomUrl).__contains__("https:"):
            CustomUrl="https:"+CustomUrl
      #  print(CustomUrl)
        id=html.find("div",{"class","p-price"}).find("strong").get("class")
        id=id[0].replace("J","").replace("_","")
        # 拿到評論信息
        Comments=GetComment(id)
        #print(Comment)

        #進入頁面拿更詳細的信息

        ImgUrl="https:"+str(html.find("div",{"class","p-img"}).find("img").get("source-data-lazy-img"))
      #  print(ImgUrl)

        Price=str(html.find("div",{"class","p-price"}).find("i"))[3:-4]
      #  print(Price[3:-4])

        Describe=str(html.find("div",{"class","p-name p-name-type-2"}).find("em").getText())
      #  print(Describe)

        #第一行一個會為空
        ShopName=html.find("div",{"class","p-shop"}).find("a")
        if ShopName != None:
            ShopName=str(ShopName.getText())
       # print(ShopName)
        # 店鋪描述可能有多個
        Mode=html.find("div",{"class","p-icons"}).findAll("i")
        BusinessMode=[]
        for i in Mode:
            BusinessMode.append(i.getText())
      #  print(BusinessMode)

        ComputerInformation["CustomUrl"]=CustomUrl
        ComputerInformation["ImgUrl"] = ImgUrl
        ComputerInformation["Price"] = Price
        ComputerInformation["Describe"] = Describe
        ComputerInformation["ShopName"] = ShopName
        ComputerInformation["CustomUrl"] = CustomUrl
        ComputerInformation["BusinessMode"] = BusinessMode
        ComputerInformation["comments"]=Comments
        LsComputer.append(ComputerInformation)

#數據寫入文件
with open(os.getcwd() + '\json.txt', 'w',encoding="utf-8") as f:
    for i in LsComputer:
        f.write(json.dumps(i,indent=4,ensure_ascii=False))
f.close()

GetMaxSalesShop(LsComputer)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM