爬取思路:
1、在京東首頁搜索欄輸入關鍵詞,以“電腦“為例。
2、爬取搜索頁面中共十頁的600件商品信息,其中包括商品名稱,商品價格,店鋪鏈接,商品樣例圖,商品價格,商品描述,店鋪名稱,商品當前活動(如免郵,秒殺)。
3、在爬取搜索頁面的商品信息時,獲得店鋪id,通過店鋪id跳轉到商品詳細信息頁面,爬取商品的50條評論信息,商品標簽信息及評論總人數,好評數、差評數、中評數。
4、將每一件商品的信息都用json格式存儲,並以json格式寫入本地txt文件中。
5、通過數據處理,計算出相同店鋪的銷售量,總銷售額和平均價格並排序,最后將排完序的數據分別寫入本地txt文件中,並將數據通過echarts進行展示。
6、對好評率超過70%的商品,進行標簽分析,拿到買家評論的標簽,進行統計,得到該類商品哪些優點會更受大家的青睞,並生成圖雲文件。
7、爬取中的問題:
7.1:在搜索頁爬取商品時會拿不到一些商品的信息,因為當一些特價商品處於秒殺情況時,他的信息會標紅,因此css樣式的class會不同。
解決:進行分類,觀察頁面會出現多少種類別的class,將不同的class進行不同的循環提取,保證不會丟失數據。
7.2:在商品詳細信息頁面中,用戶評論信息是實時加載,需要單獨拿取,且它的數據格式非標准化json,為jQuery+商品id+{內容}
解決:需要在拿到數據后,進行字符串切割操作,將{}外面的數據全部清除,這樣才能使用json.loads()
7.3:在進行數據統計的時候,如價格計算的時候會有臟數據,比如有些特賣電腦標價100,但這只是定金,所以不能用來計算,需要舍棄。
解決:對電腦價格進行一個判斷,超出常理的價格將會被丟棄,如小於一千以及大於五萬。
注:此次編寫的爬蟲程序,可以對任意京東商品進行爬取,並且爬取完會儲存數據到本地,若使用django,那么本程序可以只用一個查詢框,當輸入想要查詢的商品時,
會一鍵生成信息文件,價格,銷量排行數據,以及echarts圖表展示,並且可以下載到本地。所以當以后我們需要購買什么商品時,此程序便可以清晰明了的給我們提供商品信息,
而不用我們一件一件的商品進行查看,四處對比評論及店鋪好評率,節省了很多麻煩。
當然,本程序盡供學習使用,非商業用途。
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2020/5/7 11:17 # @Author : dddchongya # @Site : # @File : ComputerFromJD.py # @Software: PyCharm import requests from bs4 import BeautifulSoup as bst import json import os informationnumber=0 def GetComment(id): param = { 'callback': 'fetchJSON_comment98', 'productId': id, 'score': 0, 'sortType': 5, 'page': 1, 'pageSize': 10, 'isShadowSku': 0, 'rid': 0, 'fold': 1, } url="https://club.jd.com/comment/productPageComments.action" headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36', # 標記了請求從什么設備,什么瀏覽器上發出 } CommentLs={} bool=1 label=[] comments=[] commentnumber={} for i in range(1,5): param["page"]=i res_songs = requests.get(url, params=param, headers=headers) jsondata = res_songs.text jsondata = json.loads(jsondata.replace("(", "").replace(")", "").replace("fetchJSON_comment98", "").replace(" ","").replace(";", "")) if bool ==1 : # 標簽只用拿一次 hotCommentTagStatistics=jsondata["hotCommentTagStatistics"] for j in hotCommentTagStatistics: label.append(j["name"]+":"+str(j["count"])) # 評論數量也只用拿一次 productCommentSummary = jsondata["productCommentSummary"] commentnumber["commentCount"]=productCommentSummary["commentCount"] commentnumber["defaultGoodCount"] = productCommentSummary["defaultGoodCount"] commentnumber["goodCount"] = productCommentSummary["goodCount"] commentnumber["poorCount"] = productCommentSummary["poorCount"] commentnumber["generalCount"] = productCommentSummary["generalCount"] commentnumber["afterCountStr"] = productCommentSummary["afterCount"] commentnumber["showCount"] = productCommentSummary["showCount"] bool=bool+1 comment=jsondata["comments"] for j in comment: comments.append(j["content"].replace("\n","")) CommentLs["commentnumber"]=commentnumber CommentLs["label"]=label CommentLs["comments"]=comments return CommentLs def GetMoreInformation(id): url="https://item.jd.com/"+id+".html" headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36', # 標記了請求從什么設備,什么瀏覽器上發出 } res= requests.get(url, headers=headers) html=bst(res.content) def GetGoodResone(LsComputer): labells = [] label = set() labellist = {} for i in LsComputer: if (i['comments']['commentnumber']['goodCount'] + i['comments']['commentnumber'][ 'defaultGoodCount']) / float(i['comments']['commentnumber']['commentCount']) > 0.7: labells.append(i['comments']["label"]) for i in labells: for j in i: label.add(j.split(":")[0]) for i in label: labellist[i] = 0 for j in labells: for k in j: labellist[k.split(":")[0]] = labellist[k.split(":")[0]] + float(k.split(":")[1]) result = sorted(labellist.items(), key=lambda x: x[1], reverse=False) with open(os.getcwd() + '\好評過七十的標簽排行.txt', 'w', encoding="utf-8") as f: for i in result: f.write(str(i)) f.write('\r\n') f.close() def GetMaxSalesShop(LsComputer): shop=set() for i in LsComputer: shop.add(i["ShopName"]) shopcount={} shopsalecount={} shopprice={} for i in shop: shopcount[i]=0 shopsalecount[i]=0 shopprice[i] = [] for i in shop: for j in LsComputer: if j["ShopName"]==i: if j["Price"].__len__()>=5: price=j["Price"][0:-3].replace("\n","").replace(" ","").replace("\t","") # 銷售額 shopcount[i]=shopcount[i]+j["comments"]["commentnumber"]["commentCount"]*float(price) #價格總和,為了求平均數 shopprice[i].append(price) # 銷售量 shopsalecount[i]=shopsalecount[i]+j["comments"]["commentnumber"]["commentCount"] shopprice2={} for i in shopprice: sum=0 if shopprice[i].__len__() != 0: for j in shopprice[i]: sum=sum+float(j) price=sum/(shopprice[i].__len__()) shopprice2[i]=price print() print() result=sorted(shopcount.items(), key=lambda x: x[1], reverse=False) print("銷售額排行::") for i in result: print(i) with open(os.getcwd() + '\銷售額排行.txt', 'w', encoding="utf-8") as f: for i in result: f.write(str(i)) f.write('\r\n') f.close() print() print() result = sorted(shopprice2.items(), key=lambda x: x[1], reverse=False) print("銷售量排行::") for i in result: print(i) with open(os.getcwd() + '\銷售量排行.txt', 'w', encoding="utf-8") as f: for i in result: f.write(str(i)) f.write('\r\n') f.close() print() print() result = sorted(shopsalecount.items(), key=lambda x: x[1], reverse=False) print("平均價格排行::") for i in result: print(i) with open(os.getcwd() + '\平均價格排行.txt', 'w', encoding="utf-8") as f: for i in result: f.write(str(i)) f.write('\r\n') f.close() # 可任意寫搜索鏈接 url = 'https://search.jd.com/Search?keyword=%E7%94%B5%E8%84%91&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E7%94%B5%E8%84%91&page=' headers={ 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36', # 標記了請求從什么設備,什么瀏覽器上發出 } # 偽裝請求頭 LsComputer=[] #bool=1 # 每頁開頭第一個商品格式有誤差,所以以此為判斷符號跳過第一個 for k in range(1,10): url=url+str(k*2+1) res= requests.get(url, headers=headers) html=bst(res.content) list=html.findAll("li",{"class","gl-item gl-item-presell"}) for html in list: ComputerInformation={} CustomUrl=html.find("div",{"class","p-img"}).find("a").get("href") if not str(CustomUrl).__contains__("https:"): CustomUrl="https:"+CustomUrl # print(CustomUrl) id=html.find("div",{"class","p-price"}).find("strong").get("class") id=id[0].replace("J","").replace("_","") # 拿到評論信息 Comments=GetComment(id) #print(Comment) #進入頁面拿更詳細的信息 ImgUrl="https:"+str(html.find("div",{"class","p-img"}).find("img").get("source-data-lazy-img")) # print(ImgUrl) Price=str(html.find("div",{"class","p-price"}).find("i"))[3:-4] # print(Price[3:-4]) Describe=str(html.find("div",{"class","p-name p-name-type-2"}).find("em").getText()) # print(Describe) #第一行一個會為空 ShopName=html.find("div",{"class","p-shop"}).find("a") if ShopName != None: ShopName=str(ShopName.getText()) # print(ShopName) # 店鋪描述可能有多個 Mode=html.find("div",{"class","p-icons"}).findAll("i") BusinessMode=[] for i in Mode: BusinessMode.append(i.getText()) # print(BusinessMode) ComputerInformation["CustomUrl"]=CustomUrl ComputerInformation["ImgUrl"] = ImgUrl ComputerInformation["Price"] = Price ComputerInformation["Describe"] = Describe ComputerInformation["ShopName"] = ShopName ComputerInformation["CustomUrl"] = CustomUrl ComputerInformation["BusinessMode"] = BusinessMode ComputerInformation["comments"]=Comments LsComputer.append(ComputerInformation) for k in range(1,10): url=url+str(k*2+1) res= requests.get(url, headers=headers) html=bst(res.content) list=html.findAll("li",{"class","gl-item"}) for html in list: ComputerInformation={} CustomUrl=html.find("div",{"class","p-img"}).find("a").get("href") if not str(CustomUrl).__contains__("https:"): CustomUrl="https:"+CustomUrl # print(CustomUrl) id=html.find("div",{"class","p-price"}).find("strong").get("class") id=id[0].replace("J","").replace("_","") # 拿到評論信息 Comments=GetComment(id) #print(Comment) #進入頁面拿更詳細的信息 ImgUrl="https:"+str(html.find("div",{"class","p-img"}).find("img").get("source-data-lazy-img")) # print(ImgUrl) Price=str(html.find("div",{"class","p-price"}).find("i"))[3:-4] # print(Price[3:-4]) Describe=str(html.find("div",{"class","p-name p-name-type-2"}).find("em").getText()) # print(Describe) #第一行一個會為空 ShopName=html.find("div",{"class","p-shop"}).find("a") if ShopName != None: ShopName=str(ShopName.getText()) # print(ShopName) # 店鋪描述可能有多個 Mode=html.find("div",{"class","p-icons"}).findAll("i") BusinessMode=[] for i in Mode: BusinessMode.append(i.getText()) # print(BusinessMode) ComputerInformation["CustomUrl"]=CustomUrl ComputerInformation["ImgUrl"] = ImgUrl ComputerInformation["Price"] = Price ComputerInformation["Describe"] = Describe ComputerInformation["ShopName"] = ShopName ComputerInformation["CustomUrl"] = CustomUrl ComputerInformation["BusinessMode"] = BusinessMode ComputerInformation["comments"]=Comments LsComputer.append(ComputerInformation) #數據寫入文件 with open(os.getcwd() + '\json.txt', 'w',encoding="utf-8") as f: for i in LsComputer: f.write(json.dumps(i,indent=4,ensure_ascii=False)) f.close() GetMaxSalesShop(LsComputer)