爬取3w條『各種品牌』筆記本電腦數據，統計分析並進行可視化展示！真好看~...

本文轉載自查看原文 2021-03-12 13:30 295

本文代碼講解已錄成視頻，歡迎掃碼學習！

本文手撕代碼過程

前言

在上一篇文章【教你用python爬取『京東』商品數據，原來這么簡單！】教大家如何學會爬取『京東』商城商品數據。

今天教大家如何爬取『京東』平台里面『各種品牌』筆記本電腦數據約30000條進行統計分析，最后進行可視化展示（各種可視化圖表真好看！！）

本文干貨內容：

爬取京東商品所有筆記本電腦數據

數據存儲到excel

pandas對excel數據進行統計分析

繪制各種可視化圖表

爬取數據

1.鏈接分析

之前介紹了爬取其中的一種商品，這里需要爬取『各種品牌』，對應的鏈接也不一樣，需要進行分析。

可以分析鏈接中，ev參數對應着品牌的名稱，因此只需要更改ev參數就可以爬取不同品牌的筆記本數據。

避坑：

注意不要遺漏后面的括號：聯想（lenovo），少了后面括號有一些品牌的數據無法爬取（親測證明）。

此外不同品牌的筆記本商品數據總量（總頁數）不一樣，因此同樣需要對應進行匯總，這里定義了字典去存儲1.品牌名稱和2.總頁數。

brand_dict={
    '聯想（lenovo）':100,
    'ThinkPad':100,
    '戴爾（DELL）':100,
    '惠普（HP）':100,
    '華為（HUAWEI）':100,
    'Apple':100,
    '小米（MI）':47,
    '宏碁（acer）':43,
    '榮耀（HONOR）':21,
    '機械革命（MECHREVO）':31,
    '微軟（Microsoft）':100,
    'LG':3,
    '神舟（HASEE）':34,
    'VAIO':3,
    '三星（SAMSUNG）':47,
}

2.獲取不同品牌筆記本數據

#李運辰 公眾號：python爬蟲數據分析挖掘
#遍歷每一頁
def getpage(brand_dict):
    global  count
    for k, v in brand_dict.items():
        page = 1
        s = 1
        brand = str(k)
        try:
            for i in range(1, int(v) + 1):
                url = "https://search.jd.com/search?keyword=筆記本&wq=筆記本&ev=exbrand_" + str(brand) + "&page=" + str(
                    page) + "&s=" + str(s) + "&click=1"
                getlist(url, brand)
                page = page + 2
                s = s + 60
                print("品牌=" + str(k) + ",頁數=" + str(v) + ",當前頁數=" + str(i))
        except:
            pass

這里加入了try-except，防止其中某一頁爬取失敗，造成程序終止！

3.遍歷每一頁數據

#李運辰 公眾號：python爬蟲數據分析挖掘
###獲取每一頁的商品數據
def getlist(url,brand):
    global  count
    #url="https://search.jd.com/search?keyword=筆記本&wq=筆記本&ev=exbrand_聯想%5E&page=9&s=241&click=1"
    res = requests.get(url,headers=headers)
    res.encoding = 'utf-8'
    text = res.text
    selector = etree.HTML(text)
    list = selector.xpath('//*[@id="J_goodsList"]/ul/li')
    for i in list:
        title=i.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()')[0]
        price = i.xpath('.//div[@class="p-price"]/strong/i/text()')[0]

這里只獲取商品標題和商品價格

4.數據存儲到excel

定義excel表頭

#李運辰 公眾號：python爬蟲數據分析挖掘
import openpyxl
outwb = openpyxl.Workbook()
outws = outwb.create_sheet(index=0)
outws.cell(row=1, column=1, value="index")
outws.cell(row=1, column=2, value="brand")
outws.cell(row=1, column=3, value="title")
outws.cell(row=1, column=4, value="price")
count = 2

寫數據並保存成筆記本電腦-李運辰.xls

outws.cell(row=count, column=1, value=str(count-1))
outws.cell(row=count, column=2, value=str(brand))
outws.cell(row=count, column=3, value=str(title))
outws.cell(row=count, column=4, value=str(price))
outwb.save("筆記本電腦-李運辰.xls")  # 保存

這樣我們的數據就已經爬取完成。

下面開始對這些數據進行統計分析，最后繪制可視化圖。

可視化分析

1.展示每個品牌的數據量

pandas讀取excel

#李運辰 公眾號：python爬蟲數據分析挖掘
#讀入數據
df_all = pd.read_csv("筆記本電腦-李運辰.csv",engine="python")
df = df_all.copy()
# 重置索引
df = df.reset_index(drop=True)

統計分析

#李運辰 公眾號：python爬蟲數據分析挖掘
brand_counts = df.groupby('brand')['price'].count().sort_values(ascending=False).reset_index()
brand_counts.columns = ['品牌', '數據量']
name = (brand_counts['品牌']).tolist()
dict_values = (brand_counts['數據量']).tolist()

可視化展示

#李運辰 公眾號：python爬蟲數據分析挖掘
#鏈式調用
c = (
    Bar(
        init_opts=opts.InitOpts(  # 初始配置項
            theme=ThemeType.MACARONS,
            animation_opts=opts.AnimationOpts(
                animation_delay=1000, animation_easing="cubicOut"  # 初始動畫延遲和緩動效果
            ))
    )
        .add_xaxis(xaxis_data=name)  # x軸
        .add_yaxis(series_name="展示每個品牌的數據量", yaxis_data=dict_values)  # y軸
        .set_global_opts(
        title_opts=opts.TitleOpts(title='', subtitle='',  # 標題配置和調整位置
                                  title_textstyle_opts=opts.TextStyleOpts(
                                      font_family='SimHei', font_size=25, font_weight='bold', color='red',
                                  ), pos_left="90%", pos_top="10",
                                  ),
        xaxis_opts=opts.AxisOpts(name='品牌', axislabel_opts=opts.LabelOpts(rotate=45)),
        # 設置x名稱和Label rotate解決標簽名字過長使用
        yaxis_opts=opts.AxisOpts(name='數據量'),


    )
        .render("展示每個品牌的數據量.html")
)

2.最高價格對比

統計分析

#李運辰 公眾號：python爬蟲數據分析挖掘
brand_maxprice = df.groupby('brand')['price'].agg(['max'])['max'].sort_values(ascending=False).reset_index()
brand_maxprice.columns = ['品牌', '最高價']
name = (brand_maxprice['品牌']).tolist()
dict_values = (brand_maxprice['最高價']).tolist()

可視化展示

#李運辰 公眾號：python爬蟲數據分析挖掘
##去掉英文名稱
for i in range(0, len(name)):
    if "（" in name[i]:
        name[i] = name[i][0:int(name[i].index("（"))]


# 鏈式調用
c = (
    Bar(
        init_opts=opts.InitOpts(  # 初始配置項
            theme=ThemeType.MACARONS,
            animation_opts=opts.AnimationOpts(
                animation_delay=1000, animation_easing="cubicOut"  # 初始動畫延遲和緩動效果
            ))
    )
        .add_xaxis(xaxis_data=name)  # x軸
        .add_yaxis(series_name="最高價格對比", yaxis_data=dict_values)  # y軸
        .set_global_opts(
        title_opts=opts.TitleOpts(title='', subtitle='',  # 標題配置和調整位置
                                  title_textstyle_opts=opts.TextStyleOpts(
                                      font_family='SimHei', font_size=25, font_weight='bold', color='red',
                                  ), pos_left="90%", pos_top="10",
                                  ),
        xaxis_opts=opts.AxisOpts(name='品牌', axislabel_opts=opts.LabelOpts(rotate=45)),
        # 設置x名稱和Label rotate解決標簽名字過長使用
        yaxis_opts=opts.AxisOpts(name='最高價'),


    )
        .render("最高價格對比.html")
)

3.價格均值

統計分析

#李運辰 公眾號：python爬蟲數據分析挖掘
brand_meanprice = df.groupby('brand')['price'].agg(['mean'])['mean'].sort_values(ascending=False).reset_index()
brand_meanprice.columns = ['品牌', '價格均值']
name = (brand_meanprice['品牌']).tolist()
dict_values = (brand_meanprice['價格均值']).tolist()


##去掉英文名稱
for i in range(0, len(name)):
    if "（" in name[i]:
        name[i] = name[i][0:int(name[i].index("（"))]


#價格轉為整數
for i in range(0, len(dict_values)):
        dict_values[i] = int(dict_values[i])

可視化展示

#李運辰 公眾號：python爬蟲數據分析挖掘
# 鏈式調用
c = (
    Bar(
        init_opts=opts.InitOpts(  # 初始配置項
            theme=ThemeType.MACARONS,
            animation_opts=opts.AnimationOpts(
                animation_delay=1000, animation_easing="cubicOut"  # 初始動畫延遲和緩動效果
            ))
    )
        .add_xaxis(xaxis_data=name)  # x軸
        .add_yaxis(series_name="價格均值對比", yaxis_data=dict_values)  # y軸
        .set_global_opts(
        title_opts=opts.TitleOpts(title='', subtitle='',  # 標題配置和調整位置
                                  title_textstyle_opts=opts.TextStyleOpts(
                                      font_family='SimHei', font_size=25, font_weight='bold', color='red',
                                  ), pos_left="90%", pos_top="10",
                                  ),
        xaxis_opts=opts.AxisOpts(name='品牌', axislabel_opts=opts.LabelOpts(rotate=45)),
        # 設置x名稱和Label rotate解決標簽名字過長使用
        yaxis_opts=opts.AxisOpts(name='價格均值'),


    )
        .render("價格均值對比.html")
    )

4.各大品牌標題詞雲

提取文本

#李運辰 公眾號：python爬蟲數據分析挖掘
brand_title = df.groupby('brand')['title']
brand_title = list(brand_title)
for z in range(0,len(brand_title)):
    brandname = brand_title[z][0]
    if "（" in brandname:
        brandname = brandname[0:int(brandname.index("（"))]
    brandname = str(brandname).encode("utf-8").decode('utf8')
    print(brandname)
    text = "".join((brand_title[z][1]).tolist())
    text = text.replace(brand_title[z][0],"").replace(brandname,"").replace("\n\r","").replace("\t","").replace("\n","").replace("\r","").replace("【","").replace("】","").replace(" ","")
    #print(text)
    with open("text/"+str(brandname)+".txt","a+") as f:
        f.write(text)

這里將不同品牌的標題文本寫入到txt

可視化展示

#李運辰 公眾號：python爬蟲數據分析挖掘
def an4_pic():
    ###詞雲圖標
    fa_list = ['fas fa-play', 'fas fa-audio-description', 'fas fa-circle', 'fas fa-eject', 'fas fa-stop',
               'fas fa-video', 'fas fa-volume-off', 'fas fa-truck', 'fas fa-apple-alt', 'fas fa-mountain',
               'fas fa-tree', 'fas fa-database', 'fas fa-wifi', 'fas fa-mobile', 'fas fa-plug']
    z=0
    ##開始繪圖
    for filename in os.listdir("text"):
        print(filename)
        with open("text/"+filename,"r") as f:
             text = (f.readlines())[0]


        with open("stopword.txt", "r", encoding='UTF-8') as f:
            stopword = f.readlines()
        for i in stopword:
            print(i)
            i = str(i).replace("\r\n", "").replace("\r", "").replace("\n", "")
            text = text.replace(i, "")
        word_list = jieba.cut(text)
        result = " ".join(word_list)  # 分詞用 隔開
        # 制作中文雲詞
        icon_name = str(fa_list[z])
        gen_stylecloud(text=result, icon_name=icon_name, font_path='simsun.ttc',output_name=str(filename.replace(".txt",""))+"詞雲圖.png")  # 必須加中文字體，否則格式錯誤
        z =z+1