淘寶零食專欄分析（淘寶爬蟲+數據分析）

本文轉載自查看原文 2019-04-09 22:12 3564 數據分析筆記

前言：本文爬蟲的關鍵字眼是“美食”，實際分析時發現“零食”的銷售量遠遠高於“美食”，因此在一開始的數據層面就已經決定了本文分析的片面性，本篇博客主要是用於記錄代碼和分析過程。

實際的結論請看下一篇博客（下一篇博客爬蟲的關鍵字眼是“零食”）。

https://www.cnblogs.com/little-monkey/p/10822369.html

一、爬蟲

根據崔慶才老師的爬蟲視頻修改而來，利用selenium進行淘寶爬取（本來想用火車采集器爬取的，嘗試了一下發現沒法截取淘寶網址的字段）。

selenium完全模擬人工點擊操作，原理上可以爬取淘寶的所有可見內容。

爬蟲代碼有參考　　https://www.cnblogs.com/hustcser/p/8744355.html

import re
import time
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
import pymongo

MONGO_URL = 'localhost'
MONGO_DB = 'taobao'
MONGO_TABLE = 'product2'
KEYWORD = '美食'
PAGE_NUM=35            #爬取頁數

client=pymongo.MongoClient(MONGO_URL)
db=client[MONGO_DB]

browser = webdriver.Chrome()
wait=WebDriverWait(browser, 10)

def search():
    print('正在搜素...')
    try:
        browser.get('https://s.taobao.com/search?q=%E7%BE%8E%E9%A3%9F&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&sort=sale-desc&bcoffset=-30&p4ppushleft=%2C44&ntoffset=-30&fs=1&filter_tianmao=tmall&s=0')
        page_num=PAGE_NUM
        get_products()      # 獲取頁面詳情
        return page_num
    except TimeoutException:
        return search()

# 獲取下頁
def next_page(page_number):
    print('正在翻頁%s', page_number)
    time.sleep(3)
    try:
        input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#mainsrp-pager > div > div > div > div.form > input")))
        submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager > div > div > div > div.form > span.btn.J_Submit')))
        input.clear()
        input.send_keys(page_number)
        submit.click()
        wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#mainsrp-pager > div > div > div > ul > li.item.active > span'),str(page_number)))
        get_products()
    except TimeoutException:
        next_page(page_number)

# 解析頁面
def get_products():
    # 判斷元素是否已經加載下來
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-itemlist .items .item')))
    html=browser.page_source
    doc=pq(html)
    items=doc("#mainsrp-itemlist .items .item").items()
    for item in items:
        product={
            # 'image': item.find('.pic .img').attr('src'),
            'title': item.find('.title').text(),
            'price': item.find('.price').text(),
            'deal': item.find('.deal-cnt').text()[:-3],
            'location': item.find('.location').text(),
            'shop': item.find('.shop').text()
        }
        print(product)
        save_to_mongo(product)

def save_to_mongo(result):
    try:
        if db[MONGO_TABLE].insert(result):
            print('存儲到MongoDB成功',result)
    except Exception:
        print('存儲到MongoDB失敗',result)

def main():
    try:
        page_num=search()
        for i in range(2,page_num+1):
            next_page(i)
    except Exception:
        print('出錯啦')
    finally:
        browser.close()

if __name__ == '__main__':
    main()

注意：運行程序后會彈出淘寶登陸頁面，使用手機淘寶掃描登陸，然后坐等爬取結果，最后從MongoDB中導出CSV文件即可。

結果如下：

二、數據處理

data= pd.read_csv('data/淘寶美食35頁.csv', encoding='utf8',engine='python')
data.drop('_id',axis=1, inplace=True)    　　　　　　　　　　　　#去掉id列
data['price'].replace('¥', '', regex=True, inplace=True)     #去掉price列的'¥'

#從location列中取出省份和城市，然后刪除location列
data['province']=data.location.apply(lambda x:x.split()[0])
data['city']=data.location.apply(lambda x:x.split()[0] if len(x)<4 else x.split()[1])
data.drop('location',axis=1, inplace=True)

#數據類型轉化
data['price']=data.price.astype('float64')
for i in ['province','city']:
    data[i]=data[i].astype('category')

運行結果如下：

三、數據挖掘與分析

【數據處理】

import jieba   

#導入整理好的待添加詞語（因為jieba本身的分詞功能不足以應付特定數據）
add_words = pd.read_excel('data/add_words.xlsx',header=None)     
add_words_list = add_words[0].tolist() 
for w in add_words_list:                    # 添加詞語
   jieba.add_word(w , freq=1000) 
   
#導入停用詞表
stopwords = [line.strip() for line in open('data/stop.csv', 'r', encoding='utf-8').readlines()]  

#對每個標題進行分詞，使用lcut函數
title=data.title.values.tolist()     #轉為list
title_s=[]
for line in title:
    title_cut=jieba.lcut(line)
    title_s.append(title_cut)
     
#去除冗余詞Plan1：剔除停用詞:
title_clean = []
for line in title_s:
   line_clean = []
   for word in line:
      if word not in stopwords:
         line_clean.append(word)
   title_clean.append(line_clean)
#去除冗余詞Plan2：直接定義，定義完后再調用lcut函數
#removes =['熟悉', '技術', '職位', '相關', '工作', '開發', '使用','能力','優先','描述','任職']
#for w in removes:
#   jieba.del_word(w)
   
#去重，對title_clean中的每個list的元素進行去重，即每個標題被分割后的詞語唯一，如【麻辣小魚干香辣小魚干】->【麻辣，香辣，小魚干】
#去重后的title_clean_dist為二維list，即[[……],[……],……]
title_clean_dist = []  
for line in title_clean:   
   line_dist = []
   for word in line:
      if word not in line_dist:
         line_dist.append(word)
   title_clean_dist.append(line_dist)
   
# 將 title_clean_dist 轉化為一維list
allwords_clean_dist = []
for line in title_clean_dist:
   for word in line:
      allwords_clean_dist.append(word)

# 把列表 allwords_clean_dist 轉為數據框： 
df_allwords_clean_dist = pd.DataFrame({'allwords': allwords_clean_dist})
  
# 對過濾_去重的詞語 進行分類匯總：
word_count = df_allwords_clean_dist.allwords.value_counts().reset_index()    
word_count.columns = ['word','count']      #添加列名

【詞雲可視化】

# 詞雲可視化
from PIL import Image
from wordcloud import WordCloud
import matplotlib.pyplot as plt

txt = " ".join(allwords_clean_dist)                #將list轉成str，便於詞雲可視化
food_mask=np.array(Image.open("data/mask.png"))
wc = WordCloud(font_path='data/simhei.ttf',     　  # 設置字體
               background_color="white",           # 背景顏色
               max_words=1000,                  　　# 詞雲顯示的最大詞數
               max_font_size=100,                  # 字體最大值
               min_font_size=5,                 　　#字體最小值
               random_state=42,                 　　#隨機數
               collocations=False,                 #避免重復單詞
               mask=food_mask,                     #造型遮蓋
               width=1000,height=800,margin=2,     #圖像寬高，字間距，需要配合下面的plt.figure(dpi=xx)放縮才有效
              )
wc.generate(txt) 

plt.figure(dpi=200)
plt.imshow(wc, interpolation='catrom',vmax=1000)
plt.axis("off")                                 #隱藏坐標

plt.rcParams['figure.dpi'] = 600
plt.savefig('E:\\1標題詞雲.png')

進一步的，進行不同關鍵詞銷量的可視化

【銷量統計】

ws_sum=[]
for w in word_count.word:
    i=0
    s_list=[]
    for t in title_clean_dist:
        if w in t:
            s_list.append(data.deal[i])
        i+=1
    ws_sum.append(sum(s_list))

df_sum=pd.DataFrame({'ws_sum':ws_sum})
df_word_sum=pd.concat([word_count,df_sum],axis=1,ignore_index=True)
df_word_sum.columns=['word','count','ws_sum']    #詞語，出現次數，包含該詞語的商品銷量

df_word_sum.drop(8,axis=0,inplace=True)            #刪除“空格”詞所在的8行
df_word_sum.sort_values('ws_sum',inplace=True,ascending=True)    #升序排列
df_ws=df_word_sum.tail(40)

index=np.arange(df_ws.word.size)
plt.figure(figsize=(16,13))
plt.barh(index,df_ws.ws_sum,align='center')
plt.yticks(index,df_ws.word,fontsize=11)

#添加數據標簽
for y,x in zip(index,df_ws.ws_sum):
    plt.text(x,y,'%.0f' %x,ha='left',va='center',fontsize=11)    #ha參數有【center，left，right】，va參數有【top，bottom，center，baseline】
    
plt.savefig('E:\\2銷量詞匯排行.png')

【分析1】根據淘寶標題（賣家傾向）+銷量（買家傾向）分析市場情況

1、美食、零食、小吃等字眼的商品占比較高；

2、從進餐時間來看，早餐賣的最多，下午茶也有一定空間；

3、從地域來看，川渝美食領跑全國，湖南、雲南、北京、廈門次之，南京、安徽、上海、黃山也有一定市場；

4、從種類來看，糕點>面包>肉脯>蛋類，看來中國人民喜歡甜點多於肉、蛋制品；

5、從風格來看，特產和網紅處於第一梯隊，特產賣家最多，網紅銷量最高，第二梯隊中傳統>營養，國人鍾愛特產以及網紅美食坐上銷量寶座的深層原因，可能是宣傳，即買家購買零食時更傾向於購買聽說過的東西；

6、從包裝來看，整箱包裝最受歡迎，小包裝次之，大禮包、散裝、禮盒、批發雖然常見於賣家，但銷量比不上整箱包裝和小包裝，這也和糕點、面包類暢銷相關；

7、從口味來看，麻辣和香辣最受歡迎，和川渝美食地位相符；

8、從品牌來看，銷量上良品鋪子>三只松鼠>百草味，三巨頭領銜零食市場，已初步形成口碑。

【總結1】對美食而言，有力的宣傳可以極大促進銷量。川渝美食領跑全國，既和其口味【麻辣】、【香辣】有關，更和其“小吃”、“美女”、“網紅”等城市標簽有關；糕點、面包等精致食品廣受【辦公室人群】的歡迎。

同理，進行【店名分析】

#導入整理好的待添加詞語（因為jieba本身的分詞功能不足以應付特定數據）
add_words = pd.read_excel('data/add_words.xlsx',header=None)     
add_words_list = add_words[0].tolist() 
for w in add_words_list:                    # 添加詞語
   jieba.add_word(w , freq=1000) 

#去除冗余詞：直接定義，定義完后再調用lcut函數
removes =['來', '和', '有']
for w in removes:
   jieba.del_word(w) 

#對每個標題進行分詞，使用lcut函數
shop=data.shop.values.tolist()     #轉為list
shop_s=[]
for line in shop:
    shop_cut=jieba.lcut(line)
    shop_s.append(shop_cut)
    
shop_clean_dist = []  
for line in shop_s:   
   line_dist = []
   for word in line:
      if word not in line_dist:
         line_dist.append(word)
   shop_clean_dist.append(line_dist)

# 將 title_clean_dist 轉化為一維list
shop_list = []
for line in shop_clean_dist:
   for word in line:
      shop_list.append(word)

txt = " ".join(shop_list)                #將list轉成str，便於詞雲可視化
sh = WordCloud(font_path='data/simhei.ttf',     # 設置字體
               background_color="white",          # 背景顏色
               max_words=100,                  # 詞雲顯示的最大詞數
               max_font_size=100,                  # 字體最大值
               min_font_size=5,                 #字體最小值
               random_state=42,                 #隨機數
               collocations=False,                 #避免重復單詞
               width=600,height=400,margin=2,     #圖像寬高，字間距，需要配合下面的plt.figure(dpi=xx)放縮才有效
              )
sh.generate(txt) 
plt.figure(dpi=200)
plt.imshow(sh, interpolation='catrom',vmax=1000)
plt.axis("off")

# 把列表 allwords_clean_dist 轉為數據框： 
shop_list = pd.DataFrame({'allwords': shop_list})
  
# 對過濾_去重的詞語 進行分類匯總：
shop_count = shop_list.allwords.value_counts().reset_index()    
shop_count.columns = ['shop','count']      #添加列名 

#銷量統計
sh_sum=[]
for w in shop_count.shop:
    i=0
    s_list=[]
    for t in shop_clean_dist:
        if w in t:
            s_list.append(data.deal[i])
        i+=1
    sh_sum.append(sum(s_list))

df_sum=pd.DataFrame({'sh_sum':sh_sum})
df_word_sum=pd.concat([shop_count,df_sum],axis=1,ignore_index=True)
df_word_sum.columns=['shop','count','shop_sum']    #店名，出現次數，該店總銷量

df_word_sum.sort_values('shop_sum',inplace=True,ascending=True)    #升序排列
df_sh=df_word_sum.tail(30)

index=np.arange(df_sh.shop.size)
plt.figure(figsize=(12,9))
plt.barh(index,df_sh.shop_sum,align='center')
plt.yticks(index,df_sh.shop,fontsize=11)

【分析2】店名分析

1、多數店名帶有“旗艦店”、“食品”、“專營”等字眼；

2、良品鋪子霸居榜首，雖然其商品數量不多（只有14件），但其銷量高，甚至勝於天貓超市；

【總結2】店名最好帶有“旗艦店”，對賣家有信服力

【價格分布】

data_p=data[data['price']<150]
print('價格在150以下的商品占比：%.3f'%(len(data_p)/len(data)))

plt.figure(figsize=(7,5))
plt.hist(data_p['price'],bins=25)
plt.xlabel('價格',fontsize=12)
plt.ylabel('商品數量',fontsize=12)
plt.title('不同價格對應的商品數量分布',fontsize=12)

價格在150以下的商品占比：0.990

data_s=data[data['deal']<20000] 
data_s=data_s[data_s['deal']>200]
print('銷量在200~20000之間的商品占比：%.3f'%(len(data_s)/len(data)))

plt.figure(figsize=(7,5))
plt.hist(data_s['deal'],bins=25)
plt.xlabel('銷量',fontsize=12)
plt.ylabel('商品數量',fontsize=12)
plt.title('不同銷量對應的商品數量分布',fontsize=12)

銷量在200~20000之間的商品占比：0.419

#用qcut將price分成12組
data['group']=pd.qcut(data.price,12)
df_group=data.group.value_counts().reset_index()

#以group列進行分類，求deal銷量的均值
df_sg=data[['deal','group']].groupby('group').mean().reset_index()

#繪柱形圖
index=np.arange(df_sg.group.size)
plt.figure(figsize=(18,5))
plt.bar(index,df_sg.deal)
plt.xticks(index,df_sg.group,fontsize=11)
plt.xlabel('類別',fontsize=12)
plt.ylabel('平均銷量',fontsize=12)
plt.title('不同價格區間商品的平均銷量',fontsize=12)

fig,ax=plt.subplots()
ax.scatter(data['price'],data['deal'])
ax.set_xlabel('價格')
ax.set_ylabel('銷量')
ax.set_title('價格對銷量的影響')

data['GMV']=data['price']*data['deal']

fig,ax=plt.subplots()
ax.scatter(data['price'],data['GMV'])
ax.set_xlabel('價格')
ax.set_ylabel('銷售額')
ax.set_title('價格對銷售額的影響')

【分析3】根據價格、銷量、銷售額分析定價影響

1、定價在25元左右的商品最多；

2、銷量越高，商品越少，形成“長尾”分布；

3、【16.8,19】價格區間的銷量最高，【29.9,34.9】銷量次高峰；

4、價格在15~45之間的銷量和銷售額都不錯，50元以上的零食並不暢銷。

【總結3】對零食市場而言，應走“薄利多銷”路線，50元以下的市場份額最大。

【地域分布】

plt.figure(figsize=(12,4))
data.province.value_counts().plot(kind='bar')
plt.xticks(rotation=0)    #讓字體橫向分布
plt.xlabel('省份',fontsize=12)
plt.ylabel('數量',fontsize=12)
plt.title('不同省份的商品數量分布',fontsize=12)

pro_sales=data.pivot_table(index='province',values='deal',aggfunc=np.mean)   #分類求均值
pro_sales.sort_values('deal',inplace=True,ascending=False)
pro_sales=pro_sales.reset_index()

index=np.arange(pro_sales.deal.size)
plt.figure(figsize=(12,4))
plt.bar(index,pro_sales.deal)
plt.xticks(index,pro_sales.province,rotation=0)    #讓字體橫向分布
plt.xlabel('省份',fontsize=12)
plt.ylabel('平均銷量',fontsize=12)
plt.title('不同省份的商品平均銷量分布',fontsize=12)

pro_sales.to_excel('data/pro_sales.xlsx',index=False)

city_sales=data.pivot_table(index='city',values='deal',aggfunc=np.sum)   #分城市求和
city_sales.sort_values('deal',inplace=True,ascending=False)
city_sales=city_sales.reset_index()
city_sales=city_sales[:30]

index=np.arange(city_sales.deal.size)
plt.figure(figsize=(12,4))
plt.bar(index,city_sales.deal)
plt.xticks(index,city_sales.city,rotation=0)    #讓字體橫向分布
plt.xlabel('城市',fontsize=12)
plt.ylabel('銷量總和',fontsize=12)
plt.title('不同城市的商品銷量之和',fontsize=12)

【分析4】分析地域差異

1、福建的零食商品數量最多（難道是受沙縣小吃主導？），上海次之；

2、湖北雖然在售商品數量較少，但湖北的平均銷量卻最高。城市以武漢和上海銷量最強；

3、在【分析1】中強勢霸榜的川渝美食銷量卻一般，叫好不叫座？

4、銷量高的城市大多是南方經濟較發達的城市，經濟能帶動美食？南方零食比北方零食更暢銷，猜測南方零食應該種類更加豐富。

【后記】鑒於有許多同學私信我要stop.csv（即停用詞文件，這個文件可以自己設定停用詞，也可以去網上找常用停用詞），因此我將“stop.csv”和“淘寶美食35頁”兩個文件上傳

鏈接：https://pan.baidu.com/s/1UYr3kkFcNmjipaR9XSJeiQ
提取碼：gzfq

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 淘寶零食分析報告數據分析-淘寶用戶行為分析 SQL數據分析淘寶用戶分析實操數據分析報告：淘寶客戶分析報告淘寶用戶行為分析-數據分析 Python 雙十一淘寶美妝數據分析淘寶用戶畫像數據分析數據分析過程初體驗淘寶雙11數據分析與預測匯總【sql學習】sql數據分析實戰——淘寶用戶行為分析【數據分析項目】淘寶用戶行為分析【SQL+Tableau】