【項目實戰】天貓重復購買預測數據探索

本文轉載自查看原文 2022-03-21 23:00 847 機器學習/ 項目實戰

工具導入和數據讀取

工具導入

import numpy as np                                     
import pandas as pd                                    
import matplotlib.pyplot as plt                          
import seaborn as sns                                    
from scipy import stats                                  
import warnings                                          
warnings.filterwarnings("ignore")                        
%matplotlib inline

數據讀取

test_data = pd.read_csv('./data_format1/test_format1.csv')                                                 // 確定測試集              
train_data = pd.read_csv('./data_format1/train_format1.csv')                                               // 確定訓練集    
user_info = pd.read_csv('./data_format1/user_info_format1.csv')                                            // 用戶特征數據      
user_log = pd.read_csv('./data_format1/user_log_format1.csv')                                              // 商店數據     
#user_info = pd.read_csv('./data_format1/user_info_format1.csv').drop_duplicates()                         // 刪除用戶特征數據中的重復項                          
#user_log = pd.read_csv('./data_format1/user_log_format1.csv').rename(columns={"seller_id":'merchant_id'}) // 把商店數據信息中的索引merchant_id改為seller_id

數據集樣例查看

train_data.head(5)

user_id	merchant_id	label
0	34176	3906
1	34176	121
2	34176	4356
3	34176	2217
4	230784	4818

test_data.head(5)

user_id	merchant_id	prob
0	163968	4605
1	360576	1581
2	98688	1964
3	98688	3645
4	295296	3361

user_info.head(5)

user_id	age_range	gender
0	376517	6.0
1	234512	5.0
2	344532	5.0
3	186135	5.0
4	30230	5.0

user_log.head(5)

user_id	item_id	cat_id	seller_id	brand_id	time_stamp	action_type
0	328862	323294	833	2882	2661.0	829
1	328862	844400	1271	2882	2661.0	829
2	328862	575153	1271	2882	2661.0	829
3	328862	996875	1271	2882	2661.0	829
4	328862	1086186	1271	1253	1049.0	829

單變量數據分析

數據類型和數據大小

用戶信息數據

數據集中共有2個float64類型和1個int64類型的數據
數據大小9.7MB
數據集共有424170條數據
用戶行為數據
數據集中共有6個int64類型和1個float64類型的數據
數據大小2.9GB
數據集共有54925330條數據
用戶購買訓練數據
數據均為int64類型
數據大小6MB
數據集共有260864條數據

缺失值查看

年齡缺失

年齡值為空的缺失率為0.5%
年齡值缺失或者年齡值為缺省值0共計95131條數據

(user_info.shape[0]-user_info['age_range'].count())/user_info.shape[0]             //這里count函數可以統計不為空的數據個數，shape函數可以統計數據樣本的個數
user_info[user_info['age_range'].isna() | (user_info['age_range'] == 0)].count()   //這里isna函數用於統計一個值是否為空，整句代碼用於計算數據中年齡缺失或者為0的數據數目
user_info.groupby(['age_range'])[['user_id']].count()                              //這里groupby函數用於對數據的分組

性別缺失

性別值為空的缺失率 1.5%
性別值缺失或者性別為缺省值2共計95131條數據

(user_info.shape[0]-user_info['gender'].count())/user_info.shape[0]
user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count()
user_info.groupby(['gender'])[['user_id']].count()

年齡或者性別其中有一個有缺失

共計106330條數據

user_info[user_info['age_range'].isna() | (user_info['age_range'] == 0) | user_info['gender'].isna() | (user_info['gender'] == 2)].count()

用戶行為日志信息

brand_id字段有91015條缺失數據

user_log.isna().sum()

user_id            0
,item_id            0
,cat_id             0
,seller_id          0
,brand_id       91015
,time_stamp         0
,action_type        0
,dtype: int64

觀察數據分布

整體數據統計信息

user_info.describe() 
user_log.describe()                    //就是返回這兩個核心數據結構的統計變量。其目的在於觀察這一系列數據的范圍、大小、波動趨勢等等,為后面的模型選擇打下基礎

user_id	age_range	gender
count	424170.000000	421953.000000
mean	212085.500000	2.930262
std	122447.476179	1.942978
min	1.000000	0.000000
25%	106043.250000	2.000000
50%	212085.500000	3.000000
75%	318127.750000	4.000000
max	424170.000000	8.000000

user_id	item_id	cat_id	seller_id	brand_id	time_stamp	action_type
count	5.492533e+07	5.492533e+07	5.492533e+07	5.492533e+07	5.483432e+07	5.492533e+07
mean	2.121568e+05	5.538613e+05	8.770308e+02	2.470941e+03	4.153348e+03	9.230953e+02
std	1.222872e+05	3.221459e+05	4.486269e+02	1.473310e+03	2.397679e+03	1.954305e+02
min	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	5.110000e+02
25%	1.063360e+05	2.731680e+05	5.550000e+02	1.151000e+03	2.027000e+03	7.300000e+02
50%	2.126540e+05	5.555290e+05	8.210000e+02	2.459000e+03	4.065000e+03	1.010000e+03
75%	3.177500e+05	8.306890e+05	1.252000e+03	3.760000e+03	6.196000e+03	1.109000e+03
max	4.241700e+05	1.113166e+06	1.671000e+03	4.995000e+03	8.477000e+03	1.112000e+03

查看正負樣本的的分布

label_gp = train_data.groupby('label')['user_id'].count()                                                   //把標簽為0和1的數目分別統計計算輸出
print('正負樣本的數量：\n',label_gp)                                                
_,axe = plt.subplots(1,2,figsize=(12,6))                                                                    //指定畫布大小
train_data.label.value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0,0.1],ax=axe[0])    //在第一張圖上面畫出扇形圖
sns.countplot('label',data=train_data,ax=axe[1],)                                                           //在第二張圖上面畫出直方圖

可以看出樣本的分布不均衡，需要采取一定的措施處理樣本不均衡的問題

探查店鋪、用戶、性別以及年齡對復購的影響

查看不同商家與復購的關系

print('選取top5店鋪\n店鋪\t購買次數')
print(train_data.merchant_id.value_counts().head(5))
train_data_merchant = train_data.copy()
train_data_merchant['TOP5'] = train_data_merchant['merchant_id'].map(lambda x: 1 if x in [4044,3828,4173,1102,4976] else 0)      //copy深拷貝父對象（一級目錄），子對象（二級目錄）不拷貝，子對象是引用，這里用一個匿名函數之傳入top5店鋪
train_data_merchant = train_data_merchant[train_data_merchant['TOP5']==1]
plt.figure(figsize=(8,6))
plt.title('Merchant VS Label')
ax = sns.countplot('merchant_id',hue='label',data=train_data_merchant)
for p in ax.patches:
    height = p.get_height()

選取top5店鋪
店鋪	購買次數
4044    3379
3828    3254
4173    2542
1102    2483
4976    1925
Name: merchant_id, dtype: int64

查看店鋪復購概率分布

merchant_repeat_buy = [ rate for rate in train_data.groupby(['merchant_id'])['label'].mean() if rate <= 1 and rate > 0] 
plt.figure(figsize=(8,4))
ax=plt.subplot(1,2,1)
sns.distplot(merchant_repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(merchant_repeat_buy, plot=plt)            //這是一種檢驗樣本數據概率分布(例如正態分布)的方法。紅色線條表示正態分布，藍色線條表示樣本數據，藍色越接近紅色參考線，說明越符合預期分布（這是是正態分布）

查看用戶大於一次復購概率分布

user_repeat_buy = [rate for rate in train_data.groupby(['user_id'])['label'].mean() if rate <= 1 and rate > 0] 

plt.figure(figsize=(8,6))

ax=plt.subplot(1,2,1)
sns.distplot(user_repeat_buy, fit=stats.norm)            
ax=plt.subplot(1,2,2)
res = stats.probplot(user_repeat_buy, plot=plt)

可以看出近6個月，用戶復購率很小，基本買一次為主

查看用戶性別與復購的關系

plt.figure(figsize=(8,8))
plt.title('Gender VS Label')
ax = sns.countplot('gender',hue='label',data=train_data_user_info)
for p in ax.patches:
    height = p.get_height()

查看用戶性別復購的分布

repeat_buy = [rate for rate in train_data_user_info.groupby(['gender'])['label'].mean()] 

plt.figure(figsize=(8,4))

ax=plt.subplot(1,2,1)
sns.distplot(repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(repeat_buy, plot=plt)

查看用戶年齡與復購的關系

plt.figure(figsize=(8,8))
plt.title('Age VS Label')
ax = sns.countplot('age_range',hue='label',data=train_data_user_info)

查看用戶年齡復購的分布

repeat_buy = [rate for rate in train_data_user_info.groupby(['age_range'])['label'].mean()] 

plt.figure(figsize=(8,4))

ax=plt.subplot(1,2,1)
sns.distplot(repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(repeat_buy, plot=plt)

可以看出不同年齡段，復購概率不同

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【項目實戰】天貓重復購買預測特征工程模仿天貓實戰【SSM版】——項目起步 html5--項目實戰-仿天貓（移動端頁面）模仿天貓實戰【SSM】——總結天池題目：工業蒸汽預測（一）- 數據探索用戶貸款風險預測—數據探索天貓雙11歷年數據天貓淘寶評論數據抓取 Python網絡爬蟲實戰：根據天貓胸罩銷售數據分析中國女性胸部大小分布模仿天貓實戰【SSM版】——后台開發

【項目實戰】天貓重復購買預測 數據探索