針對用戶群體的特征做分群分析,也有點類似RFM模型一樣,不過可選的指標比只選擇RFM三個指標更多,這里用的數據是航空公司用戶的數據,數據指標包括

下面上代碼:
import pandas as pd data = pd.read_csv('air_data.csv') #數據的一些基本情況 data.describe()

#數據空值情況,會發現一些屬性的空值比較多 data.isnull().sum().sort_values(ascending = False).head(10)
空值最多的幾個列如下:
#查找每列數據的空值數量,最大值,最小值情況 max_data = data.max() min_data = data.min() null_data = data.isnull().sum() data_count = pd.DataFrame({'max_data':max_data,'min_data':min_data,'null_data':null_data})

#做數據清洗 #丟棄票價為空值的的數據 data = data[data['SUM_YR_1'].notnull()*data['SUM_YR_2'].notnull()] #data.dropna(subset=['SUM_YR_2','SUM_YR_1'])#只選擇票價不為0,或者折扣
index1 = data['SUM_YR_1'] != 0
index2 = data['SUM_YR_2'] != 0
index3 = (data['SEG_KM_SUM'] == 0) & (data['avg_discount'] == 0)
data = data[index1|index2|index3]
#計算LRFMC五個指標 data['FFP_DATE'] = pd.to_datetime(data['FFP_DATE']) data['LOAD_TIME'] = pd.to_datetime(data['LOAD_TIME']) data['L'] = data['LOAD_TIME'] - data['FFP_DATE'] data['R'] = data['LAST_TO_END'] data['F'] = data['FLIGHT_COUNT'] data['M'] = data['SEG_KM_SUM'] data['C'] = data['avg_discount']
finall_data = data.loc[:,['L','R','F','M','C']] finall_data['L'] = finall_data['L'].dt.days #轉換成天 #標准化 finall_data = (finall_data - finall_data.mean(axis=0))/finall_data.std()#聚類分析
from sklearn.cluster import KMeans
model = KMeans(n_clusters=5)
model.fit(finall_data)
model.cluster_centers_
model.labels_
finall_data['label'] = model.labels_
center = pd.DataFrame(center,columns=finall_data.columns[:-1])
最后幾類用戶幾個指標的分布如下,可以有針對性的做營銷

