用户购买行为分析(聚类模型K-means)

本文转载自查看原文 2019-11-27 20:23 360 Data Analysis

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score

# 读取数据,四张表格
prior = pd.read_csv("order_products__prior.csv")
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
aisles = pd.read_csv("aisles.csv")

# 合并四张表
_mg = pd.merge(prior, products, on=["product_id", "product_id"])
_mg = pd.merge(_mg, orders, on=["order_id", "order_id"])
mt = pd.merge(_mg, aisles, on=["aisle_id", "aisle_id"])

mt.head()

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	aisle
0	2	33120	1	1	Organic Egg Whites	86	16	202279	prior	3	5	9	8.0	eggs
1	26	33120	5	0	Organic Egg Whites	86	16	153404	prior	2	0	16	7.0	eggs
2	120	33120	13	0	Organic Egg Whites	86	16	23750	prior	11	6	8	10.0	eggs
3	327	33120	5	1	Organic Egg Whites	86	16	58707	prior	21	6	9	8.0	eggs
4	390	33120	28	1	Organic Egg Whites	86	16	166654	prior	48	0	12	9.0	eggs

# 交叉表（特殊的分组工具）
cross = pd.crosstab(mt["user_id"], mt["aisle"])

cross.head(10)

aisle	air fresheners candles	asian foods	baby accessories	baby bath body care	baby food formula	bakery desserts	baking ingredients	baking supplies decor	beauty	beers coolers	...	spreads	tea	tofu meat alternatives	tortillas flat bread	trail mix snack mix	trash bags liners	vitamins supplements	water seltzer sparkling water	white wines	yogurt
user_id
1	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	1
2	0	3	0	0	0	0	2	0	0	0	...	3	1	1	0	0	0	0	2	0	42
3	0	0	0	0	0	0	0	0	0	0	...	4	1	0	0	0	0	0	2	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	1	0	0
5	0	2	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	3
6	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	2	0	0	0	...	0	0	0	0	0	0	0	0	0	5
8	0	1	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
9	0	0	0	0	6	0	2	0	0	0	...	0	0	0	0	0	0	0	2	0	19
10	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	2

10 rows × 134 columns

# 主成分分析
pca = PCA(n_components=0.9)

data = pca.fit_transform(cross)

data.shape  # 原本有134列，经过主成分分析后，只保留了27列

(206209, 27)

# 为方便计算，取较少数据
x = data[:500]
x.shape

(500, 27)

# 假设用户一共分为四个类别
km = KMeans(n_clusters=4)
km.fit(x)

predict = km.predict(x)
predict

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 2, 1,
       1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,
       1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 0, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1,
       2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 2, 1, 1, 1, 1, 1,
       1, 1, 3, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1,
       1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
       1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 2, 0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1])

# 显示聚类的结果
plt.figure(figsize=(10, 10))

<Figure size 720x720 with 0 Axes>




<Figure size 720x720 with 0 Axes>

# 建立四个颜色的列表
colored = ['orange', 'blue', 'purple', 'green']
colr = [colored[i] for i in predict]

# 假设X轴为第三特征， Y轴为18特征
plt.scatter(x[:, 3], x[:, 18], color=colr)
plt.xlabel('3')
plt.ylabel('18')
plt.show()

# 评判聚类效果，轮廓系数
silhouette_score(x, predict)

0.6115021999326935

K-means通常被称为劳埃德算法，这在数据聚类中是最经典的，也是相对容易理解的模型。算法执行的过程分为4个阶段。

1.首先，随机设K个特征空间内的点作为初始的聚类中心。
2.然后，对于根据每个数据的特征向量，从K个聚类中心中寻找距离最近的一个，并且把该数据标记为这个聚类中心。
3.接着，在所有的数据都被标记过聚类中心之后，根据这些数据新分配的类簇，通过取分配给每个先前质心的所有样本的平均值来创建新的质心重,新对K个聚类中心做计算。
4.最后，计算旧和新质心之间的差异,如果所有的数据点从属的聚类中心与上一次的分配的类簇没有变化，那么迭代就可以停止，否则回到步骤2继续循环。

K均值等于具有小的全对称协方差矩阵的期望最大化算法

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 K-means聚类分析 k-means聚类算法实例分析 k-means聚类分析聚类分析一、K-Means 聚类-K-Means 数学模型：3.非监督学习--聚类分析和K-means聚类 Spss K-means聚类分析案例——某移动公司客户细分模型【机器学习】k-means——航空用户聚类分析案例聚类：层次聚类、基于划分的聚类（k-means）、基于密度的聚类、基于模型的聚类使用K-means和高斯混合模型对图像进行聚类