分群评估指标(一)|调整兰德系数与Silhouette Coefficient 轮廓系数
1 Adjusted Rand index 调整兰德系数(ARI需要真实标签)
兰德系数(Rand index)
给定 \(n\) 个对象集合 \(S=\left\{O_{1}, O_{2}, \ldots, O_{n}\right\}\), 假设 \(U=\left\{u_{1}, \ldots, u_{R}\right\}\) 和 \(V=\left\{v_{1}, \ldots, v_{C}\right\}\) 表示S的两个不同划分并且满足 \(\bigcup_{i=1}^{R} u_{i}=S=\bigcup_{j=1}^{C} v_{j}, u_{i} \cap u_{i^{*}}=\emptyset=v_{j} \cap v_{j^{*}}\), 其中 \(1 \leq i \neq i^{*} \leq R, 1 \leq j \neq j^{*} \leq C_{\circ}\)
假设U是外部评价标准即true_label,而V是聚类结果。设定四个统计量:
- a为在U中为同一类且在V中也为同一类别的数据点对数
- b为在U中为同一类但在V中却隶属于不同类别的数据点对数
- c为在U中不在同一类但在V中为同一类别的数据点对数
- d为在U中不在同一类且在V中也不属于同一类别的数据点对数
此时,兰德系数为
兰德系数的值在[0,1]之间,当聚类结果完美匹配
时,兰德系数为1。
调整兰德系数(Adjusted Rand index)
兰德系数的问题在于对于两个随机的划分,其兰德系数值不是一个接近于0的常数。 Huberr和Arabie在1985年提出了调整兰德系数, 调整兰德系数假设模型的超分布为随机模型, 即 \(U\) 和 \(V\) 的划分为随机的,那么各类别和各簇的数据点数目是固定的。
假设 \(n_{i j}\) 表示同在类别 \(u_{i}\) 和簇 \(v_{j}\) 内的数据点数目,$$n_{i}$$为类$$u_{i}$$的数据点数目,$$n_{j}$$为簇$$v_{j}$$的数目,如下表:
调整的兰德系数为:
ARI其实是去均值归一化的形式,RI中的 \(a+d\) 可以表示为 \(\sum_{i, j}\left(\begin{array}{c}n_{i j} \\ 2\end{array}\right)\),
ARI∈[−1,1] 。值越大意味着聚类结果与真实情况越吻合。从广义的角度来将,ARI是衡量两个数据分布的吻合程度
的。
- 优点:
对任意数量的聚类中心和样本数,随机聚类的ARI都非常接近于0;
取值在[-1,1]之间,负数代表结果不好,越接近于1越好;
可用于聚类算法之间的比较 - 缺点:
ARI需要真实标签
python代码实现
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score) # 0.24242424242424246
# 与标签名无关
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score) # 0.24242424242424246
# 具有对称性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score) # 0.24242424242424246
# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)#1.0
References
https://blog.csdn.net/qq_42887760/article/details/105728101
https://blog.csdn.net/sinat_30203515/article/details/82634778
2 Silhouette Coefficient 轮廓系数(实际类别信息未知)
轮廓系数( Silhouette coefficient)适用于实际类别信息未知
的情况。对于单个样本,设a是与它同类别中其他样本的平均距离,b是与它距离最近不同类别中样本的平均距离,轮廓系数为
对于一个样本集合,它的轮廓系数是所有样本轮廓系数的平均值。 轮廓系数取值范围是 \([-1,1]\), 同类别样本越距离相近且不同类别 样本距离越远,分数越高。
import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,3],[4,6,7],[4,5,6],[2,3,5]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
print(labels)#[2 2 1 1 0]
metrics.silhouette_score(X, labels, metric='euclidean')#0.6371196617847901
references
https://cloud.tencent.com/developer/article/1010857
分群评估指标(二)|调整互信息与Homogeneity, completeness and V-measure
调整互信息(需要已知数据点的真实标签)
原理
https://blog.csdn.net/qq_42122496/article/details/106193859
https://cloud.tencent.com/developer/article/1010857
互信息( Mutual Information)也是用来衡量两个数据分布的吻合程度。利用基于互信息的方法来衡量聚类效果需要实际类别信息,MI与NMI取值范围为[0,1],AMI取值范围为[-1,1],它们都是值越大意味看聚类结果与真实倩况越吻合。
代码
from sklearn.metrics.cluster import entropy, mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score
MI = lambda x, y: mutual_info_score(x, y)
NMI = lambda x, y: normalized_mutual_info_score(x, y, average_method='arithmetic')#NMI和AMI的计算均采用算术平均;log函数的底为自然对数e。
AMI = lambda x, y: adjusted_mutual_info_score(x, y, average_method='arithmetic')
A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3]
#print(entropy(A))
#print(MI(A, B))
print(NMI(A, B))#0.36456177185718985
print(AMI(A, B))#0.2601812253892505
C = [1, 1, 2, 2, 3, 3, 3]
D = [1, 1, 1, 2, 1, 1, 1]
print(NMI(C, D))#0.28483386264113447
print(AMI(C, D))#0.05674883175532439
Homogeneity, completeness and V-measure
https://www.jianshu.com/p/e1ee1f336d35
homogeneity : float
score between 0.0 and 1.0. ;1.0 stands for perfectly homogeneous labeling
completeness:float
score between 0.0 and 1.0. ;1.0 stands for perfectly complete labeling
v_measure:float
harmonic mean of the first two
A clustering result satisfies homogeneity
if all of its clusters contain only data points which are members of a single class.
A clustering result satisfies completeness
if all the data points that are members of a given class are elements of the same cluster.
Both scores have positive values between 0.0 and 1.0, larger values being desirable.
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
print(metrics.homogeneity_score(labels_true, labels_pred))
print(metrics.completeness_score(labels_true, labels_pred))
print(metrics.v_measure_score(labels_true, labels_pred))
#beta默认值为1.0,但如果beta值小于1:
print(metrics.v_measure_score(labels_true, labels_pred, beta=0.6))
print(metrics.v_measure_score(labels_true, labels_pred, beta=1.8))
#这三个值可以一起计算
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))
labels_pred = [0, 0, 0, 1, 2, 2]
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))
分群评估指标(三)|Fowlkes-Mallows scores与Calinski-Harabaz Index
Fowlkes-Mallows scores
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision and recall:
FMI = TP / sqrt((TP + FP) * (TP + FN))
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
from sklearn.metrics.cluster import fowlkes_mallows_score
print(fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1]))#1.0
print(fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0]))#1.0
Calinski-Harabaz Index(真实的分群label不知道)
在真实的分群label不知道的情况下,可以作为评估模型的一个指标。类别内部数据的协方差越小越好,类别之间的协方差越大越好,这样的Calinski-Harabasz分数会高。
import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,5],[2,4,7],[1,2,8]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabasz_score(X, labels) #4.125
分群评估指标(四)总结|实战
总结
实战
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=0)
# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50, random_state=0).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()